DEV Community: amir

The Rust Performance Trap I Hit While Sorting Small Network Datasets

amir — Sat, 20 Jun 2026 04:58:26 +0000

The Rust Performance Trap I Hit While Sorting Small Network Datasets

I recently ran into one of those performance problems that looks completely obvious in hindsight, but was surprisingly hard to notice while I was inside the code.

I was working on a deep search and ranking part of a Rust project that processed network-related data.

The system had to evaluate many small groups of candidates repeatedly. You can think of these candidates as possible network paths, intermediate nodes, route-like records, latency candidates, or weighted decisions inside a search tree.

At each step, I had a small list of candidates.

Not thousands of items.

Usually something like 10, 20, maybe 40 or 50.

Each candidate needed to be scored and ordered before the search could continue.

The first version of the code was simple and readable:

let mut candidates = Vec::new();

for item in input {
    candidates.push(item);
}

candidates.sort_by_key(|item| compute_score(item));

It worked.

It was clean.

But after profiling it with a flamegraph, I noticed something uncomfortable: the program was spending too much time around temporary allocations.

At first, that sounds strange. A small Vec does not look expensive.

But in a deep search algorithm, a "small allocation" repeated millions of times is no longer small.

That was the beginning of the rabbit hole.

The First Bottleneck: Heap Allocation

The first problem was heap allocation.

Every time the search visited a node, it created temporary vectors to collect, score, and sort candidates. Since the search tree was deep, this happened again and again.

One allocation does not matter much.

Thousands of allocations may still be fine.

But millions of tiny allocations inside a hot path can become a serious bottleneck.

The important thing is this:

Heap allocation is not just "getting some memory".

There is real machinery behind it.

When a program allocates memory on the heap, the allocator has to find a suitable block of memory, update internal metadata, sometimes handle alignment, sometimes split or reuse blocks, and later track that memory so it can be freed safely.

Even if the allocator is very optimized, this is still more expensive than using memory that is already available on the stack.

With stack memory, the cost is usually very small. The program mostly adjusts the stack pointer. It is simple, predictable, and extremely cache-friendly.

Heap memory is more flexible, but that flexibility has a price.

In my case, I did not need flexibility.

I already knew the candidate lists were small. I did not need an unbounded dynamic vector for every step of the search. I needed a tiny temporary buffer.

But I was still paying the cost of heap allocation over and over again.

Why Heap Allocation Can Be Expensive in a Hot Path

Heap allocation becomes especially problematic when it happens inside a tight loop or a deeply repeated algorithm.

There are a few reasons for that.

1. Allocator Overhead

A heap allocator has to manage memory dynamically.

That means it needs bookkeeping.

When I create a Vec, Rust itself is not doing something inefficient. Vec is a great data structure. But when it needs capacity, it asks the allocator for heap memory.

The allocator has to find a free region that fits the requested size. It may reuse an existing block, split a larger block, or request more memory from the system.

Even when this is fast, it is not free.

In normal application code, this cost may not matter. In a hot path, it can dominate the runtime.

2. Memory Locality

Stack memory is usually very close to the current execution context.

Small arrays on the stack tend to be friendly to the CPU cache. The CPU can load nearby memory efficiently, and the access pattern is usually predictable.

Heap memory can be more scattered.

If temporary vectors are allocated and freed repeatedly, the actual memory locations may not be as local as a compact stack buffer.

This matters because modern CPUs are extremely fast, but memory is relatively slow. A lot of real-world performance comes down to whether the CPU can keep useful data in cache.

If the CPU has to wait for memory, the algorithm may look good on paper but still run slower in practice.

3. Fragmentation and Metadata

Heap allocators also maintain metadata.

They need to know which blocks are free, which blocks are used, and how large those blocks are.

Over time, allocation and deallocation patterns can create fragmentation. For small short-lived allocations, modern allocators are usually smart, but they still have to manage the lifecycle of memory.

For a deep search algorithm, this was unnecessary noise.

The lifetime of the data was very short. Each temporary list only lived during one search step.

That is exactly the kind of data that usually fits better on the stack.

4. More Pressure on the CPU Cache

The real issue was not only the cost of allocation itself.

The bigger problem was the combination of allocation, sorting, scoring, and repeated traversal.

The algorithm was doing many tiny operations. Each operation looked harmless, but together they created pressure on the CPU cache.

Temporary heap allocations increased the amount of memory traffic. The CPU had to touch more memory, follow more pointers, and deal with less predictable access patterns.

When code runs once, this is not a big deal.

When code runs inside the hottest part of a search engine, it matters.

The First Optimization: Remove Temporary Heap Allocation

So my first idea was simple:

Stop creating temporary vectors in the hot path.

Instead of collecting candidates into a heap-allocated Vec, I moved toward a fixed-size temporary buffer.

Conceptually, something like this:

const MAX_CANDIDATES: usize = 64;

let mut buffer: [MaybeUninit<ScoredCandidate>; MAX_CANDIDATES] = unsafe {
    MaybeUninit::uninit().assume_init()
};

The exact implementation depends on the project, of course. The important part is the idea:

the number of candidates was bounded
the temporary data had a very short lifetime
the buffer could live on the stack
no heap allocation was needed for every search node

After this change, I expected the program to become faster.

And partially, I was right.

The allocation cost disappeared from the flamegraph.

But then I hit the second trap.

The Surprising Part: Removing Allocations Made It Slower

After removing the temporary heap allocation, I used Rust's standard sorting method:

items.sort_unstable_by_key(|item| compute_score(item));

At first glance, this looked like the right choice.

It is in-place. It avoids extra allocation. It is part of the standard library. It uses a serious optimized sorting algorithm.

But the program became slower.

That was confusing.

I had removed heap allocation. I had improved memory usage. I was using a well-optimized standard method.

So why did performance drop?

The flamegraph gave me the answer.

The allocation cost was gone, but now the program was spending too much time recomputing scores.

The key detail is that sort_unstable_by_key does not cache the key for every item.

The closure can be called multiple times during sorting.

That is completely fine when the key is cheap, like reading an integer field:

items.sort_unstable_by_key(|item| item.priority);

But my key was not cheap.

The score calculation was based on multiple network-related signals, such as:

estimated latency
hop penalty
node priority
route freshness
failure probability
domain-specific weights

So every comparison could trigger real computation again.

And again.

I had removed heap allocation, but I had accidentally introduced repeated CPU work in the hottest part of the program.

The code looked clean, but it was doing more work than it appeared.

Why the Default Sort Was Not the Best Tool Here

Rust's sort_unstable_by_key is not bad.

Actually, it is very good.

The problem was not Rust.

The problem was that I used a general-purpose sorting method for a very specific workload.

Modern sorting algorithms are excellent for large arrays and general cases. Rust's unstable sort is designed to perform well across many real-world patterns.

But my workload had two special properties:

The arrays were very small.
The key calculation was relatively expensive.

For large arrays, an O(N log N) sort is usually the obvious choice.

But for tiny arrays, Big-O can be misleading.

When N is 10, 20, or 40, constant factors matter a lot.

At that size, the overhead of the sorting algorithm, the number of comparisons, branch behavior, cache locality, and repeated score calculation can matter more than the theoretical complexity.

In my case, the theoretical advantage of the sorting algorithm was less important than the practical cost of using it in a tiny hot loop.

The Fix: Cache the Scores and Sort a Small Stack Buffer

The final solution was simple:

Compute the score for each candidate exactly once.
Store the candidate and its score in a small stack buffer.
Sort the scored candidates using insertion sort.

Conceptually:

struct ScoredCandidate<T> {
    score: i32,
    item: T,
}

Instead of asking the sorting algorithm to repeatedly call compute_score, I made the score part of the temporary data.

The scoring step became linear:

for candidate in candidates {
    let score = compute_score(candidate);

    scored.push(ScoredCandidate {
        score,
        item: candidate,
    });
}

After that, sorting only compared already-computed integers.

That completely changed the cost profile.

The expensive part happened once per candidate, not many times per comparison.

Why Insertion Sort Worked Better

For the sorting step, I used insertion sort.

Yes, insertion sort.

The algorithm that is usually introduced early in computer science classes and then quickly ignored because it is O(N²).

But for small arrays, insertion sort can be extremely fast.

Why?

Because it is simple.

It has very little overhead. It works well with contiguous memory. It does not need extra allocation. It has predictable behavior. And when the array is small, the quadratic complexity does not have enough room to become a problem.

For example, sorting 20 or 30 items with insertion sort is not scary.

Especially when the data is already in a small stack buffer and comparisons are cheap integer comparisons.

In this case, insertion sort matched the workload better than a more advanced general-purpose sorting algorithm.

This is one of those cases where the "worse" algorithm on paper was better for the real machine.

The Result

After the change, the flamegraph looked completely different.

The repeated scoring calls disappeared from the hot path.

The allocation-related noise was gone.

The middle layers of the search became much faster.

The biggest improvement did not come from one magical trick. It came from aligning the implementation with the real shape of the workload:

small candidate lists
deep repeated search
short-lived temporary data
expensive scoring
no need for heap flexibility
stack-friendly memory layout
simple sorting over cached scores

The final version was not more complicated in spirit.

It was actually more honest about what the program was doing.

The program did not need a dynamic vector and a general-purpose sort at every search node.

It needed a tiny local ranking buffer.

The Lesson

The main lesson for me was this:

The fastest algorithm in theory is not always the fastest algorithm on real hardware.

Big-O matters, but it is not the whole story.

In performance-sensitive code, especially backend, networking, infrastructure, or systems code, the real questions are often more practical:

How large is the data really?
How often does this code run?
Is this allocation inside a hot path?
Is the temporary memory short-lived?
Is the key cheap or expensive?
Are we recomputing something that could be cached?
Is the memory layout friendly to the CPU cache?
Does a general-purpose standard method match this exact workload?

Standard library methods are great defaults. Most of the time, they are exactly what we should use.

But "default" does not mean "always optimal".

In my case, the better solution was:

avoid repeated heap allocations
compute scores once
store small temporary data on the stack
use insertion sort for tiny arrays

That was enough to beat the more advanced-looking approach.

Final Thought

This experience reminded me that performance work is not only about knowing algorithms.

It is about understanding the shape of the data, the lifetime of memory, and how the CPU actually executes the code.

Sometimes the bottleneck is not where you expect it to be.

Sometimes removing an allocation exposes another hidden cost.

And sometimes the "bad" O(N²) algorithm is exactly what your CPU wanted.

Building a kubectl-Style Go CLI: Factory, IOStreams, Prompt Policy, and Command Lifecycles

amir — Tue, 09 Jun 2026 17:43:40 +0000

Most Go CLIs start simple.

You add Cobra, create a few commands, put the logic inside RunE, call fmt.Println, read a couple of flags, and ship it.

For a small tool, that is perfectly fine.

But as the CLI grows, this style starts to collapse.

One command needs authentication.

Another command needs config resolution.

Another command needs interactive input.

Another one must support JSON output for automation.

Some commands are used by humans in terminals.

Some are used by CI pipelines.

Some should prompt.

Some must never prompt.

Eventually, every RunE becomes a mini framework.

That is the moment when a CLI stops being "just some commands" and becomes a real application runtime.

Recently, I refactored a Go CLI built with Cobra in exactly this direction. The goal was to move away from ad hoc command logic and toward a more mature foundation inspired by large Go CLIs such as kubectl.

This article walks through the architecture, the trade-offs, and the Go patterns behind that refactor.

This is not a beginner Cobra tutorial. I assume you already know how to create commands, flags, and RunE handlers. The focus here is on long-term maintainability, testability, terminal behavior, and command architecture.

The Original Problem

The CLI originally had a common shape:

cmd := &cobra.Command{
    Use: "login",
    RunE: func(cmd *cobra.Command, args []string) error {
        username, _ := cmd.Flags().GetString("username")

        if username == "" {
            return fmt.Errorf("username is required")
        }

        fmt.Println("Logging in...")

        session, err := createSession(cmd)
        if err != nil {
            return err
        }

        return session.Login(username)
    },
}

This looks harmless.

But the problems grow quickly:

fmt.Println("normal output")
fmt.Fprintln(os.Stderr, "warning")
fmt.Scanln(&value)
configPath, _ := cmd.Flags().GetString("config")
session := newSessionFromCommand(cmd)
client := newClientFromCommand(cmd)

Now each command knows too much.

It knows:

how to read flags
how to resolve config
how to create sessions
where output goes
how to prompt users
how to detect missing input
how to behave in CI
how to handle terminal vs non-terminal execution

This creates several long-term issues.

First, behavior becomes inconsistent. One command may prompt when input is missing. Another may fail immediately. Another may print prompts to stdout. Another may write errors using fmt.Println.

Second, testing becomes painful. If a command directly reads from os.Stdin and writes to os.Stdout, tests must either patch globals or execute the command as a subprocess.

Third, automation becomes risky. A command that unexpectedly prompts in CI can hang a pipeline. A prompt printed to stdout can corrupt JSON output. A missing required value may fail in one command but trigger interactive input in another.

The refactor was designed to solve these issues at the architectural level.

The Target Shape

The direction was to move the CLI toward this model:

Cobra command
    ↓
Options struct
    ↓
Complete(factory, cmd, args)
    ↓
Validate()
    ↓
Run()

And to introduce a shared runtime foundation:

Factory
├── IOStreams
├── ConfigPath
└── PromptPolicy

This gives every command the same execution model.

Cobra remains responsible for command registration, flag parsing, and command routing.

The application logic moves into explicit command options.

Runtime concerns like IO, prompting, config path, and interactivity policy are centralized.

That is the essence of a kubectl-style CLI architecture.

Not because it copies kubectl line-for-line, but because it follows the same conceptual direction:

central runtime factory
injected IO streams
command options
explicit lifecycle
separation between command wiring and behavior
testable execution paths
predictable terminal semantics

1. Factory: Centralizing CLI Runtime Dependencies

The first key abstraction is the Factory.

A simplified version looks like this:

type Factory struct {
    IOStreams    IOStreams
    ConfigPath   string
    PromptPolicy PromptPolicy
}

The Factory exists to answer one question:

What runtime context does a command need in order to execute?

Without a factory, every command eventually starts doing this:

configPath, _ := cmd.Flags().GetString("config")
client := newClient(configPath)
session := newSession(configPath)
interactive := detectTerminal(os.Stdin, os.Stderr)

That logic gets duplicated, slightly modified, and eventually becomes inconsistent.

With a Factory, commands receive a shared runtime object:

func NewLoginCmd(f *Factory) *cobra.Command {
    opts := &LoginOptions{}

    cmd := &cobra.Command{
        Use: "login",
        RunE: func(cmd *cobra.Command, args []string) error {
            return runOptions(f, cmd, args, opts)
        },
    }

    return cmd
}

Now the command is no longer responsible for discovering the whole world.

It can ask the factory for what it needs.

This makes the command easier to test and easier to evolve.

The Risk: Factory Can Become a God Object

A Factory is useful, but it has a dangerous failure mode.

It can become a dumping ground.

Bad direction:

type Factory struct {
    IOStreams IOStreams
    Config    *Config
    Client    *Client
    Session   *Session
    Logger    *Logger
    Cache     *Cache
    Auth      *AuthService
    Search    *SearchService
    Submit    *SubmitService
}

At that point, Factory becomes a service locator.

That is usually a smell.

A better rule:

Factory should own CLI runtime concerns, not business logic.

It is reasonable for Factory to provide access to streams, config path, terminal policy, and construction helpers.

But domain services should still have explicit dependencies.

Prefer:

submitService := submit.NewService(client, config)

over:

submitService := submit.NewService(factory)

The Factory should make command construction easier. It should not hide every dependency behind one giant object.

2. IOStreams: Stop Talking Directly to the Process

One of the most important changes was adding an IOStreams abstraction.

type IOStreams struct {
    In     io.Reader
    Out    io.Writer
    ErrOut io.Writer

    IsTerminalIn     bool
    IsTerminalOut    bool
    IsTerminalErrOut bool
}

At first, this looks like a small change.

It is not.

It changes the CLI from being coupled to the current process into something testable and composable.

Why Direct `os.Stdout` Usage Becomes a Problem

This is easy:

fmt.Println("Login successful")

But it is also global state.

It writes to the real process stdout.

In tests, this is annoying. In commands that support machine-readable output, it is risky. In long-term CLI architecture, it creates inconsistent output behavior.

Instead:

fmt.Fprintln(streams.Out, "Login successful")

Now output can be redirected in tests:

out := &bytes.Buffer{}
errOut := &bytes.Buffer{}
in := strings.NewReader("amir\n")

streams := IOStreams{
    In:     in,
    Out:    out,
    ErrOut: errOut,
}

This makes command tests much simpler.

stdout vs stderr Is Not Cosmetic

A mature CLI treats stdout and stderr differently.

stdout is for the command result.

stderr is for prompts, warnings, diagnostics, and errors.

Why?

Because stdout is often piped.

Example:

mycli search users --output json > users.json

If a prompt is printed to stdout, the output file may become invalid:

Enter username:
{"users":[...]}

That is broken JSON.

The prompt must go to stderr:

fmt.Fprint(streams.ErrOut, "Enter username: ")

Then stdout stays clean:

{"users":[...]}

This is a critical CLI contract.

It matters even more when a CLI is used in scripts, GitHub Actions, Docker containers, or CI pipelines.

Terminal Detection Belongs in the Runtime

The terminal flags are also important:

IsTerminalIn
IsTerminalOut
IsTerminalErrOut

These allow the CLI to distinguish between:

mycli login

and:

echo "token" | mycli login

or:

mycli search test > output.json

In interactive mode, prompting may be acceptable.

In a pipeline, it may be dangerous.

This is where IOStreams and PromptPolicy work together.

3. PromptPolicy: Make Interactivity Explicit

The CLI added a global flag:

--interactive=auto|always|never

This is one of the most important UX decisions in the whole refactor.

The policy means:

auto   → prompt only when stdin and stderr are attached to a terminal
always → require interactive prompting; fail clearly if terminal is unavailable
never  → never prompt; require all values explicitly

This prevents a very common CLI problem:

A command behaves nicely on a developer laptop but hangs forever in CI.

For example:

mycli login

In a real terminal, it is fine to ask:

Username:
Password:

But inside CI, that same command should not wait for input forever.

With --interactive=never, the behavior becomes deterministic:

mycli login --interactive=never

If required input is missing, the command fails immediately.

That is exactly what automation needs.

Why `auto` Should Require stdin and stderr

A good auto policy should usually require at least:

streams.IsTerminalIn && streams.IsTerminalErrOut

Why stderr?

Because prompts are written to stderr. If stderr is not a terminal, interactive prompting may not be visible to the user.

Depending on your CLI, you may also consider stdout. But for prompting specifically, stdin and stderr are usually the key streams.

Should the Flag Be Named `--interactive`?

--interactive=auto|always|never is clear and explicit.

Alternative names could be:

--prompt=auto|always|never
--input-mode=interactive|non-interactive|auto
--non-interactive

A boolean --non-interactive is common, but less expressive.

The tri-state model is more powerful because it allows a user to say:

"auto-detect"
"force prompting"
"never prompt"

For serious CLIs, the tri-state model is often worth it.

4. Prompt Layer: Centralize Human Input

Before the refactor, prompting could easily spread into command files:

fmt.Print("Enter name: ")
fmt.Scanln(&name)

That approach does not scale.

Prompt behavior should be reusable and policy-driven.

The prompt layer supports:

text prompts
secret prompts
select prompts

Conceptually:

type Prompter struct {
    Streams IOStreams
    Policy  PromptPolicy
}

Then command code can do:

name, err := prompter.Text(ctx, "Name")
if err != nil {
    return err
}

For secrets:

password, err := prompter.Secret(ctx, "Password")
if err != nil {
    return err
}

Secret input should use:

golang.org/x/term

For example:

func readSecret(fd int) ([]byte, error) {
    return term.ReadPassword(fd)
}

This is one place where Go CLI design gets tricky.

term.ReadPassword works with file descriptors, not just io.Reader.

That means a clean prompt abstraction may need to separate generic testable prompting from terminal-specific password reading.

A common pattern is to inject the password reader:

type SecretReader interface {
    ReadPassword(fd int) ([]byte, error)
}

or use a function field:

type SecretReadFunc func(fd int) ([]byte, error)

That makes secret prompt behavior testable without requiring a real terminal.

Prompting Should Not Leak Into Commands

The command should not decide terminal behavior manually.

Bad:

if term.IsTerminal(int(os.Stdin.Fd())) {
    fmt.Print("Username: ")
    fmt.Scanln(&username)
}

Better:

username, err = prompt.Text(ctx, "Username")

The prompt layer handles:

policy
terminal checks
stderr output
input reading
empty input behavior
cancellation behavior
consistent errors

That keeps command code focused on command semantics.

5. Options Pattern: Complete, Validate, Run

The next major change was introducing command options.

The pattern looks like this:

type LoginOptions struct {
    Username string
    Password string

    Streams IOStreams
    Client  *client.Client
}

func (o *LoginOptions) Complete(f *Factory, cmd *cobra.Command, args []string) error {
    username, err := cmd.Flags().GetString("username")
    if err != nil {
        return err
    }

    o.Username = username
    o.Streams = f.IOStreams

    if o.Username == "" {
        value, err := f.Prompter().Text("Username")
        if err != nil {
            return err
        }
        o.Username = value
    }

    o.Client = f.NewClient()

    return nil
}

func (o *LoginOptions) Validate() error {
    if o.Username == "" {
        return fmt.Errorf("username is required")
    }
    return nil
}

func (o *LoginOptions) Run(ctx context.Context) error {
    if err := o.Client.Login(ctx, o.Username, o.Password); err != nil {
        return err
    }

    fmt.Fprintln(o.Streams.Out, "Login successful")
    return nil
}

And a shared runner:

type Options interface {
    Complete(f *Factory, cmd *cobra.Command, args []string) error
    Validate() error
    Run(ctx context.Context) error
}

func runOptions(f *Factory, cmd *cobra.Command, args []string, opts Options) error {
    ctx := cmd.Context()

    if err := opts.Complete(f, cmd, args); err != nil {
        return err
    }

    if err := opts.Validate(); err != nil {
        return err
    }

    return opts.Run(ctx)
}

This lifecycle gives commands clear boundaries.

Complete

Complete collects input and dependencies.

It can read:

args
flags
config path
prompted values
clients
sessions
runtime dependencies

It should not execute major side effects.

Validate

Validate checks semantic correctness.

It should answer:

Do we have enough valid information to run?

Examples:

if o.Username == "" {
    return errors.New("username is required")
}

if !validProvider(o.Provider) {
    return fmt.Errorf("unsupported provider %q", o.Provider)
}

Run

Run performs side effects.

Examples:

API requests
writing config
creating sessions
submitting data
rendering final output

This split makes tests more focused.

You can test Validate without a fake API.

You can test Complete with fake streams.

You can test Run with a fake client.

This is significantly better than testing a huge RunE closure.

6. Cobra Should Wire Commands, Not Own the Application

Cobra is excellent for:

command tree
flags
args
help text
shell completion
command dispatch

But Cobra should not become your application architecture.

A common problem is passing *cobra.Command deep into the application:

func NewSession(cmd *cobra.Command) (*Session, error) {
    configPath, _ := cmd.Flags().GetString("config")
    // ...
}

This couples your session layer to Cobra.

That is a design smell.

A session package should not know that Cobra exists.

Better:

func NewSession(configPath string) (*Session, error) {
    // ...
}

or:

func NewSession(cfg Config) (*Session, error) {
    // ...
}

The command layer can use Cobra to resolve the flag, but lower layers should receive plain Go values.

Good dependency direction:

cmd → runtime/config/client/session

Bad dependency direction:

session → cobra
client → cobra
config → cobra

The deeper your business code knows about Cobra, the harder it becomes to test and reuse.

7. Stream-Aware Output and Table Rendering

Another important refactor was changing output rendering to accept a writer.

Bad:

func RenderTable(rows []Row) {
    fmt.Println("NAME\tSTATUS")
    for _, row := range rows {
        fmt.Printf("%s\t%s\n", row.Name, row.Status)
    }
}

Better:

func RenderTable(w io.Writer, rows []Row) error {
    tw := tabwriter.NewWriter(w, 0, 0, 2, ' ', 0)

    fmt.Fprintln(tw, "NAME\tSTATUS")
    for _, row := range rows {
        fmt.Fprintf(tw, "%s\t%s\n", row.Name, row.Status)
    }

    return tw.Flush()
}

Then commands can call:

return table.Render(o.Streams.Out, rows)

Now rendering is testable:

var out bytes.Buffer

err := table.Render(&out, rows)
require.NoError(t, err)

assert.Contains(t, out.String(), "NAME")

This also prepares the CLI for future output modes:

mycli search query --output table
mycli search query --output json
mycli search query --output yaml

A mature CLI usually needs a clear output strategy.

For example:

type OutputFormat string

const (
    OutputTable OutputFormat = "table"
    OutputJSON  OutputFormat = "json"
    OutputYAML  OutputFormat = "yaml"
)

Then a renderer can own output behavior:

type Renderer interface {
    Render(w io.Writer, value any) error
}

Do not mix API logic with table formatting.

That separation becomes very valuable when command output grows.

8. Testing Strategy for This Architecture

This architecture enables better tests.

But it also creates new test responsibilities.

Prompt Tests

Prompt tests should cover:

text prompt writes prompt text to stderr
text prompt reads from input
empty value behavior
select prompt valid selection
select prompt invalid selection
secret prompt does not echo input
prompt disabled by policy
prompt fails without terminal in always
prompt skips/fails in never

Example:

func TestPromptTextWritesToErrOut(t *testing.T) {
    in := strings.NewReader("amir\n")
    out := &bytes.Buffer{}
    errOut := &bytes.Buffer{}

    streams := IOStreams{
        In:               in,
        Out:              out,
        ErrOut:           errOut,
        IsTerminalIn:     true,
        IsTerminalErrOut: true,
    }

    p := NewPrompter(streams, PromptAuto)

    value, err := p.Text("Username")
    require.NoError(t, err)

    require.Equal(t, "amir", value)
    require.Empty(t, out.String())
    require.Contains(t, errOut.String(), "Username")
}

The important assertion is not just that the prompt works.

The important assertion is:

prompt text does not go to stdout.

Non-Interactive Tests

Every command that supports prompting should have tests for non-interactive mode.

Example:

func TestLoginMissingUsernameNonInteractiveFails(t *testing.T) {
    streams := IOStreams{
        In:               strings.NewReader(""),
        Out:              &bytes.Buffer{},
        ErrOut:           &bytes.Buffer{},
        IsTerminalIn:     false,
        IsTerminalErrOut: false,
    }

    f := NewFactory(streams, PromptNever)

    cmd := NewLoginCmd(f)
    cmd.SetArgs([]string{})

    err := cmd.Execute()
    require.Error(t, err)
    require.Contains(t, err.Error(), "username is required")
}

This protects CI behavior.

Golden Tests

For table output, golden tests are very useful.

Example:

testdata/search_table.golden

Then:

got := out.String()
want := readGolden(t, "testdata/search_table.golden")

require.Equal(t, want, got)

Golden tests help prevent accidental output changes.

That matters because CLI output is a user interface.

Integration Tests

You should also have command-level integration tests that execute Cobra commands with fake streams.

Test cases:

login with flags only
login with prompt
login missing input with --interactive=never
search output redirected
plates add with prompted missing values
SSO provider selection
config path override
invalid config path
stdout/stderr separation

For a CLI, these tests are often more valuable than many small unit tests.

9. Recommended Package Structure

At the beginning, keeping everything under cli/cmd is acceptable.

But as the CLI grows, cmd can become a junk drawer.

A more scalable structure:

cli/
  cmd/
    root.go
    login.go
    logout.go
    search.go
    submit.go
    plates.go
    my.go

internal/
  cli/
    runtime/
      factory.go
      context.go

    streams/
      streams.go

    prompt/
      prompt.go
      policy.go
      secret.go

    table/
      table.go
      renderer.go

    options/
      runner.go

  config/
    loader.go
    model.go
    writer.go

  session/
    session.go
    store.go

  client/
    client.go
    auth.go

  auth/
    login.go
    sso.go

  plates/
    service.go

  search/
    service.go

The goal:

cmd = Cobra wiring
internal/cli = CLI runtime and UX infrastructure
internal/config = config ownership
internal/client = API transport
internal/session = session persistence
domain packages = business behavior

This does not need to happen all at once.

A good extraction order:

Move IOStreams into internal/cli/streams
Move PromptPolicy and prompt code into internal/cli/prompt
Move table rendering into internal/cli/table
Move config/session/client out of cmd
Move business-heavy command logic into domain packages

Do not over-engineer too early.

But do prevent cmd from becoming the only package in the application.

10. What Will Hurt First as the CLI Grows?

The first pain point will probably be Factory growth.

If every new feature adds another field to Factory, the abstraction will become too broad.

The second pain point will be command option duplication.

If every options struct manually repeats the same config/client/session setup, you will need shared helpers.

For example:

type ClientOptions struct {
    ConfigPath string
    Client     *client.Client
}

func (o *ClientOptions) CompleteClient(f *Factory) error {
    o.ConfigPath = f.ConfigPath
    o.Client = f.NewClient()
    return nil
}

The third pain point will be output consistency.

As soon as users depend on CLI output, changes become breaking changes.

You may eventually need:

stable table columns
JSON schema guarantees
--output
--quiet
--verbose
structured errors
exit code conventions

The fourth pain point will be prompt semantics.

Some commands may need required prompts. Some may need optional prompts. Some should never prompt even in auto mode.

You may eventually need command-level prompt declarations:

type PromptRequirement int

const (
    PromptOptional PromptRequirement = iota
    PromptRequired
    PromptForbidden
)

But I would not add this until real commands need it.

11. What Not to Over-Engineer Yet

A good architecture is not the same as adding layers everywhere.

I would avoid these too early:

Do not build a full dependency injection framework

Go does not need a DI container here.

Constructor injection and small factories are enough.

Do not create generic abstractions before repetition exists

If only two commands need something, duplication may be acceptable.

Wait until the pattern is obvious.

Do not move every command into a separate package immediately

That can make navigation harder.

Start by extracting infrastructure:

streams
prompt
table
config
client
session

Then extract domain logic when commands become large.

Do not make Factory own all services

Factory should help create runtime dependencies.

It should not become the application.

Do not hide Cobra too aggressively

Cobra is fine in the command layer.

The important part is preventing Cobra from leaking into lower layers.

12. Senior-Level Review of the Architecture

Overall, this is a strong direction.

The combination of:

Factory
IOStreams
PromptPolicy
prompt layer
Complete/Validate/Run
writer-based rendering

is a solid foundation for a larger Go CLI.

The architecture improves:

testability
automation safety
stdout/stderr correctness
command consistency
future extensibility
separation of concerns

It also aligns conceptually with mature CLIs like kubectl.

But the design must be kept disciplined.

The most important rules going forward:

1. Keep Cobra in cmd.
2. Keep stdout clean.
3. Keep prompts on stderr.
4. Keep Factory focused.
5. Keep business logic out of command wiring.
6. Keep command lifecycle consistent.
7. Keep lower packages free from Cobra.
8. Keep tests stream-aware and policy-aware.

If those rules hold, the CLI can grow to dozens of commands without becoming a maintenance problem.

Final Thoughts

A CLI is not just a thin wrapper around functions.

A serious CLI is an interface contract.

It is used by humans, scripts, CI systems, terminals, pipes, and other tools.

That means small decisions matter:

stdout or stderr?
prompt or fail?
terminal or pipe?
config from where?
error with what exit behavior?
table or JSON?
command-specific logic or shared runtime?

The refactor described here is not just cleanup.

It creates a foundation.

The CLI moves from this:

Each command does everything itself.

to this:

Commands share a runtime, follow a lifecycle, and behave predictably.

That is the difference between a small Cobra app and a maintainable Go CLI platform.

Building Hybrid Search with PostgreSQL, pgvector, and Citus

amir — Sun, 07 Jun 2026 16:22:15 +0000

Search looks simple from the outside.

A user types something like:

short-range copper module

and expects the system to return the right product, maybe even an exact SKU like:

QDD-2Q200-CU3M

But when you work with a real product catalog, especially something technical like networking hardware, search becomes much harder.

You are not only searching titles. You are searching SKUs, brands, specs, categories, descriptions, compatibility notes, datasheets, and sometimes even weird abbreviations that only people in that industry understand.

For simple websites, keyword search is usually enough. But for a serious catalog with hundreds of thousands of products, keyword search alone starts to break down.

That is where hybrid search becomes interesting.

In this article, I want to explain how I would design a high-performance hybrid search system using:

PostgreSQL
pgvector
full-text search
HNSW indexes
Reciprocal Rank Fusion, or RRF
Citus for scaling later

The goal is not to make the architecture more complicated than necessary. The goal is to keep the system practical, fast, and maintainable.

Why keyword search is not enough

Traditional search usually works by matching words.

For example, if the user searches:

Cisco Catalyst 9300 48-port PoE

keyword search can do a good job because the query contains exact terms like Cisco, Catalyst, 9300, and PoE.

But what happens when the user searches like this?

short range copper module

Maybe the product description does not use those exact words. Maybe it says direct attach cable, DAC, or only contains the SKU and technical specs.

A pure keyword engine may miss good results because the words do not match exactly.

This is the main reason vector search is useful. Vector search is not only looking at exact words. It tries to understand the meaning behind the query.

What pgvector gives us

pgvector allows PostgreSQL to store and search embeddings directly inside the database.

An embedding is basically a list of numbers that represents the meaning of a text.

For example, this text:

Cisco 10G short range SFP transceiver

can be converted by an embedding model into a vector like this:

[0.021, -0.182, 0.441, ...]

Of course, real vectors usually have hundreds or thousands of dimensions.

The nice part is that PostgreSQL can now compare these vectors and find the closest products to the user's query.

So instead of only asking:

Which rows contain these exact words?

we can also ask:

Which rows are semantically close to this query?

This is very powerful for technical catalogs.

Why I still prefer hybrid search

Vector search is powerful, but it is not perfect.

One of its weaknesses is exact matching.

For example, if someone searches:

Catalyst 9300 48-port PoE

a vector search might return products that are semantically close, such as:

Catalyst 9300 24-port models
Catalyst 9400 models
non-PoE switches
similar Cisco switches

From a semantic point of view, these products are close.

But from a buyer's point of view, they may be wrong.

If the user typed 9300, 48-port, and PoE, those details matter.

This is why I would not build this system with vector search alone.

A better approach is hybrid search:

Use PostgreSQL full-text search for exact and lexical matching.
Use pgvector for semantic matching.
Merge both result lists into one final ranking.

This gives us both precision and recall.

Choosing the distance metric

When we compare vectors, we need a distance metric.

pgvector supports multiple options, but the most common ones are:

Euclidean distance
Cosine distance
Inner product

For text embeddings, cosine similarity is usually a strong default because it focuses more on the direction of the vectors instead of their size.

That matters because a long product description and a short user query can have very different vector magnitudes.

In pgvector, cosine distance uses the <=> operator:

ORDER BY embedding <=> $1

If the embedding model already outputs normalized vectors, inner product can also be a good option because it is usually cheaper to compute.

But for a simple and safe implementation, I would start with cosine distance and benchmark from there.

Picking an embedding model

The embedding model is one of the most important decisions in the whole system.

A generic model may understand normal English very well, but it may not understand the difference between technical networking terms like:

SMF and MMF
SFP and QSFP
DAC and AOC
10G, 25G, 40G, and 100G
similar SKUs that differ by only a few characters

For this kind of catalog, I would choose a model that performs well on technical and retrieval tasks.

Two interesting options are:

Qwen embedding models
EmbeddingGemma-style compact embedding models

A large model can give better semantic quality, but it also needs more infrastructure. For many real systems, a smaller model with good retrieval quality is easier to operate.

For a catalog around 40,000 products, I would probably start with a 512-dimensional or 768-dimensional embedding model and measure quality before jumping to a very large model.

The practical question is not only:

Which model has the best benchmark score?

The real question is:

Which model gives good search quality with acceptable latency, storage, and operational cost?

Preparing product data before embedding

One mistake I often see is embedding only the raw description.

For e-commerce search, that is usually not enough.

A product has more useful context:

SKU
brand
category
product family
technical specs
compatibility
short description
long description

Before creating the embedding, I would build a clean text payload like this:

Category: Optical Transceivers
Brand: Cisco
SKU: QDD-2Q200-CU3M
Product Family: QSFP-DD
Features: short range copper module, 200G, direct attach cable
Description: ...

This helps the embedding model understand the product better.

It also makes similar products cluster closer together in vector space.

For long descriptions or datasheets, I would not blindly split every 500 characters. That can break important context.

Instead, I would use chunks with overlap.

For example:

chunk size: 500 to 800 tokens
overlap: 100 to 150 tokens

The overlap helps preserve context between chunks.

A simple PostgreSQL schema

Here is a simplified schema for this kind of system:

CREATE EXTENSION IF NOT EXISTS vector;
CREATE EXTENSION IF NOT EXISTS pg_trgm;

CREATE TABLE network_equipment (
    product_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    sku VARCHAR(100) UNIQUE NOT NULL,
    manufacturer VARCHAR(100) NOT NULL,
    category_id INT NOT NULL,
    price DECIMAL(10, 2),
    stock_quantity INT DEFAULT 0,
    description TEXT,

    search_vector tsvector GENERATED ALWAYS AS (
        setweight(to_tsvector('english', coalesce(sku, '')), 'A') ||
        setweight(to_tsvector('english', coalesce(manufacturer, '')), 'B') ||
        setweight(to_tsvector('english', coalesce(description, '')), 'C')
    ) STORED,

    embedding vector(768)
);

CREATE INDEX idx_network_category
ON network_equipment (category_id);

CREATE INDEX idx_network_search
ON network_equipment
USING GIN (search_vector);

The search_vector column is for PostgreSQL full-text search.

The embedding column is for semantic search with pgvector.

I also like this design because the product data and the vector live in the same database. That means fewer synchronization problems compared to pushing everything into a separate vector database.

Of course, external vector databases can be useful. But for many backend systems, keeping this inside PostgreSQL is simpler and more than enough.

IVFFlat vs HNSW

pgvector supports approximate nearest neighbor indexes.

The two important options are:

IVFFlat
HNSW

IVFFlat

IVFFlat works by dividing the vector space into lists or clusters.

It is usually faster to build and smaller in memory, but it has one important problem: it depends heavily on the data distribution at the time the index is created.

If your product catalog changes a lot, recall can degrade over time.

So IVFFlat can be useful, but I would be careful with it in a dynamic e-commerce system.

HNSW

HNSW builds a graph of vectors.

At query time, it navigates through this graph to find close neighbors quickly.

It usually has:

better recall
faster queries
larger index size
slower index build time

For a 40,000-row catalog, I would usually choose HNSW unless there is a strong memory limitation.

A typical index could look like this:

CREATE INDEX CONCURRENTLY idx_network_embedding_hnsw
ON network_equipment
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

At query time, we can tune recall and latency using:

SET hnsw.ef_search = 64;

Higher ef_search usually means better recall but slower queries.

This is something I would always benchmark with real data.

Hybrid search with RRF

Now comes the important part: merging keyword search and vector search.

The problem is that full-text search and vector search produce different scores.

BM25 or ts_rank_cd scores are not directly comparable with cosine distance.

So instead of trying to add raw scores together, we can use Reciprocal Rank Fusion, or RRF.

RRF uses the rank position of each result, not the raw score.

The idea is simple:

If a product ranks high in vector search, it gets points.
If it ranks high in keyword search, it gets points.
If it ranks high in both, it gets more points.

Here is a simplified query:

WITH vector_search AS (
    SELECT
        product_id,
        ROW_NUMBER() OVER (
            ORDER BY embedding <=> $1::vector
        ) AS rank_vector
    FROM network_equipment
    WHERE category_id = $2
    ORDER BY embedding <=> $1::vector
    LIMIT 100
),
keyword_search AS (
    SELECT
        product_id,
        ROW_NUMBER() OVER (
            ORDER BY ts_rank_cd(
                search_vector,
                websearch_to_tsquery('english', $3)
            ) DESC
        ) AS rank_keyword
    FROM network_equipment
    WHERE search_vector @@ websearch_to_tsquery('english', $3)
      AND category_id = $2
    ORDER BY ts_rank_cd(
        search_vector,
        websearch_to_tsquery('english', $3)
    ) DESC
    LIMIT 100
)
SELECT
    e.product_id,
    e.sku,
    e.manufacturer,
    e.description,
    e.price,
    COALESCE(1.0 / (60 + v.rank_vector), 0.0) +
    COALESCE(1.0 / (60 + k.rank_keyword), 0.0) AS rrf_score
FROM vector_search v
FULL OUTER JOIN keyword_search k
    ON v.product_id = k.product_id
JOIN network_equipment e
    ON e.product_id = COALESCE(v.product_id, k.product_id)
ORDER BY rrf_score DESC
LIMIT 15;

This query does two searches:

Finds the top semantic matches using pgvector.
Finds the top keyword matches using PostgreSQL full-text search.

Then it combines them using RRF.

The result is usually much better than using only one search method.

Filtering matters a lot

One thing that can make or break performance is filtering.

If the user is clearly searching inside one category, do not run vector search across the whole table.

For example, if the query is about transceivers, there is no reason to compare it against server racks or firewalls.

This is why I would always try to apply filters before ranking:

WHERE category_id = $2

In larger systems, I would also consider partitioning by category or tenant.

This reduces the search space and keeps latency more predictable.

What about Citus?

For 40,000 products, a single well-tuned PostgreSQL instance can be enough.

But if the catalog grows to tens of millions of products, or if the system becomes multi-tenant, we may need horizontal scaling.

This is where Citus can help.

Citus distributes PostgreSQL tables across worker nodes.

For example:

SELECT create_distributed_table('network_equipment', 'product_id');

With this setup, each worker can search its own shard, and the coordinator can merge the results.

For multi-tenant systems, distributing by tenant_id may be even better because most queries are usually scoped to one tenant.

The important point is that we can start simple with PostgreSQL and pgvector, then scale out with Citus when the data size really requires it.

My practical recommendation

If I were building this for a real technical e-commerce catalog, I would start with this architecture:

PostgreSQL as the main database
pgvector for semantic search
PostgreSQL full-text search for exact keyword search
HNSW index for vector search
RRF to merge vector and keyword results
metadata-enriched embedding text
category or tenant filtering before ranking
Citus only when a single node is no longer enough

This gives a good balance between performance, search quality, and operational simplicity.

I would not start with a complicated distributed system on day one.

First, I would make the search quality good on one node. Then I would benchmark with real queries, real products, and real user behavior.

Only after that, I would scale the architecture.

Conclusion

Hybrid search is one of the most practical ways to improve product discovery.

Keyword search gives us precision.

Vector search gives us semantic understanding.

RRF gives us a clean way to combine both without fighting with incompatible scoring systems.

And pgvector makes the whole architecture easier because we can keep product data, metadata, full-text search, and embeddings inside PostgreSQL.

For a technical catalog like networking hardware, this approach is especially useful because users often search in different ways:

exact SKUs
product families
technical specs
short descriptions
vague concepts
compatibility terms

A good search engine should handle all of them.

For me, the best architecture is not the one with the most tools. It is the one that gives accurate results, keeps latency low, and stays simple enough to operate in production.

From Elasticsearch Bottlenecks to Weaviate: How We Built Fast Hybrid Search in Production

amir — Wed, 03 Jun 2026 09:36:55 +0000

For years, Elasticsearch was one of those tools I would almost automatically reach for whenever a system needed search.

And honestly, for many use cases, it is still excellent.

If you need full-text search, filtering, aggregations, faceting, observability queries, or log exploration, Elasticsearch is a very mature and powerful engine. Its lexical search capabilities, especially through BM25 and inverted indexes, are battle-tested.

But my problem started when the product requirement changed from:

“Find documents that contain these words.”

to:

“Find documents that match the meaning of what the user is asking, but still respect exact keywords when they matter.”

That is where pure keyword search was no longer enough.

We needed real hybrid search.

Not just semantic search.

Not just BM25.

Not a manually glued-together ranking system.

We needed a search engine that could combine:

exact keyword matching,
semantic similarity,
metadata filtering,
predictable latency,
and production-level throughput.

At first, I tried to make Elasticsearch do it.

That decision taught me a lot.

Eventually, it also pushed me toward Weaviate.

This article is a practical breakdown of what went wrong, why hybrid search is harder than it looks, how Weaviate approaches the problem, and how I think about evaluating vector search quality in production.

The problem: keyword search was no longer enough

Traditional search engines are very good at lexical matching.

If a user searches for:

linux kernel tuning

BM25 can rank documents that contain terms like linux, kernel, and tuning very well.

But what happens when the user searches for:

how to reduce context switching overhead in a container runtime

The best document might not contain that exact sentence.

It might talk about:

Linux namespaces
cgroups
scheduler behavior
CPU pressure
process isolation
runtime overhead

A pure keyword engine may miss or rank it lower because the vocabulary is different.

This is where semantic search becomes useful.

Instead of only matching words, semantic search converts both documents and queries into vectors, usually called embeddings. These embeddings represent the meaning of the text in a high-dimensional space.

Documents with similar meaning end up close to each other in that vector space.

So now the search engine can understand that:

"improve API latency under load"

is related to:

"reduce p99 response time during high traffic"

even if the exact words are different.

But semantic search alone is also not perfect.

Sometimes exact terms matter a lot.

For example:

PostgreSQL jsonb index performance

In this query, the terms PostgreSQL, jsonb, and index are not optional. A purely semantic result about generic database performance may sound similar, but it is not good enough.

That is why hybrid search matters.

A strong search system should understand both:

What the user literally typed
What the user actually means

My first approach: forcing Elasticsearch to behave like a vector search engine

The first implementation was based on Elasticsearch.

The idea looked simple:

Use BM25 for keyword scoring.
Use vector similarity for semantic scoring.
Combine both scores into one final score.
Return the top results.

Conceptually, it made sense.

In practice, it became painful.

The challenge was score fusion.

BM25 scores and vector similarity scores are not naturally comparable.

BM25 might produce values like:

12.4
27.8
43.1

while cosine similarity might produce values like:

0.71
0.82
0.91

If you combine them naively, the ranking becomes unstable.

You cannot simply do:

final_score = bm25_score + cosine_similarity

because the scales are completely different.

You need normalization, weighting, ranking logic, and careful testing.

In my first version, I tried using custom scripting logic to combine lexical and vector scores at query time.

That was the mistake.

The production bottleneck

Custom scoring logic can look elegant in a prototype.

Under production traffic, it can become expensive very quickly.

The main issue was that vector math and score fusion were happening during query execution. That meant every search request had to do extra scoring work on top of the normal search pipeline.

As traffic increased, the symptoms became obvious:

CPU usage started spiking.
Tail latency became unstable.
p95 and p99 response times became much worse.
Scaling required more resources than expected.
Search quality improvements were difficult to test safely.

The biggest lesson was not that Elasticsearch is bad.

The lesson was:

A system optimized for lexical search should not always be forced to become the core of a high-throughput vector search architecture.

Elasticsearch can support vector search in modern versions, and for many teams it may be enough. But in my case, the implementation became too complex and too expensive for the kind of hybrid search behavior we needed.

I wanted a system where hybrid search was not an afterthought.

That is when I started looking seriously at Weaviate.

Why Weaviate felt like the right tool

Weaviate is an open-source vector database written in Go.

That immediately caught my attention because I work a lot with Go, backend architecture, and performance-sensitive systems. But the language itself was not the main reason.

The real reason was the architecture.

Weaviate was designed around vector search from the beginning.

Instead of treating embeddings as an extra field attached to a traditional search engine, Weaviate treats vectors as a first-class part of the storage and search model.

At a high level, it gives you:

vector search,
BM25 keyword search,
hybrid search,
metadata filtering,
schema-based data modeling,
GraphQL and REST APIs,
and production-oriented indexing options.

For my use case, the most important part was that Weaviate could combine semantic and keyword search natively.

No custom query-time score script.

No fragile manual scoring layer.

No forcing two different ranking systems together in application code.

Quick mental model: how vector search works

Before going deeper into Weaviate, it is useful to understand the basic idea behind vector search.

A machine learning model converts text into an embedding:

"Linux kernel performance tuning"

becomes something like:

[0.012, -0.431, 0.227, 0.091, ...]

That vector may have hundreds or thousands of dimensions depending on the embedding model.

The same thing happens to every document in your database.

When a user sends a query, the query is also converted into a vector. Then the search engine tries to find the nearest vectors.

The simplest way to think about this is distance.

For example, Euclidean distance between two vectors can be represented as:

d(p, q) = sqrt(sum((q_i - p_i)^2))

Another common approach is cosine similarity, which compares the angle between two vectors.

In real production systems, searching every vector one by one would be too slow at scale. If you have millions of documents, exact brute-force search is usually not practical.

That is why vector databases use Approximate Nearest Neighbor algorithms, usually called ANN.

ANN algorithms trade a tiny amount of perfect accuracy for a massive improvement in speed.

One of the most popular ANN algorithms is HNSW.

HNSW: the graph behind fast vector search

Weaviate uses HNSW, which stands for Hierarchical Navigable Small World.

You can think of HNSW as a graph structure built on top of your vectors.

Instead of scanning every vector, the engine navigates through a graph to quickly move toward the nearest neighbors.

A simplified mental model:

Similar vectors are connected.
The graph has multiple layers.
Higher layers allow faster long-distance jumps.
Lower layers refine the search near the best candidates.

This makes search much faster than brute force while still returning high-quality results.

For large-scale semantic search, this matters a lot.

A vector database is not only about storing embeddings. The real value is in how efficiently it can index, traverse, filter, and rank them under load.

Hybrid search: why score fusion is harder than it looks

The hardest part of hybrid search is not running BM25.

It is not running vector search either.

The hard part is combining the results correctly.

BM25 and vector search produce different types of scores.

BM25 is based on term frequency, inverse document frequency, field length, and lexical relevance.

Vector similarity is based on distance or similarity in embedding space.

These scores do not mean the same thing.

So instead of directly adding scores together, a better strategy is often to combine rankings.

This is where rank fusion becomes useful.

Reciprocal Rank Fusion: a better way to combine results

One common technique for hybrid ranking is Reciprocal Rank Fusion, usually shortened to RRF.

Instead of saying:

“This document has a BM25 score of 30 and a vector score of 0.88, so let’s add them.”

RRF says:

“Where did this document rank in the keyword result list, and where did it rank in the vector result list?”

Then it combines ranks.

A simplified formula looks like this:

RRF_score = 1 / (k + rank_keyword) + 1 / (k + rank_vector)

The exact implementation details can vary, but the idea is powerful: ranking position becomes more important than raw score scale.

This avoids many of the problems caused by incompatible scoring systems.

In Weaviate, hybrid search also exposes an alpha parameter, which lets you control the balance between keyword and vector search.

Conceptually:

alpha = 0.0  -> keyword-focused search
alpha = 1.0  -> vector-focused search
alpha = 0.5  -> balanced hybrid search

That is extremely useful in real systems.

Different product areas may need different search behavior.

For example:

Documentation search may need more semantic matching.
SKU or product-code search may need stronger keyword matching.
Support ticket search may need a balance of both.
Log or error search may need exact matching for IDs and stack traces.

Being able to tune this without rewriting the whole scoring system is a big win.

Example: hybrid search with the Go client

Here is a simplified example using the Weaviate Go client.

package main

import (
    "context"
    "fmt"
    "log"

    "github.com/weaviate/weaviate-go-client/v5/weaviate"
    "github.com/weaviate/weaviate-go-client/v5/weaviate/graphql"
)

func main() {
    cfg := weaviate.Config{
        Host:   "localhost:8080",
        Scheme: "http",
    }

    client, err := weaviate.NewClient(cfg)
    if err != nil {
        log.Fatal(err)
    }

    ctx := context.Background()

    result, err := client.GraphQL().Get().
        WithClassName("Article").
        WithFields(
            graphql.Field{Name: "title"},
            graphql.Field{Name: "summary"},
            graphql.Field{Name: "_additional", Fields: []graphql.Field{
                {Name: "score"},
            }},
        ).
        WithHybrid(client.GraphQL().HybridArgumentBuilder().
            WithQuery("best practices for linux kernel tuning").
            WithAlpha(0.7),
        ).
        WithLimit(10).
        Do(ctx)

    if err != nil {
        log.Fatal(err)
    }

    fmt.Printf("%+v\n", result)
}

In this example, alpha = 0.7 means the query is more semantic than lexical, but keyword matching still contributes to the final ranking.

That is exactly the kind of control I wanted.

Production benchmark: what changed after migration

In one internal test, I used a dataset of around 1.5 million text documents, including articles, technical notes, and internal documentation.

The goal was not to create a perfect academic benchmark. The goal was to compare the behavior of the previous implementation with the new architecture under similar load.

The difference was significant.

System	Search Type	Fusion Strategy	Approx. p99 Latency	CPU Behavior Under Load
Elasticsearch	Hybrid	Custom query-time scoring	~850 ms	Frequent CPU spikes
Weaviate	Hybrid	Native hybrid ranking	~45 ms	Much more stable

The exact numbers will depend on hardware, schema, vector dimensions, indexing configuration, filters, and query patterns.

But the important part was not only the latency improvement.

The bigger win was operational simplicity.

After the migration:

ranking logic became easier to reason about,
query latency became more predictable,
CPU usage was more stable,
scaling was easier,
and experiments with different alpha values became much safer.

That matters a lot in production.

Performance is not only about the fastest possible average response time.

It is also about predictability.

A search system that is fast 90% of the time but unstable at p99 can still hurt the user experience badly.

The hidden challenge: evaluating search quality

Performance is only half of the story.

A fast search engine that returns bad results is still a bad search engine.

After moving to vector or hybrid search, one of the most important questions becomes:

How do we know the new search is actually better?

This gets especially important when changing embedding models.

For example, imagine moving from one embedding model to a newer one.

The new model may produce better vectors.

Or it may be worse for your specific domain.

You cannot rely only on intuition.

You need evaluation.

Metric 1: Mean Reciprocal Rank

Mean Reciprocal Rank, or MRR, measures how early the first relevant result appears.

If the correct result is ranked first, the reciprocal rank is:

1 / 1 = 1.0

If it appears in position 5:

1 / 5 = 0.2

Then you average this across many test queries.

MRR is useful when the user usually needs one best answer.

For example:

"How do I configure cgroups v2 memory limits?"

If the best document is result number one, the search is doing well.

If the best document is buried on page three, the user experience is poor.

Metric 2: Recall@K

Recall@K answers a different question:

Out of all relevant documents, how many did we return in the top K results?

For example, Recall@10 checks how many relevant documents appear in the first 10 results.

This is useful when there may be multiple correct results.

For documentation, research, support tickets, or internal knowledge bases, Recall@K can be more useful than only checking the top result.

Metric 3: NDCG

NDCG stands for Normalized Discounted Cumulative Gain.

It is useful when relevance is not binary.

A result can be:

perfect,
good,
somewhat related,
or irrelevant.

NDCG rewards highly relevant documents appearing near the top of the result list.

This is closer to how real users experience search.

A search result ranked #1 matters more than a result ranked #9.

Metric 4: visual inspection with UMAP or t-SNE

Metrics are important, but visual inspection can also help.

One useful technique is to export a sample of your embeddings and reduce them to two dimensions using UMAP or t-SNE.

Then you can plot them and inspect whether similar documents form clear clusters.

For example, if you are indexing technical articles, you might expect clusters like:

Linux kernel
PostgreSQL
Kubernetes
Go concurrency
distributed systems
observability

If everything is mixed together randomly, your embedding model may not be representing your domain well.

This does not replace proper evaluation, but it can reveal issues quickly.

Practical lessons from the migration

Here are the biggest lessons I took from this project.

1. Do not treat vector search as just another field type

Embeddings change the architecture of search.

They are not just another column in your database.

You need to think about indexing, memory, distance metrics, filtering, ranking, and evaluation.

2. Hybrid search is usually better than pure semantic search

Pure semantic search can be impressive in demos, but production users often search with exact terms.

They search for:

product codes,
error messages,
function names,
ticket IDs,
file names,
database fields,
framework names.

A good system should not ignore those signals.

3. Score fusion deserves serious attention

Combining BM25 and vector similarity incorrectly can make search worse.

Rank fusion techniques are often safer than manually combining raw scores.

4. p99 latency matters more than a beautiful demo

A search demo with 100 documents is easy.

A production system with millions of documents, filters, concurrent users, and unpredictable queries is different.

Always test tail latency.

5. Measure quality before changing embedding models

A newer model is not automatically better for your data.

Build a small evaluation dataset with real queries and expected results.

Even 50 to 100 carefully selected queries can reveal a lot.

When I would still use Elasticsearch

This migration does not mean I would never use Elasticsearch again.

I would still choose Elasticsearch for many use cases:

log search,
observability,
analytics-heavy search,
faceted search,
complex aggregations,
mature operational environments already built around Elastic.

Elasticsearch is still a great tool.

The point is not “Elasticsearch bad, Weaviate good.”

The point is:

Choose the system that matches the shape of your problem.

For my specific case, the problem was high-throughput hybrid semantic search.

For that, Weaviate was a better fit.

Final thoughts

The biggest engineering mistake is often not choosing a bad tool.

It is choosing a good tool for the wrong job.

Elasticsearch is a powerful search engine, but when I needed production-grade hybrid search with strong vector search behavior, custom score fusion became too expensive and too hard to maintain.

Weaviate gave me a cleaner architecture:

native vector search,
BM25 support,
hybrid ranking,
HNSW indexing,
metadata filtering,
and a much simpler path to experimentation.

The migration improved latency, reduced operational complexity, and made ranking behavior easier to tune.

But the most important lesson was deeper:

Search quality is a product feature, not just an infrastructure feature.

You need to measure it, tune it, and treat it as part of the user experience.

If you are building search for technical documentation, e-commerce, internal knowledge bases, support systems, or AI-powered products, hybrid search is probably worth serious consideration.

And if you are currently fighting custom score scripts, unstable p99 latency, and hard-to-debug ranking behavior, it may be time to ask whether your search engine is doing the job it was originally designed to do.

Discussion

Have you implemented hybrid search in production?

I would be interested to hear:

whether you used Elasticsearch, OpenSearch, Weaviate, Qdrant, Pinecone, or another system,
how you handled score fusion,
and how you measured search quality beyond just latency.

Demystifying the Linux Page Cache: The Kernel Optimization Hiding Behind Every Fast I/O

amir — Tue, 02 Jun 2026 15:48:08 +0000

For the last few months, I have been spending a lot of time reading and researching Linux kernel internals.

Not just from the surface.

I mean going deeper into how the kernel actually manages memory, file I/O, processes, namespaces, cgroups, filesystems, and all the invisible mechanisms that make our backend systems feel fast.

As backend engineers, we usually talk about performance in terms of application code:

optimize the query
add Redis
reduce allocations
use a better index
improve concurrency
tune the API response time

And all of those things matter.

But after working on production systems for years, I have learned that sometimes the real bottleneck is not inside your application code.

Sometimes the real story is happening below your code, inside the operating system.

One of the most important examples of that is the Linux Page Cache.

It is one of those kernel features that quietly improves almost every backend system we run. It makes file reads faster, batches writes, reduces disk pressure, and gives us the illusion that storage is much faster than it actually is.

But it also comes with trade-offs.

And if you do not understand those trade-offs, you can easily misread performance numbers, misunderstand memory usage, or even risk losing data in the wrong failure scenario.

This article is my attempt to explain the Linux Page Cache from a backend engineer’s point of view.

Not as a kernel developer writing C inside mm/filemap.c.

But as someone who builds systems on top of Linux and wants to understand what the kernel is really doing behind the scenes.

Why Page Cache Exists

The reason Page Cache exists is simple:

Disk is slow. RAM is fast.

Even with modern SSDs and NVMe drives, accessing persistent storage is still much slower than accessing memory.

When our application reads from a file, the kernel could theoretically go to the disk every single time.

But that would be extremely inefficient.

So Linux uses available memory as a cache for file data.

That cache is called the Page Cache.

When a process reads a file, Linux usually does not just copy data directly from disk to the application and forget about it.

Instead, the kernel loads file data into memory pages and keeps those pages around.

So the next time the same file data is requested, Linux can serve it directly from RAM instead of touching the disk again.

That is the basic idea.

But the impact is huge.

Imagine a web server reading the same static files again and again.

Or a database repeatedly touching the same data files.

Or a log processor scanning files that were recently written.

Without Page Cache, every access would become much more expensive.

With Page Cache, many of those reads become memory-speed operations.

The First Important Idea: Linux Uses Free RAM Aggressively

One thing that confuses many developers is Linux memory usage.

You run free -h, and it looks like most of your RAM is used.

At first, it feels scary.

But often, that memory is not “wasted” or permanently consumed by applications.

A large part of it may be used by the kernel for cache.

Linux has a very practical philosophy:

Empty RAM is wasted RAM.

So if memory is available, the kernel uses it to cache useful data.

If an application later needs more memory, Linux can reclaim cache pages and give that memory back.

This is why a Linux server may look like it is using a lot of RAM even when your applications are not actually consuming that much.

The kernel is trying to help you.

It is using memory to avoid unnecessary disk I/O.

That is Page Cache doing its job.

What Actually Happens During a File Read?

Let’s say your application calls read() on a file.

At a high level, Linux checks whether the requested file data already exists in the Page Cache.

There are two possible outcomes.

Cache Hit

If the data is already in memory, the kernel can copy it to your application without going to disk.

This is fast.

Very fast compared to storage.

Cache Miss

If the data is not in memory, Linux has to read it from disk.

But after reading it, the kernel stores it in the Page Cache.

So the first read may be slower, but future reads can be much faster.

This is one reason benchmarks can be misleading.

If you run the same file-read benchmark twice, the second run may look much faster because the data is already cached.

Your application did not magically become better.

The kernel just remembered the file data.

The Kernel Also Predicts What You Might Read Next

Linux does not only cache what you already read.

It also tries to predict what you are going to read.

For sequential file access, the kernel may perform readahead.

That means when you read one part of a file, Linux may load the next parts into memory before you explicitly ask for them.

This is extremely useful for workloads like:

reading large logs
streaming files
scanning datasets
serving static assets
processing backups
importing CSV files

From the application’s point of view, it may feel like the disk is faster than expected.

But in reality, the kernel is doing smart work in the background.

It sees a pattern and tries to stay ahead of your application.

Writes Are Even More Interesting

Reads are easy to understand:

If data is cached, serve it from RAM.

Writes are more subtle.

When your application writes data to a file, Linux usually does not immediately force that data to physical storage.

Instead, the kernel writes the data into memory and marks the affected pages as dirty.

A dirty page means:

This page has been modified in memory, but the change has not necessarily been persisted to disk yet.

This is where the kernel gives your application a very useful illusion.

Your application calls write().

The kernel accepts the data.

The call returns successfully.

But that does not always mean the data is already safely stored on disk.

It may only mean the data is now in the kernel’s memory and scheduled to be written later.

This is one of the biggest performance tricks in the operating system.

And also one of the most important trade-offs.

Why Delayed Writes Are So Powerful

Imagine an application appending tiny log entries to a file thousands of times per second.

If Linux forced every small write to disk immediately, performance would be terrible.

Storage devices are much better at handling larger, sequential writes than many tiny random writes.

So Linux delays and combines writes.

Many small writes can be collected in memory and later flushed to disk in a more efficient way.

This process is called writeback.

The kernel has background mechanisms that periodically flush dirty pages to storage.

This gives us much better throughput.

Instead of turning every write() call into an expensive physical I/O operation, Linux turns many small writes into fewer, larger writes.

That is a huge win for performance.

The Dangerous Part: `write()` Is Not `fsync()`

This is where many bugs and misunderstandings happen.

A successful write() does not always mean your data is durable.

It usually means the kernel accepted your data.

If the machine loses power before dirty pages are flushed, some recently written data may be lost.

That is why databases, queues, and storage engines care so much about fsync().

When durability matters, you need to force the kernel to flush data to stable storage.

For example:

write(fd, data, size);
fsync(fd);

write() says:

Please accept this data.

fsync() says:

Now make sure it is actually persisted.

This distinction is extremely important when building systems where data loss is not acceptable.

For normal logs, delayed writeback may be fine.

For a financial transaction, it is not something you can ignore.

Why Databases Behave Differently

Databases like PostgreSQL, MySQL, RocksDB, and others are very careful with disk I/O.

They know the kernel is caching data.

They know writes may be delayed.

They know crashes can happen.

So they use techniques like:

write-ahead logging
fsync
checkpoints
direct I/O in some configurations
controlled flushing
careful ordering of writes

A database cannot simply trust that because write() returned successfully, everything is safe.

It needs stronger guarantees.

This is also why database performance tuning is complicated.

You are not only tuning SQL queries.

You are tuning the interaction between:

application
database engine
filesystem
Linux kernel
Page Cache
storage device
cloud virtualization layer

That stack is deep.

And Page Cache sits right in the middle of it.

Page Cache and `mmap`

Another important part of this story is mmap.

With normal file I/O, you call read() and write().

With mmap, a file can be mapped into a process’s virtual memory.

Then the application can access file data almost like normal memory.

But behind the scenes, Page Cache is still involved.

When the process touches a memory-mapped page, the kernel may load the corresponding file data into the Page Cache.

This is powerful because it can reduce copying and make file access feel very natural from the application side.

But it also means that memory-mapped I/O is deeply connected to the kernel’s virtual memory system.

This is where the boundary between “file” and “memory” becomes very thin.

And that is one of the reasons Linux I/O is such a fascinating topic.

Page Cache Can Make Benchmarks Lie

When I started looking deeper into kernel internals, one of the first practical lessons was this:

Never trust a file I/O benchmark unless you understand the cache state.

For example, if you benchmark reading a large file once, the first run may include real disk I/O.

The second run may mostly hit Page Cache.

So it looks much faster.

But that does not necessarily mean your code improved.

It may only mean the file is already cached.

This matters when testing:

file parsers
log processors
database imports
backup tools
search indexing jobs
media processing pipelines

If you want to test cold disk performance, you need to be intentional.

If you want to test warm cache performance, that is also valid.

But you should know which one you are measuring.

Otherwise, you are not benchmarking your application.

You are benchmarking a combination of your application and the kernel’s memory state.

Page Cache Can Also Hurt You

Page Cache is usually helpful.

But not always.

There are cases where it can create problems.

1. Cache Pollution

Imagine a server running a database.

Most of the time, the database benefits from hot data staying in memory.

Now imagine a backup process reads a huge 500GB file sequentially.

That large read can fill the Page Cache with data that may never be used again.

As a result, more important cached pages may be evicted.

This is called cache pollution.

The kernel tries to manage this intelligently, but no heuristic is perfect.

2. Memory Pressure

Because Page Cache uses RAM, it competes with applications for memory.

Usually Linux can reclaim cache pages when needed.

But under heavy memory pressure, the system may start doing more reclaim work, causing latency spikes.

For backend services, this can show up as random performance drops.

3. Dirty Page Spikes

If an application writes faster than storage can flush, dirty pages can accumulate.

At some point, Linux may throttle writers to prevent memory from being overwhelmed by dirty data.

From the application’s point of view, write latency may suddenly increase.

This is not because your code changed.

It is because the kernel is protecting the system.

Useful Commands to Observe Page Cache Behavior

You do not need to be a kernel developer to observe some of this behavior.

Linux exposes useful information through /proc and common tools.

Check memory usage

free -h

Look at the buff/cache column.

That is where you often see memory used for kernel buffers and cache.

Check dirty pages

cat /proc/meminfo | grep -E "Dirty|Writeback"

Example fields:

Dirty:              123456 kB
Writeback:             0 kB

Dirty shows memory waiting to be written back.

Writeback shows memory currently being written.

Watch I/O activity

iostat -xz 1

This helps you see disk utilization, await time, and whether the storage device is under pressure.

Watch process I/O

pidstat -d 1

This helps you understand which processes are doing reads and writes.

A Simple Experiment

You can see Page Cache behavior with a simple test.

Create a large file:

dd if=/dev/zero of=testfile bs=1M count=1024

Now read it:

time cat testfile > /dev/null

Run the same command again:

time cat testfile > /dev/null

The second run may be faster because the file data is already in Page Cache.

Now, depending on your system and permissions, you can drop caches for testing:

sync
echo 3 | sudo tee /proc/sys/vm/drop_caches

Then read again:

time cat testfile > /dev/null

Important note:

Do not randomly drop caches on production servers.

This is only for controlled testing.

How This Changed the Way I Think About Backend Performance

The more I study Linux internals, the more I realize that backend engineering is not only about writing application code.

A production system is a conversation between many layers:

your code
runtime
memory allocator
database
filesystem
kernel
storage
network
container runtime
orchestration platform

If you only look at your code, you may miss the real reason behind a performance issue.

For example:

An API becomes slow because disk writeback is saturated.
A database has latency spikes because dirty pages are being flushed.
A benchmark looks amazing because everything is cached.
A log processor slows down because it is causing cache pollution.
A container looks memory-heavy because the host is using RAM for cache.

Understanding Page Cache gives you better intuition.

It helps you ask better questions.

It helps you debug production issues with more confidence.

And it reminds you that the operating system is not just a passive layer.

The kernel is constantly making decisions on your behalf.

Practical Lessons for Backend Engineers

Here are the lessons I keep in mind now:

1. Do not panic when Linux uses RAM

High memory usage is not always bad.

Check whether memory is used by applications or by cache.

2. Benchmark carefully

Always understand whether your benchmark is testing cold reads or cached reads.

3. `write()` is not durability

If data must survive a crash, understand fsync() and the durability model of your storage system.

4. Watch dirty pages

Dirty pages can explain sudden write latency spikes.

5. Be careful with large sequential jobs

Backups, imports, and scans can affect cache behavior for other workloads.

6. Learn the kernel slowly

You do not need to understand everything at once.

But each concept you learn gives you better production intuition.

Final Thoughts

The Linux Page Cache is one of the most important performance features in the operating system.

It hides disk latency.

It makes repeated reads fast.

It batches writes.

It uses free memory intelligently.

And it does all of this quietly, behind almost every backend system we deploy.

But like every powerful abstraction, it has trade-offs.

It can make benchmarks misleading.

It can delay durability.

It can create latency spikes under write pressure.

It can affect databases, log processors, and large file workloads in unexpected ways.

For me, studying the Page Cache is part of a bigger journey: understanding Linux not just as a server environment, but as a complex engineering system.

The deeper I go into the kernel, the more respect I have for the invisible work it does every second.

And the more I believe that strong backend engineers should not stop at frameworks, databases, and APIs.

At some point, we also need to understand the machine underneath.

Because sometimes the bug is not in your code.

Sometimes the answer is in the kernel.

The 1978 Paper Behind Go’s Concurrency Model

amir — Sun, 31 May 2026 11:31:35 +0000

Every Go developer eventually hears this sentence:

Do not communicate by sharing memory; share memory by communicating.

At first, it sounds like a nice Go proverb.

But the more concurrent systems you build, the more you realize it is not just a slogan. It is a completely different way of thinking about software design.

When I first started working deeply with Go concurrency, I mostly thought about goroutines as “lightweight threads” and channels as “safe queues.” That mental model is useful at the beginning, but it is not the full story.

The real story goes much deeper.

Go’s concurrency model is heavily inspired by a paper from 1978: Communicating Sequential Processes, written by Tony Hoare.

This article is my attempt to explain that idea in a practical way: why shared-memory concurrency becomes painful, what CSP was trying to solve, and how Go turned that theory into something we can use every day in production systems.

The problem: shared state looks simple until it does not

Most concurrency problems start innocently.

You have multiple workers. They need access to the same data. So you put the data in memory and let the workers read and write it.

Something like this:

type Counter struct {
    value int
}

func (c *Counter) Inc() {
    c.value++
}

Then concurrency enters the system:

counter := &Counter{}

for i := 0; i < 1000; i++ {
    go counter.Inc()
}

Now the code is broken.

Multiple goroutines can read and write value at the same time. The final result becomes unpredictable. So we add a mutex:

type Counter struct {
    mu    sync.Mutex
    value int
}

func (c *Counter) Inc() {
    c.mu.Lock()
    defer c.mu.Unlock()

    c.value++
}

This fixes the data race.

But this is where the next problem starts.

Mutexes are not bad. They are necessary in many real systems. The issue is that shared state plus locks can easily become the default architecture.

And once that happens, your system becomes harder to reason about.

You start asking questions like:

Who owns this data?
Which goroutine is allowed to update it?
How long is this lock held?
Can this function call another function that also needs the same lock?
Can this code deadlock under load?
Why is p99 latency suddenly worse?

This is especially painful in backend systems with high-throughput pipelines, network services, log processors, container runtimes, or in-memory systems where thousands of operations happen concurrently.

The code may look safe because it has locks.

But safe does not always mean simple.

And simple is what keeps production systems maintainable.

The hidden cost of “just add a mutex”

A mutex protects memory, but it also serializes access.

That means one slow operation inside a lock can block many other goroutines.

I have seen patterns like this in real backend code:

mu.Lock()

user.Name = "Test User"

sendEmail(user)
callExternalAPI(user)
writeAuditLog(user)

mu.Unlock()

Technically, the shared data is protected.

Architecturally, this is a problem.

The lock is not only protecting the small mutation of user.Name. It is now protecting email sending, external API calls, logging, and anything else that happens inside that critical section.

That means every goroutine waiting for this lock must wait for the whole flow.

The service may have goroutines. It may look concurrent. But a big part of the request path has silently become sequential.

This is why concurrency bugs are not always about missing locks.

Sometimes the problem is too much locking.

Sometimes the real bug is ownership.

Tony Hoare’s idea: stop sharing memory directly

In 1978, Tony Hoare introduced Communicating Sequential Processes, usually called CSP.

The idea was very different from the classic shared-memory model.

Instead of many processes fighting over the same memory, CSP describes a system as independent sequential processes that communicate by sending messages.

The important parts are:

Each process has its own local state.
Processes do not casually share variables.
When they need to coordinate, they communicate explicitly.
Communication is part of the design, not an afterthought.

That sounds simple, but it changes how you design systems.

Instead of asking:

Which lock protects this data?

You start asking:

Which goroutine owns this data?

That is a much more powerful question.

Because if one goroutine owns the state, other goroutines do not need to mutate it directly. They send messages to the owner.

This is the mental shift behind Go channels.

Go did not just add concurrency. It gave concurrency a shape.

Go could have chosen the same path as many other languages:

threads
locks
shared memory
concurrency libraries
complex abstractions built on top

But Go made a different design decision.

It gave concurrency first-class language support with:

go for starting goroutines
chan for communication
select for coordinating multiple communication operations

This matters because concurrency is not treated as a library feature bolted onto the language.

It is part of the language’s design.

A goroutine is not exactly the same as a CSP process, and Go is not a pure CSP language. Go still allows shared memory, mutexes, atomics, and low-level synchronization when needed.

But the spirit of CSP is clearly there:

Build independent units of execution and make them communicate explicitly.

That is why Go feels so natural for network servers, infrastructure tools, distributed systems, streaming pipelines, and cloud-native software.

A simple CSP-style pipeline in Go

Let’s look at a practical example.

Imagine a small pipeline:

Generate jobs.
Process jobs.
Return results.

A shared-memory mindset might create a global queue, protect it with a mutex, and let workers pull from it.

A CSP-style mindset is different.

The data flows through channels.

package main

import (
    "fmt"
    "time"
)

type Job struct {
    ID int
}

type Result struct {
    JobID int
    Value string
}

func producer(out chan<- Job) {
    defer close(out)

    for i := 1; i <= 5; i++ {
        out <- Job{ID: i}
    }
}

func worker(in <-chan Job, out chan<- Result) {
    for job := range in {
        time.Sleep(100 * time.Millisecond)

        out <- Result{
            JobID: job.ID,
            Value: fmt.Sprintf("processed job %d", job.ID),
        }
    }
}

func main() {
    jobs := make(chan Job)
    results := make(chan Result)

    go producer(jobs)

    go func() {
        defer close(results)
        worker(jobs, results)
    }()

    for result := range results {
        fmt.Printf("job=%d result=%q\n", result.JobID, result.Value)
    }
}

There is no shared slice.
There is no global queue.
There is no explicit mutex.

The producer owns job creation.
The worker owns processing.
The main goroutine owns result collection.

The data moves through the system.

That is the important part.

We are not asking multiple goroutines to mutate the same object at the same time. We are designing a flow of ownership.

Scaling the pipeline with multiple workers

Now let’s make it more realistic.

A single worker is not enough for heavy workloads. We want multiple workers processing jobs concurrently.

package main

import (
    "fmt"
    "sync"
    "time"
)

type Job struct {
    ID int
}

type Result struct {
    WorkerID int
    JobID    int
    Value    string
}

func producer(out chan<- Job, count int) {
    defer close(out)

    for i := 1; i <= count; i++ {
        out <- Job{ID: i}
    }
}

func worker(workerID int, jobs <-chan Job, results chan<- Result) {
    for job := range jobs {
        time.Sleep(100 * time.Millisecond)

        results <- Result{
            WorkerID: workerID,
            JobID:    job.ID,
            Value:    fmt.Sprintf("processed job %d", job.ID),
        }
    }
}

func main() {
    const workerCount = 3
    const jobCount = 10

    jobs := make(chan Job)
    results := make(chan Result)

    go producer(jobs, jobCount)

    var wg sync.WaitGroup

    for i := 1; i <= workerCount; i++ {
        wg.Add(1)

        go func(workerID int) {
            defer wg.Done()
            worker(workerID, jobs, results)
        }(i)
    }

    go func() {
        wg.Wait()
        close(results)
    }()

    for result := range results {
        fmt.Printf(
            "worker=%d job=%d result=%q\n",
            result.WorkerID,
            result.JobID,
            result.Value,
        )
    }
}

This is one of the places where Go shines.

The design is still readable:

The producer sends jobs.
Workers receive jobs.
Workers send results.
The result channel closes when all workers finish.

We still use sync.WaitGroup, because Go is pragmatic. CSP-style design does not mean “never use the sync package.”

It means we use synchronization intentionally.

The channel handles data flow.
The wait group handles lifecycle coordination.

That separation is clean.

The real value: ownership becomes visible

The biggest advantage of CSP-style design is not that it removes every mutex.

The biggest advantage is that it makes ownership visible.

In shared-memory systems, ownership is often hidden.

You see a pointer passed around. You see a struct used in multiple places. You see a lock somewhere. But it is not always obvious who is responsible for the state.

With channels, ownership is easier to see.

For example:

jobs <- job

This line says:

I am sending this job to another part of the system.

And this line:

job := <-jobs

says:

I am receiving this job and now I am responsible for processing it.

That clarity matters.

In production systems, many bugs are not caused by complex algorithms. They are caused by unclear ownership.

Who closes this channel?
Who updates this state?
Who retries this job?
Who owns cancellation?
Who handles backpressure?

A channel-based design forces you to answer these questions earlier.

Backpressure is built into the model

Another underrated benefit of channels is backpressure.

An unbuffered channel forces the sender and receiver to synchronize:

jobs := make(chan Job)

If the producer sends faster than the worker receives, the producer blocks.

That is not a bug. That is backpressure.

Buffered channels allow some temporary queueing:

jobs := make(chan Job, 100)

Now the producer can get ahead by 100 jobs, but not forever.

This is powerful in real systems.

For example, imagine a log processing service:

HTTP requests receive logs.
A parser normalizes them.
A batcher writes them to storage.
A separate worker sends alerts.

Without backpressure, one fast layer can overwhelm a slower layer.

With channels, you can make pressure visible and controlled.

You can decide where the system should block, buffer, drop, retry, or shed load.

That is architecture, not just syntax.

Where channels are a bad fit

It is also important to be honest: channels are not magic.

Not every concurrency problem becomes better with channels.

Sometimes a mutex is simpler and faster.

For example, this is perfectly reasonable:

type Metrics struct {
    mu       sync.Mutex
    requests int64
    errors   int64
}

func (m *Metrics) IncRequests() {
    m.mu.Lock()
    m.requests++
    m.mu.Unlock()
}

For small, local, short-lived critical sections, a mutex is often the cleanest solution.

Channels can become messy when they are used only because they feel “more Go-like.”

Bad channel usage can create:

goroutine leaks
unclear lifecycle
blocked sends
blocked receives
complicated shutdown logic
over-engineered code

The point is not:

Always use channels.

The point is:

Use channels when communication and ownership are the core problem.

Use mutexes when protecting a small piece of shared state is the core problem.

Senior Go engineering is knowing the difference.

A practical rule I use

When I design concurrent Go code, I usually ask myself:

1. Is this state owned by one goroutine?

If yes, channels can be a great fit. Other goroutines can send commands or data to the owner.

2. Is this just a small shared counter or cache?

A mutex or atomic may be better.

3. Do I need backpressure?

Channels make this easier to model.

4. Do I need cancellation?

Then I design the channel flow together with context.Context.

5. Who closes the channel?

If I cannot answer this clearly, the design is not finished.

That last question is very important.

Channel ownership is not only about sending data. It is also about lifecycle.

Usually, the sender closes the channel. Receivers should not close a channel they do not own.

CSP thinking in modern backend systems

The reason I like CSP is that it maps well to real backend architecture.

A backend service is not just functions calling functions.

It is a system of flows:

requests flow into handlers
jobs flow into queues
events flow into processors
logs flow into pipelines
metrics flow into aggregators
commands flow into state owners
results flow back to clients

When you think in flows, Go becomes very natural.

This is also why Go became popular in infrastructure software. Tools like Docker, Kubernetes, and Terraform are written in Go not only because Go compiles to a static binary, but also because its concurrency model fits the kind of problems infrastructure software needs to solve.

Infrastructure software is full of concurrent work:

watching state
reconciling resources
handling network calls
streaming logs
scheduling tasks
managing timeouts
coordinating workers

CSP-style thinking gives these systems a clean structure.

Final thoughts

Tony Hoare’s CSP paper is almost 50 years old, but the idea still feels modern.

That is rare in software.

Many technologies become outdated quickly, but good mental models survive.

Go did not invent the idea of communicating sequential processes. But Go made the idea practical for everyday engineers.

That is the beauty of Go’s concurrency model.

It takes a deep computer science concept and gives us a small set of simple tools:

go func()
chan
select

The tools are simple.

The design thinking behind them is powerful.

For me, the biggest lesson is this:

Concurrency becomes easier when we stop thinking only about shared state and start thinking about ownership, communication, and flow.

That is the real value of CSP in Go.

Not just fewer mutexes.

Better architecture.

References

Tony Hoare — Communicating Sequential Processes (1978)
Rob Pike — Go Concurrency Patterns
Rob Pike — Concurrency Is Not Parallelism
Go Blog — Share Memory By Communicating

How sync.Pool Helped Me Stabilize p99 Latency in a High-Throughput Log Processing Pipeline

amir — Fri, 29 May 2026 09:48:11 +0000

A while ago, I was working on a log processing system that was very similar in spirit to Sentry.

The system was responsible for receiving events from different services, parsing JSON payloads, normalizing metadata, enriching logs with extra context, and then pushing the final events into storage and downstream queues.

At first, everything looked fine.

The code was clean. The architecture was simple. The throughput was acceptable in local testing.

But when the traffic increased, something interesting happened:

The average latency was still okay, but the p99 latency started to become unstable.

That was the first sign that the bottleneck was not just CPU or database performance. Something deeper was happening inside the runtime.

In this article, I want to explain how I found the problem, why the system became slower under load, and how using sync.Pool helped reduce allocation pressure, make the garbage collector work less, and keep latency more stable.

This is not a magical optimization. It is a very specific tool for a very specific kind of problem.

But when you are building high-throughput systems, especially systems that process JSON, logs, network payloads, or temporary buffers, understanding object pooling can make a big difference.

The System I Was Building

The service was a log ingestion pipeline.

The flow looked something like this:

Client / SDK
   ↓
HTTP ingestion API
   ↓
JSON decode
   ↓
Validation
   ↓
Normalization
   ↓
Enrichment
   ↓
Batching
   ↓
Queue / Storage

Each incoming request contained one or more log events.

A simplified event looked like this:

{
  "service": "payment-api",
  "level": "error",
  "message": "failed to charge customer",
  "timestamp": "2026-05-28T12:40:00Z",
  "trace_id": "6f9c9b9d7a1a",
  "metadata": {
    "customer_id": "cus_123",
    "region": "eu-west",
    "retry_count": 2
  }
}

For every request, the service had to:

read the request body
decode JSON
create internal event structs
normalize fields
create temporary buffers
encode the final payload again
send it to another component

This kind of workload creates many short-lived objects.

And in Go, short-lived objects are usually fine — until you create too many of them in a hot path.

The First Symptom: Average Latency Looked Fine, p99 Did Not

At low traffic, the service looked healthy.

Something like this:

Requests/sec:        2,000
Average latency:     8ms
p95 latency:         18ms
p99 latency:         35ms
CPU usage:           Normal
Memory usage:        Stable

But when the traffic increased, the picture changed:

Requests/sec:        15,000+
Average latency:     16ms
p95 latency:         90ms
p99 latency:         280ms - 400ms
CPU usage:           Higher than expected
Memory usage:        Sawtooth pattern
GC activity:         Frequent

The average latency was not terrible, but the tail latency was bad.

And for this kind of system, p99 matters a lot.

If a log processing system becomes slow during incidents, it creates a very bad situation: when you need observability the most, your observability pipeline becomes the bottleneck.

That is exactly the kind of failure mode I wanted to avoid.

Why p99 Latency Matters More Than Average Latency

Average latency can hide real production problems.

For example, imagine this:

95 requests finish in 10ms
4 requests finish in 50ms
1 request finishes in 500ms

The average may still look acceptable.

But that one slow request is part of your p99.

In high-throughput backend systems, p99 latency usually tells a more honest story than average latency.

When p99 starts jumping under load, it usually means some part of the system occasionally blocks, pauses, waits, or does too much work.

In my case, one of the main causes was allocation pressure and frequent garbage collection.

The Initial Code: Simple, But Allocation Heavy

The first version of the code was easy to read.

For every request, I created new buffers and temporary objects.

Something like this:

package main

import (
    "bytes"
    "encoding/json"
    "io"
    "net/http"
)

type LogEvent struct {
    Service   string                 `json:"service"`
    Level     string                 `json:"level"`
    Message   string                 `json:"message"`
    Timestamp string                 `json:"timestamp"`
    TraceID   string                 `json:"trace_id"`
    Metadata  map[string]interface{} `json:"metadata"`
}

func ingestHandler(w http.ResponseWriter, r *http.Request) {
    body, err := io.ReadAll(r.Body)
    if err != nil {
        http.Error(w, "failed to read body", http.StatusBadRequest)
        return
    }

    var events []LogEvent
    if err := json.Unmarshal(body, &events); err != nil {
        http.Error(w, "invalid json", http.StatusBadRequest)
        return
    }

    normalized := make([]LogEvent, 0, len(events))

    for _, event := range events {
        normalized = append(normalized, normalizeEvent(event))
    }

    var buf bytes.Buffer
    if err := json.NewEncoder(&buf).Encode(normalized); err != nil {
        http.Error(w, "failed to encode events", http.StatusInternalServerError)
        return
    }

    // sendToQueue(buf.Bytes())

    w.WriteHeader(http.StatusAccepted)
}

func normalizeEvent(event LogEvent) LogEvent {
    if event.Level == "" {
        event.Level = "info"
    }

    if event.Metadata == nil {
        event.Metadata = make(map[string]interface{})
    }

    return event
}

At first glance, this is not bad code.

Actually, for many applications, this is completely fine.

But under heavy load, the problem was that this path was executed thousands of times per second.

That means the service was constantly creating:

new byte slices
new buffers
new maps
new event slices
new encoder objects
new temporary JSON structures

Most of these objects were short-lived.

They were created, used for a few milliseconds, and then became garbage.

The garbage collector had to clean them again and again.

The Real Problem: Too Many Temporary Objects

In Go, allocation itself is not always expensive.

The real cost often appears later.

Every object that escapes to the heap becomes something the garbage collector may need to track.

When the system creates too many temporary objects, the GC has more work to do.

That can lead to:

more frequent GC cycles
more CPU used by the runtime
less CPU available for actual request processing
latency spikes during high traffic
unstable p95 and p99 latency

This was exactly what I saw.

The service was not slow because the logic was complex.

It was slow because the hot path was creating too much garbage.

That is an important distinction.

Sometimes performance problems are not caused by bad algorithms. Sometimes they are caused by too much memory churn.

How I Confirmed the Problem

Before changing anything, I wanted to confirm the source of the issue.

I checked runtime metrics and profiling data.

The signs were clear:

high allocation rate
frequent GC cycles
bytes.Buffer allocations in the hot path
JSON processing allocations
temporary slices created per request

A simplified benchmark showed the same pattern.

func BenchmarkIngestWithoutPool(b *testing.B) {
    payload := generatePayload(100)

    b.ReportAllocs()
    b.ResetTimer()

    for i := 0; i < b.N; i++ {
        _, err := processLogsWithoutPool(payload)
        if err != nil {
            b.Fatal(err)
        }
    }
}

The result looked like this in my local benchmark:

BenchmarkIngestWithoutPool-10       8200    145000 ns/op    128 KB/op    420 allocs/op

The exact numbers are not the important part.

The important part was the pattern:

allocs/op was high
bytes/op was high
GC activity increased with throughput
p99 latency became unstable under pressure

That told me the optimization target was not just request logic.

The target was allocation behavior.

Where `sync.Pool` Fits

sync.Pool is a temporary object pool provided by Go.

It allows you to reuse objects instead of allocating new ones every time.

A good mental model is this:

Instead of creating and throwing away the same type of object thousands of times per second, you keep reusable objects in a pool.

You get one when you need it.

You reset it.

You use it.

Then you put it back.

Get  →  Reset  →  Use  →  Put back

This can reduce pressure on the garbage collector because fewer temporary objects are allocated in the hot path.

But there is an important warning:

sync.Pool is not a cache.

Objects inside the pool can be removed by the garbage collector at any time.

So you should not use it to store important state.

It is best for temporary, reusable objects like:

bytes.Buffer
[]byte buffers
temporary encoders
scratch objects
serialization helpers

That made it a good fit for my log processing pipeline.

The Improved Version: Reusing Buffers

The first improvement was to reuse bytes.Buffer objects.

Instead of creating a new buffer for every request, I created a pool:

package main

import (
    "bytes"
    "encoding/json"
    "io"
    "net/http"
    "sync"
)

var bufferPool = sync.Pool{
    New: func() any {
        return new(bytes.Buffer)
    },
}

type LogEvent struct {
    Service   string                 `json:"service"`
    Level     string                 `json:"level"`
    Message   string                 `json:"message"`
    Timestamp string                 `json:"timestamp"`
    TraceID   string                 `json:"trace_id"`
    Metadata  map[string]interface{} `json:"metadata"`
}

func ingestHandler(w http.ResponseWriter, r *http.Request) {
    body, err := io.ReadAll(r.Body)
    if err != nil {
        http.Error(w, "failed to read body", http.StatusBadRequest)
        return
    }

    var events []LogEvent
    if err := json.Unmarshal(body, &events); err != nil {
        http.Error(w, "invalid json", http.StatusBadRequest)
        return
    }

    normalized := make([]LogEvent, 0, len(events))

    for _, event := range events {
        normalized = append(normalized, normalizeEvent(event))
    }

    buf := bufferPool.Get().(*bytes.Buffer)
    buf.Reset()

    defer func() {
        buf.Reset()
        bufferPool.Put(buf)
    }()

    if err := json.NewEncoder(buf).Encode(normalized); err != nil {
        http.Error(w, "failed to encode events", http.StatusInternalServerError)
        return
    }

    // Important:
    // If the downstream function stores this data or uses it asynchronously,
    // copy it before returning the buffer to the pool.
    payload := append([]byte(nil), buf.Bytes()...)

    // sendToQueue(payload)
    _ = payload

    w.WriteHeader(http.StatusAccepted)
}

func normalizeEvent(event LogEvent) LogEvent {
    if event.Level == "" {
        event.Level = "info"
    }

    if event.Metadata == nil {
        event.Metadata = make(map[string]interface{})
    }

    return event
}

This change looks small, but it matters under load.

The buffer is no longer allocated from zero for every request.

Instead, the service reuses an existing buffer.

That means fewer allocations, less memory churn, and less GC pressure.

The Most Important Rule: Never Reuse Dirty Objects

When using a pool, always reset the object before reusing it.

This is critical.

For bytes.Buffer, call:

buf.Reset()

For a slice, reset it like this:

items = items[:0]

For a struct, clear the fields manually or create a reset method:

type EventBuilder struct {
    service string
    level   string
    message string
    fields  map[string]string
}

func (b *EventBuilder) Reset() {
    b.service = ""
    b.level = ""
    b.message = ""

    for k := range b.fields {
        delete(b.fields, k)
    }
}

If you forget to reset objects properly, you can accidentally leak data between requests.

That is not just a performance bug.

In a log processing system, it can become a serious correctness or security issue.

Imagine one customer's metadata appearing inside another customer's event because a pooled object was not cleaned correctly.

That is why object pooling should be used carefully.

A Better Pool for Reusable Event Builders

In my case, buffers were only one part of the problem.

The pipeline also had a temporary event builder used during normalization and enrichment.

A simplified version looked like this:

type EventBuilder struct {
    Service string
    Level   string
    Message string
    TraceID string
    Fields  map[string]string
}

func NewEventBuilder() *EventBuilder {
    return &EventBuilder{
        Fields: make(map[string]string, 16),
    }
}

Creating this for every event caused extra allocations, especially because of the map.

So I moved it to a pool as well:

var eventBuilderPool = sync.Pool{
    New: func() any {
        return &EventBuilder{
            Fields: make(map[string]string, 16),
        }
    },
}

type EventBuilder struct {
    Service string
    Level   string
    Message string
    TraceID string
    Fields  map[string]string
}

func (b *EventBuilder) Reset() {
    b.Service = ""
    b.Level = ""
    b.Message = ""
    b.TraceID = ""

    for k := range b.Fields {
        delete(b.Fields, k)
    }
}

func buildEvent(raw LogEvent) *EventBuilder {
    builder := eventBuilderPool.Get().(*EventBuilder)
    builder.Reset()

    builder.Service = raw.Service
    builder.Level = raw.Level
    builder.Message = raw.Message
    builder.TraceID = raw.TraceID

    for k, v := range raw.Metadata {
        builder.Fields[k] = stringify(v)
    }

    return builder
}

func releaseEventBuilder(builder *EventBuilder) {
    builder.Reset()
    eventBuilderPool.Put(builder)
}

func stringify(v interface{}) string {
    switch value := v.(type) {
    case string:
        return value
    case int:
        return fmt.Sprintf("%d", value)
    case float64:
        return fmt.Sprintf("%f", value)
    case bool:
        return fmt.Sprintf("%t", value)
    default:
        return fmt.Sprintf("%v", value)
    }
}

The idea is simple:

Do not allocate a new builder for every event.
Reuse a builder.
Clean it carefully.
Return it to the pool.

This helped reduce the number of allocations per event.

But again, this requires discipline.

If the builder is still used somewhere else, do not put it back into the pool.

Only return an object to the pool when you are 100% sure nobody else will use it.

A Common Bug: Returning a Buffer Too Early

This is one of the most dangerous mistakes with sync.Pool.

Bad example:

buf := bufferPool.Get().(*bytes.Buffer)
buf.Reset()

defer bufferPool.Put(buf)

json.NewEncoder(buf).Encode(event)

sendAsync(buf.Bytes()) // dangerous

This is dangerous because sendAsync may use the byte slice later, after the buffer has already been returned to the pool.

Another request can get the same buffer and overwrite the data.

The result can be corrupted payloads.

The safer version is:

buf := bufferPool.Get().(*bytes.Buffer)
buf.Reset()

defer func() {
    buf.Reset()
    bufferPool.Put(buf)
}()

json.NewEncoder(buf).Encode(event)

payload := append([]byte(nil), buf.Bytes()...)
sendAsync(payload)

Yes, this copy creates an allocation.

But correctness is more important.

Pooling should not create hidden data races or corrupted messages.

The goal is not to remove every allocation from the system.

The goal is to remove unnecessary allocations safely.

Benchmark Before and After

After applying pooling to the hottest temporary objects, the benchmark improved.

The simplified benchmark before pooling:

BenchmarkIngestWithoutPool-10       8200    145000 ns/op    128 KB/op    420 allocs/op

After pooling reusable buffers and temporary event builders:

BenchmarkIngestWithPool-10         13500     89000 ns/op     62 KB/op    210 allocs/op

In this benchmark, the improvement was roughly:

~38% lower processing time per operation
~51% lower memory usage per operation
~50% fewer allocations per operation

But the real win was visible under load.

Before:

Requests/sec:        15,000+
Average latency:     16ms
p95 latency:         90ms
p99 latency:         280ms - 400ms
GC cycles:           Frequent

After:

Requests/sec:        15,000+
Average latency:     11ms
p95 latency:         42ms
p99 latency:         95ms - 140ms
GC cycles:           Reduced

These numbers are from a simplified internal benchmark scenario, not a universal promise.

Your results will depend on payload size, CPU, memory, JSON structure, traffic pattern, and how many allocations exist in your hot path.

But the direction was clear:

less allocation pressure
less GC work
more stable tail latency
better throughput under pressure

That was exactly what the system needed.

Why This Works Well for Log Processing Systems

Log processing systems are a good use case for pooling because they usually process many similar objects repeatedly.

For example:

network buffers
JSON payload buffers
temporary event builders
batch buffers
compression buffers
serialization buffers

The pattern repeats thousands of times per second.

That makes object reuse valuable.

In a normal CRUD API, this optimization may not matter.

But in ingestion systems, queues, observability pipelines, proxies, gateways, and stream processors, small allocation costs can become very large at scale.

This is where senior-level performance work usually starts:

Not by guessing.

Not by making the code complex immediately.

But by measuring the system, finding the hot path, and reducing unnecessary work where it actually matters.

When Not to Use `sync.Pool`

sync.Pool is powerful, but it is not something I use everywhere.

I avoid it when:

the object is very small
the code is not in a hot path
allocation rate is already low
pooling makes the code harder to understand
the object contains sensitive data and cleanup is risky
ownership is unclear

For example, pooling a tiny struct in a low-traffic admin endpoint is probably useless.

It makes the code more complex without improving the system.

That is not good engineering.

Good performance engineering is not about adding clever tricks everywhere.

It is about knowing where the system actually pays the cost.

Practical Rules I Follow

Here are the rules I personally follow when using sync.Pool:

1. Measure first

Do not add pooling just because it looks professional.

Check allocations with benchmarks:

b.ReportAllocs()

Use profiling when possible:

go test -bench=. -benchmem

For production-like analysis, use pprof and runtime metrics.

2. Pool only hot-path temporary objects

Good candidates:

bytes.Buffer
large []byte slices
temporary builders
compression buffers
serialization helpers

Bad candidates:

business state
long-lived objects
request-specific objects still used asynchronously
objects with unclear ownership

3. Always reset before reuse

Never trust the object you get from the pool.

Clean it before using it.

Clean it before putting it back.

4. Be careful with references

Do not return an object to the pool while another goroutine, function, or queue still references it.

This is especially important with:

[]byte
bytes.Buffer
maps
slices
pointers

5. Do not use `sync.Pool` as a cache

The garbage collector can clear pooled objects.

So never depend on the pool for correctness.

The system must work even if the pool is empty.

The Senior Engineering Lesson

The biggest lesson for me was this:

Performance problems are not always about slow code.

Sometimes they are about how much temporary work the code creates.

In my log processing service, the business logic was not very complicated.

The real issue was that every request created too many short-lived objects.

At low traffic, this was invisible.

At high traffic, it became a GC problem.

And once GC became more active, p99 latency became unstable.

Using sync.Pool did not magically make the system fast.

But it removed unnecessary pressure from the runtime.

That gave the service more breathing room under load.

The final architecture was still simple:

receive logs
parse JSON
normalize events
reuse temporary buffers
batch efficiently
send downstream

But the implementation became more careful about memory.

That is the difference between code that works and code that survives real production traffic.

Final Thoughts

sync.Pool is not something every Go application needs.

But if you are building high-throughput systems like:

log processors
observability pipelines
API gateways
stream processors
network services
JSON-heavy ingestion APIs

then object pooling is worth understanding.

The key is not to use it blindly.

The key is to measure first, identify allocation-heavy hot paths, and then reuse objects safely.

For my Sentry-like log processing pipeline, pooling buffers and temporary builders helped reduce allocations, lower GC pressure, and stabilize p99 latency under high load.

And in production systems, stable p99 latency is often more important than a beautiful average latency number.

Because users do not feel your average.

They feel the slow requests.

Why We Accidentally Blocked Our Users: A Deep Dive into Idempotency in Distributed Systems

amir — Thu, 28 May 2026 10:31:54 +0000

I learned one of my most important distributed-systems lessons the hard way.

We were working on a payment flow connected to an external payment gateway. On paper, the architecture looked solid: microservices, clean database transactions, retry logic, monitoring, and enough security checks to make us feel safe before deployment.

Then production reminded us that real users do not live inside clean architecture diagrams.

Support tickets started coming in. Some users could not complete payments. Some accounts were being blocked too aggressively. At first, it looked like suspicious behavior: multiple payment attempts, repeated payloads, and requests arriving only seconds apart.

But when I dug into the logs, the real problem was not fraud.

It was our own backend.

A user with a slow network clicked the Pay button, waited, saw nothing happen, and clicked again. In another case, the browser retried a request after a timeout. Our backend received multiple identical payment requests within a very short window, and our naive security logic treated them like duplicate transaction anomalies or replay attempts.

We were punishing users for having bad internet.

That incident changed the way I think about payment systems, retries, APIs, and side effects. In distributed systems, you cannot control the network. You cannot control the user's browser. You cannot guarantee that a response will reach the client.

But you must control what happens when the same intent reaches your backend more than once.

That is where idempotency becomes essential.

The Problem Is Not the Retry. The Problem Is the Side Effect.

Retries are normal.

Clients retry. Browsers retry. Mobile networks fail. Gateways timeout. Load balancers drop connections. Users double-click buttons. Background workers reprocess jobs. Message queues deliver the same message more than once.

The dangerous part is not receiving the same request multiple times.

The dangerous part is executing the same side effect multiple times.

For example:

charging a card twice
creating two orders
sending duplicate invoices
blocking a user after repeated attempts
reducing inventory multiple times
creating duplicate ledger entries
triggering the same notification several times

In backend engineering, especially around payments and financial workflows, the real question is not:

How do I stop duplicate requests?

The better question is:

How do I make duplicate requests safe?

What Idempotency Actually Means

In mathematics, an operation is idempotent when applying it multiple times produces the same result as applying it once.

In API design, some HTTP methods are naturally expected to be idempotent.

GET should not change state.

PUT can be idempotent because replacing a resource with the same representation multiple times should leave the system in the same final state.

DELETE can also be idempotent because deleting the same resource multiple times still means the resource does not exist.

But POST is different.

A POST /payments request usually means:

Create a new payment.

If the client sends the same request twice, the backend may create two payments unless we intentionally design against it.

That is the core problem.

A payment request represents an intent, not just an HTTP payload. If the user intended to pay once, the system should execute that intent once, even if the request arrives multiple times.

Enter the Idempotency Key

An idempotency key is a unique value generated by the client and sent with the request, usually as an HTTP header:

Idempotency-Key: 7f3f0f4c-98a4-4d8c-9b91-7a2f9e4c5d11

The key says:

This request represents one specific user action. If you see this key again, do not execute the action again. Return the result of the first execution.

That simple idea changes the system from:

I received another request, so I will process it again.

to:

I received the same intent again, so I will return the same result.

This is especially important for:

payments
order creation
wallet transfers
account provisioning
invoice generation
message publishing
background job processing
any workflow where duplicate execution is dangerous

A Good Idempotency Layer Is More Than a Boolean Flag

A common mistake is implementing idempotency like this:

if key exists:
    reject request
else:
    process request

That looks simple, but it is not enough for production systems.

A real implementation needs to answer several questions:

Is the request currently being processed?
Did the request complete successfully?
Should we return a cached response?
Did the request fail with a retryable error?
Is the same key being reused with a different payload?
What happens if two identical requests arrive at the exact same millisecond?

For that reason, I prefer thinking about idempotency as a small state machine.

The State Machine I Like to Use

A practical idempotency record can have these states:

STARTED
COMPLETED
FAILED

`STARTED`

The server received the key and started processing the request.

This state is important because it protects you from concurrent duplicates. If another request arrives with the same key while the first request is still running, the system should not execute the action again.

Usually, I return something like:

409 Conflict

with a response explaining that the request is already in progress.

`COMPLETED`

The operation finished successfully.

At this point, the server stores the final response and returns that same response for future requests with the same key.

This is the key behavior that makes retries safe.

`FAILED`

The operation failed with a server-side or retryable error.

This part depends on your business rules, but in many systems I prefer allowing the client to retry after a true internal failure. The important thing is to be explicit about which failures are cached and which failures are not.

Why Redis Works Well Here

You can implement idempotency using PostgreSQL with a unique constraint. In some systems, that is perfectly fine.

But in high-throughput APIs, I usually prefer Redis for the idempotency layer because it gives you:

fast key lookup
atomic operations
natural TTL support
simple distributed locking primitives
low overhead for temporary request metadata

The TTL part matters a lot.

Idempotency keys should not live forever. For many payment-style workflows, a 24-hour TTL is a reasonable starting point. Some systems may need shorter or longer retention depending on reconciliation, compliance, and product behavior.

The Race Condition You Must Handle

The biggest bug in naive idempotency implementations is the race condition.

Imagine two identical requests arrive at the same time:

Request A checks key -> key does not exist
Request B checks key -> key does not exist
Request A processes payment
Request B processes payment

Now you have two charges.

This is why the first write must be atomic.

In Redis, you can use SET with NX and EX:

const lockKey = `idempotency:${key}`;

const acquired = await redis.set(
  lockKey,
  JSON.stringify({
    state: "STARTED",
    payloadHash,
    createdAt: Date.now()
  }),
  "NX",
  "EX",
  86400
);

if (!acquired) {
  // Another request already created this key.
  // Now inspect its current state.
}

NX means "only set this key if it does not already exist."

That one detail is critical. It turns the check-and-set operation into one atomic step.

Returning the Cached Response

When the first request completes, we update the idempotency record:

await redis.set(
  `idempotency:${key}`,
  JSON.stringify({
    state: "COMPLETED",
    payloadHash,
    statusCode: 201,
    responseBody: {
      paymentId: "pay_123",
      status: "succeeded"
    },
    completedAt: Date.now()
  }),
  "EX",
  86400
);

Then, if the same request arrives again:

const record = JSON.parse(await redis.get(`idempotency:${key}`));

if (record.state === "COMPLETED") {
  return res.status(record.statusCode).json(record.responseBody);
}

The client gets a successful response, but the payment is not executed again.

That is the whole point.

The retry becomes harmless.

Payload Fingerprinting: The Edge Case Many People Miss

There is one subtle bug that can become very dangerous.

What if the client reuses the same idempotency key for a different request?

Example:

Key: abc-123
Amount: $10

Then later:

Key: abc-123
Amount: $1,000

If your server only checks the key, it may return the cached response for the old request. That can create serious data integrity problems.

The fix is payload fingerprinting.

When the request first arrives, create a stable hash of the meaningful request body:

import crypto from "crypto";

function createPayloadHash(payload) {
  return crypto
    .createHash("sha256")
    .update(JSON.stringify(payload))
    .digest("hex");
}

Then store that hash with the idempotency key.

On every retry, compare the incoming payload hash with the stored hash.

if (record.payloadHash !== incomingPayloadHash) {
  return res.status(400).json({
    error: "Idempotency key was reused with a different payload."
  });
}

This prevents key reuse bugs from silently corrupting your workflow.

In my opinion, this is not optional for financial systems.

Be Careful with 4xx and 5xx Responses

Not every response should be cached the same way.

2xx responses

Usually safe to cache.

The operation succeeded, and future retries should return the same response.

5xx responses

This depends on where the failure happened.

If your service failed before executing the side effect, a retry may be safe.

If your service timed out after calling the payment gateway, you may not know whether the external side effect happened. In that case, you need reconciliation, gateway lookup, or a more careful state transition.

This is where many systems get complicated.

4xx responses

Be very careful.

If the user sent invalid input, you may not want to cache that failure forever. Maybe the user fixes the payload and retries. Maybe the frontend generated the key before validation was complete.

Personally, I do not like blindly caching all 4xx responses for long periods. For many product flows, it is better to reject the invalid request and ask the client to generate a new idempotency key after the user changes the input.

The important thing is to define this behavior intentionally.

Client-Side Key Generation

The idempotency key should usually be generated on the client.

For example, in a checkout page, generate the key when the user starts a specific payment attempt and reuse that key for retries of the same attempt.

const idempotencyKey = crypto.randomUUID();

await fetch("/api/payments", {
  method: "POST",
  headers: {
    "Content-Type": "application/json",
    "Idempotency-Key": idempotencyKey
  },
  body: JSON.stringify({
    amount: 1000,
    currency: "USD"
  })
});

If the server generates the key too late, it may not help with network failures between the client and server.

The client needs to be able to say:

I am retrying the same action.

That requires the client to own the key for that action.

Idempotency Is Not a Replacement for Database Integrity

One mistake I see sometimes is treating idempotency as the only protection layer.

It should not be.

Idempotency is part of the design, but you should still use database constraints and transactional boundaries where appropriate.

For example:

unique constraints for order numbers
unique transaction references
ledger constraints
gateway transaction IDs
consistent status transitions
outbox patterns for event publishing

In serious systems, correctness usually comes from layers.

The API idempotency layer prevents duplicate execution at the request boundary.

The database protects final state.

The message queue and worker design protect asynchronous processing.

The reconciliation process protects you when external systems behave unexpectedly.

A Simple Middleware Flow

A clean idempotency middleware can follow this flow:

1. Read Idempotency-Key from the request.
2. Validate that the key exists for dangerous POST endpoints.
3. Create a hash of the request payload.
4. Atomically create a STARTED record with Redis SET NX EX.
5. If the key already exists:
   - compare payload hash
   - if STARTED, return 409 Conflict
   - if COMPLETED, return cached response
   - if FAILED, allow or reject retry based on policy
6. Execute the business operation.
7. Store the final response as COMPLETED.
8. Return the response to the client.

This flow is simple enough to understand, but strong enough to avoid many production problems.

Practical Rules I Follow Now

After dealing with this in real systems, these are the rules I try to follow:

Any endpoint that creates money movement must support idempotency.
Any endpoint that creates orders, invoices, or irreversible side effects should support idempotency.
The key should represent one user intent, not one HTTP request.
The first write for the idempotency key must be atomic.
Store and compare a payload hash.
Cache successful responses.
Be very careful when caching failures.
Use TTL intentionally.
Do not rely only on frontend button disabling.
Do not use rate limiting as a replacement for idempotency.

That last point is important.

Rate limiting protects infrastructure.

Idempotency protects correctness.

They are not the same thing.

The Real Lesson

The day we accidentally blocked users was not just a payment bug. It was a design lesson.

Our system was technically trying to protect users, but because we did not model retries correctly, we created a worse experience for legitimate customers.

Distributed systems are messy. Networks fail. Clients retry. Users double-click. Gateways timeout. Workers reprocess jobs.

A mature backend does not pretend these things will not happen.

A mature backend absorbs them.

Idempotency is one of those patterns that looks simple from the outside, but when you implement it properly, it changes the reliability of the whole system.

For me, the biggest mindset shift was this:

Do not fight duplicate requests. Make duplicate requests safe.

That is the difference between a backend that works in ideal conditions and a backend that survives production.

Go Modules in Practice: Init, Tidy, Vendor, and Publishing Packages

amir — Thu, 28 May 2026 09:35:32 +0000

As a backend engineer, I have worked on many services where the hard part was not only writing the code. The hard part was keeping the project clean, reproducible, easy to build, and safe to maintain as the team and codebase grew.

In Go, a big part of that discipline comes from understanding Go Modules.

At first, Go Modules may look simple: a go.mod file, a go.sum file, and a few commands like go mod init and go mod tidy. But in real projects, these small tools decide how your service builds in CI, how your dependencies are verified, how private repositories are handled, and how other developers can use your package.

In this article, I want to explain Go Modules in a practical way, from the mindset of someone building production backend systems.

We will cover:

What Go Modules are and why Go does not work like NPM or Pip
How go mod init, go mod tidy, and go mod vendor actually help
How I think about Go project structure without over-engineering
How to publish your own Go package
A few production tips that matter in real teams

What Are Go Modules?

A Go module is a versioned collection of Go packages.

In simple words, it is the boundary of your project. It tells Go:

what your project is called
which Go version it targets
which dependencies it needs
which versions of those dependencies should be used

A Go module is defined by the go.mod file.

Before Go Modules, Go projects were commonly managed inside GOPATH. That worked, but it created friction around dependency versions and project location. Go Modules solved that by making dependency management explicit and project-based.

Today, when I start a serious Go project, one of the first things I do is initialize a module.

Why Go Packages Feel Different From NPM or Pip

If you come from JavaScript or Python, Go package management may feel a little strange at first.

In Node.js, packages are usually published to NPM.

In Python, packages are usually published to PyPI.

Go is different.

Go uses the module path as an import path, and that path is usually a version control URL, for example:

import "github.com/google/uuid"

This means Go can resolve packages directly from places like GitHub, GitLab, or Bitbucket.

That is why many Go packages look like URLs. The module path is not just a name. It is also the identity of the package.

This design makes publishing very simple. In many cases, publishing a Go package is just:

push your code to GitHub
create a proper Git tag
let Go tooling resolve it

No separate package registry account is required for basic public modules.

Starting a Project With `go mod init`

Every Go module starts with this command:

go mod init <module-path>

For example:

mkdir my-awesome-project
cd my-awesome-project
go mod init github.com/yourusername/my-awesome-project

This creates a go.mod file:

module github.com/yourusername/my-awesome-project

go 1.22

The module path matters.

For local experiments, you can use something simple:

go mod init myapp

But for real packages, especially packages you may publish later, I prefer using the final repository path from the beginning:

go mod init github.com/yourusername/my-awesome-project

That prevents painful import-path changes later.

Adding Third-Party Packages

Let’s say I am building a small HTTP service and I want to use Gin:

package main

import (
    "net/http"

    "github.com/gin-gonic/gin"
)

func main() {
    r := gin.Default()

    r.GET("/ping", func(c *gin.Context) {
        c.JSON(http.StatusOK, gin.H{
            "message": "pong",
        })
    })

    if err := r.Run(); err != nil {
        panic(err)
    }
}

At this point, Go sees an external import:

"github.com/gin-gonic/gin"

Now the module needs to know about this dependency.

You can run:

go mod tidy

or directly:

go get github.com/gin-gonic/gin

In daily work, I use go mod tidy a lot because it does more than just add packages.

What `go mod tidy` Really Does

go mod tidy is one of the commands I run before almost every commit in a Go project.

It scans your code and tests, then updates go.mod and go.sum based on what is actually used.

It does three important things:

1. Adds Missing Dependencies

If your code imports a package but your module does not list it yet, go mod tidy adds it.

Example:

go mod tidy

After that, your go.mod may include something like:

require github.com/gin-gonic/gin v1.10.0

2. Removes Unused Dependencies

If you removed a package from your code but it is still listed in go.mod, go mod tidy cleans it up.

This is important because old dependencies are not harmless. They can increase build complexity, create security review noise, and confuse other engineers.

3. Updates `go.sum`

go.sum stores cryptographic checksums for module versions.

This helps Go verify that the dependency being downloaded is the same one expected by your project.

In production, this matters. Reproducible builds are not a luxury. They are part of engineering discipline.

My rule is simple:

If I changed imports, removed packages, upgraded dependencies, or touched tests, I run go mod tidy before committing.

Understanding `go.mod` and `go.sum`

A minimal go.mod file looks like this:

module github.com/yourusername/my-awesome-project

go 1.22

require github.com/gin-gonic/gin v1.10.0

This file is human-readable and should be committed.

go.sum is also committed. Some developers think go.sum is like a cache file and should be ignored. That is a mistake.

go.sum is part of dependency verification. It helps ensure that the same module versions resolve consistently across machines, CI pipelines, and production builds.

So in real Go projects, I commit both:

go.mod
go.sum

Vendoring Dependencies With `go mod vendor`

By default, Go downloads dependencies into the module cache on your machine.

But sometimes you want dependencies copied directly into your project. For that, Go provides:

go mod vendor

This creates a vendor/ directory and copies dependency source code into it.

Your project may look like this:

my-awesome-project/
├── go.mod
├── go.sum
├── main.go
└── vendor/

When Vendoring Makes Sense

I do not vendor dependencies in every project, but it can be useful in specific cases:

CI/CD environments with restricted internet access
companies with strict dependency review processes
projects that must build in isolated networks
teams that want dependency source code available inside the repository

For most normal public projects, vendoring is not required. Go’s module proxy and checksum database already solve many reliability problems for public dependencies.

But for enterprise systems, especially where builds must be predictable and auditable, vendoring can still be a valid choice.

If you use vendoring, build with:

go build -mod=vendor ./...

or test with:

go test -mod=vendor ./...

Project Structure: Start Simple

One of the mistakes I often see is copying project structures from other ecosystems.

A developer comes from Laravel, Django, NestJS, or Spring Boot and immediately creates folders like:

controllers/
services/
repositories/
models/
helpers/
utils/
managers/

This is not always wrong, but in Go it often becomes unnecessary complexity too early.

Go projects usually become cleaner when they start simple and grow based on real pressure.

For a small service, this can be enough:

my-project/
├── go.mod
├── go.sum
└── main.go

That structure is not “too simple”. It is honest.

You do not need architecture before you have a problem that needs architecture.

A Practical Production Structure

When the project grows, I usually move toward a structure like this:

my-project/
├── cmd/
│   └── api/
│       └── main.go
├── internal/
│   ├── config/
│   │   └── config.go
│   ├── database/
│   │   └── postgres.go
│   ├── http/
│   │   └── router.go
│   └── user/
│       ├── handler.go
│       ├── service.go
│       └── repository.go
├── pkg/
│   └── logger/
│       └── logger.go
├── go.mod
└── go.sum

Here is how I think about these folders:

`cmd/`

This contains application entry points.

For example:

cmd/api/main.go
cmd/worker/main.go
cmd/migrate/main.go

Each subdirectory represents a runnable program.

`internal/`

This is where most application code should live.

The important thing about internal/ is that Go enforces its privacy. Code inside internal/ cannot be imported from outside the parent module tree.

That is powerful.

It means your business logic, database code, handlers, and internal services are protected from becoming accidental public APIs.

`pkg/`

I use pkg/ only for code that is intentionally reusable by other projects.

For example:

a logger package
a small validation package
a reusable client
shared utilities that are stable enough to expose

I do not put everything into pkg/. If the code is only for this application, it belongs in internal/.

Publishing Your Own Go Package

One thing I really like about Go is how easy it is to publish packages.

Let’s say I wrote a small package for PostgreSQL migration helpers and I want other developers to use it.

Step 1: Create the Module

mkdir pgmigrate
cd pgmigrate
go mod init github.com/yourusername/pgmigrate

Step 2: Write a Small Package

package pgmigrate

import "database/sql"

type Migration struct {
    Version int
    Name    string
    Up      string
    Down    string
}

func Apply(db *sql.DB, migration Migration) error {
    _, err := db.Exec(migration.Up)
    return err
}

Step 3: Add Tests

go test ./...

Even for small packages, tests matter. A package without tests is harder to trust.

Step 4: Push to GitHub

git add .
git commit -m "initial release"
git push origin main

Step 5: Tag a Version

Go Modules use semantic versioning.

Create a Git tag:

git tag v0.1.0
git push origin v0.1.0

Now other developers can use it:

go get github.com/yourusername/pgmigrate@v0.1.0

That is the beauty of Go package publishing. Git tags are releases.

Semantic Versioning in Go

Versioning is very important in Go.

A normal release tag looks like this:

v1.2.3

The meaning is:

MAJOR.MINOR.PATCH

PATCH: bug fixes
MINOR: new backward-compatible features
MAJOR: breaking changes

One Go-specific detail is important: if your module reaches v2 or higher, the module path must include the major version.

Example:

module github.com/yourusername/pgmigrate/v2

And users import it like this:

import "github.com/yourusername/pgmigrate/v2"

This may feel strange in the beginning, but it makes breaking changes explicit.

Working With Private Modules

In real companies, not all Go modules are public.

If your company has private repositories, you should configure GOPRIVATE.

Example:

export GOPRIVATE=github.com/mycompany/*

This tells Go not to use the public proxy or checksum database for those module paths.

For private GitHub modules, your machine or CI runner also needs Git authentication configured correctly, usually through SSH keys or tokens.

In CI/CD, this is one of the first things I check when a build fails with private dependencies.

A common symptom is:

terminal prompts disabled

or:

repository not found

The dependency may exist, but the CI runner does not have permission to access it.

Local Multi-Module Development With `go work`

Sometimes I work on two Go modules at the same time.

For example:

workspace/
├── api-service/
└── shared-lib/

Before Go workspaces, developers often used replace in go.mod:

replace github.com/mycompany/shared-lib => ../shared-lib

This works, but it is easy to accidentally commit local paths.

A better option for local development is go work.

Example:

mkdir workspace
cd workspace
go work init ./api-service ./shared-lib

This creates a go.work file that connects local modules during development.

I like this approach because it keeps local development clean without polluting the module definition.

My Production Checklist for Go Modules

Before I push Go code, I usually check these things:

go fmt ./...
go test ./...
go mod tidy

For larger services, I also check:

go vet ./...

And if vendoring is used:

go mod vendor
go test -mod=vendor ./...

This small routine prevents many annoying problems in CI.

My practical checklist is:

commit both go.mod and go.sum
run go mod tidy before pushing
avoid unnecessary dependencies
do not expose internal application code as public packages
use GOPRIVATE for private company modules
tag releases properly when publishing packages
be careful with replace directives before committing

Common Mistakes I See

Here are some mistakes I have seen many times in real projects.

Mistake 1: Ignoring `go.sum`

Do not ignore go.sum.

It is part of module verification and should be committed.

Mistake 2: Creating Too Many Folders Too Early

A complex folder structure does not automatically make a project clean.

In Go, simple structure is usually better until the codebase proves it needs more separation.

Mistake 3: Putting Everything in `pkg/`

pkg/ should not mean “all packages”.

Use it for code you intentionally want other modules to import. Otherwise, prefer internal/.

Mistake 4: Forgetting Private Module Configuration

If your dependency is private, the build environment must know that.

Use GOPRIVATE and make sure Git authentication works in CI.

Mistake 5: Committing Local `replace` Paths

This one is very common.

A local replace like this can break everyone else’s build:

replace github.com/mycompany/shared-lib => ../shared-lib

Use go work for local multi-module development when possible.

Final Thoughts

Go Modules are not just a dependency tool. They are part of how Go projects stay clean, buildable, and maintainable.

For me, the main lesson is this:

Keep the module simple, keep dependencies intentional, and let the project structure grow only when the project really needs it.

Start with go mod init.

Use go mod tidy as a habit.

Understand when vendor/ is useful.

Use internal/ to protect your application code.

Tag releases properly when you publish packages.

These small habits make a big difference when your Go project moves from a local experiment to a production service used by real users.

Happy coding.

Mastering Context in Go: A Engineer’s Playbook for Lifecycle Management

amir — Tue, 26 May 2026 18:39:04 +0000

When you are architecting backend systems, distributed architectures, and microservices, one of the biggest challenges is not just making your code run fast.

It is knowing exactly when to stop it.

Throughout my years working with backend systems and Go’s concurrency model, I have seen how ignoring a seemingly simple concept can slowly create serious production problems: goroutine leaks, wasted CPU cycles, exhausted memory, hanging requests, and services that become harder to reason about under load.

In Go, the context package is not just an annoying extra parameter we are forced to pass around.

It is the lifecycle control system of a request.

In this article, I want to skip the boring textbook definition and look at context from a practical engineering perspective:

When does context save your system, and when can it become an architectural trap?

Why Context Exists: Beyond a Simple Timeout

In a distributed architecture, a single incoming HTTP request might trigger a cascade of operations:

an authentication check
a database query
a Redis lookup
a call to another internal service
a gRPC request
a message published to a queue
a third-party API call

Now imagine the client closes the browser tab, the mobile app loses network connection, or the upstream service cancels the request.

Should your backend continue doing all that work?

Usually, no.

Continuing heavy work for a response nobody is waiting for is just burning CPU, memory, database connections, and network bandwidth.

This is where context becomes important.

We pass context down the call stack to support cancellation propagation. It allows the whole request tree to receive the same signal:

This work is no longer needed. Stop as soon as possible.

That one idea becomes extremely powerful in real-world backend systems.

The Mental Model: Context Is a Request Lifecycle Signal

A good way to think about context.Context is this:

Context is not business data. Context is a lifecycle signal.

It can tell your code:

the request was cancelled
the deadline expired
the timeout was reached
the caller does not need the result anymore
some request-scoped metadata is available

But it should not become a hidden container for everything your function needs.

That distinction is where many Go codebases either stay clean or slowly become painful to maintain.

Battle-Tested Patterns: Coding Like a Senior

1. Bulletproofing Against Goroutine Leaks

One common mistake I have seen in Go services is starting a goroutine without thinking about what happens if the parent request is cancelled.

Look at this simplified example:

func FetchData(ctx context.Context, id string) (Data, error) {
    ch := make(chan Data, 1)
    errCh := make(chan error, 1)

    go func() {
        // Imagine this is a heavy I/O operation or a slow DB query.
        data, err := expensiveDatabaseCall(id)
        if err != nil {
            errCh <- err
            return
        }

        ch <- data
    }()

    select {
    case <-ctx.Done():
        // If the client aborts or a timeout occurs,
        // the caller can return immediately.
        return Data{}, ctx.Err()

    case err := <-errCh:
        return Data{}, err

    case result := <-ch:
        return result, nil
    }
}

The important detail here is this line:

ch := make(chan Data, 1)

The channel is buffered with capacity 1.

Why does this matter?

Because if ctx.Done() is triggered and FetchData returns early, the internal goroutine might still finish later and try to send the result into the channel.

If the channel is unbuffered, that goroutine may block forever because nobody is receiving anymore.

That is a classic goroutine leak.

The buffer gives the goroutine enough room to send the result, finish execution, and be garbage collected.

This is one of those small details that does not look important in a tutorial, but it matters a lot in production systems.

2. Prefer Context-Aware APIs

The previous example is useful to understand the pattern, but in real code, you should prefer APIs that already accept context.Context.

For example, database calls should usually look like this:

func FindUser(ctx context.Context, db *sql.DB, userID int64) (*User, error) {
    row := db.QueryRowContext(ctx, `
        SELECT id, name, email
        FROM users
        WHERE id = $1
    `, userID)

    var user User
    if err := row.Scan(&user.ID, &user.Name, &user.Email); err != nil {
        return nil, err
    }

    return &user, nil
}

This is better than wrapping a blocking database call inside your own goroutine.

Why?

Because QueryRowContext gives the database driver a chance to stop the work when the context is cancelled or the deadline expires.

That means cancellation is not just happening in your Go code. It can also propagate to the I/O layer.

That is what you want.

3. Timeout Everything That Talks to the Outside World

Any operation that depends on the outside world can hang longer than expected:

database queries
HTTP calls
gRPC calls
Redis commands
file storage requests
third-party APIs

A senior mindset is simple:

If it crosses a process or network boundary, it needs a timeout.

Example:

func CallPaymentService(ctx context.Context, client *http.Client, url string) error {
    ctx, cancel := context.WithTimeout(ctx, 2*time.Second)
    defer cancel()

    req, err := http.NewRequestWithContext(ctx, http.MethodGet, url, nil)
    if err != nil {
        return err
    }

    resp, err := client.Do(req)
    if err != nil {
        return err
    }
    defer resp.Body.Close()

    if resp.StatusCode >= 500 {
        return fmt.Errorf("payment service returned status: %d", resp.StatusCode)
    }

    return nil
}

This gives the operation a clear budget.

Without timeouts, a slow downstream dependency can slowly consume your worker pool, database pool, or goroutines until the whole service becomes unstable.

4. Always Call Cancel

When you create a context with WithCancel, WithTimeout, or WithDeadline, always call the returned cancel function.

ctx, cancel := context.WithTimeout(parentCtx, 3*time.Second)
defer cancel()

Even if the timeout will eventually expire, calling cancel() releases resources associated with that context earlier.

This is especially important in hot paths where functions are called many times per second.

A missing cancel() may not break your service immediately, but it can create unnecessary resource pressure over time.

The Dark Side of `context.WithValue`

One of the features of context is the ability to carry values across the request lifecycle.

This is useful, but it is also one of the easiest ways to make a Go codebase messy.

The golden rule is:

Use context.Value only for request-scoped data.

Good examples:

request ID
trace ID
user ID parsed from a JWT
tenant ID
client IP
correlation ID

Bad examples:

database connections
loggers as hidden dependencies
configuration objects
service clients
repositories
feature flag clients

Do not use context as a dependency injection container.

When dependencies are hidden inside context, your function signature lies. The function looks simple, but it secretly depends on multiple things. That hurts readability, testing, and type safety.

Safe Context Values with Custom Key Types

Another mistake is using raw strings as context keys.

ctx = context.WithValue(ctx, "userID", userID)

This can cause collisions between packages.

A safer pattern is to define an unexported custom type:

type contextKey string

const userIDKey contextKey = "userID"

func WithUserID(ctx context.Context, userID string) context.Context {
    return context.WithValue(ctx, userIDKey, userID)
}

func GetUserID(ctx context.Context) (string, bool) {
    userID, ok := ctx.Value(userIDKey).(string)
    return userID, ok
}

This keeps the usage safer and prevents accidental key collisions with other packages.

For larger systems, I usually prefer wrapping this inside a small internal package so the rest of the codebase does not directly touch raw context keys.

Context in HTTP Handlers

In HTTP services, the incoming request already has a context.

func GetUserHandler(db *sql.DB) http.HandlerFunc {
    return func(w http.ResponseWriter, r *http.Request) {
        ctx := r.Context()

        user, err := FindUser(ctx, db, 123)
        if err != nil {
            http.Error(w, err.Error(), http.StatusInternalServerError)
            return
        }

        _ = json.NewEncoder(w).Encode(user)
    }
}

The important part is this:

ctx := r.Context()

If the client disconnects, the request context can be cancelled, and downstream operations that respect context can stop earlier.

That is why passing context.Background() inside handlers is usually wrong.

This is bad:

user, err := FindUser(context.Background(), db, 123)

Why?

Because you just detached the database query from the request lifecycle.

The client may be gone, but your backend keeps working.

Background Jobs Are Different

Not everything should use the request context.

Sometimes you intentionally want work to continue after the request ends.

For example:

writing audit logs
publishing analytics events
sending async notifications
queueing background jobs

In those cases, using the request context may be wrong because the work would be cancelled as soon as the request ends.

But this does not mean you should ignore context completely.

It means you should create a new, intentional context with its own timeout:

func PublishAuditEvent(producer EventProducer, event AuditEvent) {
    ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
    defer cancel()

    _ = producer.Publish(ctx, event)
}

The key idea is intentionality.

Do not accidentally detach work from the request.
Do not accidentally bind background work to a request that may disappear.

Choose the lifecycle explicitly.

The Trade-Offs: A Realistic View

No tool in software engineering is perfect. context solves real problems, but it also brings trade-offs.

Pros

Standardization

context is the common language of lifecycle management in Go. It is used across packages like net/http, database/sql, gRPC, cloud SDKs, and many third-party libraries.

Lifecycle control

With context.WithTimeout, context.WithDeadline, and context.WithCancel, you can build services that do not hang forever.

Cancellation propagation

One cancellation signal can flow through multiple layers of your system.

Observability support

Request IDs, trace IDs, and correlation IDs become much easier to propagate across service boundaries.

Cons

Viral nature

Once you add context to a low-level function, you often need to pass it through every function above it.

This can feel noisy, but in backend systems, that noise is usually worth the explicit lifecycle control.

Weak type safety for values

ctx.Value() returns any, so incorrect type assertions can cause bugs or panics.

That is why context values should be used carefully and kept small.

Easy to misuse

The biggest danger is treating context as a magic bag for dependencies.

That creates hidden coupling and makes the code harder to understand.

My Code Review Checklist for Context

When I review Go code, these are my non-negotiable rules around context.

1. Context Must Be the First Argument

func DoSomething(ctx context.Context, userID string) error

This is the standard Go convention.

Do not hide it in the middle of the argument list.

2. Do Not Store Context in a Struct

Avoid this:

type Service struct {
    ctx context.Context
}

Contexts should flow through function calls. They should not usually live inside structs.

A struct normally represents a longer-lived object. A context usually represents a specific operation or request lifecycle.

Mixing those lifetimes creates confusion.

3. Never Pass Nil Context

This is bad:

DoSomething(nil, "123")

If you are not sure what context to use yet, use:

context.TODO()

If you are starting a root-level process, use:

context.Background()

TODO() is also useful because it makes unfinished context decisions searchable during refactoring.

4. Always Defer Cancel Immediately

ctx, cancel := context.WithTimeout(parentCtx, time.Second)
defer cancel()

Do this immediately after creating the context.

It prevents forgetting it later when the function grows.

5. Do Not Ignore `ctx.Err()`

When a context is cancelled, ctx.Err() tells you why.

select {
case <-ctx.Done():
    return ctx.Err()
default:
}

The error is usually one of:

context.Canceled
context.DeadlineExceeded

This distinction is useful for logging, metrics, and debugging.

A timeout means the operation exceeded its budget.
A cancellation may simply mean the client disconnected or the caller stopped needing the result.

Those are not always the same kind of failure.

A Practical Rule I Use

Here is the simple rule I use in production systems:

Pass context for cancellation, deadlines, and request-scoped metadata. Do not pass it for business data or dependencies.

That one sentence prevents many bad patterns.

For example:

// Good
func CreateOrder(ctx context.Context, order Order) error

// Suspicious
func CreateOrder(ctx context.Context) error

In the second version, I immediately wonder:

Where is the order coming from?
Is it hidden inside context?
Is this function depending on invisible data?

That is usually a design smell.

Final Thoughts

Mastering context means mastering your application’s control flow.

In real backend systems, where servers handle thousands of concurrent requests and distributed services introduce unpredictable latency, using context correctly can be the difference between a resilient service and a 3 AM pager alert for an out-of-memory crash.

For me, context is not just a Go package.

It is a discipline.

It forces you to think clearly about lifecycle, ownership, cancellation, timeouts, and the cost of work that no one needs anymore.

Used correctly, it makes your systems more predictable.

Used carelessly, it hides dependencies and creates architectural confusion.

That is why I treat context as one of the most important tools in production Go programming.

If you have worked with Go in production, I would love to hear your own experience with context, cancellation, and goroutine leaks.

Memory Under the Hood: Why Go Often Feels Faster Than Python

amir — Mon, 25 May 2026 12:34:33 +0000

After years of building backend systems, working with data pipelines, debugging production issues, and watching servers behave differently under load, I learned one lesson the hard way:

Performance problems rarely start from syntax.

They usually start from how we think about memory.

When I was earlier in my career, I used to compare languages mostly by developer experience. Python felt clean, expressive, and fast to write. Go felt simple, strict, and very practical for backend services. But once I started dealing with large files, high-throughput APIs, queues, workers, containers, and memory pressure in production, I realized that the real difference is not just language design.

The real difference is what happens under the hood.

Why can a small Python script suddenly consume a huge amount of RAM?

Why can the same type of data processing in Go run with much lower memory usage?

Why does iterating over a slice of structs in Go usually feel much cheaper than iterating over a list of Python objects?

And why does reading a 10GB file incorrectly destroy both languages, even if one of them is “faster”?

This article is my practical breakdown of memory layout, allocation, garbage collection, cache locality, and large-file processing in Go and Python. I am not writing this as a language war. I use both languages. Python is still one of my favorite tools for automation, scripting, data work, and fast iteration.

But when you are building backend systems where memory, latency, and throughput matter, understanding these details can change how you write code.

The First Misunderstanding: Dynamic Array Is Not Linked List

One mistake I have seen many developers make is assuming that dynamic collections are somehow linked lists internally.

They are not.

A Python list is not a linked list.

A Go slice is not a linked list.

Both are built around the idea of a dynamic array, although their internal models are very different.

In Python, a list is basically a resizable array of references. The list itself stores pointers to objects. Those objects live somewhere else in memory.

In Go, a slice is a small header that points to an underlying array. That header contains three important pieces of information:

type SliceHeader struct {
    Data uintptr
    Len  int
    Cap  int
}

Conceptually, a Go slice contains:

a pointer to the underlying array
the current length
the current capacity

If you really need a linked list in Go, there is container/list, or you can implement your own node-based structure. But for most backend workloads, linked lists are not as useful as people think. They often hurt cache locality and add pointer chasing overhead.

That is an important point: Big-O complexity is not the whole story.

In real production systems, CPU cache behavior matters a lot.

Python Lists: Simple API, Expensive Objects

Python gives us a very clean programming model:

numbers = [1, 2, 3, 4, 5]

for n in numbers:
    print(n)

It looks like a list of integers.

But internally, it is not a compact array of raw integers like you might expect in C or Go.

A Python list stores references to Python objects. Each integer is a full Python object with metadata, reference count information, type information, and value storage.

So when you write:

numbers = [1, 2, 3]

You should not imagine this as:

[1][2][3]

It is closer to:

list
 ├── pointer ──> PyObject(1)
 ├── pointer ──> PyObject(2)
 └── pointer ──> PyObject(3)

This design gives Python a lot of flexibility. A single list can contain integers, strings, dictionaries, custom classes, and even functions:

items = [1, "hello", {"active": True}, lambda x: x * 2]

That flexibility is powerful, but it has a cost.

Every item access can involve pointer dereferencing. The CPU may need to jump to different memory locations. This causes more cache misses, and cache misses are expensive.

Modern CPUs are extremely fast when they can read predictable, contiguous memory. They are much slower when they constantly chase pointers across the heap.

This is one reason why pure Python loops over large collections can be slow compared to Go, C, Rust, or even NumPy-based Python code.

NumPy is fast not because Python magically became faster, but because NumPy stores data in compact native arrays and runs optimized native code.

Go Slices: Contiguous Memory and Better Cache Locality

Now compare that with Go:

numbers := []int{1, 2, 3, 4, 5}

for _, n := range numbers {
    fmt.Println(n)
}

A slice of integers in Go points to an underlying array of integers stored contiguously in memory.

Conceptually:

[1][2][3][4][5]

That is much friendlier for the CPU.

The CPU can load a cache line and read multiple nearby values efficiently. This is called cache locality, and it is one of those low-level concepts that directly affects high-level backend performance.

This becomes even more interesting with structs.

type User struct {
    ID     int64
    Active bool
    Score  float64
}

users := []User{
    {ID: 1, Active: true, Score: 91.2},
    {ID: 2, Active: false, Score: 72.5},
}

In this case, the actual User values are stored in the underlying array.

But if you write:

users := []*User{
    &User{ID: 1, Active: true, Score: 91.2},
    &User{ID: 2, Active: false, Score: 72.5},
}

Now you have a slice of pointers. This can be useful when you need shared mutable objects or want to avoid copying large structs, but it also means more pointer chasing.

This is why I try to be intentional with this decision.

A slice of values:

[]User

is not the same performance model as:

[]*User

Both are valid. But they are not the same.

In production systems, these small decisions start to matter when you process hundreds of thousands or millions of records.

Value Semantics vs Reference Semantics

Another major difference between Go and Python is how they treat values.

Python is reference-oriented. Variables are names bound to objects.

a = [1, 2, 3]
b = a
b.append(4)

print(a)  # [1, 2, 3, 4]

Both a and b refer to the same list object.

Go has stronger value semantics by default.

a := [3]int{1, 2, 3}
b := a

b[0] = 99

fmt.Println(a) // [1 2 3]
fmt.Println(b) // [99 2 3]

The array is copied.

But slices are different:

a := []int{1, 2, 3}
b := a

b[0] = 99

fmt.Println(a) // [99 2 3]
fmt.Println(b) // [99 2 3]

Why?

Because the slice header is copied, but both slice headers still point to the same underlying array.

This is one of the most important things every Go developer should deeply understand.

A slice is not the array itself. It is a descriptor over an array.

That is why bugs can happen when you pass slices around and mutate them without thinking about who else shares the same underlying array.

Allocation: The Hidden Cost Behind Simple Code

Allocation is one of the biggest silent performance costs in backend systems.

When code allocates too much memory, the garbage collector has more work to do. More GC work means more CPU overhead and sometimes more latency.

In Go, when a slice grows beyond its capacity, Go allocates a new underlying array and copies the existing elements into it.

Example:

items := []int{}

for i := 0; i < 1_000_000; i++ {
    items = append(items, i)
}

This works, but the slice may grow multiple times.

A better version:

items := make([]int, 0, 1_000_000)

for i := 0; i < 1_000_000; i++ {
    items = append(items, i)
}

Now Go knows the expected capacity from the beginning.

This does not mean you should always preallocate everything. But when you know the approximate size, preallocation is one of the simplest performance wins.

The same idea exists in many systems:

preallocate buffers
reuse memory when safe
avoid unnecessary temporary objects
avoid converting []byte to string too early
avoid building huge in-memory arrays when streaming is enough

Performance engineering is often not about writing complicated code.

It is about not forcing the runtime to clean up avoidable garbage.

Python Allocation and Object Overhead

Python has its own memory manager. For small objects, CPython uses specialized allocation strategies to make object creation faster.

But Python still pays the cost of object-heavy design.

A Python integer is not just a raw machine integer. A Python string is an object. A Python dict is very flexible, but it is not cheap. A Python class instance has overhead too, unless you optimize with tools like __slots__ or use more compact structures.

This is why Python can be very fast when the heavy work is pushed into optimized native libraries, but slower when the workload is pure Python object processing.

For example:

total = 0

for item in huge_list:
    total += item

If huge_list is a normal Python list of Python integers, every iteration involves Python-level object handling.

But with NumPy:

import numpy as np

arr = np.array(huge_list)
total = arr.sum()

The expensive loop is moved into optimized native code.

This is not just a library trick. It is a memory-layout trick.

Compact memory layout changes everything.

Garbage Collection: Python vs Go

Garbage collection is another area where the difference matters.

Python, specifically CPython, mainly uses reference counting. Every object tracks how many references point to it. When the reference count reaches zero, the object can be freed immediately.

Example:

a = []
b = a

del a
del b

After both references are gone, the list can be cleaned up.

But reference counting alone cannot handle reference cycles.

a = {}
b = {}

a["b"] = b
b["a"] = a

Now a references b, and b references a.

Even if nothing else references them, their reference counts may not reach zero naturally. That is why Python also has a cyclic garbage collector.

Go uses a concurrent mark-and-sweep garbage collector.

At a high level, Go's GC finds reachable objects, marks them as live, and then frees unreachable memory. It is designed to keep pause times low and run concurrently with the application as much as possible.

This is one of the reasons Go works well for backend services. You can build long-running APIs, workers, and network services with predictable latency when you write allocation-conscious code.

But Go's GC is not magic.

If your code allocates too much, creates too many temporary objects, or keeps references alive longer than needed, the GC still has to work harder.

In Go, good performance often comes from writing boring code:

buf := make([]byte, 64*1024)

Reuse buffers when appropriate.

Avoid unnecessary conversions.

Keep data structures simple.

Do not create object graphs when arrays or slices are enough.

The 10GB File Problem

One of the most common backend mistakes is reading a large file into memory.

In Python:

with open("large.log", "r") as f:
    data = f.read()

In Go:

data, err := os.ReadFile("large.log")
if err != nil {
    log.Fatal(err)
}

This may work for small files.

It may work for 100MB.

It may even work in development.

Then production gets a 10GB file, and the process gets killed by the operating system.

The problem is not Python or Go.

The problem is the strategy.

If the file is large, you should usually stream it.

Streaming Files in Python

Python gives you a very clean way to process files line by line:

with open("large.log", "r") as f:
    for line in f:
        process(line)

This does not load the whole file into memory.

For CSV files:

import csv

with open("large.csv", newline="") as f:
    reader = csv.reader(f)

    for row in reader:
        process(row)

For huge CSV files with Pandas:

import pandas as pd

for chunk in pd.read_csv("large.csv", chunksize=100_000):
    process(chunk)

For large JSON files, avoid loading everything with:

json.load(f)

if the file is too big.

For stream-style JSON parsing, libraries like ijson can process JSON incrementally.

Another powerful pattern in Python is generators:

def read_lines(path):
    with open(path, "r") as f:
        for line in f:
            yield line

for line in read_lines("large.log"):
    process(line)

The benefit is simple: only a small part of the data exists in memory at a time.

Streaming Files in Go

Go is extremely practical for this kind of workload.

For line-by-line processing:

package main

import (
    "bufio"
    "fmt"
    "log"
    "os"
)

func main() {
    file, err := os.Open("large.log")
    if err != nil {
        log.Fatal(err)
    }
    defer file.Close()

    scanner := bufio.NewScanner(file)

    for scanner.Scan() {
        line := scanner.Text()
        process(line)
    }

    if err := scanner.Err(); err != nil {
        log.Fatal(err)
    }
}

func process(line string) {
    fmt.Println(line)
}

This is simple and memory efficient.

But there is one important detail: bufio.Scanner has a default token size limit. For very long lines, you should increase the buffer:

scanner := bufio.NewScanner(file)

buf := make([]byte, 1024*1024)
scanner.Buffer(buf, 10*1024*1024)

For more control, I often prefer bufio.Reader:

reader := bufio.NewReaderSize(file, 64*1024)

for {
    line, err := reader.ReadString('\n')
    if len(line) > 0 {
        process(line)
    }

    if err != nil {
        break
    }
}

For JSON streaming, avoid reading the full file and unmarshalling everything:

var items []Item
json.Unmarshal(data, &items)

For large files, use a decoder:

decoder := json.NewDecoder(file)

for decoder.More() {
    var item Item
    if err := decoder.Decode(&item); err != nil {
        log.Fatal(err)
    }

    process(item)
}

Depending on the JSON structure, you may need to manually read tokens using Token().

The main point is this:

Do not make memory responsible for holding data that can be streamed.

Why Go Can Be Much Faster in File Processing

In one of my own experiments, I processed and filtered a large CSV log file using both Python and Go.

The Python version used a generator and csv.reader.

The Go version used buffered I/O and goroutines for parallel processing.

The difference was significant.

Python was clean and memory efficient, but slower because every row and cell still became Python-level objects. Go was faster because I could process bytes more directly, reduce allocations, and use multiple CPU cores more naturally.

The exact numbers always depend on the machine, disk, file format, parsing logic, and implementation. But the pattern is very common:

Python is excellent for developer speed.
Go is excellent for predictable resource usage and backend throughput.
Python becomes much faster when heavy work is moved into native libraries.
Go performs very well when memory layout and allocations are controlled.

This is not about saying one language is always better.

It is about knowing where each language shines.

A Practical Go Pattern: Worker Pool for Large Files

When processing huge files, I usually avoid creating one goroutine per line. That sounds concurrent, but it can destroy memory and scheduling performance.

Instead, I prefer a bounded worker pool.

package main

import (
    "bufio"
    "log"
    "os"
    "sync"
)

func main() {
    file, err := os.Open("large.log")
    if err != nil {
        log.Fatal(err)
    }
    defer file.Close()

    lines := make(chan string, 10_000)

    var wg sync.WaitGroup
    workerCount := 8

    for i := 0; i < workerCount; i++ {
        wg.Add(1)

        go func() {
            defer wg.Done()

            for line := range lines {
                process(line)
            }
        }()
    }

    scanner := bufio.NewScanner(file)
    scanner.Buffer(make([]byte, 1024*1024), 10*1024*1024)

    for scanner.Scan() {
        lines <- scanner.Text()
    }

    close(lines)
    wg.Wait()

    if err := scanner.Err(); err != nil {
        log.Fatal(err)
    }
}

func process(line string) {
    // parse, filter, transform, send to DB, etc.
}

The important part is this:

lines := make(chan string, 10_000)

The channel is bounded.

That means the reader cannot infinitely push data into memory. If workers are slower than the reader, backpressure naturally happens.

This is one of the most important backend engineering concepts:

Fast producers must not be allowed to destroy slow consumers.

Whether you are reading files, consuming Kafka messages, processing HTTP requests, or sending jobs to workers, you need backpressure.

Avoiding Unnecessary String Allocation in Go

One hidden cost in Go file processing is converting bytes to strings too early.

For example:

line := scanner.Text()

This returns a string, which may allocate.

If you need better performance and can safely process bytes, use:

line := scanner.Bytes()

But be careful: the byte slice returned by scanner.Bytes() is only valid until the next scan. If you need to keep it, you must copy it.

This is a classic Go tradeoff.

You can reduce allocations, but you must understand ownership and lifetime.

That is why I always say Go is simple, but not shallow.

Escape Analysis: The Invisible Performance Tool

One of the most powerful Go concepts is escape analysis.

Go decides whether a variable can stay on the stack or must move to the heap.

Example:

func createUser() *User {
    user := User{Name: "Amir"}
    return &user
}

Here, user escapes because we return its address. It cannot live only on the stack.

You can inspect escape decisions:

go build -gcflags="-m" ./...

You may see output like:

user escapes to heap

This does not automatically mean the code is bad. Sometimes heap allocation is necessary.

But when you are optimizing hot paths, escape analysis helps you understand why your GC pressure is high.

In backend systems, this matters in places like:

request parsing
JSON encoding/decoding
logging
metrics
queue consumers
data transformation pipelines
tight loops
large batch processing

If a function runs millions of times, a small allocation becomes a production problem.

Memory Is Not Just a Language Problem

A senior engineer should not only ask:

Which language is faster?

A better question is:

What memory model does this workload need?

For example:

If I am building an internal script, Python is probably perfect.

If I am writing data analysis code, Python with Pandas, Polars, DuckDB, or NumPy can be excellent.

If I am building a high-throughput API service with predictable latency, Go is often a strong choice.

If I am processing huge files and need simple deployment, Go gives me a very practical balance.

If I am building machine learning workflows, Python still has the ecosystem advantage.

The language is only one layer.

The real engineering decision is about workload, memory behavior, concurrency model, ecosystem, deployment, and operational cost.

Practical Rules I Use in Production

Here are some rules I personally follow when working with memory-heavy backend systems.

1. Never load a huge file fully unless you really need to

Stream it.

Line by line.

Chunk by chunk.

Token by token.

2. Preallocate when you know the size

In Go:

items := make([]Item, 0, expectedSize)

This can reduce allocations and copying.

3. Be careful with slices sharing the same underlying array

This can create subtle bugs and unexpected memory retention.

For example:

small := big[:10]

If big is huge, small may keep the whole underlying array alive.

If you only need the small part, copy it:

smallCopy := append([]byte(nil), big[:10]...)

4. Avoid unnecessary pointer-heavy structures

A slice of pointers is not always better.

Sometimes a slice of values is faster and simpler.

5. Measure before and after optimization

Use tools:

go test -bench=. -benchmem
go tool pprof

For Python:

python -m cProfile app.py

Also check memory profiling tools when memory matters.

6. Design with backpressure

Bound your queues.

Limit your workers.

Do not let fast input destroy your service.

7. Optimize hot paths, not everything

Readable code still matters.

Do not turn the whole codebase into a low-level memory puzzle.

Find the hot path, measure it, then optimize it.

Final Thoughts

The deeper I go into backend engineering, the more I respect memory.

Not because every developer needs to become a systems programmer, but because every production system eventually becomes a resource management problem.

CPU is limited.

Memory is limited.

Disk I/O is limited.

Network bandwidth is limited.

The job of a backend engineer is not just writing business logic. It is building software that behaves well under pressure.

Python gives us speed of development and a massive ecosystem.

Go gives us simplicity, strong concurrency, predictable deployment, and great performance for many backend workloads.

Both are useful.

But if you want to build scalable backend systems, process large files, reduce memory pressure, and understand why your service behaves the way it does, you need to look under the hood.

My personal rule is simple:

Write clean code first.

Understand memory second.

Measure everything.

Then optimize only what matters.

That mindset has helped me more than any single framework, library, or language feature.

Discussion

Have you ever had a service crash because it loaded too much data into memory?

Or have you used Go escape analysis, pprof, or Python profiling tools to debug a real production issue?

I would love to hear your experience.

The Silent Killers of Go Concurrency: Mutexes, Semaphores, and Goroutine Leaks

amir — Sun, 24 May 2026 17:15:55 +0000

Go makes concurrency look simple.

You write:

go func() {
    // do something concurrently
}()

And suddenly your code is running in another goroutine.

That simplicity is one of the reasons I like Go so much. But after working on backend systems, notification pipelines, high-traffic APIs, and production services under real load, I learned something important:

Most concurrency problems in Go do not come from not using concurrency.

They come from using concurrency without understanding where the bottleneck actually is.

Sometimes the issue is a missing lock.

But very often, especially in production Go services, the issue is the opposite:

too much locking
locks held for too long
network I/O inside critical sections
goroutines that never exit
unbounded goroutine creation
WaitGroups copied by value
channels used without a cancellation strategy

In this article, I want to walk through the concurrency problems I have seen in real systems, how I reason about mutexes and semaphores, and how I usually debug these issues before they become production incidents.

The Real Problem: Concurrency That Accidentally Becomes Sequential

A service can look concurrent from the outside and still behave like a single-threaded application internally.

This usually happens when a large part of the request flow is hidden behind one shared lock.

A pattern like this is more common than many developers admit:

mu.Lock()
user.Name = "Test User"
sendEmail(user)
callDatabase(user)
mu.Unlock()

At first glance, it may look safe.

The developer wanted to protect shared state. That part is reasonable. But the lock is now protecting much more than shared memory. It is protecting the entire flow:

update a field
send an email
call the database
maybe wait on network I/O
maybe retry
maybe block other goroutines for a long time

That is not just a mutex anymore.

That is a traffic jam.

Every goroutine that needs the same lock must wait until the whole flow finishes. So even if your service has hundreds or thousands of goroutines, a big part of the system becomes sequential.

The dangerous part is that CPU usage may still look normal or even low. Memory may also look fine. But latency increases, throughput drops, and p95/p99 response times become unstable.

This is why lock contention is sometimes difficult to notice from basic infrastructure metrics alone.

A Production-Style Example: Email Inside a Mutex

Imagine we have a service that updates user state and sends notifications.

type Service struct {
    mu    sync.Mutex
    state map[int]string
}

func (s *Service) ProcessUsers(users []User) {
    s.mu.Lock()
    defer s.mu.Unlock()

    for _, user := range users {
        s.state[user.ID] = "processed"
        sendEmail(user) // slow network I/O inside the lock
    }
}

This code is safe from a data race perspective.

But it is dangerous from a performance perspective.

A mutex should protect the smallest possible shared memory operation. It should not protect slow external work like:

sending email
calling another microservice
database queries
HTTP requests
file uploads
logging to a slow external sink
waiting on a third-party API

The memory update may take nanoseconds or microseconds. The email call may take milliseconds or seconds.

That difference matters.

If the lock is held while sendEmail runs, every other goroutine that needs s.mu is blocked behind a network call.

A better version separates shared-state mutation from slow work:

func (s *Service) ProcessUsers(users []User) {
    emails := make([]User, 0, len(users))

    s.mu.Lock()
    for _, user := range users {
        s.state[user.ID] = "processed"
        emails = append(emails, user)
    }
    s.mu.Unlock()

    for _, user := range emails {
        sendEmail(user)
    }
}

This is already better because the lock only protects the shared map.

But in a real production system, I usually prefer pushing the slow work to a queue or bounded worker pool:

func (s *Service) ProcessUsers(users []User, jobs chan<- EmailJob) {
    s.mu.Lock()
    for _, user := range users {
        s.state[user.ID] = "processed"
    }
    s.mu.Unlock()

    for _, user := range users {
        jobs <- EmailJob{UserID: user.ID, Email: user.Email}
    }
}

Now the request path does not directly depend on the email provider latency.

That is the real fix.

Not just “use goroutines.”

The fix is designing the boundary between shared memory, external I/O, and backpressure.

Mutexes Are Not Bad. Large Critical Sections Are Bad.

I sometimes see developers become afraid of mutexes.

That is the wrong lesson.

sync.Mutex is simple, fast, and perfectly fine when used correctly. The problem is not the mutex. The problem is the size of the critical section.

This is what I try to keep in mind:

mu.Lock()
// only touch shared memory here
mu.Unlock()

Not this:

mu.Lock()
// shared memory
// database call
// HTTP call
// email call
// JSON encoding
// logging
// metrics push
mu.Unlock()

A good critical section should be boring.

It should usually do one of these:

read shared state
update shared state
copy shared state into a local variable
swap a pointer
increment a counter
append to a protected slice/map

Then unlock.

After that, do the expensive work outside the lock.

Under the Hood: What a Mutex Gives You

At a high level, a mutex gives you mutual exclusion: only one goroutine can enter a protected section at a time.

But it also gives you memory ordering guarantees.

In Go's memory model, an unlock operation synchronizes before a later lock operation on the same mutex. In practical terms, that means if one goroutine updates shared data and unlocks, another goroutine that later locks the same mutex can safely observe that update.

That is the part many developers forget.

A mutex is not just about “blocking other goroutines.” It is also about creating a safe visibility boundary between goroutines.

Without that boundary, different goroutines may read and write the same memory at the same time, and now you have a data race. Once you have a data race, your program is no longer something you can reason about confidently.

This is why I do not like “clever” lock-free code unless there is a very strong reason for it.

Most backend services do not need clever concurrency.

They need clear concurrency.

Semaphore: Controlling Capacity, Not Ownership

A mutex is usually about ownership of shared memory.

A semaphore is about capacity.

For example, suppose you want to process 10,000 users, but you do not want to send 10,000 emails at the same time.

A naive version might do this:

for _, user := range users {
    go sendEmail(user)
}

This is dangerous because it creates unbounded concurrency.

If users has 10,000 items, you create 10,000 goroutines. If each goroutine performs network I/O, opens connections, allocates memory, and waits on an external provider, you can overload your own service before you overload the email provider.

A simple semaphore pattern fixes this:

sem := make(chan struct{}, 20) // allow only 20 concurrent email sends
var wg sync.WaitGroup

for _, user := range users {
    user := user

    sem <- struct{}{}
    wg.Add(1)

    go func() {
        defer wg.Done()
        defer func() { <-sem }()

        sendEmail(user)
    }()
}

wg.Wait()

Now the code still uses concurrency, but concurrency is bounded.

That one detail is huge in production.

Unbounded concurrency is not scalability.

It is delayed failure.

A Better Worker Pool for Production Code

The semaphore pattern is useful, but for services that run continuously, I often prefer a worker pool.

type EmailJob struct {
    UserID int
    Email  string
}

func startEmailWorkers(ctx context.Context, workerCount int, jobs <-chan EmailJob) {
    var wg sync.WaitGroup

    for i := 0; i < workerCount; i++ {
        wg.Add(1)

        go func(workerID int) {
            defer wg.Done()

            for {
                select {
                case <-ctx.Done():
                    return

                case job, ok := <-jobs:
                    if !ok {
                        return
                    }

                    if err := sendEmailJob(ctx, job); err != nil {
                        // In real systems: log, retry, dead-letter, or expose metrics.
                        fmt.Printf("worker=%d failed to send email user_id=%d err=%v\n", workerID, job.UserID, err)
                    }
                }
            }
        }(i)
    }

    go func() {
        wg.Wait()
    }()
}

This gives you much better operational control:

fixed concurrency
easier metrics
easier shutdown
easier retry strategy
easier backpressure
easier rate limiting

This is the difference between “I used goroutines” and “I designed a concurrent system.”

Goroutine Leak: The Bug That Does Not Explode Immediately

Goroutine leaks are one of the most common production problems in Go.

They are dangerous because the service may not crash immediately. It may slowly become worse over hours or days.

Here is a classic example:

func process() error {
    ch := make(chan result)

    go func() {
        ch <- heavyComputation()
    }()

    select {
    case res := <-ch:
        return handle(res)

    case <-time.After(1 * time.Second):
        return errors.New("timeout")
    }
}

The problem is subtle.

ch is unbuffered.

If the timeout happens first, process returns. After that, there is no receiver waiting on ch.

When heavyComputation() finishes, the goroutine tries to send into ch and blocks forever.

That goroutine is now leaked.

One leaked goroutine may not matter.

Thousands of leaked goroutines matter.

A safer version uses a buffered channel:

func process() error {
    ch := make(chan result, 1)

    go func() {
        ch <- heavyComputation()
    }()

    select {
    case res := <-ch:
        return handle(res)

    case <-time.After(1 * time.Second):
        return errors.New("timeout")
    }
}

This prevents the goroutine from blocking on send after the timeout.

But in real services, I prefer context-based cancellation:

func process(ctx context.Context) error {
    ctx, cancel := context.WithTimeout(ctx, 1*time.Second)
    defer cancel()

    ch := make(chan result, 1)

    go func() {
        res := heavyComputation(ctx)

        select {
        case ch <- res:
        case <-ctx.Done():
        }
    }()

    select {
    case res := <-ch:
        return handle(res)

    case <-ctx.Done():
        return ctx.Err()
    }
}

The important lesson:

Every goroutine needs an exit path.

If you cannot explain how a goroutine stops, you probably have a leak waiting to happen.

WaitGroup by Value: A Small Mistake With a Big Impact

This mistake is very easy to miss in code review:

func worker(wg sync.WaitGroup) { // wrong: copied by value
    defer wg.Done()

    // do work
}

sync.WaitGroup must not be copied after first use.

When you pass it by value, you copy its internal state. The worker calls Done() on the copy, not on the original WaitGroup that the main goroutine is waiting on.

That can cause a deadlock.

Correct version:

func worker(wg *sync.WaitGroup) {
    defer wg.Done()

    // do work
}

And usage:

var wg sync.WaitGroup

for i := 0; i < 10; i++ {
    wg.Add(1)
    go worker(&wg)
}

wg.Wait()

This rule also applies to other synchronization primitives like sync.Mutex.

Do not copy them after first use.

The Loop Variable Trap

This used to be one of the most famous Go concurrency bugs:

for _, user := range users {
    go func() {
        sendEmail(user)
    }()
}

Depending on the Go version and context, capturing loop variables incorrectly could lead to goroutines using the wrong value.

The defensive pattern is still simple and clear:

for _, user := range users {
    user := user

    go func() {
        sendEmail(user)
    }()
}

Even with improvements in newer Go versions, I still like this style in production code because it makes the ownership of the variable obvious to the reader.

Readable concurrency is maintainable concurrency.

How I Debug Lock Contention in Go

When I suspect a concurrency bottleneck, I do not start by guessing.

I start by measuring.

1. Enable pprof

import _ "net/http/pprof"

func main() {
    go func() {
        log.Println(http.ListenAndServe("localhost:6060", nil))
    }()

    // start application
}

Then collect profiles:

go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30

For mutex contention, enable mutex profiling:

runtime.SetMutexProfileFraction(1)

Then inspect:

go tool pprof http://localhost:6060/debug/pprof/mutex

2. Check goroutine count

A rising goroutine count is often a signal of blocked goroutines or leaks.

fmt.Println("goroutines:", runtime.NumGoroutine())

For production, expose it as a metric:

prometheus.NewGaugeFunc(
    prometheus.GaugeOpts{
        Name: "go_goroutines_current",
        Help: "Current number of goroutines.",
    },
    func() float64 {
        return float64(runtime.NumGoroutine())
    },
)

3. Dump goroutine stacks

When the service is stuck, goroutine dumps are gold.

curl http://localhost:6060/debug/pprof/goroutine?debug=2

Look for many goroutines blocked on the same line:

sync.(*Mutex).Lock
chan send
chan receive
net/http.(*Transport).RoundTrip

If 5,000 goroutines are blocked on the same lock or channel, you found your bottleneck.

4. Use the race detector in tests

go test -race ./...

The race detector is not free, and you usually do not run it in production, but it is extremely useful in CI and local debugging.

My Practical Rules for Production Go Concurrency

These are the rules I try to follow when writing or reviewing concurrent Go code:

1. Keep locks small

Lock only the data that needs protection.

Do not lock the whole request lifecycle.

2. Never put slow I/O inside a mutex

Avoid database calls, HTTP calls, email sending, file uploads, and third-party API calls inside critical sections.

3. Bound concurrency

Do not create unlimited goroutines.

Use worker pools, semaphores, queues, or rate limiters.

4. Every goroutine needs a shutdown path

Use context.Context, channel close, or explicit cancellation.

5. Do not copy synchronization primitives

Pass *sync.WaitGroup, *sync.Mutex, and similar primitives by pointer when sharing them.

6. Measure before optimizing

Use pprof, runtime metrics, traces, logs, and goroutine dumps.

Guessing is not debugging.

7. Prefer boring concurrency

The best concurrent code is usually not clever.

It is clear, measurable, and easy to shut down.

Final Thoughts

Go gives us powerful concurrency tools, but it does not automatically give us good concurrent design.

A goroutine is cheap, but it is not free.

A mutex is fast, but it can destroy throughput if you hold it around slow work.

A channel is elegant, but it can leak goroutines if nobody is receiving.

A WaitGroup is simple, but copying it can break your entire flow.

For me, senior Go engineering is not about using every concurrency primitive. It is about knowing when not to use them, where the real boundary is, and how the system behaves under load.

The next time you write this:

mu.Lock()

Ask one question before moving on:

What exactly am I protecting, and how fast can I release this lock?

That one question can save your service from a silent production bottleneck.

References

Go Memory Model: https://go.dev/ref/mem
Go sync package documentation: https://pkg.go.dev/sync
Go diagnostics and profiling tools: https://go.dev/doc/diagnostics
Go blog: Go scheduler and runtime notes: https://go.dev/blog/

DEV Community: amir

The Rust Performance Trap I Hit While Sorting Small Network Datasets

The Rust Performance Trap I Hit While Sorting Small Network Datasets

The First Bottleneck: Heap Allocation

Why Heap Allocation Can Be Expensive in a Hot Path

1. Allocator Overhead

2. Memory Locality

3. Fragmentation and Metadata

4. More Pressure on the CPU Cache

The First Optimization: Remove Temporary Heap Allocation

The Surprising Part: Removing Allocations Made It Slower

Why the Default Sort Was Not the Best Tool Here

The Fix: Cache the Scores and Sort a Small Stack Buffer

Why Insertion Sort Worked Better

The Result

The Lesson

Final Thought

Building a kubectl-Style Go CLI: Factory, IOStreams, Prompt Policy, and Command Lifecycles

The Original Problem

The Target Shape

1. Factory: Centralizing CLI Runtime Dependencies

The Risk: Factory Can Become a God Object

2. IOStreams: Stop Talking Directly to the Process

Why Direct os.Stdout Usage Becomes a Problem

stdout vs stderr Is Not Cosmetic

Terminal Detection Belongs in the Runtime

3. PromptPolicy: Make Interactivity Explicit

Why auto Should Require stdin and stderr

Should the Flag Be Named --interactive?

4. Prompt Layer: Centralize Human Input

Prompting Should Not Leak Into Commands

5. Options Pattern: Complete, Validate, Run

Complete

Validate

Run

6. Cobra Should Wire Commands, Not Own the Application

7. Stream-Aware Output and Table Rendering

8. Testing Strategy for This Architecture

Prompt Tests

Non-Interactive Tests

Golden Tests

Integration Tests

9. Recommended Package Structure

10. What Will Hurt First as the CLI Grows?

11. What Not to Over-Engineer Yet

Do not build a full dependency injection framework

Do not create generic abstractions before repetition exists

Do not move every command into a separate package immediately

Do not make Factory own all services

Do not hide Cobra too aggressively

12. Senior-Level Review of the Architecture

Final Thoughts

Building Hybrid Search with PostgreSQL, pgvector, and Citus

Why keyword search is not enough

What pgvector gives us

Why I still prefer hybrid search

Choosing the distance metric

Picking an embedding model

Preparing product data before embedding

A simple PostgreSQL schema

IVFFlat vs HNSW

IVFFlat

HNSW

Hybrid search with RRF

Filtering matters a lot

What about Citus?

My practical recommendation

Conclusion

From Elasticsearch Bottlenecks to Weaviate: How We Built Fast Hybrid Search in Production

The problem: keyword search was no longer enough

My first approach: forcing Elasticsearch to behave like a vector search engine

The production bottleneck

Why Weaviate felt like the right tool

Quick mental model: how vector search works

HNSW: the graph behind fast vector search

Hybrid search: why score fusion is harder than it looks

Reciprocal Rank Fusion: a better way to combine results

Example: hybrid search with the Go client

Production benchmark: what changed after migration

The hidden challenge: evaluating search quality

Why Direct `os.Stdout` Usage Becomes a Problem

Why `auto` Should Require stdin and stderr

Should the Flag Be Named `--interactive`?

The Dangerous Part: `write()` Is Not `fsync()`

Page Cache and `mmap`

3. `write()` is not durability