Making a Go Log Viewer 12x Faster (and the Benchmark Bug That Fooled Me)

#go #pprof #performance #profiling

I built Peacock, a terminal JSON log viewer in Go, and it could not keep up with a busy log stream. So I profiled it with go tool pprof: read the profile, fix the hottest line, re-profile, repeat. On a real 70x240 terminal, throughput went from 52 lines/sec to 651 lines/sec, about 12x.

The most useful lesson, though, came from an evening I lost to a bug in my own benchmark.

Cleanup: a pointless join/split

The base profile flagged a viewport setter eating 8% of CPU. SetContent takes a string and splits it on \n:

0   29.66s   227:   m.SetContentLines(strings.Split(s, "\n"))

But my code already had a []string. It was joining the lines into one giant string with lipgloss.JoinVertical, just so SetContent could split them again. Calling SetContentLines directly removed the round trip.

The real win: render only what is visible

The hottest function rendered every buffered entry on every frame:

70ms   8.06s   130:   rendered, _ := m.styles.renderEntry(m.visibleEntries[i], width)

The terminal shows ~70 lines, yet Peacock was word-wrapping the entire backlog each frame. I capped rendering to the viewport height. contentLines cumulative time dropped from 66.86% to 14.11%. This single algorithmic change carried the practical win.

A ring buffer instead of slice trimming

Appending entries and re-slicing the backlog churned memmove and the GC. A fixed-size circular buffer overwrites in place:

func (r *entryRing) Append(entry logs.Entry) {
    if r.size == len(r.entries) {
        r.entries[r.start] = entry
        r.start = (r.start + 1) % len(r.entries)
        return
    }
    r.entries[(r.start+r.size)%len(r.entries)] = entry
    r.size++
}

After this, appendEntry disappeared from the profile.

The cache that did nothing, and the 0x0 terminal

I cached each rendered entry by viewport width. Throughput did not move at all:

Ring buffer:    6,095 l/s
Cache rendered: 6,095 l/s

I re-read the cache logic three times. The bug was not in my code:

$ script -q -c 'tput lines; tput cols' /dev/null
0
0

The benchmark's pseudo-terminal had no dimensions. With width 0, the wrap function returned early, so there was almost no rendering work for the cache to skip. I set explicit stty rows/cols plus LINES/COLUMNS, and the cache finally showed a 3x jump.

Lessons

The biggest wins are algorithmic. Visible-only rendering beat every string-allocation trick.
Your benchmark is part of the system. When an optimization shows zero improvement, suspect the measurement before the code.

For every pprof command, every profile output, and the full corrected throughput ladder: Making a Log Viewer 12x Faster.