When I started optimizing gomarklint, I had no benchmarks. I had unit tests. I had coverage. But I had no idea what the linter actually cost to run on a real document.
Here's what I found, what I changed, and what I'd do differently next time.
What is gomarklint?
gomarklint is a Go-based CLI Markdown linter I've been building as an open-source project. The pitch: catch broken links before your readers do, keep your Markdown clean, single binary, no Node.js required.
The main alternative most teams reach for is markdownlint, which works well but requires a Node.js runtime. For Go projects running in a lean CI environment, pulling in Node.js just to lint Markdown felt like the wrong tradeoff. gomarklint ships as a standalone binary installable via Homebrew, npm, or go install, and integrates with GitHub Actions and pre-commit out of the box.
The rule set covers around 25 checks: structural rules like heading-level (no H4 appearing under H2 without an H3 in between), content rules like no-bare-urls and fenced-code-language, and link validation including internal anchor checking. Each rule emits diagnostics with file path, line number, and severity — error causes a non-zero exit, making it safe as a CI gate.
Internally, each rule is a function that receives the full file content as a slice of lines and returns a slice of violations. No shared parse tree, no AST, just functions over strings. That made it easy to add rules quickly.
The current release processes 100,000+ lines in under 170ms. A typical 200-line README lints in under 0.2 ms across all rules, and a large documentation site with hundreds of files fits in a CI step nobody notices. Getting there required 20 PRs over three weeks, each measured before it merged.
The Starting Point: No Visibility
gomarklint's rules like heading-level, no-bare-urls, fenced-code-language, and around 25 others each receive the full file content split into lines and return a slice of violations. Rules run independently, no shared parse tree, no AST, just functions over strings.
That architecture is easy to extend. But every rule doing anything non-trivial had to solve the same problem on its own: "is this line inside a code block?" Headings inside fenced blocks aren't real headings. URLs inside fenced blocks aren't real URLs. Every rule that cared had to figure it out independently.
The solution that grew organically was a shared utility called GetCodeBlockLineRanges. It scanned the entire document, built a list of [start, end] line ranges for every fenced block, and returned them. Any rule could then call isInCodeBlock(lineNumber, ranges) to check membership via a linear search.
I didn't notice this was a problem because I never measured it.
Step 1: Build a Benchmark You Can Trust
Before optimizing anything, I needed a benchmark I could actually rely on. The existing per-rule _bench_test.go files were isolated and not included in CI comparisons, so they gave no signal about end-to-end cost.
I rewrote the benchmark around a single generateComplexMarkdown(n int) function that produces a realistic n-section document (headings, paragraphs, lists, fenced code blocks, images, links) exercising every rule's scan path without producing any violations:
func writeIntro(sb *strings.Builder) {
sb.WriteString("# Main Title\n\n")
sb.WriteString("This is the introduction to the document.\n\n")
}
func writeHeading(sb *strings.Builder, i int) {
fmt.Fprintf(sb, "## Section %d\n\n", i)
}
func writeParagraph(sb *strings.Builder) {
sb.WriteString("This section contains *important* information. ")
sb.WriteString("Here are some **key** details that you should know about.\n\n")
}
func writeList(sb *strings.Builder) {
sb.WriteString("Key points:\n\n")
sb.WriteString("- First important point\n")
sb.WriteString("- Second critical detail\n")
sb.WriteString("- Third consideration\n\n")
}
func writeCodeBlock(sb *strings.Builder) {
sb.WriteString("```
go\n")
sb.WriteString("func example() error {\n")
sb.WriteString(" return nil\n")
sb.WriteString("}\n")
sb.WriteString("
```\n\n")
}
func writeLinks(sb *strings.Builder, i int) {
sb.WriteString("Useful resources:\n\n")
fmt.Fprintf(sb, "- [Documentation](https://example.com/docs/%d)\n", i)
fmt.Fprintf(sb, "- [GitHub](https://github.com/project/%d)\n", i)
sb.WriteString("\n")
}
func writeImage(sb *strings.Builder, i int) {
fmt.Fprintf(sb, "\n\n", i, i)
}
func writeSubsection(sb *strings.Builder, i int) {
fmt.Fprintf(sb, "### Subsection %d.1\n\n", i)
sb.WriteString("More detailed information goes here.\n\n")
}
func generateComplexMarkdown(sections int) string {
var sb strings.Builder
writeIntro(&sb)
for i := 1; i <= sections; i++ {
writeHeading(&sb, i)
writeParagraph(&sb)
writeList(&sb)
if i%2 == 0 {
writeCodeBlock(&sb)
}
if i%3 == 0 {
writeLinks(&sb, i)
}
if i%4 == 0 {
writeImage(&sb, i)
}
writeSubsection(&sb, i)
}
result := sb.String()
return result[:len(result)-1]
}
That last constraint — zero violations — matters. A benchmark that triggers violations measures error-reporting cost, not scanning cost. I added a guard test to enforce it:
func TestBenchmarkContentIsViolationFree(t *testing.T) {
content := generateComplexMarkdown(1000)
results, err := lint.LintContent(content, benchmarkConfig())
require.NoError(t, err)
assert.Empty(t, results)
}
This test runs in CI on every PR in the series. If anyone accidentally adds a violation to the benchmark content, the build breaks before the numbers become meaningless.
The benchmark also pays forward. Now that it runs on every PR, any new rule added to gomarklint gets automatically checked for performance regression before it merges. If a new no-trailing-spaces rule adds 15% to the geomean, benchstat surfaces that number in the PR diff — before it ships, not after. Without a benchmark in CI, performance regressions from new features are invisible until someone notices the linter "feels slower" on a large repo.
Step 2: Profile Before Touching Anything
With the benchmark in place, I ran a CPU and allocation profile against the 1000-section document:
go test -bench=BenchmarkFullLinting -benchtime=5s \
-cpuprofile=cpu.prof -memprofile=mem.prof ./cmd/
go tool pprof -top cpu.prof
The top result:
Showing top 5 nodes out of 42
flat flat% sum% cum cum%
6.32s 63.72% 63.72% 6.32s 63.72% isInCodeBlock
0.89s 8.97% 72.69% 0.89s 8.97% strings.TrimSpace
0.54s 5.44% 78.14% 0.54s 5.44% regexp.(*Regexp).FindStringSubmatch
...
isInCodeBlock was a shared helper called from multiple rules. Each call invoked GetCodeBlockLineRanges, which allocated a [][2]int slice by scanning the entire document to find fence boundaries, then performed a linear search through that slice to answer one question: "is line N inside a code block?"
That's O(n × k) per rule per line, where n is the number of lines and k is the number of code blocks. In a document with 6 rules calling it and 50 code blocks, every line triggered 6 linear searches over 50 ranges. The cost multiplied silently across every rule that wanted to skip code block content.
The allocation profile added a second finding: CheckHeadingLevels was calling regexp.MustCompile(atxHeadingPattern) inside the function body on every invocation — not once at package init. That single oversight was allocating ~16 MB of heap per benchmark run and showed up clearly in -memprofile output.
Step 3: Fix the Biggest Problem First
Both issues were in the heading-level rule. I fixed them together in PR #182:
Before:
func CheckHeadingLevels(filename string, lines []string, offset int, minLevel int) []LintError {
var errs []LintError
prevLevel := 0
headingRegex := regexp.MustCompile(`^(#{1,6})\s+`) // compiled on every call
codeBlockRanges, _ := GetCodeBlockLineRanges(lines) // O(n) alloc
for i, line := range lines {
if isInCodeBlock(i+1, codeBlockRanges) { // O(k) linear search per line
continue
}
matches := headingRegex.FindStringSubmatch(line)
if matches != nil {
currentLevel := len(matches[1])
// ... violation checks
prevLevel = currentLevel
}
}
return errs
}
After:
// atxHeadingLevel returns the heading level (1–6) or 0.
// Pure byte scan — no regex, no allocation.
func atxHeadingLevel(line string) int {
level := 0
for level < len(line) && line[level] == '#' {
level++
}
if level == 0 || level > 6 {
return 0
}
if level == len(line) || line[level] == ' ' || line[level] == '\t' {
return level
}
return 0
}
func CheckHeadingLevels(filename string, lines []string, offset int, minLevel int) []LintError {
var errs []LintError
prevLevel := 0
inCodeBlock := false
var fenceMarker string
for i, line := range lines {
if len(line) == 0 {
continue
}
first := firstNonSpaceByte(line)
// Inline fence tracking — no allocation, no O(n×k) lookup.
if inCodeBlock {
if first == fenceMarker[0] {
trimmed := strings.TrimSpace(line)
if IsClosingFence(trimmed, fenceMarker) {
inCodeBlock = false
fenceMarker = ""
}
}
continue
}
if first != '#' && first != '`' && first != '~' {
continue
}
trimmed := strings.TrimSpace(line)
if marker := openingFenceMarker(trimmed); marker != "" {
inCodeBlock = true
fenceMarker = marker
continue
}
if first != '#' {
continue
}
currentLevel := atxHeadingLevel(trimmed)
if currentLevel == 0 {
continue
}
// ... violation checks
prevLevel = currentLevel
}
return errs
}
The result on CI (AMD EPYC, 1000-section document):
| metric | before | after | delta |
|---|---|---|---|
| time/op (geomean) | 52.55 ms | 14.78 ms | −72% |
| memory/op | 2.108 Mi | 1.912 Mi | −9% |
| allocs/op | 9,225 | 4,677 | −49% |
One PR. 72% of the time gone.
Step 4: Turn the Pattern Into a System
The heading-level fix revealed a reusable pattern: the firstNonSpaceByte prefilter. Most rules only care about lines starting with a specific byte. A heading starts with #. A fenced code block starts with ` or ~. A hard tab starts with \t.
Reading the first non-whitespace byte costs almost nothing — one loop over leading spaces, then a byte read. Calling strings.TrimSpace on every line is not free, especially when 95% of lines would be skipped anyway.
I applied this prefilter across 14 more rules over 14 subsequent PRs:
func firstNonSpaceByte(s string) byte {
for i := 0; i < len(s); i++ {
c := s[i]
if c != ' ' && c != '\t' && c != '\r' && c != '\n' {
return c
}
}
return 0
}
Each individual PR produced a modest gain — 3% to 15% per rule. But they compounded:
| PR | Rule | time delta |
|---|---|---|
| #193 | no-bare-urls | −15.5% |
| #195 | no-emphasis-as-heading | −8.8% |
| #190 | empty-alt-text | −6.4% |
| #194 | duplicate-heading | −3.0% (+ −12.6% memory) |
| #192 | no-empty-links | −2.8% |
The Final Numbers
Starting from the benchmark baseline after the scaffolding work, to the last PR in the series:
| metric | start | end | improvement |
|---|---|---|---|
| time/op (geomean) | 74.37 ms | 13.88 ms | −81% |
| memory/op | 2.221 Mi | 1.654 Mi | −25% |
| allocs/op | 9,533 | 4,510 | −53% |
Twenty PRs over three weeks. Each one measured, each one merged only after the CI benchmark comparison confirmed no regression.
What I'd Do Differently
Profile before writing a single optimization. I got lucky that the biggest problem (isInCodeBlock at 63%) was immediately obvious. In a more complex codebase, guessing the hotspot first would have sent me optimizing rules that contributed 0.5% of total CPU while the real bottleneck sat untouched. The 20-minute setup cost of a proper benchmark pays back on the first targeted fix.
Treat the benchmark content as a first-class artifact. A benchmark that produces violations is measuring error-reporting cost, not scanning cost. A benchmark whose content drifts rule-by-rule becomes impossible to interpret. The TestBenchmarkContentIsViolationFree guard test caught three cases mid-series where a content addition accidentally triggered a violation. Without it, I would have been optimizing against corrupted numbers and wondering why the geomean kept moving.
Two PRs in this series landed within measurement noise. One was closed without merging. Having numbers in each PR body made that obvious: the CI benchstat output showed ~ (p=0.485) and there was nothing to argue about. A single "performance" PR with twenty changes mixed together would have buried those non-results.
Writing before/after benchmark numbers in every PR description did something I didn't expect: it made the series reviewable after the fact. Looking back at 20 PRs, I can tell exactly which optimization accounted for which fraction of the total gain. The firstNonSpaceByte prefilter was worth applying to 14 rules because I could see, rule by rule, that it kept working. That's how I found the pattern in the first place.
Try It
gomarklint is open source: github.com/shinagawa-web/gomarklint. The entire optimization series is tracked under issue #146, with each PR linking back to it and carrying its own before/after benchmark numbers.
If you're running a Go linter or any line-by-line text processor, these two patterns are worth trying before anything fancier: inline fence tracking instead of pre-computed ranges, and a first-byte prefilter before any string work. Zero dependencies, easy to test, and between them they accounted for almost all of the 81%.
Top comments (0)