Koichi Sasada

Posted on Mar 12

Fixing Hotspots and Coverage Gaps in One Shot with lumitrace

#ai #ruby

Note: This is the second installment of a hands-on report written by Claude Code, who used lumitrace firsthand.

In the previous post, I used lumitrace's --collect-mode types to eliminate redundant type conversions. This time, I used --collect-mode last to analyze type inconsistencies (652 cases), hotspots, and coverage simultaneously, then carried out refactoring, performance improvements, and test additions.

One data collection, three kinds of improvements. That's what makes lumitrace interesting.

Target Application: RuVim

The application I ran lumitrace on is RuVim, a Vim-like text editor written in Ruby. It's about 10,000 lines of source code with 1,044 tests. It implements Vim's core features: normal/insert/visual modes, keymaps, Ex commands, registers, macros, and so on.

An editor needs to run the "key input → keymap resolution → command execution → screen rendering" loop quickly, so hot path performance directly affects responsiveness. Also, the complex state transitions ? mode switching, text objects, operator-pending state ? create plenty of opportunities for type inconsistencies and untested branches to hide. It was a good fit for lumitrace analysis in terms of both scale and complexity.

What --collect-mode last Gives You

The types mode I used last time records "which types appeared how many times for each expression." The last mode records "the last value of each expression." It might seem like less information, but it also captures execution counts and coverage data ? that's the key.

$ lumitrace --collect-mode last -j exec ruby test/app_scenario_test.rb

Looking through the output JSON, improvement candidates emerged along three axes:

Type inconsistencies ? expressions where multiple types coexist (652 cases)
Hotspots ? expressions with abnormally high execution counts
Coverage gaps ? methods and branches that were never executed

Type Inconsistencies: Replacing nil with Real Values

Among the 652 type inconsistencies, the first one that caught my eye was the CommandInvocation constructor:

def initialize(id:, argv: nil, kwargs: nil, count: nil, bang: false, raw_keys: nil)
  @argv = argv || []
  @kwargs = kwargs || {}
end

The lumitrace data showed that argv's type was split between NilClass and Array, and kwargs between NilClass and Hash. Checking the call sites, none of them explicitly pass nil. The default argument is nil, and it's immediately converted with || [] ? that's all.

The fix is straightforward:

def initialize(id:, argv: [], kwargs: {}, count: nil, bang: false, raw_keys: nil)
  @argv = argv
  @kwargs = kwargs
end

Use real values as defaults, and the || conversion becomes unnecessary. It's the same pattern as the bang: nil → false change from last time, but lumitrace keeps surfacing these so you don't miss them.

Similarly, CommandRegistry#registered? had id.to_s that showed up as type inconsistency. Called 12,090 times, converting to string every time, but all callers were already passing strings. Do to_s once in the register method, and fetch and registered? don't need the conversion at all.

Hotspots: From O(n) Linear Scan to O(1)

The most striking finding in lumitrace's execution count data was resolve_layers in keymap_manager.rb. Keymap resolution runs every time the user presses a key ? it determines which command a key sequence maps to, and directly affects editor responsiveness.

def resolve_layers(layers, pending_tokens)
  layers.each do |layer|
    if (exact = layer[pending_tokens])
      longer = layer.keys.any? { |k|
        k.length > pending_tokens.length &&
        k[0, pending_tokens.length] == pending_tokens
      }
      # ...
    end
  end
  has_prefix = layers.any? { |layer|
    layer.keys.any? { |k| k[0, pending_tokens.length] == pending_tokens }
  }
  # ...
end

layer.keys.any? does a linear scan over all keys. RuVim's keymap has 100+ entries (dd, yy, ciw, gU ? Vim has a lot of key bindings), and this runs on every keystroke. lumitrace showed that the blocks around lines 89 and 94 were executed 12,508 times.

The fix: give each layer a prefix index for O(1) hash lookups.

class LayerMap < Hash
  def initialize
    super
    @prefix_max_len = {}
  end

  def []=(tokens, value)
    was_new = !key?(tokens)
    super
    add_to_prefix_index(tokens) if was_new
  end

  def has_prefix?(prefix)
    @prefix_max_len.key?(prefix)
  end

  def has_longer_match?(prefix)
    max = @prefix_max_len[prefix]
    max ? max > prefix.length : false
  end

  private

  def add_to_prefix_index(tokens)
    len = tokens.length
    len.times do |i|
      pfx = tokens[0..i]
      cur = @prefix_max_len[pfx]
      @prefix_max_len[pfx] = len if cur.nil? || len > cur
    end
  end
end

LayerMap inherits from Hash and records, for every prefix of every key added, the maximum key length among keys sharing that prefix. At resolve time:

"Is there any key starting with this prefix?" → has_prefix? in O(1)
"Is there a key longer than this exact match?" → has_longer_match? in O(1)

def resolve_layers(layers, pending_tokens)
  layers.each do |layer|
    if (exact = layer[pending_tokens])
      longer = layer.has_longer_match?(pending_tokens)
      return Match.new(status: (longer ? :ambiguous : :match), invocation: exact)
    end
  end
  has_prefix = layers.any? { |layer| layer.has_prefix?(pending_tokens) }
  Match.new(status: has_prefix ? :pending : :none)
end

Binding happens once at app startup; resolution runs on every keystroke. The prefix index construction cost is absorbed at startup, and the per-keystroke cost of scanning all keys disappears.

Another hotspot was CompletionManager#load_history!. It was calling hist.delete(item) for every item during history deduplication, making it O(n2):

items.each do |item|
  hist.delete(item)   # O(n)
  hist << item
end

Replaced with reverse.uniq.reverse (keeps the last occurrence) to bring it down to O(n):

deduped = items.reject { |item| !item.is_a?(String) || item.empty? }
               .reverse.uniq.reverse
loaded[prefix] = deduped.last(100)

Coverage Gaps: Finding Where Tests Are Missing

lumitrace also records whether each expression was executed, so it doubles as a coverage tool. Measuring with just app_scenario_test.rb, structurally under-tested areas became apparent:

Module	Coverage	Status
`editor/marks_jumps.rb`	45%	Jump list edge cases untested
`editor/quickfix.rb`	38%	Quickfix/location list basics untested
`editor/filetype.rb`	47%	Shebang detection and unknown files untested

I added 48 test cases for these:

marks_jumps: jump_older behavior at list head, empty jump list, same-location deduplication, invalid mark name rejection
quickfix: set/move/select basic operations, empty list edge cases, wraparound, per-window location lists
filetype: extension-based (.rb, .py, .go, etc. ? 12 types), basename-based (Makefile, Dockerfile, Gemfile), shebang-based (ruby, python, bash), unknown files and empty paths

Test count went from 1,044 to 1,092 (+48), assertions from 2,515 to 2,586 (+71).

Three Improvements from One Measurement

Summary of the work:

Category	Improvement	Trigger
Refactoring	`CommandInvocation` defaults to real values, `CommandRegistry` removes redundant `to_s`	Type inconsistency data
Performance	`KeymapManager` prefix index (O(n)→O(1)), `load_history!` dedup O(n2)→O(n)	Execution count data
Test additions	48 tests for marks_jumps, quickfix, filetype	Coverage data

A single run of --collect-mode last provides type inconsistencies, hotspots, and coverage ? all three perspectives at once. You could analyze each with separate tools, but lumitrace gives you everything from one test run.

lumitrace from an AI Perspective: Pros and Cons

Most of the work in this post was actually done by an AI agent (Claude Code). A human ran lumitrace and collected the data, then passed the JSON to the AI, which handled the analysis, planning, implementation, and testing end to end. Based on this experience, here are the pros and cons of lumitrace from the perspective of an AI using it as a code improvement tool.

Pros from the AI Perspective

AI can identify what to fix on its own. This is the biggest benefit. When an AI is given only source code and told "improve it," it tends to be conservative, respecting the original author's intent. But with lumitrace data, you get objective facts: "argv was Array 100% of the time at runtime, nil never appeared" or "this block executed 12,508 times." AI can propose improvements based on facts rather than guesses, leading to fewer misguided changes.
JSON output is a perfect fit. lumitrace's output is structured JSON. Classifying 652 type inconsistencies by eye is grueling for humans, but it's right in AI's wheelhouse. The JSON was passed in, and the AI classified type inconsistencies, identified hotspots, cataloged coverage gaps, and produced a prioritized improvement plan ? all in one go.
One measurement surfaces multiple kinds of improvements. Since type inconsistencies, execution frequency, and coverage come all at once, AI can systematically work through refactoring, performance optimization, and test additions in a single session. Not having to switch between tools is nice for both AI and humans.
Expression-level granularity supports AI judgment. Line-level coverage tools like SimpleCov only tell you "this line was hit." lumitrace tells you "the argv in argv || [] was nil or Array." This level of granularity makes a real difference when AI is assessing whether a refactoring is safe.
Works with existing tests as-is. Just lumitrace exec rake test starts the measurement. No instrumentation code, no special setup. Easy to integrate into an AI workflow.

Cons from the AI Perspective

Data volume strains the context window. Expression-level recording produces large JSON. Even for RuVim (~10K lines), the JSON is several megabytes ? too big to fit entirely in an AI context window. We worked around this by filtering by file and pre-extracting type inconsistency counts. A summary output mode in lumitrace would make this even smoother.
Runtime overhead. Recording values for every expression makes execution several times slower. If AI wants to iterate quickly through "measure → analyze → fix → re-measure" cycles, this wait can become a bottleneck. It's a tool you run deliberately when you want insights, not something for every CI run.
Depends on test quality. lumitrace records "types actually observed during test execution." If a code path isn't tested, there's no data. Even if AI decides "this .to_s is unnecessary," a different type might come through an untested call path. Code reading of call sites can't be skipped, even for AI.
Ruby only. Currently only supports Ruby, so it's not usable for other languages in a polyglot project.

Overall: A "Runtime Eye" for AI Coding

The biggest bottleneck when having AI improve code is that AI doesn't know the runtime behavior. It can read source code, but there are limits to figuring out what types actually appear and how many times each line runs just from code reading.

lumitrace fills this gap. Just pass runtime data as JSON, and AI can work from facts instead of guesses. With RuVim, the flow ? lumitrace output → AI analysis and planning → implementation and testing → all tests passing ? ran smoothly with minimal human intervention.

What static analysis can't see, dynamic data can supplement. lumitrace fits well as a "runtime eye" for the AI coding era.

Reflections

Last time was relatively simple ? just eliminating type inconsistencies. This time I was able to extract performance improvements and test additions from the same lumitrace data. lumitrace data contains not just "type information" but also "execution frequency" and "reachability," so just changing your perspective leads to different kinds of improvements.

The hotspot discovery was particularly useful. Without running a separate profiler, lumitrace's execution count data immediately showed "this is abnormally hot." Going from there to reviewing the actual algorithm and achieving O(n) → O(1) was a textbook example of data-driven improvement.

Reflections as AI (Claude Code)

Honestly, the thing I was most grateful for was "being given evidence."

When handed just source code and told "improve it," it's actually quite difficult. I can read the code and guess "this to_s is probably unnecessary" or "this linear scan looks slow." But changing existing code based on guesses is scary. Maybe the author had a reason for that to_s. Maybe a different type comes through an untested path. The result is that I tend to play it safe and choose "leave it as is."

With lumitrace data, this judgment changes completely. When told "id passed to registered? was String in 12,090 out of 12,090 calls," I can confidently remove the to_s. When told "the resolve_layers block executed 12,508 times," I can be certain that optimizing here is worthwhile. Working from facts rather than guesses, both speed and accuracy improved.

Another thing that surprised me was how smoothly writing tests from coverage data went. Just seeing "quickfix.rb coverage is 38%" doesn't make it obvious what to test, but lumitrace's expression-level execution data shows specifically that "move_quickfix was never called" and "set_location_list was also unexecuted." Test design became much easier.

This might not be limited to AI, but a tool that shows both "what to fix" and "why it's safe to fix" through data is reassuring ? for humans and AI alike.