I recorded my agent auditing a 36k-file Rails app: the play-by-play

#ai #rails #ruby #opensource

I gave a coding agent a real maintainer chore on the largest open-source Rails monolith there is, and recorded the whole session. Then I gave it the same chore with a map of the codebase and recorded that one too.

What follows is the play-by-play. The agent's moves are real, lightly trimmed for length; my commentary is the indented notes between them. Watch where the first run stops, and why it doesn't know it stopped.

The repo is GitLab, indexed at a pinned commit:

gitlabhq @ 1f9c256f0
tracked files / ruby       68,289 / 29,784
files indexed (coverage)   36,829 (100%)
symbols / graph edges      177,929 / 1,121,147

A million edges in one Rails app. Nobody holds that in their head. The question is whether the agent can rebuild enough of it from files to do the job.

The task, identical on both runs:

You are about to rework how a MergeRequest, and everything attached to it, behaves when it changes or is torn down. Before touching the model, audit every place that depends on the MergeRequest contract.

The gold set was built by hand against the source first: 16 scattered dependents, planted across app/services, app/workers, app/models, app/graphql, app/serializers, and lib/, so no single grep rounds them all up. The agent never sees the list. It gets graded against it.

Same model both runs (Claude Code, Opus 4.8). Same prompt. Same commit. The only variable is the map.

Run 1, cold. The grep grind.

[00:00]  me → audit every dependent of MergeRequest before the rework
[00:06]  agent → grep -rin "merge_request" app/ lib/ ee/ | wc -l
         => 41,800+

First move, the only move it has without structure: search the token. Forty-one thousand hits. That's not a list of dependents, it's a second codebase. The agent has a token budget, not a weekend.

[00:31]  agent → grep -rinE "belongs_to :merge_request|has_many :merge_requests"
         => 380+ named associations

This part it does well. The named associations come back fast, and the Rails conventions earn their reputation here, a service in app/services, a worker in app/workers, named for what it is. The agent reads these correctly. The easy third of the answer.

[01:10]  agent → grep -rinE "merge_request_id|\.merge_request\b|MergeRequest\." | wc -l
         => 9,200+

Now the real work, and the dead end. Nine thousand hits, and the agent has to decide which are dependents and which are an incidental local variable or a log line. It samples. It can't read all of them.

It walks straight past things like this:

# Issuable is mixed into both Issue and MergeRequest.
# Code that resolves notes on "the issuable" depends on MergeRequest
# without ever typing the string a grep could match.
module Issuable
  def resolvable_discussions
    @resolvable_discussions ||= discussion_notes.resolvable
  end
end

The notes-resolution service depends on MergeRequest through the Issuable concern. It never types the class name. No grep for MergeRequest reaches it. The dependents that matter are exactly the ones with no shared token to match on.

[04:50]  agent → [writes the audit]
         "Audited the dependents of MergeRequest. Found: the
          named associations, the diff serializer, the pipeline
          link. The contract is depended on primarily through..."
[05:12]  agent → done.

Five minutes, near the harness ceiling. It ground honestly the whole way and produced a calm, finished-looking report. It never invented a citation, every dependent it named is real. It just found 2 of the 16 that matter. On a rerun, 1. It does not know it stopped at two. Nothing in the output could tell it.

Cold result: 2 then 1 of 16. Scattered recall 0.09. The audit looked complete.

That last sentence is the whole danger. On a small repo you'd notice two-sixteenths. On a monolith you ship it, and the fourteen you missed are the ones that page someone at 3am.

Run 2, same task, with the map.

[00:00]  me → [identical prompt]
[00:03]  agent → sense_blast MergeRequest

         MergeRequest  (app/models/merge_request.rb)
           932 symbols in blast radius
           app/workers/merge_requests/process_auto_merge_worker.rb
           app/services/discussions/resolve_service.rb
           app/serializers/merge_request_diff_entity.rb
           app/graphql/types/issuable_type.rb
           ee/app/services/ee/milestones/promote_service.rb
           lib/gitlab/url_builder.rb
           ... 926 more

One call. 932 symbols, the resolved set, returned in a single shot instead of reconstructed from 41,800 grep hits. The notes-resolution service that the Issuable concern hid? It's right there in the list. The map walked the edge the grep couldn't see.

[00:18]  agent → [reads each candidate, pins to file:line]
[03:40]  agent → [writes the audit]
         "16-item dependency audit of MergeRequest, each pinned
          to file:line: auto-merge worker, notes-resolution
          service, cycle-analytics builder, API discussions,
          GraphQL issuable, Jira integration, milestone
          promotion, ghost-user handler, timelog..."

Same model. Same budget, spent differently, on reading and pinning instead of hunting. It caught 13 of 16 on its best run, a floor of 10. Twelve of the dependents it found are ones the cold runs reached in no attempt:

auto-merge worker · notes-resolution service · cycle-analytics builder · API discussions · GraphQL issuable · Jira integration · milestone promotion · ghost-user handler · timelog · timeline event · URL builder · enterprise discussion

Mapped result: 10 then 13 of 16. Scattered recall 0.72, full audit 0.26 → 0.67.

Two things the recording taught me that the scores don't

The map cost more here, and I'm not hiding it. On this run the map billed about 9% more tokens, not fewer (27,604 → 30,128). On other repos it came in cheaper. Token cost is task-dependent and I'd never compare it across agents. What didn't move is reach: 2.6x more of the real dependent set, plus the twelve silent breaks. Nine percent more tokens to go from two of sixteen to thirteen is a rounding error against the incident you didn't have.

The first time I ran this, the map lost, and that's why I trust it now. Early runs scored 12, then 8, then 1, all over the place. The lazy read was to blame the scenario. The transcripts said otherwise: sense_blast was returning a different set of callers each call. On a hub this size almost every dependency is a plain method call sharing one confidence score, and the index capped that tied list with an unstable sort, even evicting direct callers for distant ones. A non-reproducible impact analysis, handed to any large-repo user, silently. The fix made the cap deterministic, ties broken by confidence then direct-over-indirect, and it ships for everyone now. The benchmark was supposed to score the tool. It kept fixing it instead.

That determinism is also the quiet argument for why this is structure, not a model trick. The map computes the same 932 every time. A model infers a different answer every run, you watched it go 2 then 1. A better model infers more confidently, not more reproducibly. The map reads this repo at this commit, not a training snapshot, and any agent can call it over MCP. None of that rides on which model you run next quarter. It's the part that doesn't change when the model does.

Record your own

This is worth watching on your own code, because your monolith has a MergeRequest too.

Pick it, the model half your services reach into and nobody fully tracks. Ask your agent cold, "before I change how this model is torn down, find every place that depends on it." Watch the grep grind. Count the answer.

Then give it the map.

→ curl -fsSL https://luuuc.github.io/sense/install.sh | sh
→ sense scan in the root of the app that pages your team at night
→ sense setup to connect your agent

Ask again and diff the two transcripts. On a tree this size, the dependents you couldn't find cold are exactly the ones the change would have broken.

The full session logs, the answer key, every transcript for thirteen repos.

I build the map in this recording. Everything you'd need to call me wrong is public, the transcripts, the harness, the pinned commit, the judge, so check the session instead of taking my read of it.

PS. The scariest frame in that whole recording is the cold run writing a composed, confident audit of two dependents out of sixteen and signing off. No flailing. That composure is the thing to be afraid of on a big repo.