The Five-Thousand-Line File

#webdev #softwareengineering #programming #agents

Every team has one. Sometimes it is called utils.ts or helpers.py. Sometimes it has the name of a domain concept that originally meant something specific and has since absorbed everything tangentially related. The file is large enough that nobody opens it casually. It has multiple maintainers, each of whom understands a different third of it. New additions go into it because that is where similar things already live, and the file gets larger.

This is the god file: a single file that has grown to do too much, and that resists the refactor that would split it because the refactor is large and the file mostly works.

The god file is one of the most agent-hostile shapes a codebase can take.

How files get this big

No file is born five thousand lines long. The growth is incremental.

The file starts as something reasonable: a module that handles one thing, three hundred lines, well-organized. A developer adds a function related to the thing. Another developer adds a function that is related to one of the existing functions, but also touches a new concept. The new concept does not justify its own file, so it lives in this one. Over time, the file accumulates concepts that share an author, a directory, or nothing in particular except convenience.

The reason nobody splits it is that splitting it is a project. The file is imported from many places. Each import has to be updated. The functions inside have shared private helpers that have to be sorted out. Tests for the file have the same problem in miniature. The estimate for the refactor is "a sprint," and a sprint is always too expensive when the file is "fine."

So the file stays. New developers add to it because the existing functions are there. The agent does the same. The file grows.

Why agents struggle here

The god file is expensive for an agent in three specific ways.

The first is context budget. The agent loads files into its working memory to understand them. A five-thousand-line file consumes a large fraction of that budget for a small change. The agent has less room left for the rest of the codebase — the calling files, the tests, the conventions. Quality drops, not because the agent is dumber, but because it is operating with less situational awareness.

The second is pattern dilution. The agent pattern-matches against the file it is editing. A file with five hundred coherent lines teaches the agent one strong pattern. A file with five thousand lines teaches the agent ten weak patterns, often contradictory. The agent picks one, often the wrong one for the specific change.

The third is the path-of-least-resistance problem. When asked to add new functionality, the agent looks for where similar functionality lives. It finds the god file. It adds to the god file. The file grows by one more function, in the same shape as the previous additions. The agent, like every previous contributor, has chosen the cheap path. The file is now slightly more god-like.

A small coherent file is a force multiplier for an agent. A god file is a tax.

When size is actually the problem

It is worth being careful about the diagnosis. Not every large file is a god file. A file that defines a complex but coherent thing (a state machine, a parser, a single algorithm) may legitimately be large. The size is not the smell. The smell is unrelated things sharing a file.

The diagnostic question is: if you had to give this file a name that described what it does, in a single concept, could you? parser.ts is a coherent file even at three thousand lines, because everything in it is parser. helpers.ts is incoherent at five hundred lines, because nothing about the name tells you what is in it. The size is downstream of the coherence.

A useful test: pick five functions from the file at random. Do they belong together? If yes, the file is big but legitimate. If no, the file is a junk drawer with a misleading name.

Splitting by concern

The right way to split a god file is not by line count. It is by concern.

Look at the functions in the file. Group them by what they are for, not by what they touch. Two functions that both manipulate strings are not necessarily related; two functions that both implement steps of the same workflow are.

For each group, ask: would this group, alone, make sense as a file? Does it have a name that describes what it does? Are the dependencies between this group and the rest of the file mostly external, or mostly internal?

Groups that score well on this become candidate files. Move them. The imports update mechanically. The tests follow. The original god file shrinks by one concept; the codebase gains a coherent module.

This is the kind of refactor an agent is good at, given a clear scope. "Move these eight functions to a new file called pricing.ts, update all callers, and split the corresponding test file." A concrete instruction. The agent does the mechanical work. A human reviews the result.

Limit the size mechanically

Once you have done the initial split, the way to keep the file from re-growing is the same as with every other limit: make it mechanical.

Most linters can enforce a maximum file length. Set the limit slightly above your current largest legitimate file. The build fails when a file exceeds it. New code cannot grow a file past the limit; it has to go somewhere else.

The limit is a forcing function, not a precise number. The point is not that 500 is correct and 501 is wrong. The point is that the team is forced to make an active decision when a file approaches the limit, instead of letting it drift past 1,000, 2,000, 5,000 without noticing.

The agent will respect the limit because the agent runs the linter. It will offer to put new functions in new files when the existing file is near the threshold. The default direction shifts from "grow the god file" to "split the god file," which is what you wanted.

First steps

If your codebase has god files and you want to start fixing them:

Find your largest source file. Count its lines. Note the number. Open it and look at the function list. Are they coherent? Or is the file a junk drawer?

If it is a junk drawer, pick the most distinct group of functions — the smallest set you can extract without untangling shared dependencies. Move that group to its own file. Update imports. Run tests. Ship the PR.

Add a max-lines rule to your linter, set 20% above the largest file you have decided to keep. The build now prevents new files from exceeding the limit.

Quarterly, look at the file-length distribution. Pick the largest file. Split one group. Repeat.

Add a rule to AGENTS.md: "When adding new functions, prefer creating a new file in a domain-appropriate directory over extending a large existing file. Files larger than [N] lines are a smell; do not extend them without splitting at the same time."

The god file did not arrive overnight. It will not leave overnight. But the trajectory matters. A team that splits one group per quarter is on a path toward a codebase made of coherent modules. A team that does not is on a path toward one file that contains everything, and an agent that gets worse the more it touches the codebase.

The size of any one file is small. The cost of letting them all grow is not.