The node_modules That Wouldn't Die

#docker #deployment #git #cicd

The node_modules That Wouldn't Die

TL;DR - An internal app of mine refused to deploy because the build kept importing the wrong version of a Vite plugin. The lockfile said one thing, the build was doing another. I blamed the codegen. Then I blamed git. Both times I was wrong. The actual culprit was a node_modules directory sitting on the deploy host from a previous era of the project, surviving every git reset --hard because it was never tracked in the first place. Once I cleared that out, the build broke a second time for almost the same reason. Here is the story.

The error that started it

Deploy of an internal app of mine fails at the build step with this beauty:

SyntaxError: The requested module './chunk-XYZ.js' does not provide an export named 'tanstackRouter'

I knew this one. @tanstack/router-plugin renamed its main export from TanStackRouterVite to tanstackRouter at some point. The lockfile on main was pinned to a version where the new name was correct. The Vite config was importing the new name. Everything on my machine was happy.

So why was the live host trying to call the new name on an older module that did not export it?

Suspect one, the codegen

The app uses Orval to generate its API client off a Swagger spec. My first thought was that one of those generated files was importing the plugin somehow, and that the codegen had drifted on the host. I went hunting through the generated output. Nothing there even touched Vite plugins.

Dead end. Time wasted. Moving on.

Suspect two, git not really resetting

The deploy script does git fetch && git reset --hard origin/main before building. So I started suspecting the reset was not really happening. Maybe the script was running in the wrong directory. Maybe the working tree was somehow detached and the reset was a no-op. I sshed in, ran the commands by hand, watched them tell me everything was clean.

Tell me I am not the only one who has stared at a "nothing to commit, working tree clean" and refused to believe it.

The tree was clean. The lockfile was right. So what was I building from?

The actual culprit

Here is the line in the Dockerfile that I had not been thinking hard enough about:

COPY . .

That copies everything in the build context into the image. Including node_modules if one happens to be sitting in the build context.

And here is what I had completely forgotten about git reset --hard. It does not delete untracked files. Neither does git checkout -f. Both will happily clobber tracked files back to their committed state. But anything that was never committed in the first place is invisible to them. It just sits there. Forever. Quietly.

Sitting on the deploy host, undisturbed across who knows how many deploys, was a node_modules directory from a much older incarnation of the project. The pnpm install step inside the Dockerfile was running, sure. But COPY . . ran first and dropped a years-old node_modules into the image, and whatever pnpm did on top of that was not enough to overwrite the bits that mattered. The version of @tanstack/router-plugin that ended up in the final image was the one that had been sitting on the host since the previous era, where the export was still called TanStackRouterVite.

A folder older than the bug. Quietly winning every deploy.

The cleanup that broke things again

Easy fix, right? rm -rf node_modules on the host, redeploy, done.

The build broke again. A missing API client file this time. And then I noticed it. The same gitignored exception was hiding two more freeloaders. The Orval output directory and a generated swagger.json, both gitignored, both supposed to be regenerated by the build, were also surviving across deploys. They had been sitting on the host so long that nobody had noticed the build itself never actually ran the generators properly. The host filesystem was the only reason the app had a working API client at all.

So I cleaned those out too, and then fixed the actual generation step in the Dockerfile. Because if a fresh checkout of the repo into a clean container could not produce a working build, that was the real problem all along.

What I changed

Three small things, none of them clever.

A proper .dockerignore in the repo. node_modules, dist, and the generated client directories all listed. The build context never sees the host's leftovers again.

The Dockerfile now runs the generators itself. The API client is produced inside the build, off a swagger.json that is also generated inside the build. No host artifact is load-bearing.

One full cleanup of the deploy host, by hand, of every gitignored thing. Then a redeploy from scratch. It worked on the first try, which felt suspicious until I remembered that is what builds are supposed to do.

The lesson

A long-lived deploy host is a museum. Every gitignored thing you have ever built on it is still there unless you actively remove it. git pull, git reset, git clean without the right flags, none of them touch the museum. Your Dockerfile does not know it is being lied to. Your lockfile does not know it is being overruled. The build just shrugs and ships you whatever the host happens to be wearing that day.

Two rules from now on.

Anything gitignored is regenerated, never inherited. If your build relies on a file the repo does not track, that file must be produced inside the build. Period. If you are shrugging at this rule because "it has been working fine", that is exactly what I was doing.

.dockerignore is not optional. Without it, your build context is a snapshot of whatever weird state the host has accumulated, and COPY . . is a great way to ship that weirdness into your image.

The whole fiasco was three cleanups, an embarrassing number of wrong guesses, and a lesson I should have learned the first time I saw git reset --hard and assumed it meant what it sounds like. It does not. Untracked is invisible.

Not going to pretend this was a perfect writeup. But if even one part of it helped someone avoid the headache I went through, then it was worth putting down. See you in the next one.