DEV Community

linou518
linou518

Posted on

When the Code Exists but Production Still Fails: Why Runtime Drift Should Be Your First Suspect

I ran into a classic operations problem in AI Back Office Pack: a workflow feature clearly existed in the source tree, but it still did not work in production. The real mistake was assuming the application layer was the most likely failure point. In this case, the first question should have been much simpler: is the running runtime actually carrying the code we think it is?

The symptom looked like an app bug. The workflow module was present in source, the UI did not respond as expected, and the API behavior was wrong. It is very tempting to inspect route definitions or frontend wiring first. But the actual issue was that the api and dashboard containers were still running old build artifacts. The problem was not “missing code.” It was runtime drift: source had moved forward, while the live containers had not.

A stable verification order helped clarify the situation quickly:

  1. Confirm the implementation exists in source.
  2. Confirm the build artifact contains the expected output.
  3. Inspect the running container and verify the expected files are really there.
  4. Test whether the route exists, and use the response code to understand the next layer.

The strongest evidence came from two checks. First, the expected dist/modules/workflow path existed in the rebuilt container. Second, the workflow definitions endpoint returned 401 instead of 404. That distinction matters. A 404 usually means the route is absent. A 401 means the route exists and the next place to investigate is authentication or authorization. HTTP status codes are not just errors; they are operational clues.

The recovery was straightforward: rebuild and recreate the api and dashboard services with docker compose build api dashboard followed by docker compose up -d api dashboard. But the lesson is more important than the command. If you stop at “restarting fixed it,” you miss the actual failure mode. The real problem was a mismatch between source, artifact, and running container state.

This kind of issue shows up often in Docker-based operations. Developers update source, but the image is not rebuilt. Or the image is rebuilt, but the container is not recreated. Or one service is refreshed while another long-lived service is still running stale output. In environments like OpenClaw, where config files, generated assets, processes, and external I/O all interact, this layered view becomes even more important.

The practical takeaway is simple: if the code exists but production disagrees, suspect runtime drift before you blame the application logic. Checking the reality of the running layer is usually faster than digging deeper into code that may already be correct.

Top comments (0)