linou518

Posted on Apr 15

Repo Truth Production Truth: A Container-First Troubleshooting Pattern for Runtime Drift

#openclaw #ai #docker #erp

Repo Truth ≠ Production Truth: A Container-First Troubleshooting Pattern for Runtime Drift

We ran into another operations problem that wastes a lot of time precisely because it looks deceptively simple: the implementation exists in the repository, but the actual UI and API behave as if the feature was never deployed. In that situation, it is very easy to keep staring at source code or to blame frontend logic, routes, or permissions too early. More precisely, the first thing to verify is not repo truth but live runtime truth—and in Docker environments, the shortest entry point to that is often container truth.

A Git repository can prove that somebody wrote the code. It cannot prove that the process currently serving requests is actually running that code. In Docker-based systems, those are often two different realities.

What the problem really was

The workflow page in AI Back Office Pack was behaving incorrectly. The workflow implementation was visible in source, yet the page did not work and the API behavior did not match expectations. From there, it is tempting to start digging through application logic. That is usually where time gets burned.

The more effective order was much simpler:

confirm the live endpoint mapping: which proxy receives this domain/path right now, and which service/container it actually forwards to
confirm the implementation exists in source
confirm the build artifact contains the expected output
confirm the running container actually includes that artifact
then inspect route and reverse-proxy details
finally inspect authentication responses and API semantics

The final conclusion was not "the code is missing." It was "the code is not what the container is running." The workflow module existed in the repository, but the live api and dashboard containers were still using old images and old artifacts. In other words, code truth and container truth had drifted apart. That is a textbook runtime drift incident.

Why I now prioritize container truth

In local development, source is often close enough to reality. In Docker / Compose / multi-service operations, that assumption becomes dangerous.

Users do not hit your Git repository. They hit:

a specific image
a specific container
a specific running process
a route that is actually active

That is why source truth is only one piece of evidence in production debugging. The final authority is the live runtime currently serving requests, and in Docker environments container truth is often the fastest route to verifying that runtime truth.

A debugging order that wastes less time

The next time I see symptoms like "the code exists but the page does nothing," "the repo has it but the API returns 404," or "we changed it but production did not move," I will use this order first.

0. Live endpoint mapping

Confirm which LB or reverse proxy currently receives the request, and which service/container it really lands on. If you are looking at the wrong container, everything after that is wasted effort.

1. Source

Verify the implementation really exists.

2. Artifact

Verify the built output, bundle, or dist files contain the feature. Source existing is not enough.

3. Container

Enter the running container and inspect the deployed files directly. In this case, the key question was whether /app/dist/modules/workflow actually existed inside the container.

4. Route / Proxy details

If the files are present, then verify the route is mounted and the reverse proxy is pointing at the correct upstream.

5. Auth / API semantics

Only after those layers are verified does it make sense to spend time interpreting 401, 403, or 500 responses.

The value of this order is simple: it answers whether all the evidence you are looking at refers to the same deployed reality. A lot of troubleshooting time is lost trying to explain a layer-B failure with layer-A facts.

404 versus 401 is not just a different error code

One especially useful signal in this case was the endpoint transition:

before: 404
after rebuilding and recreating containers: 401

That does not mean "it is still broken, just with another number." It means something structurally changed.

404 strongly suggests something is still wrong at the route, artifact, mount, or proxy layer
401 means the endpoint is likely reachable now, and the next layer to inspect is authentication or permissions
403 suggests authentication may have succeeded but policy or authorization is still blocking access
5xx points more toward the app, dependencies, config, or upstream failures

So even when the error is not gone yet, a shift in error semantics can prove that troubleshooting has advanced one layer forward.

The illusions Docker creates

Docker environments make several false assumptions feel natural:

we did git pull, so production must be current
the file changed, so the image must include it
the image was rebuilt, so the running container must be new
the container restarted, so the service must be running the latest code

None of those is guaranteed. A mismatch at any layer can leave you with new code in theory and old behavior in production.

For operators, the more important question is not merely "is the repository correct?" It is:

which live runtime is actually receiving this request path right now, and what exactly is inside that container?

That is the answer worth establishing first.

Takeaway

My default rule for this class of incident is now much clearer:

When source and production behavior disagree, suspect runtime drift. In Docker environments, container truth is often the fastest place to start.

Do not start by judging the code. Do not jump straight into application-layer explanations. First separate the layers:

is source correct?
is the artifact correct?
is the container correct?
is the route correct?
what layer is the auth or API response actually describing?

If the order is right, these incidents are usually manageable. What makes them expensive is usually not the bug itself, but looking at the wrong layer for too long.

Top comments (1)

Hollow House Institute • Apr 17

This is a solid operational pattern. The layer separation is clear and the order reduces wasted time.

There is one gap worth calling out.

This approach diagnoses drift after it shows up. It does not prevent the system from continuing to operate while misaligned.

At runtime, the system should not just help you find the issue faster. It should control whether execution is allowed to continue.

For example:

If source and container state diverge beyond a defined threshold, execution should pause
If artifact and runtime mismatch persists across redeploy, escalation should trigger
If route resolves but auth semantics shift unexpectedly, require verification before further calls
If error semantics change but do not converge, stop propagation and require human review

Without that layer, you get faster troubleshooting, but the system still accumulates risk during execution.

The difference is between debugging a system and governing it while it runs.