I was reviewing production logs at 11pm when I got a notification from Railway — my infra provider, not the transportation mode — telling me a service had gone down for the third time that week. I restarted the container, made a mental note to "look at this tomorrow," and moved on. Two days later I read that the Shinkansen had a 49-second delay and the company issued a formal public apology. Forty-nine seconds. I stared at the screen for a while.
It wasn't that the fact surprised me. I'd heard it before. What surprised me was the contrast with my own normalization: I had restarted that service three times in a week and filed it mentally under "stuff that happens." They had 49 seconds of delay and treated it as an event requiring formal analysis and a public apology. That's when I started to understand the problem wasn't technical.
Reliable Systems, Institutional Design, and Infrastructure: The Trap of Looking for Better Tools
The easy explanation for Japanese trains is technological: maglev, precision engineering, massive budget. It's a comfortable explanation because if the problem is technological, the solution is buying better technology. But it doesn't hold up.
Switzerland also has extraordinarily punctual trains. With considerably more modest technology. Germany has the ICE, high technology, federal budget, and chronic delays that are a national running joke. India is building high-tech metros in cities where signaling fails every day. Technology doesn't explain the variance.
What explains the variance is more uncomfortable: the structure of consequences.
At JR (Japan Railways), the institutional cost of a failure is brutally high. I'm not just talking about fines or metrics. I'm talking about something deeper: the organizational identity is built around reliability. An operator who reports a problem on time is treated as part of the safety system. An operator who hides a problem to avoid generating friction is betraying the institution. That inversion of incentives isn't cultural in the vague sense — it's designed, reinforced, and actively maintained.
The practical result: preventive maintenance isn't a cost. It's the only rational way to operate. Because the cost of failing — in reputation, in internal consequences, in the postmortem analysis that follows — is systematically higher than the cost of maintaining.
Compare that to most of the software systems I know, including my own.
What This Means for Software Architecture
When I was studying Computer Science at UBA while working full time, I'd sometimes show up straight from work in my suit. There was constant pressure to make things work now, not to make them work well and forever. I passed Calculus II on my fourth attempt. I learned to survive in environments where the cost of not delivering today was more visible than the cost of delivering badly.
That shapes how you think about systems. And it's exactly the institutional problem that the Japanese train solved and we haven't.
In most software teams, the structure of consequences favors failing silently:
- A service that goes down and recovers on its own doesn't generate conversation
- A service that never goes down but required 3 hours of preventive work doesn't generate visible conversation either
- A service that crashes spectacularly at 3pm in production generates a meeting, a postmortem, and sometimes an RCA
The visible consequence is in the big failure, not in the silent degradation. That's what makes restarting three times in a week "stuff that happens" instead of an institutional warning signal.
// What we typically do:
const handleError = async (error: Error) => {
// Restart and keep going — uptime recovers on its own
logger.error('Service crashed', { error: error.message });
await restartService();
// ✗ No root cause analysis
// ✗ No frequency tracking
// ✗ No visible cost to the team
};
// What an institution with a real consequence structure would do:
const handleErrorWithCost = async (error: Error, context: OperationContext) => {
// 1. Log with enough detail for later analysis
await recordIncident({
timestamp: new Date(),
error: error.message,
stack: error.stack,
context,
// Frequency in the last 24h — this is what matters
recentFrequency: await countSimilarIncidents('24h'),
});
// 2. If this is the third similar incident in a week:
// DON'T restart silently — escalate with context
const history = await getIncidentHistory(7);
if (history.similar >= 3) {
await escalateWithContext({
message: 'Third similar incident this week — this is not noise',
history,
estimatedCostOfIgnoring: calculateCostOfContinuousDegradation(history),
});
}
// 3. Recover, but leave a visible trace
await restartService();
await updateReliabilityDashboard(context.service);
};
The difference isn't technical. It's what the system makes visible, and who cares about it.
When I designed the architecture of my AI agent and measured the real costs, the problem wasn't the technology. It was that I had no structure to make the cost of bad decisions visible. Failures got absorbed silently and I kept thinking the system was running "more or less fine."
AI Agents Are Going in Exactly the Opposite Direction
This is where I get more uncomfortable, because it's the territory where I'm actively working.
The AI agent ecosystem in 2025 is building, quite systematically, systems where the cost of failing is artificially low. And it presents this as a virtue.
"The agent retries automatically." "If there's an error, the LLM detects and corrects it." "Resilience is built in." All of that sounds good. And in certain contexts it is. But in institutional terms, you're building a system that makes failures invisible. The agent fails, retries, eventually arrives at some result, and you never know the path was tortuous.
I measured it in tokens and the discomfort was concrete: there are design decisions that cost 3x more tokens without anyone knowing, because the final result arrives anyway. The cost gets absorbed silently. The system "works."
It's the equivalent of the train arriving 49 seconds late but nobody logs it because it arrived anyway.
The difference with JR is that JR built the institutional capacity to make those 49 seconds visible, analyzed, and costly. We're building agents that optimize to make the 49 seconds permanently invisible.
When I analyzed the real costs of my agent's design decisions, I found exactly that: the automatic retry architecture was, in terms of institutional visibility, a system for hiding failures. It worked. But it was building comprehension debt.
The Mistakes I Made (And That You're Probably Making)
Mistake 1: Confusing availability with reliability.
My service had 99.2% uptime last month. It also had 47 automatic restarts. Those numbers don't contradict each other, and that's the problem. JR doesn't measure "the train arrived" — it measures how long it took, why, and what conditions allowed that. I was only measuring whether it arrived.
Mistake 2: Treating retries as a solution, not a signal.
A successful retry isn't a success. It's a failure that resolved itself. The difference matters because if you don't log it as a failure, you have no data to prevent the next one.
Mistake 3: Building observability without consequences.
I had dashboards. I had logs. I had alerts. But I had no structure where that data cost anything if it showed deterioration. Information without consequences is decoration.
Mistake 4: Assuming "it works" is the goal.
This is the deepest one. The Japanese system doesn't optimize for the train arriving. It optimizes for the process that makes the train arrive to be sustainably reliable. Those are different goals with different institutional architectures.
When I wrote a Python interpreter in Python to understand compilers, the biggest lesson wasn't technical — it was that formal languages force explicitness. You can't have vague behavior. Either the grammar allows it or it doesn't. Real reliability systems work the same way: you need to make explicit what counts as a failure.
FAQ: Reliable Systems, Institutional Design, and Infrastructure
Is the Japanese model replicable in software without a massive budget?
Yes, because the most important component isn't economic. It's structural. What JR does that costs little but changes everything: logging retries as incidents, not as noise. That doesn't require budget. It requires changing what your system makes visible and agreeing that it matters.
Isn't this what SLOs and SLAs already do?
Partially. SLOs are a good first step because they make the goal visible. But the institutional problem is what happens when they're not met. If the cost of missing an SLO is a meeting and a "we need to improve," you haven't changed the consequence structure. The relevant question is: what does it cost a specific person when the SLO fails repeatedly?
How would you apply this to an AI agent system?
I'd start by making retries visible. Not as a success metric ("the agent completed the task") but as a quality-of-path metric ("the agent needed X retries, cost Y tokens, took Z seconds longer than expected"). Then I'd build a threshold where that number triggers something: not necessarily an alarm, but a review. The system needs to know that failing silently has a cost.
Why is "move fast and break things" the exact opposite of this?
Because it optimizes for iteration speed above everything else. That's not bad in contexts where the cost of failure is low and learning speed is most valuable — like an experiment, or an MVP. The problem is when that culture persists after the system has real users, real data, and real consequences. At that point, iteration speed without a consequence structure is institutional debt.
Isn't there a real trade-off between reliability and development speed?
Yes, and I'm not going to pretend there isn't. But the trade-off is usually framed badly. It's not "speed vs reliability." It's "visible cost today vs invisible cost that accumulates." JR's preventive maintenance costs more per train-kilometer than reactive maintenance. But the total cost of the system — including failures, disruptions, emergency repairs, and reputational damage — is much lower. The problem is that today's cost is visible and the future cost isn't.
What specific tool would you recommend to start?
No tool. That's exactly the point. Before choosing tools, you need to agree on what counts as a failure in your system. Write it down. Literally in a document: "A failure is X. A retry is a failure. Three similar failures in seven days triggers Y." When you're clear on that, any observability stack works. Without it, you have pretty dashboards and zero institutional change.
Real Reliability Is Not a Technical Problem
I've had this topic rattling around in my head for weeks. It started with that 49-second data point and ended up making me rethink how I design systems.
The most uncomfortable conclusion is this: in the current software ecosystem — and especially in the AI agent ecosystem — we are actively building systems that make failures invisible. And we present them as resilient. Real resilience is when the system wants failures to be visible, because the institution built the right consequences.
When Anthropic designs the developer experience for Claude, there's a tension exactly here: the API makes retries easy, errors manageable, everything flows. That lowers development friction. It also lowers failure visibility. I don't know if that's right or wrong — it probably depends on context. But I know it's an institutional decision with consequences, and it should be made consciously.
Even when I thought about Brunost, the programming language in Nynorsk, there was something of this: who decides what's readable, what counts as correct, what structure makes error visible or invisible. Institutional design is everywhere, even in languages.
I don't have a packaged solution. I have a practice I started two months ago: every time I restart a service, I log it as an incident with timestamp, context, and accumulated frequency. I'm not doing anything with that yet. But when I hit 47 restarts in a month, the number was uncomfortable enough that I couldn't keep calling it "stuff that happens."
That's where institutional change starts. In making visible what you used to absorb silently.
If any of this landed for you, tell me in the comments how you measure silent failures in your system. Or if you've got the consequence structure figured out in a way that actually works — I want to learn from that.
Top comments (0)