DEV Community: Hideki Mori

The model corrected reality

Hideki Mori — Tue, 21 Jul 2026 13:00:00 +0000

Here is the bank-transfer block from a Japanese invoice, rendered at 300 dpi. The fine print is 7.5 pt and every character is crisp. The bank is みずなら銀行 — a fictional institution I invented for a benchmark. It exists nowhere except in this document.

gemini-3.5-flash@high read this block five times. It answered みずほ銀行 — Japan's largest megabank — five times out of five.

It did not fail to read the document. It read it, and overruled it.

Where this came from

This is the strangest cell in the legibility map I published recently. That project degraded one Japanese invoice through seven simulated scan resolutions and ran 27 vision model variants down the ladder, to find where each one stops reading — and what it does after. An earlier article established the pattern everyone now expects: when a model can't read, some models invent.

This is different. This happened at the very top of the ladder, on a fully legible original. The trigger wasn't degradation.

An accidental controlled experiment

The benchmark's fine print held two fictional financial institutions, and — more by instinct than by plan — they differed in exactly one way.

The first, みずなら銀行, sits one character away from みずほ銀行, a real megabank. The second, ほしかげ信用金庫, is a small credit union with no real-world neighbor: nothing in the space of Japanese financial institutions sounds like it.

Same font size, same resolution, same prompt. The results:

At 300 dpi, @high read ほしかげ信用金庫 correctly in every run — while turning みずなら into みずほ in every run.
Across the three Gemini variants and the full ladder, the substitution happened 48 times: 25 on @low, 18 on @high, 5 on @medium.
When ほしかげ finally did break, deep in the blur, it drifted to はしかぜ信用金庫 and はしかわ信用金庫 — plausible-sounding institutions that do not exist. Across the entire run it was never once pulled to a real one.

The trigger is not legibility. It is the existence of a nearby real entity. Where a real neighbor exists, there is an attraction. Where none exists, the model reads what is on the page — or, at worst, invents something exactly as fictional as the truth.

Vision proposes, language disposes

I can't see inside these pipelines, so take this as the simplest explanation rather than a claim about architecture. These systems read with their language model as much as with their eyes. The visual evidence says みずなら; the language prior says みずほ is overwhelmingly more probable; and somewhere in decoding, probability wins — silently, with no flag, at settings you don't control. One character of visual evidence is not enough to outvote a name the model has seen millions of times.

Which is also why the credit union survived. There was no gravity well next to it.

It's a trait, not a law

claude-fable-5 read the same field on the same ladder and never substituted the real bank — not once, at any resolution, while reading the 7.5 pt tier correctly down to 50 dpi. Same input, same prompt, opposite disposition.

Prior capture is not a law of vision models. It is a measurable individual trait — which means you can select against it.

Why this is the scary one

The earlier fabrication article showed invented values that reconcile: totals that add up around a counterparty that was never there. This failure is worse in one specific way: the fabricated value is more plausible than the truth.

Run the human review in your head. A reviewer checking "does this bank look real?" passes みずほ銀行 without blinking — and would actually flag the true value as a typo. Every plausibility check you have, human or automated, is aligned with the error.

And notice what made it visible at all: the ground truth was fictional. Real invoices carry real banks, so in production this substitution produces output indistinguishable from a correct read. A benchmark built on real documents is structurally blind to prior capture. Unguessable, fictional ground truth isn't a convenience for scoring — it is the only instrument that detects this failure mode.

Concretely: never validate payment fields by plausibility. Validate against your counterparty master — the extracted bank either matches the registered account or a human looks at the page. And if fine print matters in your pipeline, benchmark for disposition, not just accuracy: a model's willingness to overrule the page matters as much as its eyesight.

Receipts

Everything is published in the benchmark repo — github.com/ldxhub-io/examples → analyzedoc/legibility-benchmark/: the deterministic material generator (the crops above are the actual benchmark materials), the raw outputs of all 4,158 jobs, and the scorer. The free-tier reproduction subset described in the map article includes @high, so you can watch the correction happen on your own API key without paying anything.

Disclosure, as before: I run LDX hub, the harness used here. It builds no models, and nothing in this post gets better or worse for me depending on which model wins.

The errors to fear are not the implausible ones — those get caught. The ones to fear are the corrections: answers improved in the direction of the world's expectations, away from what the page says. A model that corrects reality will pass every review that checks for plausibility. The only defense is knowing what the document really said — which is exactly the thing you were asking the model to tell you.

I survived 24 years because I'm lazy

Hideki Mori — Mon, 20 Jul 2026 13:00:00 +0000

I've shipped code for 24 years. Same job, mostly the same stack, mostly alone.

People assume that takes discipline. It doesn't.

The truth is simpler and less flattering: I'm lazy.

Not Larry Wall's "automate the boring stuff" lazy. Regular lazy. I avoid hassle. I skip work I don't want to do. I work around things that feel like chores.

Twenty four years happened because I built around that. The first thing I built around was deadlines.

Don't set my deadlines

I don't want to accept deadlines from other people.

Not because I'm undisciplined. The opposite. Once I'm inside a problem, I stay there until it ships. Interrupting the flow costs me more energy than continuing. That part isn't the issue.

The issue is that nobody outside my head can predict when "done" is. They don't know what's easy for me and what's hard. They guess. Then they tell me their guess as if it were a fact.

It grinds on me. Every project where someone tried to schedule me, I ended up resenting the scheduler more than the work itself.

The answer is always: "I'll ship it when I can. And yes, I'm working flat out."

Most of the time, "when I can" is the next business day. There's not much point setting a deadline for that.

What "when I can" actually looks like is the rest of this article.

What a lazy survivor actually does

Here's the daily shape of it. Some of these will sound like discipline. They aren't. They're what laziness produces when you've been at it long enough.

I don't stop until it's done. This is where "lazy" gets confusing. When I'm in a problem, stopping is more work than continuing. I sleep and eat enough to not fall over. The work isn't the chore. Pulling myself out of it and back into it is the chore.
I write the spec by being my own user. Nobody knows what's easy to use and what's consistent better than the person stress-testing it. That's me.
The 3-line discipline. (See 010.) By the time the code is written, it's already been tested.
After release, I watch the logs. Access logs, batch logs, error logs — I keep them tailed. A weird line scrolls past and something catches. The alerts haven't fired. I already know.
I redeploy as many times a day as I need to. The first release barely matters. What matters is the 10, 15, 20 years of changes that come after. The earlier you ship, the longer that window is.
I throw real data at my own software, hard. Big data, malformed data, weird-shaped data. The only confidence I trust is the kind that survives that.
When an internal user overloads my system, I thank them. Live data is a gift. I have never had the opposite feeling about a real-world failure.
A bug found before anyone else sees it isn't a bug. It's just an edit.
If there's an update worth making, the next day is too late. GPT-5.5 ships → I verify it on my app today, ship it as a selectable option today. I don't want to be the one lagging behind.
Batch processing is what I love most. Shaving milliseconds. Cutting load. Watching batch logs scroll. Every part of it is a reward. I'm always hoping more batch jobs come in from users.

And one more — the most important one, the one 24 years actually paid for:

Discomfort means something is wrong. Always. When something feels off, I hunt it down and crush it. The hunch isn't mystical — it's 24 years of pattern recognition without a vocabulary, and it hasn't been wrong yet. The part of me that wants to look past it has always paid for it later.

Where being lazy actually hurts

Not all laziness is helpful. Some of it is just being a person who avoids reading.

I don't read manuals. My wife has a car with cruise control. I've driven it for years. I still don't know how to turn the cruise control on. Every time, I'm too lazy to figure it out, and I drive without it.

That instinct does real damage at work. I don't read API docs unless I have to. I don't read change logs carefully. I skim. I assume. I run code instead of finishing a manual.

English is the other one. I'm Japanese — my technical reading is fine, but the gap between "I can read this" and "I want to read this carefully" is wide, and laziness lives in that gap. Long English documentation is exactly the thing I will not voluntarily face.

For most of my 24 years, this cost me real time. I'd build something that worked, then discover three weeks later that the API I'd wrapped had a flag I'd missed because I never read past the example block.

The honest answer is that Claude reads for me now. I describe the problem, Claude reads the docs, Claude points me at the part I need to verify. It doesn't replace judgment. It removes the friction between me and information I was always going to avoid anyway.

I don't say that to praise AI. I say it because it's true for me, and it would be dishonest to write a piece about how I work in 2026 without saying it out loud.

The engine

If you re-read the list above, two things are doing the work. Both are unglamorous.

The first is the laziness. Each new component means ten years of maintenance I'll have to carry. Each new methodology means bugs I haven't learned yet. Each methodology argument is time not spent writing code. Each promised date is someone else's deadline I'll have to keep. I avoid all of it by default.

The second is profit. What I build has to make money. Not in some abstract sense — actual revenue from actual users. That part of me has no patience. If a service isn't paying for itself, something is wrong with it, and I want to know what. So I ship early, watch logs, redeploy whenever there's something to fix. Not out of discipline. Out of refusal to leave money on the table.

These two forces don't agree on much. Laziness says "don't bother." Profit says "if it makes money, bother." What's left after they negotiate is everything I actually do.

That's why "shipping continuously for 24 years" looks like discipline from the outside. It isn't. It's laziness pushed in a useful direction by the only thing that ever moved me — the need for the result to actually pay.

A CEO at one of my earlier companies once asked me: "That service running on your desktop — can you make it public right now?" I half-dismissed him at the time. Years later I understood he was the salesperson version of the same engine. That's a story for another time.

Other solos

There are other people doing what I do. Solo. Operating something they built years ago that's still running.

I assume — based on no evidence except how this kind of work shapes a person — that no two of us work the same way. Each of us has wrapped a different cocoon around a different temperament. What I do would be unworkable for them. What they do would be unworkable for me.

I respect that more than I can put into words. Solo developers who keep their own systems running for years are doing a job that doesn't show up in any career framework I've seen. Most of them never write about it because writing is also work.

Whatever way you do it — keep going. I see you.

What survived

People who watch me work sometimes call it persistence. Conviction. The shape of a long arc.

It isn't.

I survived because every alternative was more work than continuing. I stayed lazy. I got paid. Together, those two facts ran the clock for 24 years and counting.

What looks like a disciplined career from the outside has always been, on the inside, the path of least resistance — provided someone was paying me at the end of it.

The lazy way is the only way I know.

Built with Claude (Opus).

Earlier in this series:

Where vision models stop reading — and start inventing

Hideki Mori — Wed, 15 Jul 2026 13:00:00 +0000

Earlier this week I published a strange finding: GPT's low-detail image mode doesn't misread documents it can't see — it invents them, fluently, with reconciling totals. That was one failure mode, in one model family, at one legibility level.

It left an uncomfortable question: where exactly does each model stop reading — and what does it do after that? Leave the field blank, or fabricate something plausible?

One result to hold onto while you scroll: a model that can no longer read a document can usually still tell what kind of document it is. That held across almost the entire board.

So I built a ladder.

Then I ran 27 vision model variants down it: 4,158 jobs, about $141 at list price, one afternoon. This post is the map.

The setup, in one paragraph

One Japanese invoice, rendered on a fixed 2480×3508 canvas (A4 at 300 dpi), then degraded through seven simulated scan resolutions: 300 → 150 → 100 → 70 → 50 → 35 → 25 dpi (L0–L6). Degradation is resampling only — no noise, no blur, no rotation — so legibility is the only variable. The invoice carries twelve fields across four font tiers: a 28 pt title, 16–14 pt "large" fields (total, invoice number), 10.5 pt body fields (counterparty, dates, amounts), and 7.5 pt fine print (bank details). Every value is fictional and unguessable, and subtotal + tax = total reconciles — so a plausible but wrong answer is detectable, not just a wrong one. Each variant reads each ladder step five times. The extraction prompt is deliberately neutral: it never says what to do with unreadable text, because that choice is the thing being measured.

Scoring is deterministic, four classes per field: correct / near (edit distance 1, strings only) / blank / fabricated.

The map

Rows are model variants, columns are ladder steps, color is the fabrication rate on the 10.5 pt body tier — the tier where invoice counterparties and amounts live. White means the model either read correctly or stayed silent. Red means it filled unreadable fields with invented values.

The companion table below gives each variant's frontier: the deepest ladder step where it still keeps ≥90% field accuracy, per tier (× = below 90% already at the crisp 300 dpi original).

model	title	large	body	fine	body fab @25dpi	classified correctly
`openai/gpt-5.6-sol@high`	L6	L5	L5	L2	60%	84/84
`openai/gpt-5.6-sol@low`	L6	×	×	×	64%	84/84
`openai/gpt-5.6-terra@high`	L6	L5	L4	L3	4%	84/84
`openai/gpt-5.6-terra@low`	L6	×	×	×	42%	84/84
`openai/gpt-5.6-luna@high`	L6	L5	L4	L2	96%	84/84
`openai/gpt-5.6-luna@low`	L6	×	×	×	80%	81/84
`openai/gpt-5.5@high`	L6	L5	L4	L2	86%	84/84
`openai/gpt-5.5@low`	L6	×	×	×	62%	84/84
`openai/gpt-5.4@high`	L6	L5	L4	L2	46%	84/84
`openai/gpt-5.4-mini@high`	L6	L5	L4	L0	78%	84/84
`azure/gpt-5.6-sol@high`	L6	L5	L5	L0	90%	84/84
`azure/gpt-5.6-sol@low`	L6	×	×	×	76%	84/84
`azure/gpt-5.6-terra@high`	L6	L5	L4	×	56%	84/84
`azure/gpt-5.6-terra@low`	L6	×	×	×	50%	84/84
`azure/gpt-5.6-luna@high`	L6	L5	L4	×	98%	84/84
`azure/gpt-5.6-luna@low`	L6	×	×	×	80%	84/84
`azure/gpt-5.4@high`	L6	L5	L4	×	98%	84/84
`azure/gpt-5.4@low`	L6	L5	×	×	58%	84/84
`azure/gpt-5.4-mini@high`	L6	L5	L4	×	88%	84/84
`azure/gpt-5.4-mini@low`	L6	L5	×	×	54%	84/84
`google/gemini-3.5-flash@high`	L6	L6	L6	×	10%	84/84
`google/gemini-3.5-flash@medium`	L6	L6	L6	L5	10%	84/84
`google/gemini-3.5-flash@low`	L6	L6	L6	L0	0%	84/84
`anthropic/claude-fable-5`	L6	L6	L6	L4	10%	84/84
`anthropic/claude-sonnet-5`	L6	L6	L5	L4	26%	84/84
`anthropic/claude-opus-4-8`	L6	L6	L5	L4	26%	84/84
`bedrock/global.amazon.nova-2-lite-v1:0`	×	L5	×	×	80%	63/84

The biggest surprise in this table isn't where the frontiers sit. It's what happens past them — some models go silent, and some keep talking.

Six observations fell out of the map.

1. "@low" means different things per provider

google/gemini-3.5-flash@low — the second-cheapest variant on the board — read the body tier correctly at every step down to 25 dpi, with zero fabrications. Under exactly the same conditions, every OpenAI and Azure @low variant collapsed at L0, on the pristine original. Same suffix, opposite behavior. The difference isn't the models' eyesight; it's what each provider's low-detail pipeline does to the image before the model ever sees it.

2. For GPT `@low`, a worse scan is a better scan

The @low accuracy curves are not monotonic. Most GPT @low variants read a 70 dpi scan better than the 300 dpi original — body accuracy climbing from ~40% at L0 to 70–80% at L3–L4 before falling again. My resampling acts as an anti-alias filter for the provider's own aggressive downscale. The practical corollary is genuinely odd: if you are stuck with a @low pipeline, pre-blurring your documents can improve extraction.

3. After collapse, models split into fabricators and blankers

What a model does past its frontier is a personality trait, and it's measurable. At 25 dpi, most GPT @high variants fill 75–100% of the body fields they can no longer read with invented values. openai/gpt-5.6-terra@high is the outlier of the entire board: 96% of its failures are blanks. Anthropic and Google models fail less to begin with and fabricate less when they do (0–26%). If your pipeline feeds payment systems, a blanker that admits defeat is worth more than a stronger reader that bluffs.

4. Same model, different gateway, different eyes

gpt-5.6-sol@high reads the 7.5 pt fine tier at 100% down to 100 dpi when called via OpenAI — and starts at 92% and degrades immediately when the same model is called via Azure. The failure style shifts too: terra's blank rate drops from 96% (OpenAI) to 39% (Azure). This matches an earlier measurement suggesting the Azure pipeline applies a lower effective-resolution ceiling before the model ever sees the document. Your gateway choice is silently part of your model choice.

5. Fabrication doesn't need degradation (teaser)

One fine-print field held a fictional bank whose name is one character away from a real megabank. At 300 dpi — fully legible, five out of five runs — some models "corrected" it to the real one. 48 substitutions across Gemini variants, while a fictional credit union with no real-world neighbor was read perfectly under the same conditions. The trigger isn't legibility; it's the existence of a nearby real entity. This one deserves its own write-up, with the receipts. Coming separately.

6. Classification survives reading loss

25 of 27 variants classified all 84 documents (invoice / receipt / business card / meeting minutes) correctly at every degradation step — including variants whose extraction had collapsed completely. A model that cannot read a document can still tell what kind of document it is. The two exceptions are instructive: the cheapest model on the board confuses receipts with invoices (21 out of 21 times — consistently, not randomly), and one @low variant dropped three classifications at the bottom of the ladder.

What I'd take into production

Route by tier, not by document. Titles survive almost anything; fine print dies first. If a field matters, measure the frontier of the tier it lives in.
Pick blankers for payment fields. A fabricated bank name passes every visual plausibility check. Prefer models that return "" over models that return something convincing.
Don't assume @low is one thing. Benchmark the variant you'll actually call, on the gateway you'll actually use.

Reproduce it (a free key is enough)

Everything — the deterministic material generator, the runner, the scorer, the raw outputs of all 4,158 jobs — is published:

github.com/ldxhub-io/examples → analyzedoc/legibility-benchmark/

The materials are byte-identical on any platform (the generator downloads a pinned, checksum-verified font). A three-variant reproduction subset runs in 147 jobs ≈ 17,600 credits, which fits inside LDX hub's free tier (25,000 credits/month, no card):

python3 gen_materials.py
export LDXHUB_API_KEY=...   # free key: gw.portal.ldxhub.io
python3 run_benchmark.py --models ume --t1-instances A --t1-reps 3 --t2-reps 1 --yes
python3 score_results.py && python3 report.py

Because raw model outputs ship with the results, you can disagree with my scoring rules and re-score everything without re-running a single job.

Full disclosure: I run LDX hub. It builds no models — it's the harness here, not a subject. One API key across OpenAI, Azure, Google, Anthropic and AWS is the only reason a 27-variant matrix fits in one afternoon, and that convenience is exactly what I'm selling. The measurements stand on the published raw data either way.

Caveats

Degradation is synthetic resampling, not real scanner noise — claims are limited to simulated legibility. One document type, one language (Japanese; if anything, a harder test than Latin script). The strict scorer counts character-level misreadings as fabrications, which flatters nobody. Results are a July 2026 snapshot; the ladder re-runs on every model addition, so the map will stay current.

The next time a provider ships a new vision model, it gets a row within a day. That's the point of building a ladder instead of writing a review.

When AI can't read, it invents — but it still sees the shape

Hideki Mori — Tue, 14 Jul 2026 13:00:00 +0000

I ran 110 vision-extraction jobs against a synthetic invoice. The low-resolution modes of the newest models never once returned the correct document — and the way they failed is worse than random noise.

All measurements in this post are as of July 10, 2026. Vision pipelines change; if you're reading this later, re-run the test before trusting the numbers.

The invoice that would have passed review

Here is a fragment of what GPT-5.6 (Sol, low-detail image mode) returned when I asked it to extract a synthetic invoice from a PNG:

{
  "vendor": "K Northwind Trading Ltd",
  "bill_to": "Accora Manufacturing Inc.",
  "invoice_number": "INV-2025-0731",
  "date": "2025-07-31",
  "subtotal": 757.5,
  "tax": 75.75,
  "total": 833.25
}

Every number is correct. Subtotal, tax, total, all four line-item amounts, all quantities, all unit prices — perfect, down to the cents. The vendor is right too.

The bill-to company does not exist. The real document says Aozora Manufacturing Inc.; the model wrote Accora (and, on other runs, Alcora). The invoice number is wrong in one systematic way: the year. The document says INV-2026-0731; the model wrote INV-2025-0731 — and then, consistently, dated the invoice 2025-07-31 to match. The two invented values agree with each other.

That last detail is the one that bothers me. A misreading scatters; over five runs you'd expect 2020, 2028, a garbled digit. This didn't scatter. Twenty out of twenty runs — five repeats, two image formats, two providers — said 2025, and the date field followed along. The model didn't fail to read the year — it composed a coherent document in which the year is 2025.

An extraction where the totals reconcile but the counterparty is fictional is precisely the kind of error that sails through an accounts-payable check. Nobody re-verifies the customer name when the arithmetic is clean.

The false positive that came first

Before I trust a finding like this, I have to tell you about the bug I almost blamed on the model — because it changed how I ran everything after.

My platform derives the output schema from an example_output the caller provides. My first test used example values like 10.0 and 22.0. Somewhere between my MCP client and the Java layer that infers the schema, JSON serialization collapsed 10.0 into 10 — an integer. The inferred schema said integer, and every model dutifully returned integer totals. 833.25 came back as 833, and in one configuration as 83325 — the decimal point simply gone.

For about an hour I had a tidy, wrong conclusion: "the new flagship's low mode corrupts decimals." Then I asked the question I should have asked first — did the example even survive the JSON round-trip? — and the whole finding evaporated. With examples like 12.34, every model produced clean decimals.

Lesson one, before any lesson about models: when output looks corrupted, suspect your test harness before the model. X.0 is not a safe way to say "this field is a float" in any pipeline that round-trips JSON.

That embarrassment is why everything below is n=5, scripted, with the pass/fail criteria frozen in code before the runs started.

The numbers

The setup: one synthetic invoice (all names fictional), rendered to JPEG and PNG. Ground truth: vendor K Northwind Trading Ltd., bill-to Aozora Manufacturing Inc., invoice number INV-2026-0731, total 833.25. A result counts as OK only if all four fields match. Eleven model configurations, two formats, five runs each — 110 jobs through my document-analysis API, which maps an @low / @high variant onto each provider's image-detail setting.

The low-detail modes of the GPT-5.5 and GPT-5.6 generations (Sol, Terra, Luna; direct and Azure-hosted): zero correct extractions in 70 attempts. Not "low accuracy" — zero. The failures were the kind shown above: arithmetic intact, identities invented.

The controls, same images, same five repeats:

Configuration	Result
GPT-5.5 / GPT-5.6 family, low detail (7 configs)	0 / 70 correct
GPT-5.4 (Azure), low detail	18 / 20 correct; 2 near-misses (a dropped space: "KNorthwind")
GPT-5.4 mini (Azure), low detail	10 / 10 correct
Gemini 3.5 Flash, low detail	10 / 10 correct
GPT-5.6 Sol, high detail	10 / 10 correct

An earlier run of the same protocol used a version of the invoice with smaller, lighter text. Same models, same 0-for-70 — but there the failures were total: complete fictional invoices, different every time. A vendor called Kramerwick Ltd. with a Brussels address. KittenPaws, LLC on Meowth Street. A Japanese company name the document never contained. Line items for services that don't appear anywhere in the image.

So the failure mode is not binary; it slides with legibility. Illegible source → the model invents the whole document. Partially legible source → the model reads what it can (the big bold totals) and invents the rest (the small print), stitching both into one internally consistent answer. The second mode is the dangerous one. A wholesale fabrication looks wrong at a glance. A half-real document does not.

The generational irony

Look at that table again. The token budgets are essentially identical: measured directly against the provider APIs (my gateway meters pages, not tokens), the OpenAI-family low mode spends about 315 tokens per image of this size; Gemini's low setting spends about 258. Gemini reads the invoice perfectly on the smaller budget. And GPT-5.4 — the older generation, same 315 tokens, same hosting path — gets it right, with failures that look like classic OCR noise: a dropped space, a mangled character. It degrades the way you'd expect a reader to degrade.

The 5.5 and 5.6 generations do something different with the same pixels. Where 5.4 returns less, they return other. My best reading — and I'll flag it as interpretation, not measurement — is that the newer generations are stronger generators, and when perception runs out, generation fills the gap with whatever is most plausible. "Aozora" becomes "Alcora": right silhouette, right length, wrong word. 2026 becomes 2025, and the date agrees, because a coherent story beats a faithful blank.

Newer model, better prose, worse witness.

The reversal

At this point the obvious move is to kill the low-detail image modes entirely. I almost did. Then I ran the opposite experiment: instead of asking the same models to read documents, I asked them to sort them.

Round one: five synthetic documents with distinct layouts — invoice, receipt, business card, contract, blank page. Nine configurations (the seven "guilty" low modes plus two controls), three runs each: 135 / 135 correct, including refusing to force the blank page into a category.

Round two was designed to be unfair. Three Japanese business documents — 請求書 (invoice), 御見積書 (quotation), 注文書 (purchase order) — with identical layouts, identical tables, identical amounts, identical document numbers. The only difference is the title and one label line. You cannot sort these by shape; you must read the title. I then degraded them: photocopier noise, a 2° tilt, JPEG quality 25. Nine configurations, seven materials, three runs: 189 / 189 correct.

Same models. Same low-detail budget. 0-for-70 at reading the fine print; 324-for-324 at reading the headline and the shape.

The capability boundary is suddenly crisp: at ~300 tokens per page, these models see the title tier of a document reliably and the body tier not at all — and where the body tier fails, the 5.5+ generations fill it with fiction.

What I did about it

Delisting was the wrong answer — the classification result is real, and a sorting gate that costs a tenth of a high-detail read is genuinely useful (mixed scan folder → cheap low-detail triage → route each type to the right extraction pipeline). Silence was also the wrong answer: my catalog said the low modes were for "clean, large-text documents," and my own test — a clean, large-text document — had just proven that description wrong.

So the fix was one sentence of honesty. Every affected variant in my catalog now reads:

Low-resolution mode for fast, economical PDF extraction and document classification; **text read from JPEG/PNG images is unreliable at this resolution.**

(PDFs are unaffected in my measurements — the OpenAI-family models read PDF input through its text layer, so image downscaling never touches it. That's also why this failure hid so well: every PDF test passed.)

The formats stay listed. The capability stays available. The sentence tells you what 110 jobs taught me: what it's for, and what it will quietly get wrong.

The rule I'm keeping

My pipeline runs on a rule I've kept for a long time: absorb failures with deliberate retries, and always return a result. But a retry only helps when the next attempt can go differently — a network hiccup, a rate limit. This failure is worse than the structural kind. A malformed schema at least announces itself; you can validate before you send. A fabricated bill-to announces nothing. The request succeeds. The JSON validates. The totals reconcile.

You cannot retry your way out of this, and you cannot fully predict it either. What you can do is measure where the boundary sits — n=5, criteria frozen, controls included — and then write the boundary down where your users choose models. Not in a postmortem. In the catalog, in the sentence they read before they click.

The models will keep getting better at writing. That is exactly why "I couldn't read this" increasingly comes back as fluent, internally consistent, confidently formatted text. The shape is real. The details may be fiction. Design — and document — accordingly.

Method notes: 110 extraction jobs + 324 classification jobs against synthetic documents (all names fictional), run July 10, 2026 via my document-processing gateway with per-provider low/high image-detail variants; token figures measured directly against provider APIs. Pass/fail criteria were fixed in the harness before execution. The generation and test scripts, plus the recorded runs, are public: ldxhub-io/examples › analyzedoc/low-detail-study.

The graph nobody is watching

Hideki Mori — Mon, 13 Jul 2026 13:00:00 +0000

If you ask me what part of the system I protect the most, the answer is the database.

I've been writing software alone for twenty-four years, and across every platform I've built, the rule has stayed the same: the web servers can take whatever you throw at them, the batches can be rebuilt, but the database has to stay idle on purpose. Not because I love idle databases, but because the day a database actually starts to struggle is a day with very few good options.

This article is about what "keep the database idle on purpose" actually means in practice, and about one particular kind of graph that, in my experience, almost nobody is watching.

The three layers and what each of them gets

I think of a production system as having three tiers, and each tier gets a different rule.

The web server tier can be horizontally scaled. If load grows, you add machines. If something is wrong, you take a machine out of the pool, and the others handle it. Failures here are visible immediately, and they're cheap to recover from.

The batch server tier can be scaled up or out depending on the work. A batch that's too slow can be split. A batch that crashes can be retried. End users don't see batch servers, so a stuck batch is a problem for me and not for them. Some headroom up here is fine.

The database tier is the one I treat completely differently. The database is not where you absorb load. The database is what you protect from load. The reason is simple: the other tiers can be rebuilt or re-scaled. The database is the irreplaceable record. If it slows down, everything slows down. If it falls over, you don't have many minutes before the rest of the stack notices.

So my rule for the database is: keep it idle. Not idle in the sense of "doing nothing." Idle in the sense of "running well below its capacity, at all times, so that any extra load it picks up has somewhere to go."

For more than a decade I ran a large appliance-grade database where I kept the load average below 1 at all times. Not as a target. As a fact. If the load average went up, that was the signal that something had changed in the application and I needed to find it before the database told me about it.

How I keep it idle

A few habits, repeated for decades.

Aggregate periodically, not on demand. When an application needs a daily total, a monthly summary, a yearly count, the wrong thing is to compute it at the moment of the request. The right thing is to compute it ahead of time, on a schedule, into a summary table the application can read from cheaply. If the summary needs to be refreshed every minute, that's fine — a per-minute aggregation against a well-indexed working set is a small, predictable cost. An on-demand aggregation against the full source table is a large, unpredictable cost, and it scales with data growth in a way you don't want.

Cut joins dynamically. When a query joins many tables but a particular filter condition makes some of those tables redundant, the query construction layer can skip them. The fewer tables in the join, the less work for the planner and the executor. This kind of work is invisible to the application engineer — it lives in the layer that builds the SQL — but it pays for itself many times over the lifetime of the system.

Refuse to optimize the optimization. Once, an infrastructure engineer suggested running an enterprise database optimization dashboard against my database, to surface query-level improvement candidates. The pitch was that with more compute and storage capacity, we could push queries to run much faster. I declined.

The headline reason was that the database wasn't there only to serve analytical queries — its primary job was to keep the user-facing OLTP layer responsive, which meant the spare capacity I was carrying was a buffer for user load, not a budget to be spent on faster reports. The secondary reason was that the optimization process itself would have consumed CPU on the database, and the database is the one place where extra processes are not free. The database I had built was already running below LA 1. There was nothing to optimize at that level. Adding optimization itself would only have added load.

These three habits are not clever. They don't require special tools. They require the willingness to put the database first in design decisions, every time, even when the application engineer's instinct is to do otherwise.

The graph nobody is watching

Here's the part of this that I think is genuinely under-discussed.

I've been looking lately at two production database graphs, from two different services I'm involved with. Both graphs cover the last several months. Both are showing the kind of metric — read-row-count from random access — that tracks how much physical work the database is being asked to do.

Service A's graph is bumpy. Most of the time it sits near zero. Several times a week, there's a spike — 100,000 reads, sometimes more, in a short burst. The spikes are predictable in shape: they happen when a dashboard somewhere runs an on-demand aggregation. The fix would be to move that aggregation off the live database and onto a summary table refreshed periodically, or onto a separate analytics store. The shape is alarming on a single graph, but the architecture explains it.

Service B's graph is the opposite, and that's what makes it the more dangerous of the two. It's not bumpy. It's a slow, steady upward slope. Several months ago the line sat around 12,000. Today it sits around 18,000. Usage is not growing — in fact, the user count for this service has been declining over the same window.

There's no spike to point at. There's no incident to investigate. There's no alert that has fired, because no threshold has been crossed. There's only a slope.

This is what I mean by "the graph nobody is watching." Spikes get attention. Sudden failures get attention. A gradual upward slope, on a metric most teams don't even look at, while usage is flat or declining — this gets no attention at all. And yet it is, in my reading, the more serious signal. Something inside the system is doing more physical work to serve a smaller number of users. The application has degraded silently, in a way the dashboard isn't designed to detect.

The hard part is that the only way to spot this is to look at the graph carefully, over a long enough window that the slope can become visible. A glance at last week's numbers tells you nothing. A glance at last month's numbers tells you very little. The kind of degradation I'm describing only resolves into a recognizable shape when you've been watching the same metric over an extended period — long enough that small monthly differences become a slope.

The asymmetry of upward slopes

Not every upward slope is dangerous. Disk usage climbs because logs accumulate — explainable, dismissable. Connection counts climb because a new client integration came online — explainable, dismissable. Some upward slopes have a story behind them, and the story is fine.

The dangerous upward slopes are the ones without a story. A database doing more work for fewer users has no story that's good. Either the data model has grown in a way the queries weren't designed for, or some piece of code is doing many more operations per request than it used to, or some background job has multiplied without anyone noticing. None of these have an alert attached. All of them are visible only as a slope on a graph that someone has to be looking at.

Twenty-four years of running databases has taught me that the alert thresholds are not the boundary between "fine" and "in trouble." The alert thresholds are the boundary between "I can keep ignoring this" and "I have to act now." There's a whole region below the threshold that contains all the early warnings, and you only see that region if you go looking for it.

Closing

The database is the one component in my stack that I won't let degrade. Everything else exists, in part, to keep the database from being asked to do too much. The web tier absorbs the user. The batch tier absorbs the work that doesn't have to be live. The aggregation layers absorb the queries that would otherwise hit the source tables. The dynamic-join construction absorbs the cost of joins that don't need to happen. All of this exists so that the database can stay below LA 1, all day, every day, for years at a time.

When I look at a system someone else built and the database is the part that surprises me, I read it as a sign that the surrounding layers haven't been doing their job. The database telling you it's tired is a late signal. The graph nobody is watching is an earlier one.

This is not what you should do. This is what twenty-four years has taught one specific person to do.

Built with Claude (Opus).

Earlier in this series:

The example is the schema: extracting Japanese qualified invoices to JSON

Hideki Mori — Tue, 07 Jul 2026 13:35:40 +0000

Japan's qualified invoice system requires every invoice to carry a registration number (a "T" followed by 13 digits) and a per-rate tax breakdown — 8% reduced rate for food, 10% standard, frequently mixed on the same document. That makes Japanese invoices a nice stress test for structured extraction: non-Latin text, full-width characters, honorific suffixes, kanji-formatted dates, and two tax rates whose arithmetic has to reconcile to the yen.

This post runs one through AnalyzeDoc (LDX hub) — PDF, JPEG, or PNG in, structured JSON out — and looks closely at what came back. The part worth your time isn't that it works. It's how the schema is defined, and what that definition quietly controls.

There is no JSON Schema. You hand the API an example of the output you want, and the example is compiled into the schema.

The document

A fictional qualified invoice, one page (sample PDF in the repo). Four line items: two food items at the 8% reduced rate (marked ※, as the law requires), two at 10%. Registration number, per-rate tax summary, bank details, a payment deadline. Total: ¥42,210.

The example is the schema

Instead of a schema file, you send example_output:

{
  "invoice_number": "INV-2025-0001",
  "issue_date": "2025-01-15",
  "due_date": "2025-02-28",
  "issuer_name": "株式会社サンプル",
  "registration_number": "T9876543210987",
  "issuer_phone": "03-9876-5432",
  "customer_name": "株式会社テスト商会",
  "line_items": [
    { "description": "サンプル品目A", "quantity": 5, "unit_price": 2000, "amount": 10000, "tax_rate": 10 }
  ],
  "tax_summary": [
    { "tax_rate": 10, "taxable_amount": 10000, "tax_amount": 1000 }
  ],
  "subtotal": 10000,
  "total_tax": 1000,
  "total": 11000,
  "bank_details": "サンプル銀行 本店 普通 0000000"
}

Three rules make this work, and the first one is the whole article:

Your example values are type declarations. Write 10000 and the field is inferred as an integer; write 1234.56 and it's a number. Japanese yen has no decimals, so integers are the semantically correct choice here — every amount in the output will be a clean integer your downstream systems can trust. For a USD invoice you'd do the opposite: always write the example amounts with decimals, or a $250.00 line item risks coming back as an integer type. The literals you type are the contract you get.

Arrays receive variability. tax_summary is an array because mixed rates are the entire point of a qualified invoice. One entry per rate — the schema absorbs the document's core complexity instead of fighting it.

Example values must differ from the document. If the example matches the invoice, you can't tell whether the model read the page or copied the example. Different numbers, different names, different dates.

The prompt carries rules, not fields

Field descriptions belong to the example. The prompt is only for rules the example can't express:

Extract the invoice data from this Japanese qualified invoice (適格請求書).
Rules:
- Dates in YYYY-MM-DD format.
- All monetary amounts as integers in JPY (no separators, no currency symbols).
- registration_number in "T + 13 digits" format.
- tax_rate as an integer percentage (8 or 10).
- tax_summary must contain one entry per tax rate on the invoice.
- Phone numbers must not start with "+".

The last rule is scar tissue: a +81-prefixed string dropped into Google Sheets gets parsed as a formula. Cheaper to kill it at the extraction boundary than to escape it everywhere downstream.

Running it

Four calls. No polling loop — ?wait parks the request server-side.

# 1. Upload → file_id
curl -s -X POST https://gw.ldxhub.io/files \
  -H "Authorization: Bearer $LDXHUB_API_KEY" \
  -F "file=@invoice-sample-ja.pdf"

# 2. Create the job
curl -s -X POST https://gw.ldxhub.io/analyzedoc/jobs \
  -H "Authorization: Bearer $LDXHUB_API_KEY" \
  -H "Content-Type: application/json" \
  -d @job.json

# 3. Wait for completion (server-side)
curl -s "https://gw.ldxhub.io/analyzedoc/jobs/$JOB_ID?wait=30" \
  -H "Authorization: Bearer $LDXHUB_API_KEY"

# 4. Fetch the result
curl -s "https://gw.ldxhub.io/files/$OUTPUT_FILE_ID/content" \
  -H "Authorization: Bearer $LDXHUB_API_KEY"

job.json is five fields: model, file_id, output_format, system_prompt, example_output. The model here is google/gemini-3.5-flash@high.

The result

Completed in 26 seconds:

{"invoice_number":"INV-2026-0157","issue_date":"2026-06-30","due_date":"2026-07-31","issuer_name":"株式会社グリーンリーフ食品","registration_number":"T1234567890123","issuer_phone":"03-1234-5678","customer_name":"サンプルマート株式会社","line_items":[{"description":"有機緑茶ギフトセット ※","quantity":2,"unit_price":3000,"amount":6000,"tax_rate":8},{"description":"国産純粋はちみつ 500g ※","quantity":10,"unit_price":1200,"amount":12000,"tax_rate":8},{"description":"陶器マグカップ（箱入）","quantity":24,"unit_price":800,"amount":19200,"tax_rate":10},{"description":"配送料","quantity":1,"unit_price":1500,"amount":1500,"tax_rate":10}],"tax_summary":[{"tax_rate":8,"taxable_amount":18000,"tax_amount":1440},{"tax_rate":10,"taxable_amount":20700,"tax_amount":2070}],"subtotal":38700,"total_tax":3510,"total":42210,"bank_details":"サンプル銀行 本店 普通 1234567 カ）グリーンリーフショクヒン"}

Verification: the four line amounts sum to 38,700 = subtotal. The 8% base (18,000) yields 1,440; the 10% base (20,700) yields 2,070; together 3,510 = total tax; 42,210 = total. All fifteen fields match the source. No transposed digits, no invented fields.

What the model understood

The arithmetic is table stakes. The details are where it gets interesting.

The honorific is gone. The invoice addresses the customer as 「サンプルマート株式会社　御中」. 御中 (onchū) is an honorific suffix appended to company names in correspondence — roughly "To the esteemed...". It is not part of the name, and customer_name came back without it. Nothing in the prompt asked for that.

The ※ marks stayed. Line descriptions preserve the reduced-rate marker verbatim (「有機緑茶ギフトセット ※」) while the meaning lives in tax_rate: 8. Text fidelity in one field, semantics in another — a separation of concerns that exists because the schema was designed to hold it.

Kanji-formatted dates normalized. 「2026年6月30日」 became 2026-06-30 off a single prompt line.

Every amount is an integer. Because the example said so. 6000, not 6000.0.

The footer disclaimer went nowhere. The sample PDF carries a "this is a sample" notice at the bottom. No schema field fits it, so it correctly appears in none.

Swap the model, keep the code

Changing one string — the model ID — switches the same request across OpenAI, Azure, Google, Anthropic, and Amazon models (15 at the time of writing). Clean, printed layouts run fine on fast, cheap models; degraded scans or handwriting can go to the heavyweight tier. The JSON shape doesn't change, because the schema compatibility is handled below the model line.

Cost

This page cost 285 credits (~$0.029). The free plan includes 25,000 credits — about 85 pages' worth of finding out whether your own invoices survive contact.

Try it

The quickstart is built to go from signup to a completed job in about 60 seconds:

The fastest evaluation is the one you run on your own documents.

How I removed the middleman, one phone call at a time

Hideki Mori — Mon, 06 Jul 2026 13:00:00 +0000

In the mid-2000s I worked on a content distribution platform that served over a hundred storefronts. Books, music, comics — different shops, the same underlying content, each store with its own branding.

Among those hundred-plus storefronts, several dozen were technically resold through a third party — a viewer-side company that operated its own delivery infrastructure on top of ours. Their architecture required us to push every file into per-storefront slots on their servers, before the end user could download anything. Hundreds of slots. Per file.

This article is about how those slots disappeared, and what I learned along the way.

How the layers got there

When the platform launched, we were not the dominant player. The market for digital content on Japanese mobile phones was already shaped by a few established companies, and one of them sat between us and several dozen of our storefronts. They had built a viewer client that the end user installed on their phone. They had built a delivery server that the viewer pulled from. They had a working business.

We came in as the catalog provider. The deal was simple in principle: their viewer, our content. To make their viewer work, we had to put files into the directory structure their delivery server expected — one directory per storefront, the same file copied into each.

For a small catalog, this was tolerable. For a growing catalog, it was wasteful. A single new title meant dozens of identical writes — same bytes, different paths — every time we published.

We were the latecomer. We didn't push back. We did the work their architecture required.

The phone call

One Friday afternoon I got a phone call from the lead engineer on their side.

"Could you reduce the frequency of your pushes? Our servers are having trouble keeping up."

The phrasing was polite, but the request was unusual. We had recently parallelized our delivery batches — moved from sequential pushes to concurrent ones — and the volume of writes had multiplied. We were doing what we should have been doing for our growing catalog. They were absorbing it on infrastructure that hadn't been designed for that rate.

I could have agreed. Reducing batch concurrency was a one-line config change on our side. It would have made their afternoon easier and our publishers' release schedules a little slower.

I didn't agree. I made a counter-proposal.

"What if we stopped pushing per-storefront entirely? Could your viewer fetch directly from us, with the storefront as a parameter, instead of from a pre-placed copy?"

There was a pause on the line. Not a hostile pause. The kind of pause an engineer makes when an idea reorganizes itself in their head.

"Let me think about that and get back to you," they said.

The two steps that followed

It would be neat to say they called back the next day and we shipped it. That's not what happened. The conversation that started on that phone call took roughly two years to finish, in two discrete steps. Neither of the steps was urgent. Each of them was treated, on both sides, as a quiet refactor.

Step one: stop the per-storefront duplication.

The first thing we agreed was that, even if the viewer still fetched through their infrastructure, the file did not need to be duplicated per storefront on disk. They modified their delivery server to look up a single canonical copy, with the storefront determined by a parameter on the URL. We changed our push so that each file was uploaded once, not duplicated per storefront.

Our delivery batch shrank by an order of magnitude. Their disk usage dropped accordingly. Nothing changed for the end user.

Step two: remove the dependency on their server entirely.

The second thing was larger. Since the file was already canonical on our side, there was no architectural reason for the data path to go through their infrastructure at all. They updated their viewer to fetch directly from our delivery infrastructure, using the parameters they had previously injected on their server side. They kept the catalog metadata and the user-account layer; everything else went away.

When the last storefront cut over, the middleman delivery server was no longer in the data path. They turned off the disks they had been running for us. We turned off the placement processing we had been running for them.

What I learned

It's tempting to read this as a story about how a smaller company outmaneuvered a larger one. That's not what happened. What happened was that I made a proposal in the middle of a routine operational complaint, and over two years both sides incrementally chose to remove a layer that had stopped earning its keep.

There are a few things I took from it.

A symptom request is an opportunity to ask about the cause. "Please reduce your push frequency" is a symptom request. The honest underlying question is "why are we doing this many writes at all?" Most operational complaints I've received over the years, in the moment, sounded like requests for symptom relief. Some of them turned out to be requests for an architectural conversation that nobody had explicitly opened yet.

When I get a symptom request now, I always try to surface the cause version of the same request, at least once, to see what happens. Sometimes nothing. Sometimes a two-year refactor.

Conviction has to come with the other party's interest. I was able to propose removing their delivery infrastructure because, by that point, our volume was material to them. A meaningful share of their viewer's traffic came from our catalog. Removing a layer that had stopped paying for itself was, for them, a saving — not a loss. Without that alignment, the same proposal would have been a polite no.

The conviction wasn't only mine. It was conviction plus a real shape of interests. Without the second, no amount of conviction moves a vendor's roadmap.

Latecomers can rearrange the layer cake. When we started, we accepted their architecture because they were the established party and we were not. Two years later, we redefined the architecture together. The thing that changed in between was not their willingness to change. It was our weight in the system. The latecomer can become the party who proposes the new shape, once the latecomer is no longer small.

I think a lot of the operational shapes that look fixed in a given industry are, in fact, shapes that nobody has been in a position to renegotiate. The middleman in our case wasn't there because anyone defended its existence. It was there because nobody yet had reason to ask whether it should be.

Twenty-four years later

I'm telling this story now because I keep finding myself doing variations of it. A vendor surfaces an operational complaint. I look at the complaint, and I find — sometimes — that the structure underneath the complaint is the actual subject of the conversation. The complaint is the symptom. The conversation is the door.

Not every complaint opens a door. Most of them are just complaints. But some of them are an invitation to ask whether the current shape is the right one, and a year or two later you find that the shape is no longer there.

This is not what you should do. This is what twenty-four years has taught one specific person to do.

Built with Claude (Opus).

Earlier in this series:

The 3-line discipline

Hideki Mori — Mon, 29 Jun 2026 13:00:00 +0000

When I write code in unfamiliar territory, I write three lines, then I run it.

Then I write three more lines, and I run it again.

I've been doing this for twenty-four years. It's the most specific habit I have. I almost didn't write this article, because the habit feels too small to be worth describing — but then I noticed that it's the part of my way of working that I can never seem to explain to someone in real time. It needs writing down.

Three principles

The discipline rests on three things I believe about writing code. They're not deep. They've just stayed with me.

1. Trust nothing but your own code.

If you can't trust the code you wrote yourself, what can you trust? Not a library, not a vendor's documentation, not your own assumption from yesterday. The only thing in the system whose behavior you can fully verify is the code you just typed, by running it.

2. Write in code, not in language.

If you're describing what the code should do in Japanese or English, you're spending the same time you could have spent writing the code itself. By the time the code runs, the description is already done — by the code, in a more precise form than any language could give it.

3. Make three lines complete.

The three lines you just wrote should be complete. Error handling included. Validation included. Logging included. Not "I'll add validation later." Not "I'll wrap it in a try-catch later." Three lines, complete, then run.

(There's a small exception to this. Sometimes you do want to ignore every error and move on — for instance, when you're trying to understand whether the happy path works at all before you care about anything else. That's a different mode, used deliberately. It's not the same as "I'll handle errors later.")

Why three lines

Three lines is roughly the unit of thought I can hold completely. Five lines, and I start guessing what the third line did. Ten lines, and I'm reading the code as if it were someone else's. Three lines is the size that stays mine.

When three lines run and produce what I expected, I keep them. When they don't, I either fix them or delete them. The cost of deleting three lines is small enough that I have no attachment to keeping them.

What I'm protecting, by writing this small, is the alignment between the code in my head and the code that's actually running. When that alignment is intact, debugging is fast: I know which three lines just changed. When that alignment slips — because I wrote thirty lines without running them — debugging becomes archaeology. I'd rather spend the time in three-line increments and avoid the archaeology.

Components: write the caller, run, write again, run

When I introduce a new component into a system — a new library, a new vendor's API, a new framework — I don't start by writing the code that needs the component. I start by writing the code that calls the component.

The caller is small at first. Three lines. I call the component, I print what comes back, I run it. Then three more lines. I call it with different arguments. I print what comes back. I run it. Three more lines. I call it in a way that should fail. I see what failure looks like.

By the time I've spent an hour doing this, I know how the component behaves on the inputs I care about, how it behaves on edge cases, how it fails, what it returns when it fails. I have a small body of code that has tested the component from the outside, written in my own hand.

After that, the component almost never surprises me when I integrate it into the real work. It surprised me already, during the hour I was poking at it, and I noted what I learned.

This part of the discipline is what twenty-four years has changed about me the most. When I started, I would try to use a new library inside the real code right away, and then I'd be debugging both the library's behavior and my own use of it at the same time. I don't do that anymore. The library has to pass a small private interview first.

Even then, live data still surprises you

Here's the part that keeps the discipline honest: even when you've done all of the above, live data will still surprise you. The vendor's API will return something the documentation never mentioned. The status code will say success while the payload says something else. The same call you've made a thousand times will, on the thousand-and-first try, return data that belongs to a different question entirely.

I worked with a fairly mainstream translation API for many years. It's the kind of API most people in the industry have heard of. In day-to-day operation, the integration was stable. But in the operational record of how my code calls it, there are several places where I had to write defenses that aren't suggested by the documentation:

The languages endpoint, when asked for target languages, sometimes returned the response shaped like a source language listing. The HTTP status was 200. The JSON parsed. But a field that should be present on target languages was missing. The fix wasn't to escalate or to file a bug — it was to detect the mismatch in my code, log it as a retry-worthy condition, and call the endpoint again. The next call usually returned correctly.
Certain error messages from the API turned out to be retry-worthy, even though the HTTP status code didn't say so. A "Temporary Error" in the response body, or a "Tag handling parsing failed", both warranted retrying. I learned this not from the documentation but from watching the production logs over many months.

None of this is a complaint about the vendor. It's a mainstream API. The point isn't that it's flawed; the point is that any API run at scale, against real data, will produce these moments. The documentation is a description of intended behavior, not observed behavior. Observed behavior, in production, is always wider.

So even after the three-line discipline, even after the private interview with the component, the system goes into production and surprises me. Not catastrophically. Quietly. A condition I hadn't tested, behaving in a way I hadn't predicted.

I don't experience this as a failure of the discipline. I experience it as the part of the work that the discipline doesn't cover — and was never going to cover. The discipline brings me to the doorstep with my code in good shape. Live data is what's on the other side of the door, and it's not something I get to fully prepare for. It's something I respond to.

This is, I think, the part of the work I find most enjoyable.

Twenty-four years of this

I didn't set out to develop a discipline. I started writing software in 2002, at a company that was almost out of money, on a product that needed to ship in thirty days. There was no time to write thirty lines and then debug them. I wrote three lines, I ran them, I wrote three more. The shape of how I work formed itself in that situation, and it never went away.

What's changed over twenty-four years is what I do with the result. The three-line increments are the same. The "write the caller first, run, run again" is the same. The willingness to be surprised by live data is the same. What's deeper now is just the cumulative trust in my own code, and the cumulative humility about everything outside it.

If you write the code you can trust, you can carry the weight of everything you can't.

This is not what you should do. This is what twenty-four years has taught one specific person to do.

Built with Claude (Opus).

Earlier in this series:

Billing asynchronous work exactly once

Hideki Mori — Wed, 24 Jun 2026 13:00:00 +0000

Synchronous billing is easy, and that's the problem — it makes you think all billing is easy.

When a request does its work inline, the billable number is in the response by the time you send it. The gateway meters from there — the meter write, retries and all, is its problem, not yours. From your side, synchronous billing is one number in the response.

Asynchronous work breaks that. The request submits a job; the work happens later, in a worker; the result comes back through a poll or a callback. And the thing you bill for — characters processed, pages converted — isn't known when the request arrives. It's known when the job finishes.

So you can't meter at the edge. The meter has to fire from the completion path. And the real difficulty is firing it exactly once per unit of completed work — because requests, polls, and retries all conspire to make that zero times or many times.

This is platform-agnostic. Every submit-process-poll API has it. I'll use the system I run as the example, but the shape is the same anywhere.

Three ways metering goes wrong

On arrival. Carry the synchronous habit over and you meter when the job is submitted. But you don't know the size yet, so you're forced into a crude flat fee — or you bill for work that hasn't happened and might fail. Wrong unit, wrong time.

On retrieval. The subtle one. You wire the meter to fire when the client fetches the result. Now a client who submits a job, lets it run — costing you real money downstream — and never bothers to poll is never billed. You did the work for free. "Completion" is not "the client picked up the result." It's the worker finishing.

Without a fixed quantity. Input characters or output characters? Pages before OCR or after? If you haven't decided exactly what you measure and where, invoices drift and customers argue. Decide once; measure there.

All three point the same way: meter on measured work-completion, with a fixed definition of the unit. Not on arrival. Not on retrieval.

The mechanism: a durable outbox

In synchronous billing the gateway took the numbers off the response and metered them for you. Async takes that away: the numbers exist only in the worker, after the request has returned. So completion itself has to become a durable event.

The completion path writes a metering task — the job's measured quantities — into a durable outbox: a table that is the source of truth for what still needs sending. Something drains it, sends each task to the meter, records the outcome; a failed send stays in the table and is retried until it lands. (In my system a once-a-minute batch does the draining. The interval doesn't matter; the durability does.)

It has a name — the transactional outbox pattern — though it's the sort of thing you'd build without the name. And it is, exactly, the one rule the rest of the system already runs on: when the job finishes, report it reliably — retry as much as possible, return the result. Metering is just one more result that has to be reported reliably. I didn't build a billing system. I pointed the engine's own discipline at billing.

Exactly once = at-least-once × at-most-once

The outbox gives me at-least-once. A meter event is never silently dropped, because a failed send leaves the task in place to retry.

But at-least-once, on its own, double-charges. The classic failure: the send succeeds, the acknowledgement is lost on the way back, the task looks failed, the next run resends — and now it is counted twice.

So at-least-once needs a partner: an idempotent sink. Send the same meter ID twice, it counts once. That is at-most-once.

exactly-once = outbox (at-least-once) × idempotent sink (at-most-once)

Neither half is enough alone. I learned the second one the hard way — the same outbox-and-retry code, pointed at two different metering backends. One deduplicated on the ID and the numbers stayed clean. The other didn't, and the retries double-charged. Same mechanism, different sink, different bill.

So the thing worth writing down isn't "this platform guarantees idempotency." Platforms change. The durable statement is: this pattern requires an idempotent sink. If yours doesn't deduplicate, your retries are a liability, not a safety net.

Bill on success, and survive retries

Two more places it bites.

Success, not completion. Fire the meter on successful completion — not on "the job reached a terminal state." A failed job must not emit a billable event. Wire it to the wrong terminal state and you charge people for errors, then spend your week on refunds.

Partial failure. What you bill on a half-finished job depends on whether half a result is worth anything. A text extraction fans out into many independent calls; if nine of ten succeed and one fails for good, you bill the nine — the successful work has standalone value. Document conversion is the opposite: a file that converts eight of ten pages and then dies isn't eighty percent of a document, it's a corrupted one. No charge, nothing returned. Bill at the granularity where partial output has value.

Retries. The engine retries aggressively — that is the point of it. Meter per attempt and every retry inflates the bill. So the meter is per successful job, fired once — which is exactly what the outbox and the idempotent sink already guarantee. It is not extra work; it falls out of the same design.

It all reduces to one sentence: the billable event is one successfully-completed unit, counted once.

The shape

In synchronous billing the meter is a property of a request arriving. In asynchronous billing it is a property of work finishing — and the discipline is firing it exactly once per successful unit.

It is worth separating what is hard from what is free. The completion wiring — the outbox, the retries — is yours to build. The at-most-once half is the sink's job, if you chose a sink that does it. Get both, and a client polling ten times, a worker retrying five, and a job that half-failed all resolve to the right number of credits.

That is the whole thing. It isn't much once it's drawn — but every line of it is a place I have watched a bill come out wrong.

Built with Claude (Opus).

Write the code well once, the spec stops bothering you

Hideki Mori — Mon, 22 Jun 2026 13:00:00 +0000

I wrote a Java class twenty years ago that assembled tar archives on the fly. It ran for fifteen years. In that fifteen years, nobody touched it. Not me, not anyone else.

This is a story about why.

The hundred storefronts and the one carrier spec

In the mid-2000s, I was running a content distribution platform. Over a hundred storefronts plugged into it. Bookstores, label-branded stores, carrier-branded stores. Each one resold the same underlying content — books, comics, music, video, and more — but with their own branding wrapped around it.

Among those hundred-plus storefronts, a few dozen of them shared a particular delivery spec — one of the major Japanese carriers had pinned down a specific shape for downloadable content on their old mobile phones. The content had to arrive as a tar archive. The phone would fetch it using HTTP Range Requests, byte ranges at a time, often resuming after a dropped connection.

The format was the same for all storefronts on that spec: a tar archive containing a known set of files, in a known structure. The files inside were not the same. Each storefront wanted its own branding image, its own store name in the metadata, its own thumbnail. The wrapper format was fixed. The contents were per-storefront, per-product.

A few dozen storefronts asking for the same shape with different contents, multiplied by the catalog. It was a combinatorial problem.

What I didn't do: pre-generate

The obvious approach was to pre-generate the tar archives. For each storefront, for each product, produce the archive once, write it to disk, serve it from there.

I rejected this almost immediately.

Storage isn't free. Number of storefronts times number of products times the size of each archive. The math wasn't terrible at the time, but the moment any storefront changed its branding — a new logo, a new store name, a new image — every archive associated with that storefront would be invalidated. Every product, every catalog entry. I'd be re-running the bake of tens of thousands of archives because someone tweaked a string.

I thought about this for a while and then stopped thinking about it. The shape that came out the other side was different.

What I did: deterministic generation at request time

I generated the archive at the moment of the request.

For each incoming download:

Look up the product. Find the raw content.
Look up the storefront. Find the branding config: the store name, the image to embed, the thumbnail.
Assemble a tar archive in memory, with that store's content wrapped around that product's data.
Stream the result to the client.

The archive didn't exist before the request arrived. It didn't exist after.

The CPU cost was real but small. Application servers were cheap and easy to scale horizontally. Storage was scarce and combinatorially expensive. The trade was obvious to me.

But there was one detail that made the whole shape work — and without it, the rest would have fallen apart.

The mtime had to be fixed

HTTP Range Requests assume the resource is stable. The client says give me bytes 0 through 8191, you give them those bytes. Later the client says give me bytes 8192 through 16383, and the bytes you give now have to be the second half of the same file the client started downloading. If they're not — if the file changed between the two range requests — the client ends up with a corrupted archive.

A tar header has a field for modification time. Every file inside the archive has an mtime — twelve bytes of octal-encoded Unix timestamp.

If I generated those mtimes from the current clock at the moment of header creation, every regeneration would produce a different archive. Even though the content was identical. Even though the structure was identical. Just the timestamps would shift, and Range Requests would break.

So the mtime had to be deterministic. The same archive had to come out every time, regardless of when the request arrived.

I fixed it to the product's update timestamp:

mtime = product.updateDate.getTime() / 1000L

That timestamp changed only when the underlying product was updated by an operations colleague. Between updates, every tar archive for that product was byte-identical, regardless of how many times it was assembled.

The natural follow-up question is: what about an update that lands mid-stream, while an archive is being assembled? The source files for each product were versioned. A product update produced a new version of the source set, not an in-place rewrite. An archive being assembled never saw a half-updated set of source files; it saw a consistent version, from start to finish.

Range Requests worked. Resumes worked. The phone could disconnect at byte 5000 and reconnect for bytes 5001 onwards from a different application instance entirely. The bytes would line up.

The archive existed only at the moment of the request, but it was deterministic to the last source update.

Fifteen years of nobody touching it

I wrote that class in 2006. The platform ran on it for the next fifteen years.

In those fifteen years:

New storefronts were added. Config rows in a database. No code change.
New products were added. New files on disk in a known structure. No code change.
New storefronts wanted different branding images. They uploaded different images. No code change.
The product update flow was tweaked many times. The mtime convention held. No code change.

Nobody re-wrote it. Nobody refactored it. Nobody patched it. It just kept assembling tar archives.

The on-call queue never had an alert about it. The maintenance docs never had a runbook section for it. New engineers never asked about it, because there was nothing to ask.

The spec stopped bothering anybody.

What I was actually choosing

When I picked dynamic generation over pre-generation, I wasn't really picking CPU over storage. I was picking where the complexity would live.

If you pre-generate, the complexity lives in the data. Every change in input — branding, content, metadata — has to propagate into a regeneration of the materialized output. The complexity is distributed across the storage layer, and someone has to maintain the regeneration pipeline.

If you generate at request time, the complexity lives in one piece of code. That code is hard to write well the first time. There's a tar format to understand. There's a deterministic mtime requirement that's easy to miss. There's the Range Request semantics. But once that code is written well, the system has no other place where the complexity is stored.

And then nobody has to touch it.

This is the trade I keep coming back to, twenty-four years into writing software alone. If you write the code well once, the spec stops bothering you. The work you did at the start absorbs all the work you didn't have to do later — by you, or by anyone else.

Twenty-four years of the same choice

I'm still doing it. The platform I run now is built on the same pattern at a different scale: a hundred-plus vendors of OCR, translation, and other document services, all behind a small set of public APIs that compose them dynamically per request. There's no pre-generated catalog of vendor combinations. There's one piece of code that knows how to wrap one vendor in one shape, and a configuration system that lets new vendors plug in.

I expect the same fifteen-year quiet that the tar archive code got. It's not that I'm confident. It's that I've made the choice often enough to know what it usually leads to.

This is not what you should do. This is what twenty-four years has taught one specific person to do.

If you write the code well once, the spec stops bothering you. Most of what I do, I do for that.

Built with Claude (Opus).

Earlier in this series:

Two patterns, five services, one n8n workflow

Hideki Mori — Wed, 17 Jun 2026 13:00:00 +0000

The first two articles in this series each showed one technique. Implementation notes #001 was a dynamic dropdown — a form field that fills itself from an API. Implementation notes #002 was a dynamic credential — an API key that arrives from the form and threads through to the HTTP nodes.

This article is the capstone. It walks through all-services-demo, the example workflow that ships with n8n-nodes-ldxhub, where those two techniques combine with a Switch node to host five different AI document-processing services inside one workflow — structured extraction, translation refinement, OCR, PDF conversion, and text extraction.

The screenshots and the workflow JSON below come from the n8n-nodes-ldxhub package. The patterns themselves are generic — they work for any set of services you want to consolidate into a single template.

This is not a "follow these steps" article. It's a parts catalog. No two readers are solving the same problem, and templates rarely fit anyone's situation as-is. Take what fits. Drop the rest. You don't need to understand all 46 nodes to reuse the patterns.

The shape

The workflow has 46 nodes — large enough to look intimidating in the editor, but structurally it's just five repeated paths plus a small routing section.

The entry section is two nodes:

On form submission — the trigger. Asks the user which service they want and collects an API key.
Route by Service — a Switch node with five outputs, one per service.

Everything to the right of the Switch is service-specific. Five paths fan out: StructFlow, RefineLoop, RenderOCR, CastDoc, ExtractDoc. Each path ends in two Form Ending nodes — one for success (auto-downloads the result), one for error.

That's the spine: form → switch → service path → ending. The complexity is pushed into the service paths.

The spine: routing by static comparison

The Switch node ("Route by Service") uses Rules mode. Each rule reads the same expression from the form — {{ $json.service }} — and compares it to a static service name.

Rule 1:  {{ $json.service }}  is equal to  structflow   → output: structflow
Rule 2:  {{ $json.service }}  is equal to  refineloop   → output: refineloop
Rule 3:  {{ $json.service }}  is equal to  renderocr    → output: renderocr
Rule 4:  {{ $json.service }}  is equal to  castdoc      → output: castdoc
Rule 5:  {{ $json.service }}  is equal to  extractdoc   → output: extractdoc

The read side is dynamic (the expression resolves to whatever the user picked). The match side is static (fixed strings). That asymmetry is intentional. Static rules mean adding a new service is a manual edit — open the Switch node, add a row, save. No regeneration, no template hooks, no auto-discovery. Boring and unsurprising.

This is the part you can lift cleanly: a Switch with N static rules driven by one expression from upstream. It works for service routing, document type routing, user tier routing, anything that fans into discrete branches.

Two patterns inside

Once you start reading the service paths, you notice something: they are not all the same shape. There are two distinct patterns.

Pattern A: single-step form (StructFlow, RefineLoop — 7 nodes)

Get Models (HTTP)
  → Derive Options (Set)
    → Run Form (Form, next page)
      → Inject Binary (Code)
        → Run (LDX hub)
          → Download / Error (Form Endings)

One form page collects everything the user needs to choose. The model selection, the input, the parameters — all in one screen. There's only one form page after the Switch.

Pattern B: cascading multi-step form (RenderOCR, CastDoc, ExtractDoc — 10 nodes)

Get Engines (HTTP)
  → Select Engine (Form, next page)
    → Derive Options (Set)
      → Upload File (Form, next page)
        → Filter by File (Set)
          → Select Output (Form, next page)
            → Inject Binary (Code)
              → Run (LDX hub)
                → Download / Error (Form Endings)

Three form pages, each gated on the previous one. First the engine is chosen. Then the file is uploaded. Then the output format is selected — and the available outputs are filtered based on what the chosen engine supports for the uploaded file type. The Filter by File Set node sits in the middle of that dependency.

Why two patterns, not one

The shape of the path follows the shape of the user's decisions. When the choices are independent — pick a model, pass some data — one form page is enough. When the choices cascade — engine restricts file types, file restricts output formats — the form has to be split, and intermediate Set nodes have to filter the options between pages.

I tried to force a single pattern across all five services. It made the simpler services more complicated than they needed to be. The honest design was to let the cascading services be longer, accept the asymmetry, and document it.

Two patterns isn't a sign of incompleteness. It's the consolidation accepting that two shapes were genuinely warranted.

Anatomy of one path

Walking through StructFlow gives you the vocabulary for all five paths. The other four are variations.

Get Models — an HTTP node that hits /structflow/models and returns the available LLMs (gpt-5.5, claude-sonnet-4-6, gemini-3-flash, etc.). This is the data source for the dropdown.

Derive Options — a Set node that reshapes the model list into the format n8n's Form trigger wants for a dropdown: [{name, value}, ...]. Same trick as in #001 — derive the dropdown from data, not from a hardcoded list.

Run Form — a single form page that asks for everything: which model to use, the input data, any parameters. The "model" dropdown reads its options from the upstream Derive Options node.

Inject Binary — a Code node that does one thing. In n8n, uploaded files travel through the workflow as binary data, separate from the JSON fields, and some intermediate nodes (like Set) only preserve the JSON side. By the time data reaches the LDX hub node, the binary part has been dropped.

return $input.all().map(item => ({
  json: item.json,
  binary: $('StructFlow: Run Form').item.binary
}));

The Code node reaches back to the form node and re-attaches the binary. One line of glue, but without it the file disappears.

Run — the LDX hub custom node, with runJob: structFlow. This is where the API call actually happens. The credential is set in expression mode to read from the form input — the dynamic credential pattern from #002.

Download / Error — two Form Ending nodes. The Run node has two output ports: Success goes to Download (which serves the result file), Error goes to Error (which shows the error message).

Five other paths follow the same idea. The names change, the number of form pages changes, but the role of each node is the same.

The convergence: five paths, one node

All five service paths end at the same LDX hub custom node. Same node type, same credential, same shape — only the runJob parameter differs:

StructFlow: Run    →  runJob: structFlow
RefineLoop: Run    →  runJob: refineLoop
RenderOCR: Run     →  runJob: renderOcr
CastDoc: Run       →  runJob: castDoc
ExtractDoc: Run    →  runJob: extractDoc

This is the abstraction the custom node provides. From the workflow's perspective, "running a service" looks identical across the five paths. The differences are buried inside the node's implementation, where they belong.

This generalizes the idea from #002's sidebar: a custom node that hides its variations behind a uniform interface lets you compose it freely. Five services. One credential type. One node. Five jobs. The workflow author doesn't have to know how StructFlow differs from RenderOCR — the node knows.

If you're building your own custom node, this is the shape worth aiming for. One node, parameterized by job kind. The workflow stays clean. The variations stay encapsulated.

The if/else pair: error endings

Every service path has two endpoints: Download (success) and Error. Both are Form Ending nodes. Both are visible to the user. This isn't decorative.

A distributable template has one minimum obligation: once the user clicks Submit, they need to see what happened. If the call succeeded, they get the result. If it failed, they get the error. There's no third state where the form just ends silently.

Earlier failures — the Get Models call returning empty, the Inject Binary node crashing — are not handled. Those are skipped here because the template is meant to be minimal, and because adding error endings everywhere makes the canvas unreadable. But the final Run node's error branch is mandatory. That's the one place where the user's expectation ("I started a job, what happened?") has to be answered.

Where there's an if, you need an else. The else doesn't have to be elegant. It just has to exist.

This is the part you can lift on its own: any time a workflow has a user-visible "run" step, pair its success with an error ending. Other branches can be skipped or logged, but the user-facing one is non-negotiable.

What consolidation looks like

The technique parts above — Switch routing, two form patterns, Binary injection, Run convergence, if/else error pairing — are each portable. You can lift them one at a time. But the article would be missing something if it stopped there.

Bringing five services into one workflow is itself the work. Not a tutorial-friendly kind of work, because nothing in this section is a discrete technique. It's a series of small decisions:

Choosing which five services to include (not all of them — five is enough to demonstrate, more would clutter).
Standardizing the node naming across paths (<Service>: <Role> everywhere — every reader knows where they are).
Accepting that the two patterns are real, and not forcing every path into the same shape.
Putting the differences inside the form pages and inside the Run node's runJob parameter — the spine stays uniform.
Designing the convergence: every path ends at the same node type, so adding a sixth service later is a copy-paste-edit, not a redesign.

None of these are clever. Each is the obvious decision once you've seen the alternatives. But the obvious decisions are what hold the template together.

There's a particular kind of reader this article is also for: the one who doesn't need the technique, who just needs a working template they can drop into n8n and run. The consolidation work is the deliverable for them. The fact that the article also explains how it works is a side benefit.

This contrasts with a different design philosophy — the one that spreads concerns across many separate sub-workflows, each handling a narrow responsibility. In larger n8n installations with several maintainers, you might split these responsibilities into reusable sub-workflows, with each service called via the Execute Workflow node. For a solo-maintained distributable template intended to be lifted and adapted, the consolidated shape was easier to understand and ship. Fewer moving parts, fewer integration points, one place to read.

Closing

This is parts.

If you read this article and take the whole workflow, run it as-is, that's fine. If you take only the Switch routing and rebuild every service path from scratch, that's better in many cases. If you take only the Inject Binary trick because that's what bit you yesterday, that's the best use of this article.

No two requirements are identical. Every reader is solving a different problem. A template that pretends to be a one-size answer would be lying. A template that is honest about being a parts catalog — here are the pieces, here is how they fit together, take what fits — is a different kind of useful thing.

That's what all-services-demo is meant to be. That's what this article is meant to be.

The n8n-nodes-ldxhub package ships examples/all-services-demo.json alongside the node code. Import it into your n8n instance, add an LDX hub API key (free tier: 25,000 credits/month, no card), and the workflow runs. Open the JSON and lift parts into your own templates.

This closes Phase 1 of Implementation notes — three articles, three angles on the same theme: how an n8n workflow becomes a small, distributable thing. The next phase will pick up other corners worth writing down.

Twenty four years, ten DB migrations, zero downtime

Hideki Mori — Mon, 15 Jun 2026 13:00:00 +0000

Twenty four years. Ten DB migrations. Zero downtime.

Except the first one, where I lost seven minutes I couldn't accept.

That seven minutes is why this article exists.

The seven minutes

It was in the mid-2000s. I was running a content distribution system on my own, with a small open-source database underneath. The platform was growing, and at some point we decided to move to a much larger commercial database — an appliance-grade one, the kind you specify by line of business and not by hostname.

I planned the switchover for a thirty-minute maintenance window. I did the work. End-to-end, it took seven minutes.

Seven minutes during which the service was down. End users couldn't reach the catalog. Bookstores couldn't sync. Publishers couldn't see their numbers.

It bothered me more than it should have. The migration was a success. The system came back up. Nobody complained.

But it had been down. Seven minutes that, on paper, the agreement said was acceptable. Seven minutes that, in my head, I never wanted to repeat.

That feeling didn't go away. I started designing every database operation from that day as if seven minutes was the wrong answer.

The contract no one made me write

That platform delivered books, comics, and music — content from publishers and labels through the bookstores that resold it. End users, bookstores, publishers, label owners: all of them sat on top of a single piece of plumbing I was responsible for.

There was no formal SLA written anywhere that said "this never goes down." But there was something stronger than an SLA: a sales-side expectation. Inside the company, it was assumed the service was always reachable. Customers were sold on the assumption that downloads worked at any time of day. Bookstores integrated against that assumption. Publishers settled royalties on top of it.

If I took the service down for an hour, none of those agreements would have technically been broken. But the silent contract — you never notice me running maintenance — would have been.

Once, before a particularly large migration, I had to brief one of the major carriers (they were on the bookstore side, technically a B2B customer). We met around a whiteboard. I drew the sequence. After about five minutes, the lead engineer on their side just nodded and said something like, "okay, if you're doing it, we're fine." We didn't need a recovery plan from them. We didn't need a coordinated test window. They didn't even update their monitoring.

That trust didn't come from documentation. It came from the fact that none of the previous migrations had touched their integration.

So the shape of how I do these migrations started with seven minutes I couldn't accept, and was kept alive by twenty-three years of not breaking that quiet contract again.

The shape

Here is the shape, stripped of vendor names. It is not new. It is not clever. It has just survived.

Step zero: separate the data into two kinds.

Data A: anything the end user reads. This is what the service actually serves. If this is wrong or unavailable, the service is wrong.
Data B: aggregates, summaries, derived tables, everything else. Batches write to it. End users never read from it directly.

The rule for Data A is: real-time synchronization, both old and new databases, no exceptions. The rule for Data B is: batches can stop for a while, you'll catch up later.

Step one: well before the migration window, set up two batches that incrementally copy Data A and Data B from the old database to the new one. Use the last-modified timestamp on each row. Take physical deletes into account by occasionally diffing the row sets and removing what's no longer there. Run these for days or weeks until the new database is essentially a copy of the old, plus or minus the most recent few minutes.

Step two: at migration time, stop the batches that write Data B. Run the Data B copy one last time. The aggregate tables on the new database are now identical to the old.

Step three: switch the application logic to a maintenance mode where Data A is written to both databases, but read only from the old one. Every user-facing update now produces two writes: one to the old database (the authoritative one), one to the new database (best effort, the eventually-authoritative one). If the write to the new database fails, it's swallowed — step four catches the drift.

Step four: once all instances of the application are in this dual-write mode, run the Data A copy one final time. This catches anything that was written to the old database between the last sync and the dual-write switchover. After this, both databases agree.

Step five: switch the application logic to read from the new database, while still writing to both. This is the moment of truth. If anything is wrong with the new database, this is when the user feels it.

Step six: switch the application logic to read and write to the new database only. The old database is detached.

For a migration (the new database stays), the batches that write Data B can now resume against the new database, and you're done. For a maintenance bypass (the old database is coming back), you do the same sequence in reverse, and the old database returns to service.

For reference, here's how the application behaves at each step:

Step	App writes	App reads	What happens
0	—	—	Classify rows into Data A and Data B
1	old	old	Incremental copy batches running
2	old	old	Stop Data B batches, final Data B copy
3	old + new	old	Dual-write switchover
4	old + new	old	Final Data A sync
5	old + new	new	Read switchover
6	new	new	Old database detached

The thing nobody talks about: mixed states

This shape works because each step is resilient to mixed states.

I deploy applications manually. I have for twenty-four years. There is no coordinated rolling restart, no atomic feature flag flip across all instances. When I switch the application logic to dual-write mode in step three, some instances are still in single-write mode (against the old database only) while others are already in dual-write mode. That mixed state can last as long as it takes me to walk through each server.

The shape is designed so that mixed states are correct.

When step three is rolling out: some instances write to old only, some write to both. All instances read from the old. The old database stays authoritative. Reads are consistent.

When step five is rolling out: some instances read from old, some read from new. By this point both databases agree (step four just synced them). Either read is correct.

When step six is rolling out: some instances still dual-write, some write to new only. All instances read from new. The new database is authoritative. The few writes that still hit the old database are harmless — it'll be detached momentarily.

I don't have to wait for a deployment to finish. I don't need a feature flag system to coordinate it. I don't need a service mesh to make it safe. I need the property that the shape stays correct while it's transitioning.

This is the part of the design that is older than every operational tool I see today. And it's the part I haven't found a reason to replace.

Rollback was never a special case

If you'd asked me five years ago whether this design has rollback, I would have said yes, of course. The reverse sequence is the rollback.

A maintenance bypass already runs forward, then backward. That backward leg is rollback. It's executed every single time. Rollback isn't an emergency path. It's a normal part of the workflow.

I've done this kind of migration over ten times in twenty-four years. I have never had to use the reverse sequence as an emergency. Not because nothing ever went wrong. But because the preparation phase — the days and weeks of incremental copying, of double-checking the deletes, of staring at row counts — catches the things that would have gone wrong, before they can.

The boring part of the work is what makes the dramatic part of the work disappear.

Twenty four years later

I'm going to write something now that should probably embarrass me but doesn't: I enjoy this work.

A new database to move into — especially a serious appliance-grade one — is one of the most enjoyable things I get to do. The preparation is meditative. The cutover itself is short and quiet. The week after, when the system is running on the new hardware and the end users have noticed nothing, is satisfying in a way I have not gotten from any other kind of engineering.

Twenty four years. Ten migrations. Zero downtime, after the seven minutes I couldn't accept.

That's the entire story. There are other databases out there now, other appliances, other ways of doing this. There are managed services that do most of the dance automatically. There are tools that take a lot of the carefulness off your hands.

I don't have an argument against any of those. I just know what survives in my hands: a separation of data into two kinds, a dual-write window in the middle, a resilience to mixed states, and a reverse sequence I always treat as ordinary.

This is not what you should do. This is what twenty-four years has taught one specific person to do.

Built with Claude (Opus).

Earlier in this series:

DEV Community: Hideki Mori

The model corrected reality

Where this came from

An accidental controlled experiment

Vision proposes, language disposes

It's a trait, not a law

Why this is the scary one

Receipts

I survived 24 years because I'm lazy

Don't set my deadlines

What a lazy survivor actually does

Where being lazy actually hurts

The engine

Other solos

What survived

Where vision models stop reading — and start inventing

The setup, in one paragraph

The map

1. "@low" means different things per provider

2. For GPT @low, a worse scan is a better scan

3. After collapse, models split into fabricators and blankers

4. Same model, different gateway, different eyes

5. Fabrication doesn't need degradation (teaser)

6. Classification survives reading loss

What I'd take into production

Reproduce it (a free key is enough)

Caveats

When AI can't read, it invents — but it still sees the shape

The invoice that would have passed review

The false positive that came first

The numbers

The generational irony

The reversal

What I did about it

The rule I'm keeping

The graph nobody is watching

The three layers and what each of them gets

How I keep it idle

The graph nobody is watching

The asymmetry of upward slopes

Closing

The example is the schema: extracting Japanese qualified invoices to JSON

The document

The example is the schema

The prompt carries rules, not fields

Running it

The result

What the model understood

Swap the model, keep the code

Cost

Try it

How I removed the middleman, one phone call at a time

How the layers got there

The phone call

The two steps that followed

What I learned

Twenty-four years later

The 3-line discipline

Three principles

Why three lines

Components: write the caller, run, write again, run

Even then, live data still surprises you

Twenty-four years of this

Billing asynchronous work exactly once

Three ways metering goes wrong

The mechanism: a durable outbox

Exactly once = at-least-once × at-most-once

Bill on success, and survive retries

The shape

Write the code well once, the spec stops bothering you

The hundred storefronts and the one carrier spec

What I didn't do: pre-generate

What I did: deterministic generation at request time

The mtime had to be fixed

Fifteen years of nobody touching it

What I was actually choosing

Twenty-four years of the same choice

Two patterns, five services, one n8n workflow

The shape

The spine: routing by static comparison

2. For GPT `@low`, a worse scan is a better scan