Hideki Mori

Posted on Jun 22

Write the code well once, the spec stops bothering you

#softwareengineering #architecture #design #backend

I wrote a Java class twenty years ago that assembled tar archives on the fly. It ran for fifteen years. In that fifteen years, nobody touched it. Not me, not anyone else.

This is a story about why.

The hundred storefronts and the one carrier spec

In the mid-2000s, I was running a content distribution platform. Over a hundred storefronts plugged into it. Bookstores, label-branded stores, carrier-branded stores. Each one resold the same underlying content — books, comics, music, video, and more — but with their own branding wrapped around it.

Among those hundred-plus storefronts, a few dozen of them shared a particular delivery spec — one of the major Japanese carriers had pinned down a specific shape for downloadable content on their old mobile phones. The content had to arrive as a tar archive. The phone would fetch it using HTTP Range Requests, byte ranges at a time, often resuming after a dropped connection.

The format was the same for all storefronts on that spec: a tar archive containing a known set of files, in a known structure. The files inside were not the same. Each storefront wanted its own branding image, its own store name in the metadata, its own thumbnail. The wrapper format was fixed. The contents were per-storefront, per-product.

A few dozen storefronts asking for the same shape with different contents, multiplied by the catalog. It was a combinatorial problem.

What I didn't do: pre-generate

The obvious approach was to pre-generate the tar archives. For each storefront, for each product, produce the archive once, write it to disk, serve it from there.

I rejected this almost immediately.

Storage isn't free. Number of storefronts times number of products times the size of each archive. The math wasn't terrible at the time, but the moment any storefront changed its branding — a new logo, a new store name, a new image — every archive associated with that storefront would be invalidated. Every product, every catalog entry. I'd be re-running the bake of tens of thousands of archives because someone tweaked a string.

I thought about this for a while and then stopped thinking about it. The shape that came out the other side was different.

What I did: deterministic generation at request time

I generated the archive at the moment of the request.

For each incoming download:

Look up the product. Find the raw content.
Look up the storefront. Find the branding config: the store name, the image to embed, the thumbnail.
Assemble a tar archive in memory, with that store's content wrapped around that product's data.
Stream the result to the client.

The archive didn't exist before the request arrived. It didn't exist after.

The CPU cost was real but small. Application servers were cheap and easy to scale horizontally. Storage was scarce and combinatorially expensive. The trade was obvious to me.

But there was one detail that made the whole shape work — and without it, the rest would have fallen apart.

The mtime had to be fixed

HTTP Range Requests assume the resource is stable. The client says give me bytes 0 through 8191, you give them those bytes. Later the client says give me bytes 8192 through 16383, and the bytes you give now have to be the second half of the same file the client started downloading. If they're not — if the file changed between the two range requests — the client ends up with a corrupted archive.

A tar header has a field for modification time. Every file inside the archive has an mtime — twelve bytes of octal-encoded Unix timestamp.

If I generated those mtimes from the current clock at the moment of header creation, every regeneration would produce a different archive. Even though the content was identical. Even though the structure was identical. Just the timestamps would shift, and Range Requests would break.

So the mtime had to be deterministic. The same archive had to come out every time, regardless of when the request arrived.

I fixed it to the product's update timestamp:

mtime = product.updateDate.getTime() / 1000L

That timestamp changed only when the underlying product was updated by an operations colleague. Between updates, every tar archive for that product was byte-identical, regardless of how many times it was assembled.

The natural follow-up question is: what about an update that lands mid-stream, while an archive is being assembled? The source files for each product were versioned. A product update produced a new version of the source set, not an in-place rewrite. An archive being assembled never saw a half-updated set of source files; it saw a consistent version, from start to finish.

Range Requests worked. Resumes worked. The phone could disconnect at byte 5000 and reconnect for bytes 5001 onwards from a different application instance entirely. The bytes would line up.

The archive existed only at the moment of the request, but it was deterministic to the last source update.

Fifteen years of nobody touching it

I wrote that class in 2006. The platform ran on it for the next fifteen years.

In those fifteen years:

New storefronts were added. Config rows in a database. No code change.
New products were added. New files on disk in a known structure. No code change.
New storefronts wanted different branding images. They uploaded different images. No code change.
The product update flow was tweaked many times. The mtime convention held. No code change.

Nobody re-wrote it. Nobody refactored it. Nobody patched it. It just kept assembling tar archives.

The on-call queue never had an alert about it. The maintenance docs never had a runbook section for it. New engineers never asked about it, because there was nothing to ask.

The spec stopped bothering anybody.

What I was actually choosing

When I picked dynamic generation over pre-generation, I wasn't really picking CPU over storage. I was picking where the complexity would live.

If you pre-generate, the complexity lives in the data. Every change in input — branding, content, metadata — has to propagate into a regeneration of the materialized output. The complexity is distributed across the storage layer, and someone has to maintain the regeneration pipeline.

If you generate at request time, the complexity lives in one piece of code. That code is hard to write well the first time. There's a tar format to understand. There's a deterministic mtime requirement that's easy to miss. There's the Range Request semantics. But once that code is written well, the system has no other place where the complexity is stored.

And then nobody has to touch it.

This is the trade I keep coming back to, twenty-four years into writing software alone. If you write the code well once, the spec stops bothering you. The work you did at the start absorbs all the work you didn't have to do later — by you, or by anyone else.

Twenty-four years of the same choice

I'm still doing it. The platform I run now is built on the same pattern at a different scale: a hundred-plus vendors of OCR, translation, and other document services, all behind a small set of public APIs that compose them dynamically per request. There's no pre-generated catalog of vendor combinations. There's one piece of code that knows how to wrap one vendor in one shape, and a configuration system that lets new vendors plug in.

I expect the same fifteen-year quiet that the tar archive code got. It's not that I'm confident. It's that I've made the choice often enough to know what it usually leads to.

This is not what you should do. This is what twenty-four years has taught one specific person to do.

If you write the code well once, the spec stops bothering you. Most of what I do, I do for that.

Built with Claude (Opus).

Earlier in this series:

DEV Community