DEV Community: Vivian Voss

One Clock, One Tool, Three Distros

Vivian Voss — Mon, 04 May 2026 08:51:37 +0000

The Unix Way — Episode 15

Ask a Linux admin which time daemon their server runs. Pause for the silence. Then watch them check three different places to find out. On FreeBSD the question does not arise.

This is not a story about clocks. It is a story about how a basic system service ended up with three implementations, three default behaviours, three configuration files, and zero agreement, and what it costs the people who have to administer the result.

A Short History of NTP

The Network Time Protocol was designed by David L. Mills at the University of Delaware in 1985. RFC 958 standardised the original protocol; RFC 1305 codified version 3 in 1992; RFC 5905 finalised version 4 in 2010. The reference implementation, called ntpd, was written by Mills and his students alongside the protocol itself. For most of the next thirty years, ntpd was the only serious option for synchronising the clock on a Unix machine. The University of Delaware NTP code became the reference, the documentation, and the default.

The protocol matters because it is harder than it looks. A client cannot simply ask a server "what time is it" and accept the answer; the round-trip introduces latency, the network jitters, the local clock drifts at a rate that depends on temperature and load, and the only way to compute a useful offset is to query several servers, weigh their replies against each other, and apply a clock-discipline algorithm that gradually steers the local oscillator without abrupt jumps. The full NTP protocol embeds this into the daemon. The Simple Network Time Protocol, SNTP, is the same wire format used naively: query one server, accept one answer, set the clock.

FreeBSD: One Tool, In Base

ntpd(8) has been part of the FreeBSD base system since 4.x in 2000. To enable it, an administrator adds a single line to /etc/rc.conf:

ntpd_enable="YES"

The configuration file /etc/ntp.conf is shipped with sensible defaults pointing at the FreeBSD pool servers; non-default servers can be added by editing it. The service is started with service ntpd start. The diagnostic command is ntpq -p, which prints the list of configured peers, their measured offsets, and the daemon's current opinion about which it is using.

For the rare case where the in-base ntpd does not fit, OpenNTPD is available in ports under net/openntpd. The most common reason to reach for it is jails: the OpenBSD-derived OpenNTPD is more comfortable binding to specific addresses, which matters when a jail wants its own NTP daemon listening on its own IP. For ninety-nine servers out of a hundred, the in-base ntpd is the answer, and the answer has not changed in over twenty years.

The reason this is unremarkable on FreeBSD is the reason most things on FreeBSD are unremarkable: a single team makes the decisions and maintains the result. The base system includes ntpd. The base system documents ntpd. The base system upgrades ntpd. There is no second opinion, because there is no second team.

Linux: ntpd, the Original

The same ntpd from David L. Mills was, for most of Linux's existence, the standard time daemon on Linux too. Distributions packaged it under names like ntp (Debian, Ubuntu) or ntp (RHEL, Fedora). The configuration file was /etc/ntp.conf, the service was ntpd, the diagnostic was ntpq -p. The world worked.

By the late 2010s, the consensus had begun to shift. The ntpd codebase had a complicated history of maintenance, several serious CVEs, and a configuration syntax that newer admins found rather opaque. The Linux Foundation funded a separate "Network Time Foundation" maintenance effort which kept the lights on but did not modernise the daemon in ways that distributions wanted. By 2026, every major distribution has moved on. ntpd is still installable, still works, and is still a perfectly reasonable choice for an administrator who knows it. It is no longer what new servers ship with.

Linux: chrony, the Modern Replacement

Richard Curnow began work on chrony in 1997, originally as a personal project for synchronising laptops with intermittent network connectivity. After Curnow stepped back from active development, Miroslav Lichvar at Red Hat picked it up, and chrony is now developed under Red Hat sponsorship. By 2026 it is the default time daemon on Fedora, RHEL, CentOS Stream, Rocky, Alma, openSUSE, and Arch.

chrony is a complete NTP implementation. It can act as a client, a server, or both. It supports NTS (Network Time Security, RFC 8915) for authenticated time without the operational burden of NTP autokey. It converges faster than ntpd on first start, handles long network outages without drifting badly, and behaves better on virtual machines whose hardware clock cannot be trusted. The configuration file is /etc/chrony.conf on RHEL-family systems and /etc/chrony/chrony.conf on Debian-family systems, because packagers disagreed on the right path. The daemon is chronyd. The diagnostic is chronyc tracking (current sync state) and chronyc sources (list of upstream servers). The service unit is sometimes chronyd, sometimes chrony, depending on which packaging tradition the distribution inherited.

If a Linux admin is asked to set up time synchronisation on a server in 2026 and given a free hand, the answer is almost always chrony. The accuracy is better, the codebase is healthier, the maintainer is responsive, and the documentation is current.

Linux: systemd-timesyncd, the Minimalist

The third answer was added by systemd in version 213 (December 2014). systemd-timesyncd is not a full NTP implementation; it is an SNTP client. It queries one server at a time, accepts the answer, and sets the clock. It does not triangulate from multiple sources, it does not weight peers against each other, it does not detect a single misconfigured upstream lying about the time. It is also small, simple, and adequate for the case it was built for: a desktop or a container that just needs the clock to be roughly right.

Ubuntu adopted systemd-timesyncd as the default in 18.04 (2018) and has kept it as the desktop default ever since; Ubuntu Server installs chrony by default in newer releases but systemd-timesyncd was the documented standard for several LTS cycles. Many container images use systemd-timesyncd because it has fewer dependencies than chrony. The configuration file is /etc/systemd/timesyncd.conf, the status command is timedatectl status, and the daemon is part of systemd itself rather than a separate package.

The trap with systemd-timesyncd is that it ignores /etc/ntp.conf. An administrator who inherits a server, sees /etc/ntp.conf with carefully tuned upstream servers, and assumes those servers are being used, may well be wrong: if systemd-timesyncd is the active daemon, it is reading /etc/systemd/timesyncd.conf, and /etc/ntp.conf is decoration. The diagnostic for "which daemon is actually running" on a Linux box is to enumerate the candidates: systemctl status chronyd, systemctl status ntpd, systemctl status systemd-timesyncd, until one of them returns "active". Two of them sometimes do. That conversation is its own kind of fun.

OpenNTPD, the Quiet One

A footnote, but worth one. OpenNTPD was written by Henning Brauer for OpenBSD in 2003, motivated by the same complaints about the reference ntpd that eventually drove Linux distributions toward chrony: complicated configuration, a hard-to-audit codebase, and licensing that did not fit OpenBSD's standards. OpenNTPD is included in OpenBSD base, and is available on FreeBSD via the net/openntpd port. It is a deliberately small full-NTP implementation, with a configuration file the size of a postcard. For administrators who want a simple time daemon without the size of ntpd and without depending on chrony, OpenNTPD is the quiet third option that has worked steadily for over twenty years.

The Point

The same problem (the clock drifts, set it from a network source) has three answers on Linux because three different communities, working on overlapping but not identical use cases, arrived at three different daemons, and the distributions that integrate them could not converge on a single recommendation. Fedora chose chrony. Ubuntu chose systemd-timesyncd for the desktop and chrony for the server. Debian sat on ntpd for years before transitioning. Arch let the user decide.

FreeBSD did not have that argument. The base team picked ntpd, kept it for two decades, and the rest of the system was built around the choice. When OpenNTPD became available, it went into ports as a clearly-labelled alternative; it did not displace the in-base default.

The cost of three answers is paid every time a new admin inherits a Linux server and has to discover which daemon is currently running, which config file it is reading, and which it ought to be reading. The cost of one answer is a config file you have already seen, on a system you have already learned.

This is not a question of which daemon is best. chrony is, by most measures, the best of the four. The question is what it costs to need to know that, every time, on every system, in a way that varies by distribution and by year. On FreeBSD the answer to "what time daemon does this server run" is in /etc/rc.conf, has been since 2000, and is the same on the next server too.

Read the full article on vivianvoss.net →

By Vivian Voss — System Architect & Software Developer. Follow me on LinkedIn for daily technical writing.

Why We Reach for the Layer

Vivian Voss — Sun, 03 May 2026 07:37:36 +0000

On Second Thought — Episode 06

The ORM hides the SQL. The cache hides the ORM. The service mesh hides the services. The operator hides the YAML, which already hid the kubelet, which already hid the container, which already hid the process. By Tuesday, nobody quite remembers what the original problem was. They are too busy configuring its sixth wrapper.

This is the post about that wrapper.

The Axiom

When something does not work as wished, one adds a layer on top. The pattern is invisible because it is universal. We do it in code, in infrastructure, in process, in organisation. We wrap APIs in clients, clients in adapters, adapters in service objects, service objects in factories. We wrap deploys in pipelines, pipelines in operators, operators in platforms, platforms in portals. We wrap teams in tribes, tribes in chapters, chapters in centres of excellence. The plaster is the one tool that fits every wound, and the wound, rather conveniently, is never the one that gets examined.

The reflex is so deeply trained that the alternative does not occur as an option. The question "could we remove the underlying thing instead of wrapping it?" is rarely asked because the team that built the underlying thing is in the next room, the project that delivered it is in the previous quarter's review, and the engineer who would have to do the removal has thirteen tickets that close more cleanly with a wrapper. So the wrapper goes in, and a year later, the wrapper has its own wrapper.

The Origin

Layered architecture has a perfectly respectable origin. Edsger Dijkstra published "The Structure of the THE-Multiprogramming System" in CACM in May 1968, introducing disciplined layering as a means of bounding complexity. Each layer presented a strictly defined interface to the layer above it; an engineer could reason about one layer without holding the entire stack in their head. It was a brilliant move and remains one.

David Parnas, four years later, gave the underlying principle its enduring name. His 1972 paper, "On the Criteria To Be Used in Decomposing Systems into Modules", introduced Information Hiding: a module should hide what is likely to change behind a stable interface, so that change in one place does not propagate to all the others. Layers were one application of the principle. The intent was to contain complexity, not to defer it.

Somewhere between Parnas's paper and the third generation of cloud abstractions, the verb shifted. Containing became postponing. The layer that once prevented the lower one from leaking now exists primarily to defer the moment in which one would have to look at it. The Kubernetes operator does not hide a stable abstraction; it hides a YAML format that nobody wishes to read. The retry decorator does not bound a clean interface; it papers over an upstream service that has never been made reliable. The ORM does not abstract the database; it postpones the conversation about what the queries should actually be.

The vocabulary survived. The discipline did not. One does notice the inversion.

The Cost

Manny Lehman, working at IBM in the early 1970s, formulated what came to be called the laws of software evolution. The second law, in its mature form: the complexity of an evolving system increases unless explicit work is done to maintain or reduce it. Few sentences in computer science have aged this well. Lehman compared it, half-seriously, to the second law of thermodynamics: entropy is the default; order requires energy. The work to maintain or reduce, in practice, is the work that nobody is funded to do, because it produces no new feature, ships no new ticket, and leaves no diagram for the architecture review.

Defensive code-paths multiply as a consequence. Every API call gets wrapped in retries. Every value gets wrapped in null-checks. Every cache gets wrapped in invalidation logic. Phil Karlton, working at Carnegie Mellon and later at Netscape, is widely credited with the observation, somewhere around 1970, that there are two hard things in computer science: cache invalidation and naming things. The line was popularised by Tim Bray around 1996. We have, with rather industrious enthusiasm, made the first one our default architectural pattern, and we still cannot agree on the name of the variable that holds the result.

The cost is not only the cache. The cost is what happens to the people inside the system. The Senior Engineer's day shifts from building to understanding. She spends the morning tracing why a request that should take six milliseconds is taking eight hundred, walks through three retry decorators, two adapter classes, a service mesh sidecar, and a fallback strategy that has not been triggered since 2023, and finds at the bottom of all of it a database query that wants for an index. The index goes in; the eight hundred milliseconds become six. The retry decorator stays. The adapter stays. The sidecar stays. Removing them would be another quarter of work, and the quarter has other plans.

The Junior Engineer never gets to building, because the layers between her and the system have grown taller than the system itself. She is taught the operator before the syscall, the framework before the language, the platform before the protocol. When the abstraction breaks (and it always does), there is no layer beneath to fall back to. The foundation was never taught. It was skipped.

This was the substance of Episode 02. It is also the substance of this one. The two are linked because the layer-reaching reflex and the foundation-skipping curriculum are two halves of the same economy: an industry that compounds abstractions because compounding abstractions can be hired for, certified for, conferenced about, and sold. Reduction cannot be hired for. There is no certification.

The Question

Reduction is the hardest discipline in software. It looks easy from the outside because the result is, by definition, small. The result is small because someone has spent twenty years making it small.

SQLite, the most widely deployed database engine on earth, carries roughly 156,000 lines of mature C code (the canonical figure published by the project for version 3.42, May 2023). It has stayed one library because, every time a feature was proposed, the maintainers asked whether the existing surface could be made to do the work instead. The test suite is 92 million lines. The library is 156,000. That ratio is not an accident. It is the operational definition of reduction as a discipline.

awk has run essentially unchanged since Aho, Weinberger, and Kernighan published it at Bell Labs in 1977. Forty-nine years on, the language is the same language. Engineers who learned it in the 1980s can read code written this morning. The design was small enough to be finished, and the maintainers had the discipline to recognise that "finished" is a category that exists.

pf, the OpenBSD packet filter, has been one configuration file with one syntax since OpenBSD 3.0 in December 2001. Daniel Hartmeier began writing it in June 2001, after IPFilter was removed for licensing reasons. The syntax has been refined; the model has not been replaced. Twenty-five years later, an administrator who learned pf in 2003 can read a pf.conf written this week. There is no v2. There is no successor. There is no parallel implementation that one is encouraged to migrate to. There is one tool that does the work it was built to do.

None of these were elegant by accident. They were elegant by patience, which is the one resource the sprint cycle does not allocate. Each of them required a maintainer or a small team to refuse, repeatedly, the temptation to add. Refusing is not a quarterly metric. It is not a ticket category. It is not a Slack reaction. It is the silent work that holds the small body of software that the rest of the industry quietly stands on without thanking.

The deeper question is not whether we should layer less. The deeper question is what kind of organisation, what kind of contract, what kind of incentive structure could allow reduction to be a fundable activity rather than a private virtue practised by the few. Today it is funded only by accident: by maintainers who are paid for something else, by retired engineers donating evenings, by small institutions that never grew into the structures that would have stopped them from doing it.

What would happen if a team were given one sprint, just one, not to add a layer but to remove one? Who has the authority to ask the question? Who would carry the cost of the answer being yes?

The plaster is cheap. The wound is not.

Read the full article on vivianvoss.net →

By Vivian Voss — System Architect & Software Developer. Follow me on LinkedIn for daily technical writing.

The Browser That Brought Its Own AI

Vivian Voss — Sat, 02 May 2026 07:05:51 +0000

Not in the Brief, Episode 01

Open chrome://on-device-internals in a new tab. If your machine qualifies, you will see a multi-gigabyte language model that Chrome has downloaded onto your disk, listed with a version number and a file size. Any website you visit can call this model through an API in JavaScript. There is no permission prompt. There never was. This is the first episode of Not in the Brief: a series on the documented mechanics of software that has been added to your machine without you being asked. We start with the largest target available, because the change is hidden in plain sight, and because almost everyone is affected.

What Is Built In

Chrome ships seven on-device AI APIs, all backed by a foundation model called Gemini Nano. Gemini Nano runs locally; the inference does not leave the machine. The APIs, in order of relevance, are:

LanguageModel: the Prompt API. Free-form text in, text out.
Summarizer: text summarisation.
Translator: language translation between supported pairs.
Writer: short-form generative writing.
Rewriter: text rewriting and tone change.
Proofreader: grammar and style correction.
LanguageDetector: language identification.

Generally available for extensions since Chrome 138 (May 2025). For ordinary web pages, the APIs are in Origin Trial and behind a feature flag for end users; in practice, this means a website needs an Origin Trial token, and the browser must have the model loaded. The trial route does not require user consent at the page level either.

How It Got On Your Machine

Announced at Google I/O in May 2024. The decision logic Chrome applies today, as documented by Google's own developer pages, is the following.

When the user starts Chrome on a qualifying device, the browser checks four things in the background: more than 4 GB of VRAM available, at least 16 GB of system RAM, at least 22 GB of free space on the volume holding the Chrome profile, and a supported operating system (Windows 10 or 11, macOS 13 or later, Linux, or ChromeOS on a Plus device). On mobile Chrome the entire feature is unavailable, so phones are out. Desktops, laptops and Chromebook Plus devices are in.

When all four conditions hold and any local activity (a relevant page, an extension, an Origin Trial token) triggers the API, Chrome downloads Gemini Nano in the background, on an unmetered connection. There is no installation prompt. No "Chrome would like to download a 2 GB model" dialog. The model arrives, lives on disk, and updates itself.

If the free disk space later drops below 10 GB, Chrome removes the model. If the eligibility criteria are not met for 30 days, the model is purged. Both removals happen without user interaction; both reinstall when the conditions return.

How A Website Talks To The Model

The API surface is straightforward. A web page in JavaScript writes:

const session = await LanguageModel.create();

If the model is available, a session is returned and the page can call session.prompt("...") to get text back. There is no permission dialog at any point in this exchange. The same code, in 2018, would have produced a navigator.mediaDevices.getUserMedia() call for the microphone or camera, and would have triggered a browser-level prompt asking the user. The Prompt API does not.

The cross-origin story is partial. A top-level page on example.com can call the API. A same-origin iframe can call it. A cross-origin iframe (an embedded ad, an embedded widget) needs the parent page to set the allow="language-model" attribute on the iframe. This is the only permission boundary in the architecture, and it lives between iframes, not between site and user.

The capability probe LanguageModel.availability() returns one of three values: 'available', 'downloading', or 'unavailable'. Any page that calls this method learns whether the visitor is on a model-capable device. This is a hardware-class probe without a permission prompt, in a browser whose Privacy Sandbox was discontinued in April 2025 and whose own advertising policy explicitly permits digital fingerprinting since February 2025. It is one further entry in a list of more than thirty active fingerprinting vectors active in production Chrome today.

There is no rate limit on the website's calls. There is no per-origin token quota. The compute cost is paid by the user's CPU, GPU and battery; the website pays nothing.

Why The Permission Prompt Is The Story

Browsers express their security policy through their dialogs. Geolocation needs one. Microphone, camera, screen capture, notifications, clipboard, USB, MIDI, Bluetooth: all need them. Local language-model inference does not. That is not a small detail. It is an architectural statement: this feature is classified, by Chrome, as belonging to the standard platform, not to the privileged plate. It sits next to the JavaScript engine, not next to the camera.

A reasonable user reading the brief would expect to be asked about a feature that uses several gigabytes of disk, runs on the user's GPU, and consumes battery whenever a website calls it. The user is not asked. The brief did not specify this, the user accepted it by accepting Chrome.

What This Means For Risk

Three risk lines are real, and they should be named without panic.

Fingerprinting and tracking. LanguageModel.availability() is a fingerprinting input. Combined with the canvas, font, audio, language, GPU and timing vectors that are already in production use across roughly thirteen percent of the top 20,000 websites (per a 2025 ACM study), the addition of a model-availability probe contributes to a higher-entropy fingerprint. In a browser without Privacy Sandbox and with explicit policy permission for fingerprinting, this is a measurable degradation.

Indirect prompt injection. Web content goes into the model. Model output goes back into web UIs. A page that includes a hidden instruction in user-readable text can attempt to coerce the model into producing output that influences subsequent actions. OWASP found indirect prompt injection in 73 percent of production AI deployments it audited in 2024. Google has responded with a five-layer defence and a separate "User Alignment Critic" model that watches the agentic Gemini sidebar; that response is itself a recognition that the threat class is severe. The on-device Prompt API does not face the agentic-action surface, but a website that uses it to summarise web content for the user is one Prompt-Injection layer away from inserting whatever the attacker has inserted into the displayed result.

Hijack and privilege boundaries. CVE-2026-0628, disclosed in early January 2026 and fixed by Google in mid-January, allowed Chrome extensions with basic permissions to hijack the Gemini Live panel and inherit camera, microphone and file-access privileges through the panel's surface. The panel is not the same component as the Prompt API, but the disclosure shows that the boundary between the AI surfaces in Chrome and the rest of the browser's privilege system has been crossed at least once.

These are not theoretical risks. They are documented in Chrome's own security advisories, in OWASP's annual report and in disclosures by Palo Alto's Unit 42, Malwarebytes and SecurityWeek.

How To See It On Your Own Machine

Three tabs, one policy. Five minutes.

chrome://on-device-internals. The model status, version, file size and update history. If the page shows a Gemini Nano entry with a version number and a size, the model is on your disk and ready to be called.

chrome://flags/#optimization-guide-on-device-model. Set this to Disabled. Relaunch Chrome. The model will not be downloaded; if it has already been downloaded, it will be removed.

chrome://flags/#prompt-api-for-gemini-nano. Set this to Disabled. Relaunch. The web-page API surface is gone; pages calling LanguageModel.create() will fail with 'unavailable'.

Enterprise policy. On Windows, set the registry value HKLM\SOFTWARE\Policies\Google\Chrome\GenAILocalFoundationalModelSettings = 1 (DWORD). On macOS and Linux there are equivalent profile keys. Chrome will respect this and the model will not return through future updates.

DevTools. Open a tab on any site, open DevTools, switch to the Network panel, and filter for optimizationguide-pa.googleapis.com. This is the model and configuration update server. You will see traffic when Chrome checks for or pulls model updates.

The Pattern Is Not Chrome-Only

This is a series, and Chrome is the first episode because it is the largest. The same pattern is in other browsers, each in its own form.

Microsoft Edge ships a Copilot Sidebar that is enabled by default for many users; the toggle lives at edge://settings/copilot.
Brave Browser ships Leo AI built in, with a cloud-mode toggle in Settings → Leo.
Firefox added an AI Chat sidebar with a configurable provider in about:preferences#general → AI Chatbot.
Arc ships AI features in Settings → AI; the project is now under The Browser Company.
Apple Safari integrates Apple Intelligence on supported macOS and iOS versions, configured under System Settings → Apple Intelligence.

The architecture varies. The pattern is the same: an AI feature added by default, exposed to web pages or to the user, with the awareness path tucked into a Settings page that few users will visit unprompted. We will take each of these in turn over the coming episodes, and the awareness path for each.

What To Take Home

If your browser ships an API without a permission prompt, the browser has stated where that API stands. It is treating the feature like part of the standard platform: not a privileged surface that needs explicit user consent, but a default capability of the runtime.

That is not an argument against Chrome's built-in AI. It is an argument for knowing it is there. The feature is real. The performance is genuinely good. The local-model architecture is, in some respects, more privacy-respecting than a cloud round-trip would have been. The quarrel is not with the existence of the feature. It is with the silence of its arrival, and with the dialog that the browser used to show and now does not.

The browser used to ask about the camera. It does not ask about the model. The line moved. The dialog did not.

If you have not looked, you do not know what is on. The looking is not difficult. It just has to start.

Read the full article on vivianvoss.net →

By Vivian Voss, System Architect & Software Developer. Follow me on LinkedIn for daily technical writing.

The Subscription You Did Not Ask For

Vivian Voss — Fri, 01 May 2026 09:05:18 +0000

In the Net, Episode 01

In 2012, Adobe shipped Creative Suite 6. A studio could buy the Master Collection once, for around two thousand five hundred euros per seat (US list price was 2,599 dollars), and run it for years. Thirteen years later, the same studio leases Adobe Creative Cloud All Apps for around 743 euros per seat per year (Adobe Germany list price, May 2026). The tools have not got dramatically better. The architecture under the licence rather has.

This is the first episode of In the Net: a series on the documented mechanics of vendor lock-in. The premise is simple. Every platform tells you how to come in. The architecture tells you whether you can leave. We will look at how each major platform's promise was built, what the lock-in mechanisms are, why the documented exit does not work as advertised, and what the realistic escape routes look like.

We start with Adobe because the pattern is at its clearest. The promise was good. The promise was kept for thirty years. The architecture under the subscription, however, is a separate story.

The Promise, Honestly

Adobe Creative Suite was, for three decades, the most defensibly chosen tool stack in design and publishing. Photoshop appeared in 1988. Illustrator in 1987. InDesign in 1999. Premiere in 1991. The applications were excellent, kept in active development, and produced the file formats that the entire commercial design industry settled on.

A studio buying CS6 in 2012 made a rational decision. Pay once, run forever. Major upgrade cycle every three years. Predictable cost, predictable workflow, the same key bindings their staff had learned over a decade. Nothing about that was a trap.

This matters. Lock-in stories are most useful when they begin with the promise that was real. Adobe's promise was real. The mechanism that came after did not undo the original promise. It changed the architecture under it.

The Day the Architecture Changed

In May 2013, Adobe announced that CS6 was the last perpetual-licence release of Creative Suite. Going forward, the only way to use Photoshop, Illustrator, InDesign and the rest would be Creative Cloud, a monthly subscription. The applications continued to ship. The licence did not.

The transition was managed elegantly. Existing CS6 owners could keep using their software. New customers and upgrades, however, were on the subscription path. Within two years, the perpetual market was effectively closed.

Three lock-in mechanisms were wired into the new architecture.

The First Hook: File-Format Binding

Adobe's file formats are proprietary and rich. PSD carries Smart Objects, Layer Effects, Adjustment Layers, and a depth of state that no open format mirrors completely. AI carries Illustrator's specific path semantics. INDD, more than the others, has effectively no portable equivalent: IDML exists as an interchange format, but loses live links, master-page state, and complex layout work.

The judgement is not that Adobe should publish their file formats. The judgement is that the file format is the lock. A studio that has spent a decade building PSDs is not just buying software; it is paying rent on its own back catalogue.

The Second Hook: Cloud-Library Binding

Creative Cloud syncs fonts, brushes, asset libraries, colour palettes, and shared resources between the applications and across teams. Cancel the subscription, and the synchronised assets become unavailable. Some assets remain locally, others depend on the licence. The working environment is not portable.

This is a more recent hook than the file format. It came in over the years between 2014 and 2020, with Creative Cloud Libraries, Adobe Fonts (formerly Typekit), and the deeper integration between Photoshop and Lightroom Cloud. Each individual feature was a real productivity gain. The cumulative effect was that the user's working environment lived inside Adobe's account, not on their own machines.

The Third Hook: Cancellation Architecture

Adobe's Annual Plan, billed monthly, carries an Early Termination Fee equal to 50 percent of the remaining months on the contract. A user three months into a twelve-month plan who cancels owes nine months at half rate.

In June 2024, the US Department of Justice, on behalf of the Federal Trade Commission, filed suit against Adobe alleging that this fee was hidden behind layered cancellation flows and that the company had violated the Restore Online Shoppers' Confidence Act. The case (FTC v. Adobe) is, as of this writing in 2026 and to my knowledge, still in motion. Whatever the eventual outcome, the architecture itself is documented in the agreements.

A subscription that costs more to leave than to keep is not, by any honest reading, a subscription one can leave at will. The architecture says what the marketing copy does not.

The Standing: Market Power and How the User Is Treated

Two further dimensions matter, and neither is reducible to price. The first is market power. The second is how the user is treated in the contracts the user has signed.

Market position. Adobe holds, by industry-aggregator estimates, more than eighty percent of the creative-software market. Photoshop alone accounts for around forty-two percent of the graphic-design segment, InDesign for around twenty-six percent, Illustrator for around twelve percent (multiple market-research aggregators citing 2024 figures, e.g. ElectroIQ, Bayelsawatch). The consequence is that Adobe's three core file formats, PSD, AI and INDD, are the de-facto industry standard. A studio that has migrated to Affinity, Krita or any other alternative still receives PSDs from clients, suppliers and partners. The lock-in is not only at the studio level. It is at the level of the entire industry's file-exchange protocol. An agency that holds out alone has not actually escaped; it has only insulated its own machines, while still negotiating in someone else's format.

This is the difference between a Lock-in and a Quasi-Monopoly. The Lock-in keeps the individual customer. The Quasi-Monopoly keeps the customer's customers, and through them, the customer.

How the user is treated. In February 2024, Adobe quietly updated its Terms of Service to include language that, in many readings, granted Adobe broad rights to access user content, including to train generative AI models. The change went largely unnoticed for months. In June 2024, Adobe pushed a routine re-acceptance pop-up to Creative Cloud and Document Cloud users, and many users encountered the new language for the first time as a forced confirmation: accept, or lose access to your tools. The artist community responded with a sustained backlash. Adobe issued clarifications on its own blog (10 June 2024), and on 24 June 2024 published revised Terms that explicitly state Adobe does not train generative AI on customer content (with the exception of submissions to the Adobe Stock marketplace), and that the company never has.

The clarification is, as far as one can verify, accurate. The clarified Terms are clearer than the prior wording. The question that remains is what was being granted before the backlash, and why a company with this market position needed to be told by its own users that this is not the language a creative tool's licence ought to contain. The original wording was not a slip. It was the architecture of the contract a corporation believed it could propose to its captive user base. That belief, and not the words themselves, is the part of this story that does not get corrected.

These two dimensions, market position and contractual stance, change the reading of the price. The price is not only what the studio pays in money. The price is also what it accepts, by signing the contract, about the rights it retains over its own work, and about the position from which the contract is offered. A subscription is the lower-cost end of the price. The other end is harder to put a number on.

The Exit That Isn't

Adobe will tell you the exits exist. They are correct, technically. They are wrong, operationally.

Export your PSD to a portable format. Smart Objects flatten. Layer Effects sometimes lose their parameters. Adjustment Layers may bake into the underlying pixels. The file opens elsewhere; it is no longer the file you built. Anyone who has tried to recover an old PSD in another tool has met this in person.

Cancel your subscription. The Early Termination Fee triggers. The flow is layered. Customers report that the cancellation requires multiple confirmations, optional retention offers, and live-chat negotiation. The exit is documented. It is also a fence.

This is a Lock-in by design. Not in the sense that someone in a boardroom said "let us trap our users". In the sense that the architecture, taken as a whole, produces the outcome that users do not leave even when they would prefer to. That is what an architecture is: the outcome the structure makes likely, regardless of intent.

The Price

For a studio of ten seats on Creative Cloud All Apps, the recurring cost is in the order of seven thousand four hundred euros a year, every year, with no terminating event. Compare to the perpetual model: ten CS6 Master Collection licences in 2012 would have cost around twenty-five thousand euros and remained usable for the years the team chose to run them. The subscription model has been more profitable for Adobe and more expensive for studios across any horizon longer than a few years.

For a studio that wants to leave, the cost is migration. Months of file conversion, weeks of retraining, and a tail of legacy work that will, rather inevitably, not transfer cleanly. This is not a fee Adobe charges; it is a cost the architecture imposes.

The Escape Route

The alternatives are real and professional, but the landscape has shifted in 2024 and 2025. Two of the strongest names have changed owners. The migration paths are still good. They simply require attention.

Affinity, by Serif (now Canva). In March 2024, Canva acquired Serif for approximately three hundred million pounds (around three hundred and fifty million euros at the time, as reported by Canva's own newsroom). In October 2025, the three Affinity applications (Photo, Designer, Publisher) were consolidated into a single application called Affinity by Canva, version 3.0. The new model is free with a Canva account. AI features are locked behind a Canva Pro subscription, around one hundred and ten euros a year. The original Pledge given at acquisition included a commitment to perpetual licences; that commitment was, by the publisher's own communication at the version 3.0 launch, replaced when the new model launched. Affinity remains a strong one-to-one replacement for the Adobe core, with direct PSD and AI import. It is also itself an example of the pattern this series tracks.

Pixelmator Pro, by Pixelmator (now Apple). Apple announced its acquisition of the Pixelmator team in November 2024 and completed it in February 2025 (MacRumors, TechCrunch). The standalone perpetual licence is still listed on the Mac App Store at around fifty euros (US list 49.99 dollars). Newer features, such as the Warp tool added with the iPad release in January 2026, are exclusive to Apple Creator Studio: a subscription bundle priced at 12.99 dollars per month or 129 dollars per year (US list, around twelve euros a month or one hundred and twenty euros a year), which ships Pixelmator Pro alongside Final Cut Pro, Logic Pro, and the iWork suite. The standalone path remains, but the rate of new feature development sits inside the bundle.

DaVinci Resolve, by Blackmagic Design. Free for video editing, with a Studio version available as a one-time purchase. The free version includes colour grading that has, afaik, been used on theatrical productions; Blackmagic Design lists notable feature-film credits on its own product pages. The most stable of the alternatives in this list, with a perpetual non-subscription model that has held since 2009.

Krita. Free and open source, professional digital painting application maintained by the KDE community. Strong brush engine, deep tablet support, scriptable. Stable.

GIMP and Inkscape. Free, photo and vector. Robust and feature-complete for most workflows. The user experience is, honestly, dated, and migration cost from Adobe is real work because the muscle memory does not transfer cleanly. They are the right choice for budget-constrained teams or FOSS-aligned organisations who can absorb the UX cost.

Capture One. A direct Lightroom alternative. Individual perpetual licences are still offered (US list 299 dollars, around two hundred and seventy-five euros). Multi-user studios, however, were moved to subscription-only in 2024 (PetaPixel reported a 344 percent multi-user pricing change). This is the same pattern as Adobe: perpetual quietly retained for individuals, removed for the larger commercial users.

The maths still works. Adobe Photoshop and Lightroom Photo plan in Germany is around 17.99 euros per month (annual commitment, billed monthly) or about 142 euros per year prepaid; Photoshop Single App and Creative Cloud All Apps cost more (Adobe Germany list, May 2026, with VAT). Affinity is free with an account. Pixelmator Pro standalone is around fifty euros, perpetual. DaVinci Resolve is free. The break-even against any Adobe subscription happens within months, not years.

The harder question is which of the alternatives stays an alternative. Two of the largest names have changed owners in eighteen months, and the change has, in both cases, moved the architecture in directions worth watching. This is not an argument against the alternatives. It is an argument for the only sustainable strategy in this landscape: attention.

What to Take Home

If your tools come with a recurring bill, your back catalogue lives in someone else's licence. If your tools come from a vendor that holds eighty percent of the market, the industry's standard formats are theirs as well, and your migration is incomplete even when your studio has finished it. If your tools come with Terms of Service that granted broad rights over your content until your community noticed, the architecture of the contract is a separate question from the price tag.

The pattern repeats. The market holds the lock-in. The contract holds the user. Even the alternatives need watching, because two of the largest names have changed owners in eighteen months. The subscription you did not ask for is the architecture you signed up for: the rent has gone up, the formats have stayed standard, and the wording in the agreement has, for at least one entire spring, said something the publisher later had to take back.

The tools have not got better. The rent has, the standing has not, and attention is the only price that scales.

Read the full article on vivianvoss.net →

By Vivian Voss, System Architect & Software Developer. Follow me on LinkedIn for daily technical writing.

The Backup That Wasn't

Vivian Voss — Thu, 30 Apr 2026 10:18:25 +0000

Tales from the Bare Metal, Episode 01

« Thou shalt not trust a backup thou hast not restored! »

At half past eleven on the night of Tuesday, 31 January 2017, an engineer at GitLab.com typed rm -rf on what they believed was the secondary PostgreSQL database. The terminals on their screen were visually identical, save for the hostname in the prompt. Two seconds later, when they realised the prompt did not say what they thought it said, they killed the command. By that point, three hundred gigabytes of production data had been removed.

That was the easy part. The hard part came over the next eighteen hours, as the team discovered, in a sequence that has since become teaching material, that none of their five backup mechanisms had been working.

This is a long-documented incident. GitLab's response, by industry standards, was extraordinary. They live-streamed the recovery on YouTube. They published their internal chat logs. They wrote a postmortem so detailed and so honest that it remains, nearly a decade later, one of the most widely cited operational documents in software engineering. The point of revisiting it now is not the story; the story is well-known. The point is the pattern.

What Happened, in Sequence

The day had not been routine. From around 19:00 UTC, GitLab's primary database had been under unusual load, suspected at the time to be coordinated spam-account creation. The on-call engineers had been working through the load issue for hours. By 23:00 UTC, replication between the primary and the secondary had stalled: the secondary's WAL receiver could not keep up, and the secondary fell sufficiently far behind that recovery would require re-seeding it.

The engineer in question had earlier that day created an LVM snapshot of the production database, intending to use it to set up a staging instance for testing pgpool-II. That snapshot, taken around 17:20 UTC, was a side effect of unrelated work; it had nothing to do with backups, and was scoped for staging.

At around 23:30 UTC, while attempting to clean up the secondary's data directory in preparation for re-seeding, the engineer ran the cleanup command on the wrong host. The terminals were visually identical. The prompt difference was a hostname they had been working in for several hours. They killed the command within seconds. Approximately 300 GB had been removed from the primary. Affected: roughly 5,000 projects, 5,000 comments, and 700 newly created user accounts created between 17:20 and 23:30 UTC.

Service was taken offline at 23:30. Recovery began immediately. The team turned to their backups in sequence. Each one, in turn, did not work.

The Five Mechanisms, in Order

pg_dump backups to S3. GitLab's primary off-site backup mechanism was a wrapper script that took daily pg_dump exports and uploaded them to Amazon S3. The script had been working for a long time. It stopped working silently after a PostgreSQL upgrade: the wrapper script invoked an older pg_dump binary which produced an error against the upgraded server, but the script swallowed the error and produced empty output files. The S3 bucket was full of zero-byte files going back several months.

Email alerts for backup failures. The pg_dump script sent failure-notification emails when something went wrong. Those emails would have caught the upgrade-mismatch problem on day one. They did not arrive. A change to the email infrastructure (DMARC) elsewhere in the organisation, made for unrelated reasons, caused the alert emails to be silently rejected at the receiving end. They had been rejected for months. No one noticed because the absence of failure emails was the same signal as success.

LVM snapshots. GitLab had no scheduled LVM snapshot strategy for production databases. The single snapshot that existed on 31 January was the one the engineer had taken six hours earlier for the unrelated staging-pgpool work. By coincidence, this snapshot was the most recent operational backup of the database that existed anywhere in the organisation.

Azure disk snapshots. The cloud platform on which GitLab.com was hosted at the time offered automated disk snapshots. They had not been enabled for the database servers. The decision was deliberate: cost considerations, plus a stated intention to rely on PostgreSQL replication and pg_dump. Restoring from Azure snapshot, when investigated during recovery, was estimated to take days rather than hours.

WAL archiving. PostgreSQL supports continuous archiving of write-ahead log segments, which would have allowed point-in-time recovery to any moment within the archive window. WAL archiving had never been configured.

Recovery used the LVM snapshot. Copying the data from the staging host back to production took roughly eighteen hours, bottlenecked by the network storage's 60 Mbps throughput. By 17:00 UTC on 1 February, GitLab.com was back online, missing the data that had changed between 17:20 and 23:30 UTC the previous day.

The Context, in Fairness to the Build

Each of these five mechanisms had been a reasonable design at the time it was built. pg_dump-to-S3 worked correctly the day it was deployed. LVM snapshots had a clear scope (staging) that was honoured. Azure snapshots were a deliberate cost-trade decision with documented reasoning. WAL archiving was on the long roadmap. The DMARC change, which silently severed the alerting chain, was made by another team, in another part of the organisation, for a reason that had nothing to do with PostgreSQL.

Three systemic conditions made the outcome likely.

First: redundancy of mechanism is not redundancy of recovery. Five backup mechanisms feel like resilience. In practice, none of them had ever been exercised end-to-end against the specific scenario "the primary is gone; restore from the most recent good source within an hour". The drills that would have surfaced each mechanism's silent failure were not part of routine operations.

Second: the absence of a failure signal is not the same as success. The team's confidence in pg_dump rested on the steady absence of failure emails. That signal had been broken for months. Monitoring that depends on negative signals (silence as good news) is a class of monitoring that fails open. The fix is positive monitoring: a periodic heartbeat that says "this thing ran today, here is its output, here is its size, here is its checksum".

Third: the path between backup and restore had no single owner. Backup configuration sat with one team; restore procedures sat with another; the email infrastructure sat with a third. No one team owned the integration test that would have walked through the entire path on a regular cadence. The handoffs were the gap.

These are not excuses. The wrong directory was deleted, by an engineer who knew which directory they intended to delete, on a host whose hostname they could read. The error happened. But the consequences of the error (six hours of unrecoverable data, eighteen hours of downtime, public restoration on YouTube) were architectural, not behavioural. A different architecture would have absorbed the same error in minutes.

The Principle

Backups are not backups until they have been restored.

This is older than every database. The unixoid expression of it is small enough to fit in a few lines of shell:

# /etc/cron.weekly/restore-test
set -euo pipefail

backup=$(latest_backup)                      # find the most recent
restore_to_sandbox "$backup"                  # restore to a throwaway env
verify_checksum "$backup" sandbox_env         # compare row counts or checksums
report_success "restore-test passed: $backup" # broadcast success on slack/email/wiki

# anything failing aborts via set -e and triggers an oncall page.

The shape is what matters. The signal is positive: the cron job is required to broadcast success, every week. If a week passes with no success message, on-call is paged. Silence is treated as failure, not as the absence of failure.

The further structural change is to put backup configuration and restore verification under one team's ownership. The path from "backup runs nightly" to "we just restored last week's backup and it worked" has to belong to one group of humans who own the whole sequence.

Where the Pattern Travels

The principle does not depend on PostgreSQL. It does not depend on Unix. It applies anywhere data is being protected against loss.

Cloud-managed databases. AWS RDS, Google Cloud SQL, Azure Database for PostgreSQL all offer automated snapshots. The snapshot is taken; the question is whether you have ever restored one to a sandbox database and confirmed your application can connect to it and read every table. If you have not, you are guessing.

Kubernetes StatefulSets with PVC snapshots. PVC snapshots are convenient. They are also untested by default. The drill is identical: weekly, restore the snapshot to a sandbox cluster, run the application's startup health checks against it.

Object storage. S3 versioning, Backblaze B2, Cloudflare R2. These are good systems. They protect against accidental deletion if and only if you have, at some point, recovered an object from them by version, with the production-side application code, and verified it was correct. Otherwise it is faith.

Git-LFS, large media stores, vendored binary archives. The same logic. The cost of weekly verification is small. The cost of discovering, mid-incident, that the chain has been broken for six months is everything.

Filesystem and host backups. The classic case. Backup tapes that have been rotated for years, never read. The drill is to read one back, mount it, diff against a recent snapshot, and confirm the bytes are present.

In every case, the same shape: the backup is provisional until proven by restore. The cost of the proof is hours. The cost of relying on faith is the whole company.

What to Take Home

If your operational situation reminds you of any of the following, treat it as a thing to investigate this week:

You have multiple backup mechanisms and trust their combination.
Your backup-status monitoring relies on the absence of failure messages.
Backup configuration and restore responsibility live with different teams.
You have not actually restored a production-grade backup to a clean environment in the last quarter.

Each of these was true at GitLab on 30 January 2017. Each of them is true at many organisations now.

The fix is not a new tool. The fix is a one-shell-script weekly drill, a positive heartbeat, and a single person whose job description includes "the backup chain works end-to-end". The cost is small. The alternative is broadcasting your recovery on YouTube to a peak audience of five thousand strangers, which GitLab handled with grace, and which most organisations would not.

Do not push this into a maintenance ticket; the ticket will be deferred each sprint until the next outage promotes it for you. Listen to the critics in your own ranks before you listen to the velocity-celebrants.

Read the full article on vivianvoss.net →

By Vivian Voss, System Architect & Software Developer. Follow me on LinkedIn for daily technical writing.

The Width You Never Had to Measure

Vivian Voss — Wed, 29 Apr 2026 12:09:19 +0000

Stack Patterns — Episode 14

Every web developer has written this hack. A card component lives in a sidebar at 280 pixels on one page and on a dashboard at 1100 on another. You want the badges to disappear in the narrow case and the layout to switch from horizontal to vertical. You reach for JavaScript:

const observer = new ResizeObserver(entries => {
  for (const entry of entries) {
    entry.target.classList.toggle('narrow', entry.contentRect.width < 400);
  }
});

observer.observe(card);

You then write the CSS twice, once for the parent class and once again for the .narrow case. You hope the observer fires before the first paint, and most of the time it does, sometimes only after a flash of unstyled content. You move on, because the alternative is to wrap the component in a width-aware HOC, install a layout library, or accept the inevitable hydration mismatch in your SSR pipeline.

Container queries make all of that go away.

A Brief, Slightly Embarrassing History

The idea is older than most reading this article. The early 2010s saw a steady stream of element queries discussions in the front-end community, with polyfills (EQCSS and others) and a long mailing-list debate about what the missing tool should look like. The proposals all stumbled on the same architectural problem: querying an element's own size while inside the layout pass that produces that size invites circular layout, which is roughly the equivalent of asking a function for its own return value before it has finished returning.

The fix took the better part of a decade and required a quiet redefinition of the problem. The element under styling cannot query its own size; that path leads to circularity. But the element's containing context can. The work that became Container Queries, originally drafted in CSS Containment Module Level 3 and now living in CSS Conditional Rules Module Level 5, was championed by Miriam Suzanne (Invited Expert at the CSS Working Group) with contributions from many others. An element marks itself as a container and provides a stable size context to its descendants. The descendants then query that context. No circularity, no observer, no JavaScript.

The implementations followed quickly: Chrome 106 and Safari 16 in September 2022, Firefox 110 on 14 February 2023. Global usage is just over 94 per cent as of March 2026. The feature has been cross-browser stable for more than three years.

The Pattern

The simplest case takes two declarations:

.card {
  container-type: inline-size;
}

@container (max-width: 400px) {
  .card h3   { font-size: 1rem; }
  .card .badge { display: none; }
}

container-type: inline-size tells the browser to watch the inline dimension (the writing-mode-aware horizontal axis in Latin scripts). The @container rule then matches when that dimension drops below 400 pixels. Rules inside the block apply to descendants of .card, never to .card itself.

There is also container-type: size, which watches both axes. Use it sparingly: size containment forces the container's block size to be intrinsic, which can collapse otherwise auto-height layouts in surprising ways. The inline-size form is the safe default for component-level responsiveness, and the form most CSS examples reach for.

When components nest, the closest matching ancestor wins. A query without a container-name matches the nearest ancestor with container-type set. To target a specific level, name your containers:

.sidebar { container-name: side; container-type: inline-size; }
.card    { container-name: card; container-type: inline-size; }

@container card (max-width: 400px) {
  /* only the card matters here, not the sidebar */
}

The Length Units That Travel With the Container

Container queries also bring four new length units that resolve against the container rather than the viewport:

cqw: 1% of the container's width
cqh: 1% of the container's height
cqi: 1% of the container's inline size (writing-mode-aware width)
cqb: 1% of the container's block size (writing-mode-aware height)
cqmin and cqmax are also defined, taking the smaller or larger of cqi and cqb

Combined with clamp(), these allow typography that scales with the component, not the viewport:

h3 { font-size: clamp(1rem, 4cqi, 1.5rem); }

Drop the same component into a 280-pixel sidebar and a 1100-pixel main column, and its heading scales appropriately in both, without writing any breakpoints. The vocabulary that responsive design has wanted for fifteen years has finally arrived on the right axis.

Why It Works

The cleverness, as with @scope in the previous episode, is what container queries deliberately do not do. They do not query the element being styled; they query the size of an ancestor. The cycle that broke every previous attempt at element queries simply does not arise.

A second piece of cleverness: the container-type declaration triggers CSS containment for the chosen axes. The browser knows that nothing outside the container can affect what is inside it, and vice versa, for the purpose of size calculations. That makes the cost of container queries predictable: each container is a self-contained layout unit, evaluated once.

The performance question, which used to dominate discussions of element queries, has therefore become a much smaller question. Each container has a cost (a containment scope), but it is a known cost. You do not pay for queries you do not write.

Combined With :has()

Container queries become properly powerful in combination with :has(), the parent-aware selector covered in episode 5 of this series. A component now reacts to two independent axes at once: its own size, via @container, and its own content, via :has().

A common pattern: cards switch to a vertical layout when narrow, but only if they actually contain an image. A text-only card at the same width keeps its inline form.

.card {
  container-type: inline-size;
  display: grid;
  grid-template-columns: auto 1fr;
}

@container (max-width: 400px) {
  .card:has(img) {
    grid-template-columns: 1fr;
  }
}

The same approach scales: badges that disappear only when the container is narrow AND there are more than two of them; a sidebar widget that switches layout only when it contains a form; a section heading that changes weight when it is followed by a long article. None of these require JavaScript, a class toggle, or a render hook.

The composability is the broader point. Each modern CSS feature (container queries, :has(), @scope, @layer, view transitions) is independently useful, but combinations multiply their value. The platform has spent a decade quietly assembling a vocabulary in which most former JavaScript responsibilities for layout and conditional styling become CSS again.

Honest Limitations

Three things to know before you ship container queries to production.

First, container-type: size disables auto-height. If you want a container to react to changes in its own height (rare, but possible), the container must have a defined height rather than letting its content determine it. For most components, you only care about width, and inline-size is the correct choice. Reach for size only when you genuinely need both axes.

Second, style queries (the form @container style(--theme: dark) { ... }) are a separate, newer feature. As of 2026 they are supported in Chrome 111+ and Edge 111+ only; Firefox and Safari are still developing support. They allow a component to query the value of a custom property on its container, rather than its size, and are powerful for theming and design tokens. Until cross-browser parity arrives, treat them as progressive enhancement rather than a default tool.

Third, container queries do not propagate across iframe boundaries or shadow root boundaries. A widget embedded in an iframe queries its own document's containers, not the parent page's. This is generally what you want; it is worth knowing if you build embedded components that span those boundaries.

When to Use

Anywhere a component lives at more than one width. Cards in a sidebar and a main grid. Article previews in a related-articles strip and a featured slot. Dashboard widgets that the user can resize. Embedded widgets where you do not control the host layout. Anywhere your team currently maintains two or three classes (.card, .card--narrow, .card--wide) coordinated by JavaScript.

For a new design system, build it in containers from the start. Each component sets a container-type on its outer element and queries it from within. Page-level media queries become a layer above, handling viewport-scoped concerns: navigation collapse, hero sizing, things that genuinely depend on the device. Component-level concerns drop a level and stay there.

The Layout, Civilised

Episode 11 of this series gave native page transitions; episode 12 gave a built-in deep clone; episode 13 gave native CSS scoping. Each replaced a stack of build-time tooling and runtime libraries with a few lines of standard CSS or JavaScript. Container queries belong to the same lineage: a feature the platform has been quietly building for a decade, while the framework ecosystem invented and reinvented increasingly elaborate workarounds.

Three years on from cross-browser support, the workaround code is still in production codebases everywhere, and the platform feature still has under-used potential. If your team is still measuring component widths in JavaScript, the cascade has been waiting.

Read the full article on vivianvoss.net →

By Vivian Voss — System Architect & Software Developer. Follow me on LinkedIn for daily technical writing.

pf

Vivian Voss — Tue, 28 Apr 2026 08:25:47 +0000

Technical Beauty — Episode 33

You have written this rule, or you have written something near enough to it: iptables -A INPUT -m conntrack --ctstate ESTABLISHED,RELATED -j ACCEPT. You have written it because the alternative is dropping every packet that is the response to a packet you let through, which is, on reflection, nearly all of them. You have also, at some point, lost twenty minutes to the question of which chain a packet enters first when it is being forwarded but also locally addressed, and twenty more to whether INPUT runs before or after FORWARD for a particular interface bridge.

You are not a bad systems administrator. You are a systems administrator reading a tool that does not want to be read.

The Mid-Release Crisis

In May 2001, OpenBSD did something that most operating systems would not have the nerve to do mid-release. It pulled its firewall out of the source tree.

The firewall was IPFilter, written and maintained by Darren Reed since the mid-1990s. It was the de-facto BSD-world packet filter, deeply integrated into OpenBSD, and a fixture of the OpenBSD security story. Reed informed the OpenBSD project that IPFilter's licence did not in fact permit the kinds of modifications OpenBSD was making. After some unproductive correspondence, Theo de Raadt removed IPFilter from the OpenBSD CVS tree on 30 May 2001. The release was four months away.

This left OpenBSD without a packet filter. Nobody, internally, was lined up to write one.

The Wrong Person for the Job

Daniel Hartmeier was a Swiss developer who had been contributing peripherally to OpenBSD. He had never written kernel code. He read about Drawbridge, an Ethernet-layer filter from Texas A&M, and noticed that the filter itself was essentially a single C module with a small, comprehensible kernel interface. That, he thought, looked like something he could learn.

In a later interview he was characteristically dry about the experience: "If I had known in advance how many nights I would spend, I might have given up. But the progress kept me motivated."

He committed the first version of pf to the OpenBSD CVS tree on 24 June 2001, twenty-five days after IPFilter was removed. By the end of that month, his code was filtering packets and performing network address translation. Five months later, on 1 December 2001, pf shipped as the default firewall in OpenBSD 3.0.

A man who had never touched the kernel wrote, in four weeks of summer evenings and a few months of polish, a firewall that would still be running, twenty-five years later, on roughly a billion devices.

The Design

The point of pf is that the rules read like English. Here is a working firewall:

block in all
pass in on em0 proto tcp to port 22 keep state
pass out keep state

Three lines. The first denies all inbound traffic by default. The second allows inbound TCP to port 22 (SSH) on the external interface, and tells pf to remember the connection in its state table so the responses get back. The third allows all outbound traffic, again with state.

That is the whole firewall. Not the introduction. Not the simplified example. The whole firewall.

There are no chains. There are no separate tables for filter, NAT, mangle, and raw. There is no conntrack module to load and configure. There is no question about which hook a packet enters first under what bridging conditions. Rules are read top to bottom. The last matching rule wins, unless a rule is marked quick, in which case it short-circuits. That is the entire evaluation model.

The grammar is small enough to fit on the back of a postcard, but expressive enough that the same vocabulary handles filtering, network address translation, traffic queueing, packet scrubbing, anchors (named subgroups for hierarchical configuration), tables (named lists of addresses, updatable at runtime), and policy routing. New features extend the grammar. They do not invent new tools and new flags.

The Contrast

iptables was, as it happens, also released in 2001, also as a successor to an older system (ipchains). Rusty Russell wrote it. It is, by any reasonable engineering standard, a competent piece of work. It is also, by any reasonable readability standard, the wrong shape.

In iptables, a rule is a tuple of flags applied imperatively to a table inside a chain inside a hook point. The user has to know about chains (INPUT, OUTPUT, FORWARD, PREROUTING, POSTROUTING), tables (filter, nat, mangle, raw, security), targets (ACCEPT, DROP, REJECT, LOG, MASQUERADE, SNAT, DNAT), and the order in which all of these interact for a packet that is, say, locally generated and forwarded over a bridge to a destination behind NAT. Connection tracking is a separate kernel module with its own configuration. Quality of service is a separate tool entirely (tc). The whole apparatus rewards memorisation rather than reasoning.

This is, as I understand it, why nftables exists. The Linux netfilter team, in roughly the late 2000s, took a long look at iptables, took a long look at pf, and rather quietly concluded that the second one had the right shape. nftables, released in January 2014, borrows pf's design philosophy without quite borrowing pf itself: one consistent grammar, one tool, one configuration model. One does admire the gesture, even as one notes that the conversion is still ongoing more than a decade later.

The Proof

Twenty-five years on, pf is the firewall in OpenBSD, FreeBSD, NetBSD, DragonFly BSD, and macOS. It is what your iPhone and iPad use to filter network traffic. It is what runs underneath pfSense, OPNsense, and a substantial fraction of the commercial firewall appliances sold to enterprises that have no idea their security perimeter is configured in a Swiss developer's grammar.

The current OpenBSD pf is not the same code as Hartmeier's 2001 commit. Henning Brauer and Ryan McBride, in particular, have extended it considerably: redesigned the rule evaluation engine, added sophisticated state-tracking options, integrated with CARP for high availability, and rebuilt the queueing system. The grammar, however, has remained backward-compatible to a remarkable degree. A pf.conf written in 2003 still mostly parses on a current OpenBSD machine. The dialect has grown, but the language has not changed.

This is one of the markers of a well-designed system: the original creators no longer need to be there for the work to continue, and the work that follows feels of a piece with the work that began.

The Point

pf was not engineered to scale to a billion devices. It was engineered to be readable. Hartmeier wrote it because the alternative was unreadable, and because OpenBSD needed a firewall in less time than was reasonable, and because he wanted to learn the kernel. Twenty-five years later it is the firewall language of the BSD world and a substantial portion of the Apple device estate, and Linux has, in nftables, paid it the considerable compliment of imitation.

Sometimes a system holds up because the original design knew what was missing. pf is one of those systems.

Read the full article on vivianvoss.net →

By Vivian Voss — System Architect & Software Developer. Follow me on LinkedIn for daily technical writing.

What the Bootloader Knows

Vivian Voss — Mon, 27 Apr 2026 08:24:54 +0000

The Unix Way — Episode 14

Between the firmware that knows almost nothing and the kernel that must know everything, there is a small program with a rather strange job. The bootloader is the one piece of software whose design quietly encodes what the operating system above believes itself to be.

Three answers to the same question. Each tells a different story.

FreeBSD's Loader

Three stages. boot0 is a 446-byte master boot record. boot1 reads the partition table and locates loader. loader(8) is a Forth interpreter, around 600 KB on amd64.

Forth, because in 1992 someone decided the bootloader should be programmable without becoming an operating system. Forth gives you a small, deterministic, stack-based language with no runtime assumptions about memory layout or operating system services. It is enough to express "look at this dataset, decide which kernel to load, present a menu, hand off."

The loader understands UFS and ZFS natively. It can mount a ZFS dataset as root. It knows what a Boot Environment is. It presents the list of available Boot Environments as a menu before the kernel starts. None of this is bolted on. It is the design.

The pre-install bootloader for ZFS-on-root systems writes the loader into a small partition outside the ZFS pool, but the loader itself reads the pool metadata, walks the dataset hierarchy, and offers the user the choice of which root dataset to boot. This is the enabling primitive on which the entire FreeBSD Boot Environment workflow stands.

What the Userland Built

The loader exposed Boot Environments as a first-class concept. The userland question was: how does an admin create, list, activate, destroy, and switch between them?

The first answer was a shell script called manageBE, written in the early days of ZFS-on-FreeBSD. Functional, but not pleasant to use.

In 2012, a Polish FreeBSD admin called vermaden wrote beadm(8) in POSIX sh(1) and awk(1), deliberately mimicking the Solaris and Illumos beadm interface so that anyone coming from those systems would feel at home. The original announcement and discussion lives on the FreeBSD Forums in the thread "HOWTO ZFS Madness" (number 31662, still readable today). The script was, and remains, around fifteen hundred lines of POSIX shell and awk: small, auditable, dependency-free, runnable on any FreeBSD system since 9.x.

For six years, vermaden's beadm was the working standard for managing Boot Environments on FreeBSD. It was packaged in ports, recommended by the documentation, and built into countless admin workflows.

In 2018, FreeBSD 12.0 shipped bectl(8), a reimplementation in C that landed in base. The motivation was straightforward: a tool used as universally as beadm had become deserved to be in base, with all the testing and consistency guarantees that base implies. The C rewrite gave it tighter integration with libbe, the FreeBSD library that already encoded the Boot Environment data model, and direct ZFS access without the cost of subprocess fan-out.

The transition was not entirely smooth. Early bectl versions had bugs that beadm did not, in part because beadm had been shaken out by six years of production use across hundreds of admins. There are documented cases from the 12.x era in which running beadm against a bectl-confused dataset state cleanly resolved the inconsistency. The bugs have largely been fixed, and bectl is now the recommended tool.

beadm continues to be developed and continues to ship features that bectl has not adopted. The most interesting is the REROOT option: after creating a Boot Environment, upgrading the running system, and discovering that something is not right, REROOT swaps userspace into the pre-upgrade Boot Environment without a full reboot. The kernel stays up; the userspace is reloaded from the chosen dataset. The whole operation takes seconds rather than the half-minute of a reboot. It is the kind of feature that exists because someone was solving an actual operational problem and writing the code that fixed it, rather than waiting for the perfect abstraction.

That trajectory, from manageBE to beadm to bectl, is a model of how Unix tools mature. A clever script meets an unfilled need. A wider community adopts it and shapes it. A reimplementation in base preserves what worked and fixes what did not, while the original keeps innovating in places where base cannot move as quickly. Both tools coexist, and the choice between them is technical rather than political.

Linux: LILO (1992 to 2015)

Werner Almesberger wrote LILO at ETH Zürich in 1992 and maintained it until 1998. John Coffman took over until 2007, when development passed to Joachim Wiedorn. Active development ended in December 2015 with version 24.2, and was never resumed.

The Linux Loader's design choice was a blocklist. The position of the kernel image on disk was recorded as a list of physical sectors at install time. Move the kernel, the bootloader points at gibberish. Update the kernel, run /sbin/lilo before rebooting or the machine refuses to come back.

This was reasonable in 1992. Disks were small, kernels were rebuilt rarely, and the BIOS understood little beyond INT 13h. The assumption was a slow, static world.

By 2010 the assumption had become a trap. GPT and UEFI made blocklists structurally awkward. RAID and LVM moved blocks behind the back of the bootloader. The Linux kernel started shipping new versions every couple of months. The ritual that LILO required after each change had become the leading cause of unbootable systems for users who forgot it.

Linux: GRUB (2005 to present)

The other extreme. GRUB began in 1995 as a research bootloader by Erich Boleyn, was adopted by the GNU project in 1999, and was rewritten from scratch as GRUB 2 in 2005. The result is a small operating system that runs before the operating system.

GRUB 2 ships filesystem drivers for ext2, ext3, ext4, btrfs, XFS, F2FS, JFS, ReiserFS, FAT, NTFS, ISO9660, AFFS, HFS, UDF, ZFS (read-only, partial), and several others. It can decompress kernels in zlib, lzma, lz4, and zstd. It runs scripts in its own POSIX-shell-flavoured language. It supports themes, fonts, gfxmode, network boot, UEFI Secure Boot, multiboot, and chainloading. Its core is around 200 KB; its loadable modules add several megabytes.

The assumption is a fragmented world. Linux distributions disagree about filesystem choice, kernel layout, root-on-everything, and boot configuration. GRUB has to understand every variant because the world above it does not converge.

The cost of that assumption is everything complexity costs. The benefit is that GRUB will boot almost anything you put in front of it.

The omission is interesting. GRUB has read-only ZFS support, sufficient to load a kernel from a ZFS dataset. It does not have Boot Environment awareness, because the conversation about ZFS-as-an-OS-management-layer never happened on Linux. ZFSBootMenu exists precisely to fill that gap, by replacing GRUB entirely with a small kexec-loaded Linux kernel whose only job is to present a Boot Environment menu and hand off to the chosen one.

The Point

The question is not which bootloader is best. It is what the bootloader assumes about the system above it.

FreeBSD's loader assumes the system above it is coherent enough to be worth talking to: that the bootloader, the filesystem, the kernel, and the userland are designed by people who can sit in the same room. From that assumption, things like Boot Environment selection at the loader become possible, and tools like beadm and bectl become natural.

LILO assumed a slow, static world in which the admin would re-run the installer after each change. The world was no longer that world by 2005, and the bootloader could not move with it.

GRUB assumes a fragmented world in which the bootloader must understand every variant of every filesystem because the distributions above it do not converge. The cost is permanent complexity, the benefit is that GRUB boots anything.

Three loaders. Three theories of the operating system. Each design is correct for the world it assumes. The interesting work is choosing which world to live in.

Read the full article on vivianvoss.net →

By Vivian Voss — System Architect & Software Developer. Follow me on LinkedIn for daily technical writing.

Why We Measure Tickets, Not Problems Prevented

Vivian Voss — Sun, 26 Apr 2026 08:50:56 +0000

On Second Thought — Episode 05

The dashboard is green. Velocity is up. Burndown is on track. The demo on Friday will be smooth. Production has been quietly fragile for eleven weeks, and nobody notices, because fragility does not have a column.

This is the post about that column.

The Axiom

Productivity, for the purposes of any reporting line above the work itself, is what one can count. Tickets closed, story points completed, lines shipped, deploys per week, sprints reported as "successful", incident-mean-time-to-resolution charted in a quarterly review.

The work that does not produce a number does not exist. The thinking that prevented the incident in the first place does not appear. The decision not to deploy on Friday afternoon does not log. The conversation in which a senior engineer talked the team out of a doomed approach is not in any system of record.

This is not because the people running the dashboards are foolish. It is because the dashboard is the only thing they were given to look at, and over time, the thing one looks at becomes the thing one believes is real.

The Origin

Frederick Winslow Taylor published The Principles of Scientific Management in 1911 with rather industrious enthusiasm. The stopwatch, the time study, the one-best-way. Taylor's intent was to bring rigour to factory work; his unintended legacy was to make the act of being measured the new floor of working life.

Workers struck. The Watertown Arsenal foundry walked out in the summer of 1911 over the introduction of the stopwatch. The US Congress investigated, and banned time studies and pay premiums tied to them on US Government work. Taylor's specific instrument was, briefly, defeated.

The instinct survived under a procession of new names: scientific management, then efficiency, then management-by-objectives, then Six Sigma, then Lean, then agile, now velocity. The vocabulary moves on every fifteen years; the underlying premise (that what cannot be counted does not count) moves not at all.

In 1975, the British economist Charles Goodhart wrote, in a footnote of a paper on UK monetary policy: "Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes." Twenty-two years later, the anthropologist Marilyn Strathern, observing British university assessment regimes, condensed it into the version everyone now quotes: when a measure becomes a target, it ceases to be a good measure.

We were warned by name, twice in one century, by people whose entire professional life had been spent watching the phenomenon. The industry that calls itself data-driven did not, in this case, read the data.

In 2019, Ron Jeffries, one of the original signatories of the Agile Manifesto and the man widely credited with promoting the story point, published a public reconsideration:

"I may have made the name-changing suggestion. If I did, I'm sorry now."

He went on to recommend abandoning story-point estimation entirely. The industry, having found story points terribly useful for promotion-decisions, performance-reviews, and quarterly board reports, kept them. The dashboard, rather firmly, demands them.

The Cost

Consider two engineers in the same team for the same quarter.

The first engineer prevents three outages. She does this by refusing to deploy on Friday afternoon when the staging environment is showing intermittent failures; by patiently explaining to a junior why the proposed cache invalidation strategy will produce a thundering herd; by spotting, in a routine code review, the off-by-one in the rate limiter that would have melted production under the next traffic spike. None of this work produces a ticket. None of it closes a backlog item. None of it is visible to her manager's manager.

The second engineer cheerfully closes forty-seven tickets that quarter. He is praised in the sprint review. He ships the architecture that produces the outages the first engineer prevented. The outages are then opened as new tickets, which the team will close in subsequent sprints, generating velocity and a sense of forward motion.

The first engineer is invisible to every metric in the building. The second is promoted.

This is not a hypothetical. It is the architecture of the modern software organisation, applied with the consistency of a religious practice. The dashboard goes up. The system goes down. The dashboard goes up again, because the new outages are recorded as feature requests in next quarter's backlog, and closing them counts as work.

The cost is not only the outages. The cost is that the first engineer, finding her judgement systematically unrewarded, eventually leaves. The second engineer, finding his ticket-throughput systematically rewarded, eventually becomes director of engineering. The system optimises the people the way one would optimise a queue, and the queue knows nothing about the building it is keeping standing.

The Made-in-Germany Inversion

A short historical detour, because the contrast is precise and the contrast is the point.

"Made in Germany" was an insult before it was a compliment. The British Merchandise Marks Act of 1887 was passed by Parliament after British manufacturers (Sheffield cutlery, in particular) complained that German imitations were entering the country with British-style markings. The Act required all foreign goods to be plainly marked with their country of origin, the practical purpose being to allow British consumers to recognise and refuse the inferior German imports.

The plan worked exactly as designed for about a decade. Then it inverted.

Within thirty years, "Made in Germany" had become a guarantee of value. German firms had used the label not to perform productivity but to ship goods that did not need replacing. Solingen blades that lasted decades. Carl Zeiss optics that astronomers in Britain quietly preferred to anything domestic. Steinway pianos. Leica cameras. The label became a guarantee because the work was a guarantee.

The inversion was not a marketing campaign. It was a consequence of a culture that measured the chair, not the hours.

We have, rather industriously, built the inverse industry. The label is impeccable. The dashboards are green. The retros are constructive. The substance, increasingly, is not. A modern enterprise software product carries a thousand certifications, three SOC-2 audits, an ISO-27001 stamp, an SBOM, an OpenSSF Scorecard, and falls over when a region drains. The label does the work the work used to do.

This is not specifically a German point. It is a craftsmanship point that happens to have a useful German example. The craftsmen of Sheffield, who triggered the 1887 Act, would have understood it perfectly well had the inversion gone the other way.

The Question

If a craftsman built a chair to last fifty years, the metric was the chair, not the hours. If a Bell Labs engineer in 1955 designed a Number 5 Crossbar switch that ran in the field for thirty years, the metric was the switch, not the patches. If the engineers at Volvo who designed the three-point seatbelt in 1959 had been measured on patents-filed-per-quarter, they would not have given it away to every other manufacturer in the world, and the next sixty years of road safety would have run rather differently.

What would happen to a software organisation that measured problems prevented rather than tickets closed? That measured quality held rather than features shipped? That measured judgement applied rather than activity logged?

The honest answer is that nobody knows, because nobody has been allowed to try long enough for the chair to last fifty years. Software organisations rarely outlive their last quarterly review by more than two or three of them. The promotion cycle is faster than the chair.

There is a quieter question underneath, which is the one this episode is really about. It is not "what metric should we use instead?" It is: why have we agreed, as an industry of nominally clever people, to organise our working lives around an instrument we know to be wrong, in a way that has been documented for fifty years and apologised for by its inventors?

One does suspect the answer is uncomfortable. It is much easier to ship a number than to ship a thing that lasts. It is much easier to manage a number than to manage a person who is doing something difficult. It is much easier to write a quarterly review of a number than of a judgement.

The dashboard is green. The chair, somewhere, is not being built.

Read the full article on vivianvoss.net →

The Dependency Avalanche: 644 Strangers in Your package.json

Vivian Voss — Sat, 25 Apr 2026 07:12:08 +0000

Beta Stories — Episode 09

The promise of the modern package ecosystem was a kind one: you do not have to write everything yourself. Stand on the shoulders of giants. Reuse, do not reinvent. Anyone maintains it.

The reality, measured this morning on a fresh laptop, is the avalanche this episode is named for.

The Reality, in Two Numbers

npm install express on a clean directory pulls Express 5.2.1, declares 28 direct dependencies, resolves a total of 65 packages across the dependency tree, and produces a node_modules of 3.6 MB. That is the smaller end of the modern web.

npx create-next-app@latest with the recommended defaults (TypeScript, Tailwind CSS, ESLint, App Router) creates a project whose package.json declares 11 packages, resolves a transitive tree of 644 packages, and writes a node_modules of 463 MB. The application, at this stage, renders a single page that says "Welcome to Next.js".

Six hundred and forty-four pieces of someone else's work to render eighteen characters of text. Each package authored, in principle, by a stranger; in practice, sometimes by a small group of strangers; in a handful of unpleasant cases, by an account that has been waiting to be useful.

For comparison: the same minimal HTTP responder in Rust uses hyper (the lower-level HTTP foundation) or axum on top of it, and keeps its dependency tree small and explicit in Cargo.toml, where every crate is signed off in writing before it lands. The Go equivalent uses the standard library's net/http, pulls zero external dependencies, and builds, statically linked, to a single 7 MB binary. FreeBSD ships a base userland audited as one source tree, and a ports collection where every port has a named maintainer and a fully declared dependency graph (nginx and friends arrive that way, complete with provenance). All three have existed for years. None asks the team to import a stranger's recursion.

The Mechanism

The phrase that does the damage, repeated almost reflexively, is "we do not maintain it; the upstream maintainer does."

Which means, when one reads it carefully: nobody on the team has read the code. Nobody has reviewed the commit history. Nobody has audited the build script. Nobody has checked who has commit access. Nobody has asked what the package was last week, or what it will be next week. The entire audit duty has been outsourced, in writing, to a name on an npm registry page.

A 644-package install is 644 outsourced audits. The team that ships it has, by default and in good faith, agreed that some other unnamed group will do the reading on their behalf. The other unnamed group, when one goes to look, is mostly volunteers who themselves have not been paid to read the code below them either. The audit duty has been outsourced so many times that nobody is left holding it.

This is the Beta Stories mechanism: software gets worse not by mistakes, but by accumulation. The decay is in the count. Every package added is a piece of code that the team has decided not to read.

The Audit That (Barely) Happened

In January 2021, an account named Jia Tan was created on GitHub. It submitted small, useful patches to xz-utils, the compression library that ships in essentially every major Linux distribution and underlies a great many compressed-archive operations across the Unix world. The patches were good. The maintainer at the time, Lasse Collin, accepted them, as one accepts good patches.

By the summer of 2022, three accounts (Jia Tan, Dennis Ens, Jigar Kumar) were active in the xz-utils mailing lists and issue tracker, applying coordinated pressure on Lasse Collin to add an additional maintainer with commit rights. Lasse, by his own admission, was burnt out, unpaid, and had been carrying the project alone for years. Jia Tan was eventually granted those rights.

In February and March 2024, Jia Tan committed the backdoor. The payload was inserted into xz-utils 5.6.0 (released 24 February) and 5.6.1 (released 9 March), hidden inside test fixtures and triggered through the autotools build system in a way that left the source tarball pristine to a casual reader. The build, when invoked in a particular way that distribution maintainers happened to invoke it, linked malicious object code into liblzma. liblzma was, in turn, linked into sshd on systemd-based distributions via the systemd-notify integration. The result was a remote-code-execution backdoor in OpenSSH, triggerable by any client with the right key. CVSS 10.0.

By 29 March 2024, the backdoored versions had already reached Debian unstable, Fedora 40 rawhide, openSUSE Tumbleweed, Ubuntu testing, and Kali Linux. They were days, in some cases hours, away from rolling into stable releases that would have been on hundreds of millions of servers.

On 28 March 2024 (the public posting was on the 29th), Andres Freund, a PostgreSQL maintainer at Microsoft, noticed that his SSH login on a Debian unstable system was taking approximately 500 milliseconds longer than usual. He had recently been benchmarking Postgres builds and was attentive to small latency drifts. He ran the login under valgrind, found memory-access errors pointing at liblzma, traced the chain back through the autotools build, found the obfuscated payload in the test fixtures, and posted the disclosure to the oss-security mailing list that night.

The internet was saved by one engineer noticing half a second of latency he did not expect.

The Signal

Two and a half years of patient social engineering against an unpaid maintainer. A chain of distribution maintainers signing off on routine version bumps. CI pipelines, test suites, code-review tools, all returning green. SBOM generators producing clean reports. The audit duty had been outsourced so many times that nobody was holding it.

The signal one watches for, after XZ, is not "is there malicious code in your dependencies." That signal is impossible to read directly. The signal is who is doing the reading. If the answer is "the upstream maintainer," ask: who is the maintainer, what is their funding, what is their burnout level, who has commit access, what has changed in the past six months. If those questions cannot be answered for a package one ships to production, one is in the avalanche zone.

The Boring Counter-Move

The counter to dependency-avalanche is not "audit every package", which is impossible at 644 packages. It is fewer packages. Each one read, each one chosen, each one with a known maintainer model.

The Go standard library is one approach: a curated, audited, batteries-included foundation that removes 80 percent of the reasons one would reach for a third-party package in the first place. Rust takes the opposite end of the same idea: a deliberately small standard library plus a Cargo.toml that any reviewer can read in a single sitting, with every dependency declared in writing and every transitive crate visible in Cargo.lock. The FreeBSD base system is a third: a coherent userland built and audited as a single project, where what ships in /usr/sbin/sshd is the same source tree as what ships in /usr/bin/find and /usr/bin/awk, maintained by the same project, released together. The Python standard library, before the wheel-and-PyPI culture took over, was something similar. So was the venerable Unix toolkit.

None of this scales to "build any product, in any language, in any week". It scales to "build a sustainable system, with a known maintenance posture, over decades". The Beta Stories question is whether one has been paying for the first while pretending it is the second.

The Closer

XZ is the post-mortem the industry will keep referencing because it nearly worked. The patient, careful, well-funded version of the same attack will work, and one will not notice in time. The reason is not that the security tools have failed. The reason is that nobody is reading the code, and nobody has been for some time.

Next time it might not be 500 milliseconds. Next time it might be 50. Next time it might be no perceptible drift at all, until the breach notification arrives in the inbox.

Read the full article on vivianvoss.net →

By Vivian Voss — System Architect & Software Developer. Follow me on LinkedIn for daily technical writing.

Service Mesh: The Sidecar Tax

Vivian Voss — Fri, 24 Apr 2026 07:09:38 +0000

The Invoice — Episode 19

"mTLS, observability, traffic management, zero-code retries. You need a service mesh."

Splendid. Let us examine what one is actually paying for.

A service mesh moves cross-cutting concerns (mTLS, retries, timeouts, traffic shifting, observability) out of application code and into a proxy that sits beside each pod. Istio, the archetype, launched in 2017 as a joint project of Google, IBM, and Lyft. It graduated within the CNCF in July 2023. In the 2024 CNCF Annual Survey, service-mesh adoption across respondents fell to 42 percent, down from 50 percent the year before. That is not a catastrophe. It is, however, the first full-year decline the category has ever posted.

The Complexity Invoice

Istio ships over a dozen primary custom resource definitions across three categories (traffic management, security, telemetry) and dozens more through its operator, telemetry plugins, wasm extensions, and gateway APIs. A minimally useful installation comprises:

A control plane (istiod) responsible for configuration distribution, certificate issuance, xDS API serving to every sidecar
A per-pod sidecar (Envoy) injected into every workload, running a second container alongside the application
An ingress gateway at the cluster edge, usually another Envoy in a standalone pod
mTLS certificates rotated by istiod, distributed via SDS to each sidecar
Policy resources (PeerAuthentication, RequestAuthentication, AuthorizationPolicy)
Telemetry bindings (Telemetry CRDs) to send traces and metrics to external collectors
A platform team that knows what each of those does, how they interact, and how to debug any given failure mode

The CNCF's own reports describe Istio as mature, powerful, and "operationally demanding". The second adjective is the one to watch. Installing Istio in a fresh cluster takes a senior SRE about two days. Operating it for six months takes roughly 0.5 to 1.0 FTE, scaling upwards with cluster size. Debugging it at 3 a.m. is a skill one acquires by losing two nights of sleep and one customer.

The Latency Invoice

Every inter-service HTTP or gRPC call now traverses two Envoy proxies: the caller's sidecar, then the callee's sidecar. Adding two proxies to every request path means adding latency. How much is now well-measured.

A 2025 peer-reviewed performance comparison (published by the DeepNess Lab, Performance Comparison of Service Mesh Frameworks: the mTLS Test Case) measured the overhead with mTLS enforced on otherwise identical workloads. The results:

Mesh	mTLS overhead vs. baseline
Istio (sidecar mode)	+166%
Cilium	+99%
Linkerd	+33%
Istio (ambient mode)	+8%

The headline number (+166 percent for Istio sidecar with mTLS) is surprising only to people who have never read the benchmark. Envoy is fast; two Envoys in the path plus TLS handshakes and certificate validation are not free. Linkerd's Rust-based linkerd2-proxy is measurably lighter because it was built for the job, not adapted to it. Ambient mode, introduced in Istio 1.23 (August 2024), replaces per-pod sidecars with a shared node-level ztunnel and produces dramatically less overhead. Ambient is, in effect, Istio's own public admission that the sidecar model had a problem it could not solve by optimisation alone.

A sidecar also costs memory. The Istio 1.24 documentation reports approximately 60 MB of RAM and 0.20 vCPU per Envoy sidecar at 1,000 HTTP RPS with 1 KB payloads. A cluster with 1,000 pods is therefore paying roughly 60 GB of RAM and 200 vCPU for the mesh before a single byte of application code has executed. Ambient ztunnels are smaller (approximately 12 MB RAM, 0.06 vCPU each) but you now also pay for waypoint proxies where L7 features are enabled. Either way, the total is non-zero.

The Debugging Invoice

When the mesh works, it is invisible. When it does not, the request path has doubled and so has the attack surface for bugs. A 500 that arrives at the client might originate in:

The application code itself
The caller's Envoy (wrong upstream cluster, circuit breaker tripped)
The destination's Envoy (connection limits, bad cert rotation)
A mis-parsed VirtualService or DestinationRule
The mTLS trust chain (expired intermediate, wrong trust domain)
istiod failing to push updated configuration within the retry window
A wasm plugin throwing an exception
A Kubernetes NetworkPolicy quietly dropping the packet

The distributed tracing one installed to understand the mesh is now required to understand the mesh. Troubleshooting skills become mesh-specific skills, which means they do not transfer and do not scale with engineer headcount in the obvious way.

The Honest Case For

Service meshes solve a real problem for a real set of operators. If you:

Run more than roughly 100 microservices with cross-team ownership
Have strict compliance that mandates mTLS between every internal service
Operate across multiple clusters or multiple clouds with incompatible primitives
Need uniform observability across polyglot services that cannot ship an OTel library

then the tax starts to pay for itself. Everyone else: you are paying Google's architecture to solve problems you do not, in fact, have.

The Alternative

Direct HTTP or gRPC calls between services, over a network one already trusts. This is how the internet worked for three decades before sidecars existed.

mTLS terminated at a single ingress gateway (HAProxy, NGINX, Envoy itself, or whatever one's load balancer of choice is), because the VPC was a trust boundary before sidecars were a marketing category. Internal traffic over plaintext inside the VPC is fine for the vast majority of workloads, and mTLS between services is a compliance requirement for a minority of them, not an architectural necessity for all of them.

Tracing and metrics via an OpenTelemetry library linked into each service. OTel is language-agnostic, vendor-neutral, and five lines of initialisation in most runtimes. It sends traces and metrics via OTLP to any collector. No proxy required.

Retries and timeouts in the client library (Go's http.Client, Rust's reqwest, Java's RestTemplate or OkHttp, Python's httpx, Node's undici). All of these ship configurable retries, timeouts, connection pools, and circuit breakers. The retry logic that a service mesh claims to provide "without code changes" is three lines of configuration in a mature client.

Authorisation at the application layer, because only the application knows what "this user may read this document" means. Delegating authorisation to a proxy is delegating it to a component that does not understand the data.

The Pattern

Service mesh is sold as "zero code changes". You get that by paying:

Two proxies of latency on every internal call, measurably more under mTLS
A platform team of overhead to run istiod, gateways, policies, and upgrades
A debugger's worth of new moving parts: VirtualService, DestinationRule, PeerAuthentication, Envoy configuration, trust chains, wasm plugins

All to avoid writing retry logic that any mature HTTP client already provides in three lines of configuration.

The mesh was always, architecturally, a political solution to a technical problem. It existed because microservice teams did not trust each other's code, and a proxy in the middle was a way of enforcing cross-cutting concerns without convincing any one team to adopt them. The proxy became the architecture. The architecture became the operational cost centre. The cost centre produced Ambient Mode, which is the industry's second try at making sidecars not cost what sidecars cost.

Meanwhile, the original alternative (a library in each service, a trusted network below, and a single ingress gateway at the edge) remained exactly what it has been since approximately 1995.

The side call was always there. One simply decided it wasn't enterprise enough.

Read the full article on vivianvoss.net →

By Vivian Voss — System Architect & Software Developer. Follow me on LinkedIn for daily technical writing.

Integrated by Design: Out Today, With a Few Rather Educational Caveats

Vivian Voss — Thu, 23 Apr 2026 06:30:44 +0000

Today, Integrated by Design goes on sale. 371 pages on FreeBSD, from philosophy to practice, with a subtitle one has had rather a lot of time to consider ironically: Why the Best Systems Are the Ones You Don't Notice.

A book about invisible systems, launched by some rather visible problems.

Five months of writing. Three weeks of final proofs. Then the last 72 hours, dedicated to problems one does not anticipate before one's first book. What follows is, in the interest of honesty and possibly the education of the next first-time author who stumbles on the same rakes, a field report.

The Font

The final proof arrived from the printer with a rather small complaint. One of the glyphs in the book's monospace font was quietly broken. Not all glyphs. Just the number 8.

JetBrains Mono ships in sixteen styles (four weights, each in regular and italic, plus a paired set for various screen densities). The bug: the lower counter of the numeral 8, the small enclosed oval at the bottom of the figure, had an un-closed path in the outline. On every screen one could put the font in front of (Retina, non-Retina, one's own laser printer) it rendered as a clean hollow counter. Printed at full size on matte coated paper by a professional offset press, the press dutifully filled it in. Every 8 in the book became a smudge.

The fix required opening the font in a glyph editor, locating the offending Bézier in the descender of the 8, closing the path by hand, and re-exporting. The re-export produced a variant I labelled "JetBrains Mono Fixed". It ships with four styles, not sixteen, for the practical reason that the book only uses four: regular, bold, italic, bold italic. The fix was re-applied to each. The source files of every chapter were recompiled, the code listings re-typeset, the PDF regenerated.

The alternative would have been to re-order a proof print from the printer. A proof print, it turns out, does not arrive the next day. It arrives in seven. The book shipped on its planned launch date only because the surgery happened in-house.

The Cover

The cover came from a small layered Photoshop composition. Between the first proof and the second, somewhere in the layer stack, an adjustment layer survived that was not meant to.

On screen the adjustment layer produced nothing visible. On paper, printed at CMYK on the press, it produced large patches where the dark navy of the cover shifted by a few degrees towards grey-black. Subtle individually, plainly wrong once spotted. The patches did not correspond to any artwork one had intended; they were the ghost of a layer one had meant to delete.

The fix was unglamorous: flatten all adjustment layers before export, audit the PDF with a preflight tool, and repeat until the printed proof and the design intent agreed. A bunch of clean exports later, the kerfuffle was gone. Lesson, pinned to the corkboard: flatten before export, preflight before upload, or be prepared to explain pink patches to your future self.

The Price

This one cannot be fixed as quickly. Amazon's KDP form invites one to enter the list price, and quietly means the net price. Local book VAT is then added on top at checkout: 7% in Germany, 9% in the Netherlands, 5.5% in France, 0% in Ireland and the UK (books are zero-rated), and so on across the marketplaces.

The form, one suspects, was designed from a US perspective where the sticker price is the final price and sales tax appears later at the till. For European marketplaces the same form silently switches meaning, and does not see fit to display the gross total next to the input. For additional fun, the form itself is rearranged for each book type (hardcover, paperback, Kindle), so the lesson learnt from one edition does not quite transfer to the next.

The round figures I had announced on the book page, I entered as though they were gross. At the moment of writing, a book announced at €90 sells for €96.30 on Amazon.de. Corrections to bring every marketplace back to round gross figures are already submitted. The prices will come down, not up. KDP advertises up to 72 hours for propagation; in practice 12 to 24 is typical. While the change is in flight, the dashboard is rather firmly locked: one cannot pause the launch, one cannot postpone by a day, one cannot edit the offending page.

A helpful table of the announced round prices (the target) is on the book page. If a euro or two matters, wait until tomorrow. If it does not, the book is on Amazon now, and I am genuinely grateful for every early reader.

The Kindle

The Kindle edition has been sitting in Amazon's review queue since yesterday evening. What Amazon reviews for, exactly, is not entirely documented: format compliance, metadata validity, some degree of content review. The queue is famously unhurried and famously in Amazon's hands. It will go live when it goes live; today if one is lucky, tomorrow if not.

Nothing I can do to accelerate it, which is itself rather a lesson in how much of a modern launch one does not, in fact, control.

The Direct-Order Option

A direct-order option from Voss'scher Verlag (the imprint) goes live on the book page later today. It offers:

Secured EPUB (with watermarking tied to the buyer's invoice, non-DRM but traceable)
Secured PDF (same approach, full-colour, readable on any PDF reader)
Direct-printed hardcover / paperback (shipped from a local print partner, the same content, without Amazon's 40% cut and with the author paying the EU VAT rather than the customer)

For readers who would rather not route through Amazon at all, this is the path. The pricing is competitive with Amazon's round figures, and the author's margin is measurably better. Both parties win: the reader gets a clean file with no DRM lockdown, the author keeps a living wage's worth of royalty per book.

What I Learned

Three days of fixing three problems at launch is not catastrophic. It is normal for a first book, and somewhat less normal for a fifth. The three rakes one steps on, in order, are:

Your tools are not tested at your scale. JetBrains Mono is used by millions of developers on screens. Nobody I have been able to find ever printed a long stretch of it in monospace on a 100% dot-gain matte press. The bug was there for years; it only surfaces when one subjects the font to the particular combination of ink, paper, and size that printing a book imposes. The lesson: if you are the first person to do something at a particular scale, expect to be the first person to find the bugs.
Exports accumulate history. A Photoshop file used across months of revisions carries every mistake one has ever made as an orphan layer. An export tool that does not actively flatten is an export tool that helpfully preserves those mistakes for the printer. The lesson: flatten is not optional, preflight is not paranoid.
Large marketplaces are designed for their dominant audience. Amazon's KDP is optimised for a US English-speaking self-publisher selling ebooks to a US audience. For a European hardcover with thirteen marketplaces and thirteen VAT regimes, the interface is a minor archaeology project. The lesson: when you are the edge case of a global platform, budget extra time for the platform not noticing.

None of this is meant as complaint. It is the cost of shipping. The book is, for all of that, out. The systems that carried it here were, as the subtitle argues, mostly invisible: the open-source font with its sixteen-year public development history, the reproducible build pipeline for the PDF, the print-on-demand global logistics chain, the pricing-and-tax engine, the review-queue-and-propagation infrastructure of the world's largest bookshop. Each of these is someone's life work, built on somebody else's earlier life work, and they all very nearly got this book to the shelf on time without my noticing.

Almost.

By Vivian Voss — System Architect & Software Developer. Integrated by Design is out today from Voss'scher Verlag. Follow me on LinkedIn for daily technical writing.