DEV Community: Joshua Hall

Owning the Design System

Joshua Hall — Tue, 30 Jun 2026 13:00:00 +0000

Photoshop and Illustrator have lived in the same building at Adobe for more than thirty-five years. They share a brand, an installer, a roadmap, and a customer base. They do not share what the escape key does inside a text field. Hit escape in Illustrator and your edit commits. Hit escape in Photoshop and your edit cancels. Same vendor, same decades of design investment, opposite behavior on the single most common keystroke a user makes in a text input.

This is what I think about when someone tells me their company is "between design systems" or "about to start a design system project." Every software product needs a design system, in the sense that every product runs into the questions one exists to settle. Not every product has one. The gap between needing and having is where the escape key ends up meaning two different things, and most of the failure modes I see live in that gap, because companies treat the system as a project they'll finish rather than a function they have to own.

The Project Trap

The pattern is predictable. Leadership notices the products have drifted, sometimes cosmetically but usually a mix of visual and UX inconsistency. They fund a six-month design system effort, which is a reasonable first investment depending on scope. They assign a small team, which is also fine; a design system rarely needs a large one. The team ships a version on time. Then it gets disbanded or rotated back to product work, because the system has been "delivered."

Within a year the system is stale. Within two there are seventeen dropdowns again. Within three, leadership starts talking about "doing a real design system this time," and the cycle restarts. I've watched this play out at three companies I've worked with closely, and from the outside at many more.

The fix is structural. A design system is not a deliverable; it's a function. Someone has to be accountable for it for as long as the products live, and that accountability has to be staffed continuously: named people, on the org chart, evaluated against the health of the system. Without it, the system decays at roughly the rate the surrounding products evolve, which is fast.

That rate is accelerating. As product development speeds up under agentic engineering, so does the rate of decay, and agents introduce a failure mode of their own. Point a coding agent at a feature without strong harnesses, CI checks, and constraints, and it will happily invent a component that diverges from your approved variants, because inventing one is locally easier than finding the sanctioned one. The same force runs the other way. Build a strong system with those harnesses and agent-facing tools, and you accelerate agentic development instead of fighting it. The companies that get this right compound a durable advantage; the ones that don't keep paying down the same drift, faster than before.

Who's Accountable

Accountability means one person, not a committee, and not "design owns it" as a department-wide shrug. A specific individual whose job is the system and whose performance is judged on whether the system is healthy. What "healthy" means is the crux, and I define it precisely further down.

The background that fits best is heavy UX design experience, but the role benefits from range, frontend engineering especially, because designers and engineers are the two stakeholder groups the system serves. In a smaller org this is usually a principal-level person. In a larger one it can be a director or VP. The level matters less than the mandate.

Placement is where it gets interesting. This person typically reports into a cross-org product or technology unit, and I think it's important they sit as a peer to the product-line leaders rather than beneath one. The design system team will almost always be smaller than the teams it serves, and silos form fast when a small team lacks peer-level authority. Put it under a single product line and it inherits that line's priorities; the other lines stop treating it as theirs.

Reporting under Design is the most common arrangement and often the worst. The team becomes a designer-only group shipping beautiful Figma libraries that engineering implements incompletely, and the code drifts because there's no accountable counterpart on the build side. Reporting under Platform Engineering buys technical credibility at the cost of design quality; engineers optimize for "is it reusable" and "is it cheap to build" over "does it feel coherent." The structure I prefer for mature orgs is a peer to both, reporting to whichever executive can fund and protect the function, usually the CPO and sometimes the CTO. As a technical CPO myself I lean toward product, but that's how I happen to operate, not a universal rule.

What matters more than the reporting line is political weight. The team has to be able to tell a product team no, you can't fork this component, and have that no taken seriously. If it can't say no, the system fragments. If it can only say no, the system becomes an obstacle. It also needs enough influence to prioritize system work ahead of other engineering initiatives when the reasoning is sound. Often the two run in parallel; sometimes a system change genuinely has to finish first, and the team needs the standing to make that call stick.

The Team

A single design system team is small: one accountable lead, one or two senior designers, one or two senior engineers, and partial design-ops support. It scales with company size, but the shape stays consistent.

Seniority matters most at the top. Design system engineers are platform engineers, building infrastructure other engineers depend on, and the calls they make about API design, breaking changes, and deprecation get baked into every downstream product. The same holds for designers: reasoning about how a component lands across thirty feature contexts takes someone who has seen thirty feature contexts. The standard to hold isn't "big team" or "small team." It's the highest possible quality control, because a mistake in the design system behaves like a mistake in DevOps. It propagates everywhere, and in the worst case it takes down the whole org's surface for as long as it takes to unwind.

That standard is about the leadership of the team, not a wall around it. I've rotated junior engineers and designers through the system team, paired with a senior or architect-level mentor, precisely because the work is so high-impact downstream. It's the roadworks of the org: less glamorous than shipping product, and the thing that lets everything else get built. A tour on the system team shows people the full breadth of the org's work in a way they never see siloed in one business unit. Some find they love it and stay. Most should do a six-month tour, absorb how the system thinks, and carry that back to a product team. The trap to avoid is leaving people feeling stuck there forever.

For larger portfolios, one team isn't enough, and the structure that works is layered. A meta-system team owns what every product shares: brand, yes, but also shared workflows, shared interaction patterns, and platform-level components. Each product's design system builds on that foundation and adds what's specific to it. Audition and Premiere both need a media-file list (a shared component), both add media to a project (a shared workflow), both put tracks on a timeline (a shared interaction model). Audition never exports a video file; Premiere never takes a live MIDI feed, so those stay product-specific. This is the layer where Photoshop and Illustrator should have inherited one shared answer for how text boxes behave on a canvas, the same in every Adobe app with a visual artboard. They wouldn't have started that way, since most of these products were separate before they were Adobe's, but as each migrated onto the shared system it would have inherited the common behavior. The escape key would mean one thing.

How You Know It's Healthy

I said the accountable person is judged on whether the system is healthy. That word does no work unless you define it, so here's the definition I use: four measures, each one a question the team can answer with data.

Adoption is the first, and the one most teams already reach for. The question is simple, do product teams actually use the system, and it's measurable without much effort. Parse the codebase for system component instances versus inline implementations. Count the buttons that are <Button> against the ones that are <div role="button">. Flag inline styles that bypass design tokens. None of this is hard; most companies just never ask anyone on the team to measure it.

Throughput is the one teams forget, and it's the closest thing to a business case. The question is whether the teams consuming the system ship faster because of it. The proof is in cycle time on feature work that leans on the system versus work that doesn't. A healthy system shows up as product teams moving faster on the parts the system covers, because they're not relitigating primitives for every feature.

Completeness is distinct from adoption, and keeping the two separate is what makes this framework crisp. Adoption asks whether teams use the system. Completeness asks whether they can. The measure is how often a team hits a wall that needs a new component, state, or pattern before it can proceed. A system can have high adoption and low completeness: everyone uses it, and everyone is also blocked twice a sprint waiting on it. The hard cases live here. Does the system hold up across device sizes and interaction models, across screen readers and locales, against WCAG and ADA, when a new feature needs something the system never anticipated?

Downstream value is the hardest to measure and the largest in magnitude. The question is what the system is worth in compounded efficiency. The honest comparison is to scaffolding everywhere else in software: writing HTML and CSS from scratch versus reaching for Tailwind, building authentication from scratch versus configuring Clerk or Cognito. Good scaffolding compresses weeks of work into hours of configuration. A design system does the same for product UI, and a system built with agent-facing documentation and tooling (an MCP server the agents can query, skills scoped to your components) compounds it further, because every agent-assisted feature inherits the system instead of reinventing around it. I'd expect a strong system to move total output by double-digit percentages on that axis alone.

Measuring these changes how the team responds when a product team forks a component. Without numbers, the response is moralistic: they were wrong, they should have used the existing component. Moralism breeds resentment and eventually silent forking. With numbers, the response is diagnostic: adoption on this component dropped last quarter, why? The diagnostic stance produces a working system. The moralistic stance produces a working dashboard and a fractured product.

The goal is to make the system the platform of choice, the easiest path most of the time, with enough shared ownership that people want to use it. When a gap shows up, it should land as opportunity: "I get to help define a thing everyone will use," or "I just found something that already does this better than what I was about to build." What you're avoiding is the third reaction, "this is going to slow me down for two months and put me on the wrong side of a slipped schedule." That reaction is how systems die.

The Productive Conversation

Authority to say no only works if no is almost never the answer. When a product team needs to fork a component, the team's response matters more than its veto. The pattern that works starts the same way every time. The product team says "here's what we're building, and the existing component doesn't fit because X," and the system team's first move is to listen and confirm the gap is real. Then it branches three ways.

Sometimes the gap is real and small: "Agreed, we can patch the existing component within a week. Use it for the prototype and we'll ship the update before you launch." Sometimes the gap is real and large: "Build what you need bespoke, document it as a temporary fork, tag it so we can track it, and we'll pull the pattern back into the system next release." That's the legitimate fork: explicit, time-bounded, with a path home. And sometimes the better move is a redirect: "Did you consider this alternative with existing components? It's slightly off-optimal for your case but stays in the system, and we'll watch whether your case is common enough to warrant a change."

The wrong response, the one that kills system teams, is "no, use what we have, your case isn't important enough." That trains product teams to stop having the conversation, and the moment they stop, silent forking starts and the seventeen dropdowns come back.

On a large enough org, rotation reinforces this. Cycling people through the system team, the way a factory floor rotates positions, builds a deeper appreciation for the work and a web of social connections with the core team everyone depends on. As long as people like working with that core team, it's one of the most effective ways I've found to build lasting esprit de corps across an org.

The Failure Mode to Avoid

The system teams that die all die the same way. Treated as a project that got delivered. No authority to hold the line on adoption. No relationships to make the productive conversation work. No measurement, so they couldn't show leadership why the investment paid off. Funding pulled after a quarter or two for lack of visible ROI, the team disbanded, the products fragmented again, and three years later someone proposes a real design system this time.

The alternative isn't exotic. Fund the function continuously. Give it the authority to operate and the peer standing to prioritize. Staff it with people who can hold the bar. Define healthy as adoption, throughput, completeness, and downstream value, and measure all four, so the ROI is visible before anyone questions it. Most companies skip this not because it's hard but because a design system doesn't show up on the org chart as a function the way platform engineering does.

So here's the test, if you're building a product company meant to run more than a few years. The need isn't in doubt; every product runs into the questions a design system exists to answer. Whether you have a design system is the only thing that varies, and that's a question about ownership, not components. Ask who's accountable for the system by name, what number tells them it's healthy, and what happens to that number when they leave. If you can't answer all three, you're in the gap between needing and having. What's sitting there is a collection of components waiting to drift, and the next dropdown is already being built.

Voice Is the Smoking of Office Interaction

Joshua Hall — Thu, 25 Jun 2026 16:35:00 +0000

I'm dictating this draft into a transcription tool because the first pass of a post is a long stream-of-consciousness, and talking is the fastest way to get it out of my head. I'll edit it later with my hands on a keyboard, like a sane person.

That split — voice for capture, keyboard for refinement — is the one voice pattern I've found that reliably earns its keep. It isn't even new. Writing a book report in the 1980s, I'd dump every idea onto the page first, then read it back, build a rough order, then rewrite and rearrange until it actually said something. The workflow never changed; voice just lets me sprint the first step faster. Almost everywhere else I see designers reaching for voice as a primary input, I want to ask whether they've ever tried to use it in a room with another human being.

Where Voice Actually Works

Dictating a loose draft alone in my office works for a reason I didn't expect. The problem with typing the first dump isn't speed; it's that the moment I type, I slip into editor mode and start polishing sentences before the ideas are even out. Voice blocks that. I can't easily go back and fix what I just said, so it forces me forward in a straight line until the whole mess is on the page. It's a limiter that imposes a behavior I know is better but won't choose on my own.

Hands-busy contexts earn it too, and earn it cleanly. Capturing a thought while driving is both useful and far safer than the alternative of thumbing at a screen; the case is so obvious that "call my wife" and "add a stop for gas" are the rare voice commands even skeptics reach for. Setting a kitchen timer mid-recipe, hands covered in flour, is the single thing I use Siri for most, and I doubt I'm alone; it may be the most-used thing the assistant does. Whether that's because timers are a perfect high-frequency hands-busy fit or because Siri is too unreliable for anything harder is genuinely unclear (probably some of both).

And it stays narrow even when I'm alone. For real design or engineering work, typing slows me to the cadence the problem needs: fast enough to keep up with the thinking, slow enough to organize it, and that instinct to pause and add nuance is a feature, not friction. Talking at a screen runs weirdly fast for that kind of work; the register doesn't match. Then there's the reliability tax. Across every transcription tool I've used over twenty-five years (Dragon, the macOS built-in, Wispr Flow) there's a real failure rate where I thought I'd started recording and hadn't, and nothing rage-quits a session faster than monologuing for ten minutes to a machine that captured none of it.

That's the ceiling on voice when the room is empty. Add one more person and it drops through the floor — the interaction goes from awkward to actively rude.

The Office Problem

Picture a general contractor's office. There are well over four hundred thousand home-building businesses in the United States, and once you get past the few hundred big national builders, nearly eighty percent of them are tiny: self-employed operators and two-or-three-person shops. The office matches: a couple of people in a light-industrial unit or a tired strip mall, because the building is the point and the office is just where you meet a client and run takeoffs. Nobody's spending money on a nice one, and nobody needs to.

The size barely matters, though. Scale up to Intel and you've traded the strip mall for a cube farm, which is worse; there is nothing quite as sensory-overloading as a sales floor packed with cubicles. Some people thrive in that. In my experience the people who build software (designers, engineers, the analytically minded) mostly can't think through that much noise, for good reasons. It's self-defeating fast: even if voice is a small optimization for one person, in a shared room that one person is wrecking everyone else's concentration.

Now drop voice-first software into that office. One person talking at their computer is fine. Two is a soundscape nobody can think in. Three is unworkable. The entire reason the room exists is to be a small quiet place where two or three people can concentrate, and voice-first software breaks that premise on contact. The same math ruins any shared space: a café where everyone dictates is intolerable for the baristas and the next table over; an open-plan floor where voice replaces typing is just the boiler-room sales pit rebuilt on purpose. We've been there. We didn't like it.

So I think some offices will eventually ban voice input at the desk for the same reason they once banned smoking — not because it harms the speaker, but because one person's input is everyone else's externality, and at some density the host has to step in. Whispering into the mic doesn't fix it; it just makes you look like you're plotting something. Headsets don't fix it, because the talking is the problem, not the listening. And the honest alternative isn't a better mute button. It's rethinking whether we need to be in the shared room at all, and for how long. George Jetson clocked a nine-hour week pushing one button; it's well past time we took that as a target rather than a punchline.

The Minority Report Test

Voice sits in the same bin as the gestural interface Tom Cruise waves around in Minority Report. It looks spectacular on screen. Imagine doing it for eight hours and you realize you've never wanted to move your arms that much; a coworker drifting past at the wrong second smears everything you were holding in the air. The film literally shows the strain and somehow audiences walked out wanting the gloves instead of reading them as a warning.

There's probably a whole essay in how badly science fiction designs interfaces, because it's optimizing for what reads on camera, not what survives a workday. Hackers is my favorite tell: swooping 3D cityscapes of glowing files, when the real thing was, is, and will almost certainly remain a command prompt. What looks good on screen and what's good to use are mostly different things, and voice has been coasting on the former.

So here's the test I run on any input modality: imagine doing it for forty (or sixty) hours a week on a floor of peers all doing the same thing. Voice fails. Gestures fail; your arms would simply give out, the same way plenty of people's hands give out from typing. That last part is the fair steelman for voice. Fewer hours at the keyboard means less repetitive strain, which is a real benefit, and I can't talk for eight hours any more than I can type for eight. The answer isn't to crown one modality. It's that I need several, and voice can be one of them — but only if the design accounts for the room, because keyboard and mouse clear a bar voice can't. They're near-silent, screen-only, and personal. They scale into dense shared space precisely because they impose on no one else.

This is the part the voice-first crowd skips, and it's the whole game. They evaluate the interaction in isolation (does it work for one user, one task, one ideal environment) and never ask whether it works in the room. And the room is the deeper point: as a designer you are responsible for the entire context of use, including the parts you don't control. I can't stop a user from opening my app on a packed train. If it supports mobile, they're using it exactly as intended, so I have to design for that train. Not by forbidding it, but by never forcing the awkward path. If the only way to enter a destination is to say it out loud because someone decided "voice matters while driving," that's just as broken in reverse, because half the time I'm parked and I just want to type the address. You design for the whole situation and the whole audience, not the slice you can see in the demo.

The Restraint Pitch

When someone on the team pushes voice as a hero feature, ask two questions. First: what physical environment is the typical user actually in? If there's another human within earshot, you're not designing for voice; you're designing a fight with the user's coworkers, family, baristas, and seatmates. Second, and this is the one people skip: when and why would a user not be able or willing to speak? The answer is almost never "rarely." It means voice lands somewhere around thirty, forty, sixty percent of sessions (a real but partial slice) and almost never the ninety-nine percent that would justify making it the front door.

That percentage is the tell, and it's the same one from the Chat Is an Input argument, because voice is chat, just entered by mouth instead of keyboard. It inherits every limit of language as an interface: the imprecision, the ambiguity, all of it, then adds an acoustic externality typing never had, and arguably gets worse, since few of us are as precise out loud as we are on the page. It's why we built distinct interfaces for spreadsheets and CAD and design tools in the first place; sometimes a picture beats a thousand words, and sometimes a thousand words can't do what a single dragged handle does. A growing crowd in the Bay Area now narrates at a fleet of agents all day, and good for them; I do a version of it myself. But that is not how a general contractor works, or an accountant, or a business analyst, and mistaking your own workflow for your users' is the original sin here.

The job of design isn't to expose every modality the underlying tech can support. It's to choose the ones that work for the user and the people around the user, build those well, and let the operating system catch the edge cases. Every major OS already does competent voice-to-text for anyone who wants it. Voice is an edge case for most software. The discipline is having the restraint to let it stay one, to design for the room you can't see, not just the user in front of you.

Building a Universal Container System (So I Never Have to Write Another Dockerfile)

Joshua Hall — Sat, 13 Jun 2026 14:30:00 +0000

TL;DR: Built a modular Dockerfile system that lets you compose dev/prod containers using build arguments instead of writing custom Dockerfiles. Includes 28 feature modules with 100+ tools, weekly automated version updates with testing, and support for Python/Node/Rust/Go/Kubernetes and more. Saves me between a few hours and a day or two per project setup in addition to a lot of downstream effort with updates and coordinating environments.

The Problem That Wouldn't Go Away

You know that feeling when you're starting a new project and you think "oh great, I get to set up devcontainer again"? That was me, every single time. Another day lost to writing yet another Dockerfile, configuring Python, Node, databases, cloud tools, dev tools... rinse and repeat.

After several projects, I realized something depressing: I was literally copying and pasting the same 300 lines of Dockerfile with minor tweaks. The only thing changing was whether I needed Python 3.12 or 3.13, or if this project used Postgres or MySQL.

But the real pain hit when maintenance season arrived. Python releases a security patch? Great, now I get to update six different repos. New team security practice? Update six Dockerfiles. New teammate starts a project? They copy the Dockerfile from the last project (which was already out of date), and now we have seven slightly different configurations.

This is ridiculous, I thought. I'm a designer and understand software engineering best practices and patterns. I should be able to solve this.

The core problem wasn't just the repetition. It was the maintenance burden. I don't have time to manually track when Python 3.13.7 comes out, or when kubectl updates, or when some npm package has a critical vulnerability. I needed automation that would handle the routine stuff and only bother me when something actually broke.

So I built it.

The Core Idea

What if, instead of writing custom Dockerfiles, you could just declare what you want?

docker build -t myproject:dev \
  -f containers/Dockerfile \
  --build-arg INCLUDE_PYTHON_DEV=true \
  --build-arg INCLUDE_NODE_DEV=true \
  --build-arg INCLUDE_POSTGRES_CLIENT=true \
  --build-arg NODE_VERSION=20 \
  --build-arg PYTHON_VERSION=3.13 \
  .

That's it. No custom Dockerfile to write or maintain. Just build arguments that compose pre-tested features into exactly what you need.

Base layer: "Debian Slim" (gray box)
Arrow down with "+INCLUDE_PYTHON_DEV=true" label
Second layer: "Base + Python 3.13 + Poetry + Pytest + Black + Mypy" (blue boxes stacking)
Arrow down with "+INCLUDE_NODE_DEV=true" label
Third layer: Previous + "Node 20 + TypeScript + ESLint + Jest" (green boxes adding)
Arrow down with "+INCLUDE_KUBERNETES=true" label
Final layer: Previous + "kubectl + helm + k9s" (purple boxes adding)

The solution ended up being a single, modular Dockerfile that can create any development or production container through build-time configuration. I designed it as a git submodule so I could add it to any project and immediately have access to dozens of pre-built features. This means you get the benefits of a centralized, maintained Dockerfile without losing the ability to customize per-project.

Want Python with all the dev tools? INCLUDE_PYTHON_DEV=true. Need to add Kubernetes tools later? INCLUDE_KUBERNETES=true. The Dockerfile doesn't change. The project doesn't change. You just flip a switch.

Repository: https://github.com/joshjhall/containers

Let me show you what this actually looks like in practice.

What This Actually Solves

Let me be concrete about what changes when you use this.

Setup Time: Days to Minutes

Before: Starting a new Python API project meant 1-2 days of Dockerfile work. Copy an old one, update Python version, fix broken apt packages, research how to install AWS CLI, configure poetry, set up pytest, add black and mypy, configure non-root user, set up entrypoint scripts... you know the drill.

Now: Add the git submodule, set INCLUDE_PYTHON_DEV=true, and you're done in 10 minutes. Everything is already there and tested.

Left side (Before): Timeline showing "Day 1: Research Docker setup, write Dockerfile" → "Day 2: Debug apt packages, fix Python install" → "Day 3: Add dev tools, configure entrypoint" → "Done: 2-3 days"
Right side (After): Single timeline showing "Minute 1: Add git submodule" → "Minute 5: Set build args" → "Minute 10: Done ✓"
Large "2-3 days → 10 minutes" callout

The time savings alone justify this. But honestly, the bigger win is the mental overhead. I no longer dread starting new projects because of Docker setup.

Consistency: Stop the Drift

Before: Project A has Python 3.11 with poetry 1.4. Project B has Python 3.12 with poetry 1.5. Project C has Python 3.13 with poetry 2.0. They all work... differently. New developer joins? Good luck figuring out which version of which tool you need for which project.

Now: All projects use the same foundation. Update the submodule, and all projects move forward together. Same Python version. Same tool versions. Same configurations. It's so much saner.

Dev/Prod Parity: One Dockerfile, Two Environments

Before: Separate Dockerfiles for dev and prod. Try to keep them in sync. Fail. Ship to production. Discover your dev environment had a dependency that prod doesn't. Debug in production. Not fun.

Now: Same Dockerfile for both. Just different build arguments:

# Development: Full tooling
docker build --build-arg INCLUDE_PYTHON_DEV=true ...

# Production: Minimal runtime
docker build --build-arg INCLUDE_PYTHON=true ...

If it works in dev, it works in prod. Same base, same versions, same everything.

Adding Features: From Hours to Seconds

Before: Need to add Redis to your project? Time to research the correct apt package name, figure out how to configure the client, update environment variables, test it... 2 hours later you have Redis support.

Now: --build-arg INCLUDE_REDIS_CLIENT=true. Done. Tested. Works.

This is the kind of thing that makes development feel fast again.

Security & Updates: Set It and Forget It

Before: Python security patch comes out. Now you get to manually update six different Dockerfiles in six different repos. Miss one? Hope your security team doesn't notice.

Now: The automation handles it. Update happens automatically, gets tested, merges if everything passes. You wake up and it's done. Or you get notified if something broke and you need to intervene.

Security best practices are built into the system. Non-root users. Minimal base images. Vulnerability scanning. It's all there by default.

The Automation Philosophy (Or: How I Stopped Worrying and Learned to Trust CI)

Here's the thing that really makes this system practical: I genuinely don't have time to track version updates manually. Python 3.13.7 comes out? I'll find out... eventually. kubectl 1.34 releases? I'm probably three versions behind already. Security patch for some npm package? I might hear about it on Hacker News if it's bad enough.

This is a terrible way to manage infrastructure. So I automated it completely.

The Problem with Manual Updates

Every tool in your container has a version. Python, Node, kubectl, Terraform, AWS CLI, poetry, npm... the list goes on. Each one gets updated regularly. Some weekly, some monthly. Tracking all of them manually? That's not a job, that's a punishment.

And it's not just tracking. You need to test each update. Does the new Python version break your linter? Does the new kubectl version have API changes? Does the new Node.js version introduce a breaking change in how it handles modules?

I needed a system that would handle the boring parts and only bug me when something actually needed my attention.

The Solution: Sunday Morning Automation

Every Sunday at 2am UTC, the system wakes up and checks every pinned tool version against the latest releases:

Python, Node, Rust, Go, Ruby, Java
kubectl, helm, Terraform, AWS CLI, Google Cloud SDK
Poetry, npm, cargo, and all the other package managers
Development tools, database clients, everything

If it finds updates, it creates a new branch with all the version bumps, updates the Dockerfile and CHANGELOG, commits everything, and pushes to GitHub.

No human intervention. No ticket in my inbox. Just automatic detection and branch creation.

The Test Suite That Makes It Possible

This is where it gets good. That new branch triggers the full CI gauntlet—and I mean full:

535+ unit tests on every bash script (one for each feature installation, version check, cache configuration, error handling path, and Debian compatibility scenario)
Shellcheck for code quality and common bash pitfalls
Gitleaks scanning for accidentally committed secrets
Full Docker builds for all six variants (minimal, python-dev, node-dev, cloud-ops, polyglot, rust-golang)
Integration tests that actually use the tools—compile code, run tests, execute commands
Debian compatibility checks spot-testing across versions 11, 12, and 13
Security scanning with Trivy for known vulnerabilities

If the update breaks something, the tests catch it. If it introduces a security vulnerability, Trivy catches it. If it has Debian compatibility issues, the matrix testing catches it.

Start: "Sunday 2am UTC" (clock icon)
Step 1: "Check for Updates" → "Found: Python 3.13.7, kubectl 1.31.2" (magnifying glass icon)
Step 2: "Create Branch 'auto-update-2025-10-27'" (git branch icon)
Step 3: "Run Full CI Pipeline" (gear icon) with sub-bullets: "535+ unit tests", "Build 6 variants", "Security scan"
Decision Diamond: "All Tests Pass?"
- YES path (green): "Auto-merge to main" → "Create tag v1.2.3" → "Notify: ✅ Patch Release v1.2.3 Deployed"
- NO path (red): "Preserve branch" → "Notify: ❌ CI Failed - Python 3.13.7 breaks black formatter - Manual review required"

What Actually Happens Next

Here's the magic part:

If everything passes: The system auto-merges to main, creates a version tag, and sends me a Pushover notification on my phone: "✅ Patch Release v1.2.3 Deployed". I wake up to updated containers. I did nothing. It's beautiful.

Pushover notification at top of phone screen
App icon, title "Container System"
Message: "✅ Patch Release v1.2.3 Deployed"
Subtext: "Updated: Python 3.13.6→3.13.7, kubectl 1.31.1→1.31.2"
Time: "6:47 AM"

If anything fails: I get a high-priority Pushover notification (the kind that makes noise even in Do Not Disturb mode): "❌ CI Failed - Python 3.13.7 breaks black formatter". The branch is preserved for manual review. Now I actually need to get involved.

This means 95% of version updates happen automatically. I only get involved when something actually breaks. That's the level of automation I was looking for.

Important note on update control: This automation updates the containers repository itself. Projects that include containers as a git submodule maintain full control over when they adopt new versions. You can pin to a specific version for stability and test updates on your schedule (treating it like any other dependency managed via npm, cargo, etc.). Or you can automate pulling updates if you want to stay current automatically. The choice is yours—you get the benefits of automated testing and version tracking without being forced to adopt updates before you're ready.

How It Works Under the Hood

I designed this as a git submodule that you add to any project:

git submodule add https://github.com/joshjhall/containers.git containers

One Dockerfile, Many Configurations

A single Dockerfile accepts build arguments to enable features:

# Dockerfile (simplified)
ARG INCLUDE_PYTHON_DEV=false
ARG INCLUDE_NODE_DEV=false
ARG INCLUDE_RUST_DEV=false
# ... dozens more features

RUN if [ "$INCLUDE_PYTHON_DEV" = "true" ]; then \
      /tmp/build-scripts/features/python-dev.sh; \
    fi

Modular Features

I broke everything into self-contained installation scripts:

lib/
  features/
    python.sh          # Python runtime
    python-dev.sh      # + poetry, pytest, black, mypy
    node.sh            # Node.js runtime
    node-dev.sh        # + TypeScript, ESLint, Jest
    rust.sh            # Rust toolchain
    docker.sh          # Docker CLI (for Docker-in-Docker)
    kubernetes.sh      # kubectl, helm, k9s
    aws.sh             # AWS CLI
    postgres-client.sh # psql
    # ... 28 feature modules total

Each script validates its installation, configures caching, handles Debian version differences automatically, and follows security best practices. Critically, each is independently testable.

The 28 feature modules install 100+ individual tools. For example, golang-dev.sh alone installs 34 Go development tools (gopls, dlv, golangci-lint, staticcheck, etc.), while rust-dev.sh installs 11 Rust tools, and dev-tools.sh adds 10+ productivity utilities.

Smart Caching Strategy

BuildKit cache mounts are configured for every package manager:

RUN --mount=type=cache,target=/cache/pip \
    --mount=type=cache,target=/cache/npm \
    --mount=type=cache,target=/cache/cargo \
    pip install poetry && \
    npm install -g typescript

X-axis: "First Build" and "Rebuild with Cache"
Y-axis: Time in minutes (0-10)
First Build: Bar reaching 8.5 minutes (red)
Rebuild with Cache: Bar reaching 1.2 minutes (green)
Large callout: "7x faster rebuilds"
Below: Icons showing cached items: pip, npm, cargo, go modules

Rebuilds are fast even when switching features. I've had builds that took 8 minutes the first time complete in under 90 seconds on subsequent runs.

Runtime Initialization

First-time setup scripts run on container start:

lib/runtime/
  first-time-setup.d/   # Run once per container
    20-aws-setup.sh     # Check AWS credentials
    20-kubernetes-setup.sh
  startup.d/            # Run every time
    10-docker-socket-fix.sh

This means users get helpful setup messages instead of cryptic errors when something's misconfigured.

Testing: Confidence Instead of Crossing Fingers

Here's something that still surprises people: I have 535+ unit tests for bash scripts. Yeah, really.

Most Dockerfile projects have zero tests. You write it, build it, hope it works, and find out it doesn't work when someone else tries to use it three months later. That's not acceptable for production infrastructure.

So I built a complete testing framework specifically for bash. It has assertions, mocking, container testing utilities, and detailed reporting:

#!/usr/bin/env bash
source "../../framework.sh"
init_test_framework

test_python_installs_correctly() {
    # Mock external commands
    mock_function "curl" "echo 'mocked download'"

    # Run the installation
    source lib/features/python.sh

    # Assert expected behavior
    assert_success "Installation should succeed"
    assert_file_exists "/usr/local/bin/python"
}

test_python_version() {
    local image="test-python:latest"
    assert_command_in_container "$image" "python --version" "Python 3."
    assert_executable_in_path "$image" "poetry"
}

run_test test_python_installs_correctly "Python installs correctly"
generate_report

Running tests for features/python.sh...
✓ test_python_installs_correctly (0.3s)
✓ test_python_version_check (0.2s)
✓ test_poetry_available (0.4s)
✓ test_cache_configuration (0.2s)
✗ test_rust_compiles (1.2s)
  Expected: rustc command available
  Got: command not found

Tests: 447 passed, 3 failed, 0 skipped
Time: 2m 14s
Coverage: 94.2%

The framework provides assertion functions (assert_success, assert_equals, assert_file_exists), container testing (assert_command_in_container), a mocking system, and detailed reporting. Each test runs in isolation.

What Gets Tested

Unit Tests (535+): Every bash script tested in isolation—base system setup, all 28 feature modules, runtime scripts, and user-facing commands. Each script has tests for successful installation, version verification, error handling, cache configuration, and Debian compatibility.

Integration Tests: Full container builds for six real-world scenarios:

minimal: Base system only, for when you want to start from scratch
python-dev: Python stack with databases, for API development
node-dev: Node.js stack with test frameworks, for web development
cloud-ops: Kubernetes + Terraform + AWS + GCloud, for infrastructure work
polyglot: Python + Node.js together, for full-stack projects
rust-golang: Rust + Go, for systems programming

Each integration test builds the container, verifies tools are installed correctly, runs version checks, tests actual functionality (compile code, run tests), and verifies cache configuration.

Debian Matrix Testing: The CI pipeline spot-checks compatibility across Debian 11 (Bullseye), 12 (Bookworm), and 13 (Trixie) to catch compatibility issues before they ship. The system can be configured for more thorough cross-version testing when needed.

Security Testing: Shellcheck for static analysis, Gitleaks for secret scanning, Trivy for vulnerability scanning of the final images.

When the CI pipeline runs, if something fails, I know exactly which feature broke, what assertion failed, which Debian version it affects, and whether there are security implications. No more "it works on my machine" mysteries.

By the Numbers: What 42,000 Lines of Code Looks Like

Let me show you what's actually in this repository, because the scope surprised even me when I looked at it recently:

Repository stats:

Total size: 4.8 MB
Total files: 178
Lines of code: ~42,000 (including documentation)

Code breakdown:

Shell scripts: 117 files, 35,776 lines total
- 28 feature installation modules: 13,013 lines
- Test framework + unit tests: 15,263 lines
- Runtime/startup scripts: 1,966 lines
- Core utilities: 1,565 lines
- User-facing scripts: 1,337 lines
Documentation: 17 markdown files, 4,434 lines
CI/CD workflows: 1,369 lines
Docker configs: 412 lines

What this actually means:

The 1:1 ratio of feature code to test code (13K lines each) isn't an accident. When I said I have comprehensive testing, I meant it. For every line of feature installation code, there's roughly a line of test code validating it.

The feature modules are 36% of the codebase. The tests are another 36%. The remaining 28% is split between documentation, core utilities, runtime scripts, and CI/CD. This is what production-ready infrastructure looks like.

The efficiency angle:

Here's what makes this interesting: 42,000 lines supporting 28 feature modules covering 100+ tools, with full CI/CD, comprehensive testing, and extensive documentation, all in under 5MB.

Most enterprise Dockerfile collections I've seen would be 5-10x this size for similar functionality. They'd have separate Dockerfiles for every combination, duplicated setup code across files, and minimal or no testing. This modular approach is genuinely more maintainable.

For context, a typical enterprise setup might have:

30+ separate Dockerfiles (one per project/team)
Each 200-500 lines
6,000-15,000 lines of duplicated Docker code
Maybe 500 lines of tests if you're lucky
Inconsistent versions across all of them

This system replaces all of that with one Dockerfile, modular features, and more tests than most teams have for their entire Docker infrastructure.

What's Actually Included

Over time, I've built 28 feature modules that install 100+ tools to cover basically everything I need across different projects. Here's how they group by use case:

If you're building APIs or web services:

Python with FastAPI/Flask (runtime + poetry, pytest, black, mypy, ruff)
Node.js with Express/Next.js (runtime + TypeScript, ESLint, Prettier, Jest)
Database clients for PostgreSQL, Redis, and SQLite
All the testing frameworks and linters you actually use

If you're doing cloud operations or infrastructure:

Kubernetes tools: kubectl, helm, and k9s for cluster management
Terraform + Terragrunt for infrastructure as code
AWS CLI, Google Cloud SDK, and Cloudflare Workers tooling
Docker CLI for Docker-in-Docker workflows

If you're doing ML or data science work:

Python data science stack with all the usual suspects
Ollama for running local LLMs (because apparently that's a thing we do now)
R for statistical computing
Java for Spark/Hadoop work

If you're doing systems programming:

Rust toolchain with cargo, clippy, and rustfmt
Go with all the build tools
C/C++ compilers and build systems

For everyone:

Git with GitHub CLI for version control
1Password CLI for secrets management (so you stop committing API keys)
All the basic dev utilities you forget you need until you don't have them

Every feature is independently toggleable through build arguments. All versions are pinned and automatically tracked for updates by the weekly automation I described earlier.

Architecture support: Works on ARM64 (Apple Silicon M1/M2/M3, AWS Graviton) and AMD64 (traditional x86_64). The same Dockerfile builds correctly on both.

Architecture: How I Organized This Thing So I Could Actually Find Stuff Later

Here's the folder structure that evolved as this project grew:

containers/
├── Dockerfile              # Universal, feature-based
├── lib/
│   ├── base/              # Core system setup
│   │   ├── setup.sh       # Base system config
│   │   ├── user.sh        # Non-root user creation
│   │   └── apt-utils.sh   # Debian version detection
│   ├── features/          # Individual feature modules (28)
│   │   ├── python.sh, python-dev.sh
│   │   ├── node.sh, node-dev.sh
│   │   └── ... all other features
│   └── runtime/           # Container initialization
│       ├── first-time-setup.d/
│       └── startup.d/
├── tests/
│   ├── unit/              # Feature-level tests
│   └── integration/       # Full build scenarios
└── examples/              # Docker Compose templates

Root: "containers/" folder icon
Green section: "lib/base/" with shield icon - "Core system, security, user setup"
Blue section: "lib/features/" with puzzle piece icons - "28 modular features, 100+ tools" with mini-icons for Python, Node, Kubernetes, etc.
Purple section: "lib/runtime/" with play button icon - "Initialization scripts"
Orange section: "tests/" with checkmark icon - "535+ tests"
Arrows showing: "Build time uses lib/base + lib/features" and "Runtime uses lib/runtime"

Handling Debian Version Compatibility (Or: That One Time Debian Broke Everything)

Remember when Debian 13 (Trixie) removed the apt-key command in 2024? If you maintain Docker images, you probably remember. Container builds across the entire ecosystem broke overnight. HashiCorp tools? Broken. Kubernetes? Broken. Terraform? Broken. Every single image that added third-party repositories the "old way" just... stopped working.

The error looked like this:

bash: line 1: apt-key: command not found
✗ Adding HashiCorp GPG key failed with exit code 127

I saw this coming (the deprecation warnings had been around for a while), so I built automatic Debian version detection into the system. The scripts detect which Debian version they're running on and use the appropriate method:

# lib/features/terraform.sh (simplified)
source /tmp/build-scripts/base/apt-utils.sh

if command -v apt-key >/dev/null 2>&1; then
    # Debian 11/12: Legacy method
    curl -fsSL https://apt.releases.hashicorp.com/gpg | apt-key add -
else
    # Debian 13+: Modern signed-by method
    curl -fsSL https://apt.releases.hashicorp.com/gpg | \
        gpg --dearmor -o /usr/share/keyrings/hashicorp-archive-keyring.gpg
    echo "deb [signed-by=/usr/share/keyrings/hashicorp-archive-keyring.gpg] ..."
fi

Package/Command	Debian 11	Debian 12	Debian 13
apt-key	✓ Available	✓ Available	✗ Removed
lzma-dev	✓ Available	✓ Available	→ liblzma-dev
GPG key method	apt-key add	apt-key add	signed-by=

Highlight cells that changed in red, add checkmarks in green

I also built utility functions that feature authors can use:

if is_debian_version 13; then
    # Trixie-specific logic
fi

apt_install_conditional 11 12 lzma-dev  # Only Debian 11/12
apt_install liblzma-dev                  # Works on all versions

The system handles package migrations (like lzma-dev to liblzma-dev in Debian 13) automatically. The CI pipeline spot-checks compatibility across Debian 11, 12, and 13, catching major issues before they ship—without the overhead of testing every possible combination on every run.

This saved me when Debian 13 released. While everyone else was scrambling to fix broken builds, mine just... worked. That felt good.

Cache Strategy Deep Dive

I configured persistent cache volumes for every package manager that matters:

/cache/
  ├── pip/       # Python packages
  ├── npm/       # Node packages
  ├── cargo/     # Rust crates
  ├── go/        # Go modules
  └── bundle/    # Ruby gems

Mount these as Docker volumes for fast rebuilds:

docker run -v project-cache:/cache myproject:dev

The first build downloads everything. Subsequent builds reuse the cache, even if you change which features are enabled. I've seen build times drop from 8+ minutes to under 2 minutes just from proper cache configuration.

Common Pitfalls I Learned the Hard Way

Let me save you some pain by sharing mistakes I made while building this:

Why I Chose Debian Over Alpine

Alpine seems attractive at first—tiny base image, minimal attack surface. And for many use cases, Alpine is excellent (I use it for database containers and simple services all the time).

But for development containers with lots of tools, I ran into friction:

Different package manager (apk vs apt) meant learning a new ecosystem and maintaining two versions of scripts
Musl libc instead of glibc caused occasional compatibility issues with pre-compiled binaries
Many Python packages need compilation from source on Alpine (no pre-built wheels)
Some tools have better support and documentation for Debian/Ubuntu

After spending time debugging Alpine-specific issues, I switched to Debian slim for development containers. The image size difference was about 50MB, but the development experience improved significantly. Your mileage may vary—Alpine is great for production workloads and simpler images, but Debian slim gave me fewer surprises when installing development tools.

The Day Debian 13 Broke Everything

I already mentioned this, but it's worth emphasizing: always plan for breaking changes in base images. I learned to:

Pin Debian versions in CI testing (debian:11-slim, debian:12-slim, debian:13-slim)
Test new Debian versions before they become stable
Build version detection into installation scripts
Never assume commands will exist forever

The apt-key deprecation taught me this lesson hard.

Why I Don't Use ARG for Secrets

Early on, I tried using build arguments for API keys and credentials. Don't do this:

# DON'T DO THIS
ARG AWS_ACCESS_KEY_ID
ARG AWS_SECRET_ACCESS_KEY

Build arguments end up in the image history. Anyone with access to the image can read them with docker history. Use secrets management (1Password CLI, AWS Secrets Manager) or mount them at runtime instead.

The Submodule Update Trap

Git submodules are great but have one big gotcha: they don't auto-update. When you git pull your main project, the submodule stays at its old commit unless you explicitly update it.

I now include a note in every project's README:

# Update the container system
git submodule update --remote containers

Better yet, I'm working on a pre-commit hook that warns when the submodule is more than a week behind main.

Keeping Your Projects Updated

Here's the workflow for keeping your containers current:

Updating to Latest Versions

# In your project directory
cd containers
git checkout main
git pull origin main
cd ..
git add containers
git commit -m "Update container system to v1.2.3"

Your existing containers keep running. Next time you rebuild, they'll use the new versions. If you want to force an update immediately:

docker compose down
docker compose build --no-cache
docker compose up

What Happens During Updates

When you update the submodule:

No immediate effect - Running containers keep running
Next build uses new versions - Python 3.13.6 → 3.13.7, etc.
Tests run during build - If something breaks, the build fails (not your running container)
Cache mostly survives - Only changed features need re-downloading

Rolling Back if Needed

Git submodules make rollbacks trivial:

cd containers
git checkout v1.2.2  # Previous version
cd ..
git add containers
git commit -m "Rollback container system to v1.2.2"

Then rebuild. This is one of the big advantages of the submodule approach—every version is one git checkout away.

What's Actually Built

Here's what the system includes right now:

Feature Coverage:

28 feature modules that install 100+ individual tools
8 programming languages (Python, Node.js, Rust, Go, Ruby, R, Java, Mojo)
5 cloud platform CLIs (AWS, GCloud, Cloudflare, Kubernetes, Terraform)
3 database clients (PostgreSQL, Redis, SQLite)
Comprehensive dev tools (Git, GitHub CLI, Docker CLI, 1Password CLI, and more)

Test Coverage:

535+ unit tests covering every feature installation
6 integration test scenarios (minimal, python-dev, node-dev, cloud-ops, polyglot, rust-golang)
Debian compatibility spot-checks across versions 11, 12, and 13
Security scanning with Trivy for all built images
Shellcheck validation on all bash scripts
Secret scanning with Gitleaks

Automation:

Weekly automated version checks (Sunday 2am UTC)
Full CI pipeline on every update (build 6 variants, run all tests, scan for vulnerabilities)
Auto-merge on success, high-priority notification on failure
Pushover notifications keep you informed without requiring constant monitoring

Architecture Support:

ARM64 (Apple Silicon M1/M2/M3, AWS Graviton, Raspberry Pi)
AMD64 (traditional x86_64, Intel/AMD processors)
Same Dockerfile builds correctly on both architectures

The system is production-ready with comprehensive testing, documentation, and automation. It's being used across multiple projects, but the numbers that matter are the ones above—those demonstrate the engineering rigor built into the foundation.

Why Not Just Use...?

You might be wondering why not just use existing solutions. Fair question.

Custom Dockerfiles: This is what I was doing. Full control is great until you have ten projects and need to update something in all of them. The maintenance burden became untenable. Every project drifts slightly differently, and keeping them in sync is a losing battle.

Dev Container Features: These are actually pretty good! Microsoft's dev container features are well-designed and solve similar problems. But they're VS Code specific. I wanted something that works with VS Code, Docker Compose, plain Docker, CI/CD pipelines, and production environments. Also, dev container features don't solve the version tracking and automation problem—you still need to manually update your feature versions.

Pre-built Images (python:3.13, node:20, etc.): Fast to pull from Docker Hub, but you get what you get. Need Python + Node.js? That's not a standard combination. Need Python + Kubernetes tools + PostgreSQL client? Good luck finding that exact image. And you definitely don't get automatic version updates with comprehensive testing. You're trusting someone else's build process and update cadence.

Configuration Management (Ansible, Chef, Puppet): Different problem space. Those are for runtime configuration of running systems (mutable infrastructure). This is build-time configuration for immutable containers. Both have their place, but they're solving different problems.

Docker Official Images + Multistage Builds: This gets closer, but you still end up maintaining multistage Dockerfiles for every project. The complexity moves but doesn't disappear. And you still need to manually track version updates.

The unique value here is the combination: modular features + automated updates + comprehensive testing + production-ready defaults. I haven't found another solution that does all of this together.

Documentation

I wrote comprehensive documentation because I got tired of answering the same questions (and because I'd forget details myself):

Core docs:

README.md - Quick start and common use cases
CLAUDE.md - Architecture guidance and design decisions (yes, I wrote docs specifically for Claude to understand the codebase)
CONTRIBUTING.md - How to add new features
CHANGELOG.md - Version history with breaking changes highlighted

Detailed guides in docs/:

Troubleshooting common issues (the "it doesn't work" guide)
Writing tests for new features
Security best practices
Architecture decisions and rationale
Version tracking and automated releases
Security scanning with Trivy

Examples in examples/:

Docker Compose templates for common scenarios
Build context patterns
Environment configurations
Multi-service setups

This means new developers can onboard without waiting for me to explain things, and troubleshooting is self-service. The CLAUDE.md file has been particularly useful—it means debugging can be quickly handed off to an AI agent to track down issues. Given the rigor of the testing system, this has been quite effective. I still review the changes, of course, but an agent running something like Claude Sonnet can usually identify and fix problems while I'm mostly focused on something else.

Future Plans

I'm actively working on several improvements:

Performance optimizations:

Parallel feature installation (run independent installs concurrently)
More aggressive layer caching
Build time metrics tracking

Plugin system:

Allow custom features without modifying core
Company-specific tools (internal VPNs, proprietary CLIs)
Private registry support
Local feature overrides

Configuration templates:

Pre-built combinations for common stacks (Python FastAPI + PostgreSQL, Next.js + Redis, etc.)
Quick-start templates
Best practices baked in

Observability:

Build-time metrics (what takes the most time?)
Image size tracking over time
Security vulnerability trends
Feature usage analytics

Advanced runtime features:

Environment templating (generate configs from 1Password/AWS Secrets)
Health checks
Auto-update mechanisms for running containers
Graceful feature degradation

Enterprise features:

SBOM (Software Bill of Materials) generation for compliance
License scanning
Air-gapped environment support
Custom registry integration

The roadmap is driven by real usage. If you have feature requests, open an issue on GitHub.

Getting Started

Want to try it? The setup is straightforward.

Step 1: Add as Submodule

cd your-project
git submodule add https://github.com/joshjhall/containers.git containers

Step 2: Create Docker Compose Configuration

Create .devcontainer/docker-compose.yml (or use it anywhere you'd normally use a Dockerfile):

services:
  devcontainer:
    build:
      context: ../containers
      dockerfile: Dockerfile
      args:
        PROJECT_NAME: myproject
        INCLUDE_PYTHON_DEV: "true"
        INCLUDE_POSTGRES_CLIENT: "true"
        INCLUDE_DOCKER: "true"
    volumes:
      - ..:/workspace/myproject
      - myproject-cache:/cache

volumes:
  myproject-cache:

Arrow pointing to INCLUDE_PYTHON_DEV: "This enables Python + poetry + pytest + black + mypy"
Arrow pointing to INCLUDE_POSTGRES_CLIENT: "Adds psql command"
Arrow pointing to INCLUDE_DOCKER: "Docker-in-Docker support"
Arrow pointing to myproject-cache:/cache: "Persistent cache for fast rebuilds"
Green checkmark icon: "That's it - no custom Dockerfile needed!"

Step 3: Build and Run

docker compose -f .devcontainer/docker-compose.yml up

That's it. You now have a fully configured Python development environment with PostgreSQL client and Docker-in-Docker support.

VS Code Dev Container Integration

If you're using VS Code, you can use Microsoft's devcontainer base images for a cleaner integration. This avoids the Docker-in-Docker plugin complications and their questionable security implications. Here's what that looks like:

.devcontainer/docker-compose.yml:

services:
  devcontainer:
    build:
      context: ../containers
      dockerfile: Dockerfile
      args:
        BASE_IMAGE: mcr.microsoft.com/devcontainers/base:trixie
        PROJECT_NAME: myproject
        USERNAME: vscode
        WORKING_DIR: /workspace/myproject
        INCLUDE_PYTHON_DEV: 'true'
        INCLUDE_POSTGRES_CLIENT: 'true'
        INCLUDE_DEV_TOOLS: 'true'
      cache_from:
        - type=local,src=/tmp/.buildx-cache
      cache_to:
        - type=local,dest=/tmp/.buildx-cache,mode=max
    volumes:
      - ..:/workspace/myproject
    environment:
      - TZ=${TZ:-America/Chicago}
      - ENVIRONMENT=${ENVIRONMENT:-development}
    command: sleep infinity
    networks:
      - containers-network

networks:
  containers-network:
    driver: bridge

.devcontainer/devcontainer.json:

{
  "name": "My Project",
  "dockerComposeFile": "docker-compose.yml",
  "service": "devcontainer",
  "workspaceFolder": "/workspace/${localWorkspaceFolderBasename}",

  "postCreateCommand": "poetry install",

  "customizations": {
    "vscode": {
      "extensions": [
        "ms-python.python",
        "ms-python.vscode-pylance",
        "charliermarsh.ruff"
      ],
      "settings": {
        "terminal.integrated.defaultProfile.linux": "zsh",
        "python.defaultInterpreterPath": "/usr/local/bin/python"
      }
    }
  },

  "remoteUser": "vscode"
}

Notice the key differences:

BASE_IMAGE arg lets you use Microsoft's devcontainer base images
USERNAME: vscode integrates with VS Code's expectations
Simple devcontainer.json without Docker-specific plugin configurations
The same features work whether you use debian:13-slim for production or mcr.microsoft.com/devcontainers/base:trixie for development

This flexibility means you can optimize for each environment: Microsoft's devcontainer images for local VS Code development, standard Debian LTS images (11, 12, or 13) for production and QA. Same features, same tools, just different base images.

Adding More Features

Want to add Node.js? Just update the build args:

args:
  INCLUDE_PYTHON_DEV: "true"
  INCLUDE_NODE_DEV: "true"      # Add this line
  INCLUDE_POSTGRES_CLIENT: "true"
  INCLUDE_DOCKER: "true"

Rebuild:

docker compose down
docker compose build
docker compose up

Common Use Cases

Python API development:

args:
  INCLUDE_PYTHON_DEV: "true"
  INCLUDE_POSTGRES_CLIENT: "true"
  INCLUDE_REDIS_CLIENT: "true"

Node.js web development:

args:
  INCLUDE_NODE_DEV: "true"
  INCLUDE_POSTGRES_CLIENT: "true"

Cloud operations:

args:
  INCLUDE_KUBERNETES: "true"
  INCLUDE_TERRAFORM: "true"
  INCLUDE_AWS: "true"

Full-stack development:

args:
  INCLUDE_PYTHON_DEV: "true"
  INCLUDE_NODE_DEV: "true"
  INCLUDE_POSTGRES_CLIENT: "true"
  INCLUDE_REDIS_CLIENT: "true"
  INCLUDE_DOCKER: "true"

All available build arguments are documented in the repository README.

Security

Security best practices are built into the foundation:

Non-root user by default - All processes run as a non-root user unless explicitly needed
Minimal Debian slim base images - Smallest viable attack surface
Automated security updates - Weekly automation checks for and applies security patches
Secret scanning with Gitleaks - Prevents accidentally committed credentials
Vulnerability scanning with Trivy - Catches known CVEs in dependencies
Validated installations - Each feature verifies correct installation
Proper file permissions - No world-writable files or directories

The automated update system is designed with security in mind. Security patches are prioritized and tested immediately. If a critical vulnerability is detected, the system can create emergency updates outside the normal Sunday schedule.

Contributing

Want to add a feature? The process is straightforward:

1. Create the Feature Script

Create lib/features/your-feature.sh following this template:

#!/usr/bin/env bash
set -euo pipefail

# Detect latest version
TOOL_VERSION="1.2.3"

# Install
apt-get update
apt-get install -y your-tool

# Validate
if ! command -v your-tool >/dev/null 2>&1; then
    echo "Installation failed"
    exit 1
fi

echo "Successfully installed your-tool $TOOL_VERSION"

2. Add Unit Tests

Create tests/unit/features/your-feature.sh:

#!/usr/bin/env bash
source "../../framework.sh"
init_test_framework

test_your_feature_installs() {
    source lib/features/your-feature.sh
    assert_success "Installation should succeed"
}

test_your_feature_version() {
    local image="test-image:latest"
    assert_command_in_container "$image" "your-tool --version"
}

run_tests
generate_report

3. Update Documentation

Add your feature to:

README.md - Available features list
CHANGELOG.md - Under "Unreleased"
docs/FEATURES.md - Detailed feature documentation

4. Submit Pull Request

The CI pipeline will automatically test your changes on all Debian versions and run the full test suite.

See CONTRIBUTING.md for detailed guidelines, coding standards, and best practices.

Wrapping Up

I built this out of frustration. I was tired of writing the same Dockerfile repeatedly. Tired of tracking version updates manually. Tired of fixing the same Docker issues in six different repos. Tired of spending days on infrastructure instead of building actual products.

This system solves all of those problems for me:

New projects set up in 10 minutes instead of 2 days
All my projects stay in sync automatically
Version updates happen while I sleep (and only wake me if something breaks)
Dev and production environments use identical foundations
I have actual confidence that builds work because of comprehensive testing

It started as a side project to scratch my own itch. But it's become the foundation for everything I build with containers now. I haven't written a custom Dockerfile in over a year. I don't miss it.

The best part? The automation. That weekly CI run that updates everything, tests it, and either deploys or notifies me? That's the difference between managing infrastructure and just using it. I wake up to patched containers. I spend time building products instead of maintaining Docker configs.

If you're dealing with the same pain points—multiple projects, manual version tracking, inconsistent environments, time-consuming setup—this might help you too.

Try It On Your Next Project

Don't overcommit. Just try it on one project:

Add the git submodule
Enable the features you need
Build the container
Spend your time building instead of configuring Docker

If you run into issues or have questions, open an issue on GitHub. I'm actively maintaining this and usually respond within a day.

If it saves you even half the time it's saved me, that's hours back in your week. Hours you can spend on things that actually matter.

Repository: https://github.com/joshjhall/containers

Documentation: See the docs/ directory for detailed guides

Examples: See the examples/ directory for Docker Compose templates

Issues/Questions: Open an issue on GitHub - I'm responsive and want this to work well for others too

CFS: Scoring Features Before You Argue About Them

Joshua Hall — Wed, 10 Jun 2026 22:29:56 +0000

Should we build two-factor authentication? Users have asked for it. That isn't a yes. Should we add the export-to-Excel feature the enterprise account keeps requesting, or the second product line a competitor just shipped? Every team faces a steady stream of these, and most answer them in a meeting where whoever holds the strongest opinion wins. Six months later the roadmap has a dozen features nobody uses, and a dozen more that got turned down for reasons no one can reconstruct.

The deeper problem isn't the decision. It's that the decision evaporates. It becomes an action item, a five-minute hallway conversation, a Slack thread, and it is almost never written down with the reasoning attached. So in a year, someone asks the same question, proposes the same feature you already killed, and nobody can say why it died. Maybe the answer should change now; maybe it shouldn't. Without a record you can't tell, and you can't improve the way you decide, because the inputs are gone. Organizational memory turns out to be however much a few people can hold in their heads (which is less than you'd hope and not nearly as long as you need) and that's before half the team turns over.

The fix isn't more meetings. It's a small piece of structure that gives the conversation an anchor: three axes, a one-to-five rating on each, and a score you write down with the decision. It's called CFS — Commonality, Frequency, Severity. It won't make the call for you. It makes the call comparable and durable.

The Three Axes

Commonality asks what share of your users would touch this feature at all. Are 80% of users going to export to Excel, or 20%? Is it one important enterprise client and nobody else? A one is a single niche audience. A five is universal: the kind of thing nearly everyone in the product needs. Fives are rare, and the rarity is the point. If everything scores a five, the axis has stopped telling you anything.

Frequency asks how often the people who do use it come back. Among that slice of users (whether it's 20% of the base or 95%) is this a couple-times-a-year thing or a twenty-times-a-day thing? Note that this is conditional on commonality: a feature can matter to only a sliver of users and still be something that sliver lives in daily.

Severity asks how much the absence hurts the people who'd benefit, and whether there's an easy way around it. The workaround question is doing most of the work here. Can they copy-paste manually? Lean on a shortcut key, or an OS capability common enough to assume everyone has it, like printing? That's low severity. At the other end: without this, the product is worthless to the audience that needs it. Not being able to export your books to Excel out of an accounting tool can be that. Not being able to print from a photo app can be that. Most things land somewhere between the convenience and the dealbreaker.

The three are independent. A feature can be common but rarely used, used constantly by a tiny group, painful to lack but only in one corner of the product. Keeping them separate is what lets the score carry information.

Multiply, Don't Add

Here's the part that matters and is easy to get wrong: the axes multiply. Commonality times Frequency times Severity, so a one-to-five scale tops out at 125, not 15.

Multiplication is deliberate, because each tick should land with more weight than the last — a jump from severity three to four isn't one unit more pain, it's a different category of pain, and the math should say so. It also means a single one anywhere drags the whole thing down, which is usually correct: a feature almost nobody can use, no matter how often or how critically, probably isn't where your next two weeks should go.

A shortcut key shows why the axes have to stay separate. I remap keys to split and merge browser windows constantly (high frequency, for me). But my web habits are geeky and unrepresentative; commonality is low. High frequency, low commonality, and the product is fine without it: the workaround is a button two pixels away. Multiply it out and the score stays honest about that.

Calibrate the top of the scale hard. A five means six-sigma certainty: everyone uses it, or the people who do use it dozens of times a day, or the product is simply broken without it. I rarely reach for fours and almost never fives. Most real scoring lives in the one-to-three band with the occasional four, and that's the framework working, not failing.

What the Scores Look Like in Practice

Take private accounts on a consumer social network. Authenticated, individual accounts are how the product works at all (high commonality, high severity, you don't have a social network without them). But that one decision fans out into a cluster of features whose scores diverge sharply. Password reset, lost-password flow, passkeys, emailed one-time codes, OTP: all roads to "I can get into my account," each scoring differently. And frequency is genuinely contextual. If I let a mobile session refresh silently for months, signing back in is rare even though the account itself is universal and critical. High commonality, high severity, low frequency, and the math holds all three at once.

Now account merging. You join a network with your Gmail, forget, and join again with Yahoo. Two accounts, two addresses. There are workflows to reconcile them, but it's an uncommon situation, an infrequent one, and it takes a savvy user to even notice. The system usually can't detect it without a reliable shared anchor like a verified phone number. And the workaround is brutal but real: just delete one account. Low commonality, low frequency, low-to-moderate severity; multiply it out and you get a small number, which is the right answer. A lot of products correctly never build this.

Printing sits in the messy middle, which is exactly why it's useful. Plenty of apps don't need it. But plenty of users still print, or print-to-PDF because a manager wants it emailed. Middling commonality, low-to-middling frequency, low-to-middling severity depending on the domain: a clean printable view is a respectable, unglamorous, middle-of-the-table feature. Not every decision is a clear build or a clear cut, and the score is honest about the ones that aren't.

One Input Among Many

CFS is not a prioritization engine. It's the benefit half of a cost-benefit, and it deliberately leaves cost out.

That's the line between CFS and something like RICE, the Intercom model that folds effort in as a divisor: Reach times Impact times Confidence, over Effort. RICE bakes the cost into the number. CFS doesn't, on purpose, because cost belongs in a separate, cleaner conversation. Effort is the easy thing to compare: put it in person-weeks and you're done. I've prioritized half a dozen ones and twos ahead of an eighteen plenty of times, simply because the small ones shipped in days while the big one needed two more weeks of design before engineering could even start.

The harder inputs resist a tidy number. Team load-balancing: the feature needs Susan and Javier, but Javier is booked solid for a quarter on something the executives flagged, and Susan won't surface from her two current projects for six weeks. That has nothing to do with the feature's merit and everything to do with whether you can staff it. Then there's strategic value a usage score will never capture. A feature that scores low on all three axes but demos beautifully and helps sales close can absolutely be worth building.

So treat the CFS number as one calibrated input you set beside cost, capacity, and strategy, not the verdict. As a rough read on whether something looks important before the harder conversations start, I've found nothing better. An eighteen on the board earns real attention. North of twenty-four, I'm usually building it or making a deliberate case for why not.

The Real Payoff Is Rigor

Here's what the score actually buys you, and it took me embarrassingly long to name it: it adds a pseudo-quantitative layer to a fundamentally qualitative judgment, and that structure does two things a meeting can't.

First, it lets the best idea win regardless of who has it. When the conversation is "how common, how frequent, how severe," it stops mattering whether the proposal came from the CEO or the quietest person in the room. You're rating the need, not the advocate. The strong-opinion-wins dynamic that runs most feature debates loses its grip, because everyone is now arguing about the same three things in the same terms.

Second, it forces people to check their own biases. If I walked in certain feature A beat feature B, and we score them and A comes out a nine while B comes out a twenty-four, I have to sit with that. Why did I think A mattered more? Is our read on B inflated, or was my conviction about A just ego, or familiarity, or whatever I carried into the room? The tension between gut and score is the most valuable thing CFS produces — not because the number is right and the gut is wrong, but because the gap is where the real conversation lives. That's also where you should introspect hardest: when half the room expected low and it came back high, the disagreement is pointing at a hidden assumption worth dragging into the light.

The number was never the deliverable. The deliverable is a room full of people who have stopped arguing about whose opinion is louder and started arguing about how much the absence actually costs — written down, with the reasoning attached, so that when the question comes back in a year, the answer is right there waiting. Commonality two, frequency one, severity two. Has anything changed? Usually nothing has. And when it has, you'll know exactly what.

Chat Is an Input, Not an Interface

Joshua Hall — Tue, 09 Jun 2026 00:51:03 +0000

Ask me my address in a chat box and you've just made my life worse. I can type it into a form in three seconds: one that validates the ZIP, knows my state from my city, and tells me immediately if I fat-fingered a digit. In a chat box I'm at the mercy of however the model decides to parse my reply, and some fraction of the time it comes back subtly wrong because I phrased it in a way the prompt didn't anticipate. The form was right there.

That's the thing I keep coming back to looking at the wave of chat-first products. Conversation is a real input modality. It is rarely the right default. Picking the interface that fits the task is design work, and replacing every form, grid, and canvas with a prompt box because the model is cheap is the opposite of doing that work.

Chat Is One Input, Not the Interface

The category error in most chat-first products is treating chat as the interface rather than an input. The model underneath is a processing layer. The surface the user touches is a separate decision, and collapsing the two ("we shipped an LLM, so the product is a chat window") is what produces the worst of this genre.

Conversation is one input among many, and most software already runs on a rich vocabulary of others: forms, dropdowns, sliders, grids, direct manipulation on a canvas. Each exists because it fit a task better than typing a sentence would. A chat box doesn't retire any of them. It joins the list, useful in the specific places where the others fall short — and a liability everywhere they don't.

The honest test for any chat-first screen is whether prose is the best way to express the task, or merely the easiest way to ship it. Those answers diverge more often than the demos admit.

Where Chat Earns Its Keep

There's a category of input that genuinely benefits from conversation: anything where the structure of the answer isn't known in advance, where the follow-up questions depend on previous answers, or where the user's vocabulary doesn't match the application's and something has to translate between them. If the task is the kind of thing an interview handles better than a questionnaire, chat is worth exploring. The keyword is interview — there are limits, and they arrive fast.

Construction takeoff estimating is a good example from a project I work on. The drawings give you measurable structured data (square footage, room count, fixture positions) but say nothing about finish level. A builder-grade kitchen package might run $10K. A premium one with a Viking range and a Sub-Zero fridge can hit $150K. The plans don't say which the owner wants, and no single form field captures the difference cleanly.

Chat fits there. The system asks the contractor what conversations they've been having with the owner, picks up on "they want it nice but not crazy" or "she keeps showing me Italian tile," and translates that into structured data the estimator can apply. The contractor's mental model is conversational; forcing them through a finishing-level dropdown throws information away. Chat also lets ambiguity survive when it needs to — "they're still deciding, but I think it'll land in the $35K–$45K range" is a perfectly valid answer, and form fields either choke on that kind of variance or become elaborate interfaces in their own right.

Chat earns its place there because the input is genuinely unstructured at the source. Most chat interfaces I see aren't solving that problem. They're solving the much smaller problem of "we shipped a model and now everything has to look like one."

Where Direct Manipulation Wins

Then there's the larger territory the chat-first framing quietly ignores: the work that graphical interfaces do better than any sentence can, because the interface is the precision.

Try to nudge the spacing on an image by describing it. "Move it left a bit — no, less — okay, now down." That's ten seconds of direct manipulation in Figma or Illustrator turned into a frustrating game of telephone. A prose-only path to production design (describe the screen, accept whatever comes back) fails before it starts, because human language can't specify pixel spacing, and the tools that actually work never ask it to. The same holds for CAD, spreadsheets, page layout, business analytics: domains where the value lives in a precise, spatial, direct relationship between the user's hand and the artifact. Prose can't carry that bandwidth.

The forms case is just the narrow, well-behaved end of this same spectrum. Address entry, date selection, numeric ranges, anything with strict validation, anything the user has typed a hundred times, anything whose answers enumerate cleanly in a short dropdown: these are known data of a discrete shape the system needs to collect, sometimes with an order or a dependency between them. A form is a contract. It states exactly what's needed, validates locally, surfaces errors on the spot, and finishes in seconds. Run the same task through chat and you get a slow guessing game with the ambiguity pushed onto the user, who might not notice the model misread the address until the shipping confirmation arrives a day later.

Bulk is where it gets stark. Entering one phone number through chat? Maybe. Entering a client list with every contact's details? Almost never. It's faster to upload a CSV or type into a grid than to narrate a hundred rows. The richer the structure, the worse prose performs.

None of this is an argument against the model. The model can sit behind the form, the grid, the canvas, autocompleting, validating against external sources, suggesting corrections. The interaction surface doesn't have to be a chat box just because the processing layer is an LLM. As Andrej Karpathy puts it in his Software 3.0 framing, the future stack is classical code, machine learning, and LLMs working together; for genuine business logic I'd add a rules engine to that list. The interface is a design choice independent of which of those is doing the work underneath.

The Right Number of Ways

This is partly an old debate in new clothes: how many ways to do one thing should an application offer? Both extremes are wrong.

Perl and COBOL famously hand you fifteen ways to write the same line, and the result is often code nobody else can read; the reader has to reverse-engineer not just what the author did but which dialect they were speaking. The most flexible interface to an operating system is probably a Unix shell: it can do nearly anything, and it's correspondingly hard to learn and easy to misuse. Maximum optionality has a real cognitive price.

Python takes the opposite stance ("there should be one obvious way to do it") and rode it past the point of usefulness. For most of its history the language had no real case statement, on the principle that nested if/elif/else was already enough. The philosophy was consistent; the lived reality was that a match statement is simply easier to read than a chain of elifs, and the one-obvious-way dogma cost the language readability for years in a spot that didn't need the discipline.

The balance for an application is two or three modalities for the same task, chosen deliberately. A form for the structured case. A chat fallback for the unstructured one. Maybe a power-user shortcut for repeat operations. Three covers the meaningful variation in how people work without sliding into Perl-land. One, picked because the designer fell for a modality, quietly pushes half the user base into a worse experience.

The LLM Bridges the Gaps

So if chat isn't the interface, what is the model actually for? Its real power is bridging ambiguity, papering over the places in a workflow where the input is messy, the formatting is inconsistent, or the answer is still half-formed, so the user can keep moving toward the objective instead of stalling on a field that won't accept what they have.

Back to construction. Bids come back from subcontractors in wildly inconsistent formats. Pulling the total price off each one is trivial; regex or a parser handles it. What that price includes and excludes is where you need real interpretation, and that's the LLM's job: read the messy document, triage what plain machine learning can extract, what a rules engine can decide, and what genuinely needs a model to interpret. The output isn't a chat transcript. It's a structured model of the bid, echoed back through the normal interface in a consistent shape: the same fields, every time, however ragged the source.

That consistency is the actual payoff of LLM-powered UX, and it has almost nothing to do with chat as a surface. Picture it in CAD. I've got my 3D mouse, my shortcuts, my hands deep in the geometry of a part. Chat isn't where I model; that would be absurd. But "now that the shape's roughed in, run a torque analysis on this and farm it to the background while I keep working" is exactly where a conversational command earns its place, riding alongside the precise interface instead of replacing it. The model bridges; it doesn't take the wheel.

Default Is the Enemy of Design

If you're designing a product right now and the default screen is a chat box, here's the test. Pull up the five most common tasks your users perform. For each, ask whether a form, grid, dropdown, button, or canvas would let them finish faster, with less ambiguity and better validation than typing a sentence.

If the answer is yes across the board, you've built a forms-and-grids application wearing a chatbot costume, and the costume is hurting your users. Ship the real interface. Keep chat for the slice of cases where the input genuinely doesn't fit a structured surface, and put it behind a button that says "or just describe it" rather than making it the front door.

If the answer is no for some of them (you can't predict the shape of the input, the follow-ups depend on the answers, the user's vocabulary won't survive translation to fields) then chat is the right primary input for those tasks. Build it well, set the structured shortcuts beside it for the users who want them, and treat chat as a mode the application offers, not the application's identity.

The model is a tool. Chat is one of the interfaces it can power, valuable exactly where ambiguity lives and forgettable everywhere else. Forms, grids, sliders, and direct manipulation remain the right answer for most of what people actually do with software. The designers who can hold all of those in their head at once and choose on purpose will out-build the ones who reach for whatever modality is trending.

Default is the enemy of design.

Rethinking Design Systems: When Code Becomes the Source of Truth

Joshua Hall — Sat, 06 Jun 2026 14:30:00 +0000

The Design-to-Development Gap

Every product team knows this dance: Design creates beautiful components in Figma. Engineering builds them in React. Then something breaks. The button in production doesn't quite match the button in Figma. Someone updated the design. Someone else updated the code. Nobody updated both. The documentation is definitely not up to date either.

We tell ourselves this is inevitable. Design tools and code are fundamentally different beasts. The best we can hope for is "close enough" and good documentation.

I think we've been solving this backwards.

This is the first in a series about building design systems that actually work. Figma doesn't have to be just pretty pictures for engineers to interpret. It can be a structured system that directly informs code generation. Design and engineering don't have to be two sources of truth struggling to stay in sync. They can be a single system with design tools as one interface and code as another.

Sound impossible? I've built a proof of concept, and I think there's something here.

The Vision: Reverse the Flow

The typical process looks like this:

Design in Figma: Create component mockups with variants, states, interactions
Spec it out: Write documentation explaining what engineers should build
Implement in code: Engineers interpret the designs and build React components
Publish to Figma: Designers use these components in mockups
Drift begins: Design updates Figma. Code updates independently. Documentation lags behind both.

The problem? We're maintaining two separate sources of truth and hoping they stay synchronized through diligence and documentation.

What if we flipped this? Here's the component library workflow:

Design conceptually in Figma (sketches, explorations, mockups)
Implement the component in React based on those designs
Build testing, Storybook, documentation (the real specification)
Generate Figma components from the code ← This is the flip
Publish these code-generated components to the shared Figma library

The Figma component library stops being aspirational and starts reflecting what actually exists in production. When a designer uses a button component in Figma, they're using the button component in code, just rendered in a design tool.

These code-generated Figma components carry metadata. Design handoff tools and AI agents don't have to guess that a button maps to <Button variant="primary" size="large">. They can know this because the component itself contains that information.

What This Requires

This approach demands rethinking some assumptions about component libraries. Design tools become where component concepts get explored and refined, but the component library itself reflects the code. Product designs still happen in Figma using these components. But the components designers use are no longer aspirational—they're direct representations of what exists in production. Documentation stops being a bridge between two systems and gets embedded in the system itself.

Early-stage products iterating rapidly might not want this level of structure yet. But small teams building for the long term could benefit significantly—this is exactly the kind of foundation that prevents the tech and design debt that accumulates as teams grow.

But if you're building a mature product with a design system, constantly fighting drift between Figma and code, and dreaming of better design-to-development handoffs? This might be interesting.

The approach requires solving some hard technical problems: generating Figma components programmatically, tracking changes efficiently, maintaining sync between systems, and handling edge cases gracefully.

I'm starting with something concrete: icons.

If this is the easiest part of the vision (and honestly, it is), then succeeding here proves the approach is feasible. Failing here means the bigger vision isn't practical.

The plugin and automation are open source: github.com/joshjhall/google-symbols-figma-plugin

Why Icons Make Sense as Step One

Icons might seem like a small part of the bigger vision, but they're the perfect place to start for three practical reasons.

I need them anyway. Whether or not the code-driven design system idea works, I need a comprehensive, up-to-date icon library in Figma. If the larger experiment fails or becomes too complicated, this part still delivers value to both design and engineering today.

It's a learning ground. Building a Figma plugin that pulls data directly from a Git repository and generates components programmatically? That's exactly what I'll need to do when generating components from React code later. Better to learn on a well-structured external repository than on my own codebase.

It forces optimization. With nearly 4,000 icons and 504 variants each, that's close to 2 million total variants to manage. GitHub rate limiting. Figma memory constraints. Partial import failures. Incremental updates. All the problems I'll face when processing a real codebase at scale need solutions here first.

It took several days to get all the icons initially imported into Figma. The process would fail partway through due to HTTP rate limits when downloading raw files directly from GitHub (turns out downloading 75,000+ SVG files is enough to hit those limits). I'd restart it. It would fail again, differently. I spent a lot of time staring at progress bars and error logs.

Those same strategies (graceful recovery, smart batching, delta-only updates) will be essential when re-processing codebases for changes. If rebuilding from scratch takes hours or days, the system won't work in practice.

Why Google Material Symbols?

I needed an icon system that could work as a shared foundation. Google's Material Symbols fit perfectly.

The library is comprehensive, with nearly 4,000 icons covering most interface needs. But more importantly, it's a proper system, not just a collection of SVGs. Each icon has 3 visual styles (Outlined, Rounded, Sharp), 7 weights (100-700, like typography), 2 fill states (empty or filled), 3 optical grades (adjusts for light/dark backgrounds), and 4 standard sizes (20, 24, 40, 48dp).

That's 504 explicit variants per icon (3 × 7 × 2 × 3 × 4). Google ships these as variable fonts for production use, about 1.5MB per style or 4.5MB total, highly optimized for web delivery. But they also maintain the source SVGs these fonts are built from, which is what the plugin uses.

Why SVGs for Figma? Because Figma needs actual vector paths to render icons correctly. Using SVGs means the Figma components look exactly like production, with all the same configuration options clearly represented for engineering and other stakeholders. Then the actual implementation can use the optimized font files rather than embedding SVGs directly.

The raw SVG source is nearly 1GB. Two million very small files add up quickly.

Universal delivery: The library is already cached on millions of devices. Developers know it. Designers recognize it.

Active maintenance: Google updates it regularly with new icons and refinements. If I'm building a system around staying synchronized, I need a source that actually changes. That said, there have only been about 10 updates in the last 18 months, so changes happen infrequently.

If both my Figma components and my React components use the same Material Symbols system with the same naming and variants, we have a shared vocabulary. A search icon in Figma maps to a search icon in code, with weight, style, and fill properties that mean the same thing in both places.

The Technical Challenge: Updates and Metadata

Building a static snapshot of 4,000 icons is straightforward. Building a system that stays synchronized as Google updates their repository was the hard part. Here's how I solved the three trickiest problems.

Tracking Changes with Metadata

Each icon component in Figma stores metadata using Figma's setPluginData() API. Specifically, each icon component stores the Git SHA from the commit that last changed it in Google's repository, and each variant frame stores a hash of its SVG content.

When Google updates the repository, my automation script (running weekly via CI) analyzes which icons changed between the old commit and the new one. It notifies me and calculates the delta information needed for the plugin.

Then I manually run the plugin on each of the 26 files. Running all 26 takes about an hour or two total now. The initial build took nearly an hour or more per file, over 26 hours total. The plugin only regenerates icons that actually changed.

Take the pentagon icon. Google has a bug in the 20dp optical size alignment. (Yes, I check every update hoping they've fixed it.) When they eventually fix it, the script identifies that pentagon has a different Git SHA than what's stored in Figma. The plugin downloads all 504 SVG variants, computes content hashes for each, and compares them with stored hashes. Only the 20dp variants have different hashes, so only those frames get updated. Icons without updates are skipped entirely.

What used to take 26+ hours becomes a 1-2 hour update cycle.

Handling Edge Cases

Deprecations: When Google deprecates an icon, I don't delete it immediately. The plugin prefixes the component name with _deprecated_. Designers can still see deprecated icons in their existing designs while preventing new usage. On the next publish, these can be cleanly removed from the shared library.

New additions: When Google adds new icons, they need to fit into the existing alphabetical organization. The 26 files are split alphabetically, with most files containing about 160 icons ranging from 80 to 200. New icons are added to whichever adjacent file has fewer icons, and the filename range gets updated accordingly. Each file is 28-39MB, 900MB total. Keeping files balanced prevents unnecessary icon moves and stays within Figma's memory constraints.

Graceful recovery: During initial development, partial failures were common. I'd start building 170 icons for a file, get 70% through, then start hitting rate limits. Instead of getting all 504 variants for an icon, I'd only get 274 SVGs. The rest returned 404 errors.

The plugin creates partial icons with whatever SVGs it successfully retrieved. Then I'd wait a few minutes and rerun the plugin on the same file to backfill the missing variants. Some files I had to rerun several times. Often because I was getting impatient and didn't wait long enough between attempts.

On the plus side, most of this ran in the background while I worked on other things. On the downside, it still took a bit of my attention off and on for several days to complete the initial import.

But this robustness is exactly what I mean by "scalable to more complex scenarios." Icons are simple. Just SVG paths. When I eventually generate components from React code, I'll likely need to download JSON schemas, API responses, or other data to describe what's needed for complex components. I'll hit those same errors. The system needs to handle partial failures gracefully and resume where it left off.

The Result: A Living Icon Library

With all these strategies in place (delta updates, smart batching, metadata tracking, and graceful recovery), the system delivers what I set out to build: a living icon library that stays current with minimal manual intervention.

The plugin generates 26 Figma files, organized alphabetically, with nearly 4,000 icons and their 504 variants. These files are exported and attached to GitHub releases, so anyone can download pre-built .fig files and use them immediately.

More importantly, the system stays current. A GitHub Action runs weekly, checks Google's repository for changes, calculates the delta if changes are detected, runs tests on the changes, and notifies me with an optional mobile push notification.

When I merge the changes and run the plugin on the 26 files (an hour or two of work), only the changed icons regenerate. Given that updates happen roughly twice a year, we're talking about just a few hours of annual maintenance. Additional automation either isn't possible in Figma's environment or wouldn't save enough time to be worth building.

The plugin and pre-generated .fig files are available in the latest release: v1.2.2

The repository also takes advantage of my universal docker container system for consistent development and CI/CD environments. That system is open sourced at github.com/joshjhall/containers, with a detailed write-up about the approach: Building a Universal Container System.

How These Files Actually Get Used

I'm going to dive deeper into this in the next post, but the brief version: my design system has multiple layers, each serving different purposes.

1. Core components (where code-generated components will live): Fundamental UI elements like buttons, fields, sheets, and dialogs. The goal is to build these in React, then generate them back into Figma and publish to the library. Today these are manually maintained, but this is where the code-driven approach will eventually land.

In Figma, I use naming conventions to convey information to engineers. Components prefixed with an underscore (like _icon) are private and won't be published directly to the shared library. Think of these like private classes. Components wrapped in angle brackets (like <icon>) indicate React component boundaries, helping engineers understand where component splits likely make sense.

The core file includes all 26 icon files as dependencies and has a base _icon component with a swappable child symbol. An <icon> component enhances this with sizing and optional badge decorations. The _icon component maintains a preferred icons list, filtering the full 4,000 icons down to 100-300 that the team actually needs. Common icons like close, menu, and search are universal, but there's about 20-60% variation between different apps.

A small set of heavily-reduced icon variants (24 variants each instead of 504) are embedded directly in the core components file for things like checkboxes, radio buttons, and dropdown arrows. Keeping these embedded means the core file stays clean of dependencies except for the main _icon preferences.

2. Product-specific components (where partially code-generated components will live): Specialized components and configured variations of core components. The vision is for some to be generated from code, while others are designed in Figma then built in code.

3. Product designs (never code-generated): Actual screens, flows, and states designed using components from layers 1 and 2. This is where designers work day-to-day, combining existing components to design new features.

There's also an implicit exploration layer that doesn't fit neatly in this hierarchy: component concept files that consume core and product components to explore new component ideas. These inform the code development that eventually generates new components for layers 1 and 2.

Designers working on product features only need to include the core components and product components in their files. The icon filtering is already done at the system level. Adding new icons to the preferred list requires a deliberate decision and a new publish from core components. Just enough friction to make teams think about it, and in larger organizations, forces collaboration between teams before icon proliferation gets out of hand.

More on component structure patterns in the next post.

Taking Stock: What This Proves

Building this icon library has proven several things:

Programmatic Figma component generation works at scale
Git-based synchronization can keep design assets current with external sources
Metadata and delta updates make large-scale regeneration practical
The same approach that syncs with Google's repository can sync with a React codebase

But this is just the foundation. These are raw Material Symbols, exactly as Google publishes them. They're the shared vocabulary layer, the common foundation that both design and engineering build upon.

In the next post, I'll show how to actually use these icons in well-structured Figma components. We'll look at what makes a good base component using buttons as an example, and how consistent practices in Figma can convey rich information to engineers.

We'll also run into some of Figma's frustrating limitations, which is part of why generating components from code starts looking increasingly appealing.

I'm still refining this approach. The icon foundation is working well in production. The component generation from React code is next. The full workflow hasn't been battle-tested at scale.

But I think there's something here. A way to build design systems that aren't two separate systems hoping to stay in sync, but one system with multiple interfaces.

More to come.

Next in this series: Building effective base components in Figma—what makes a good button, and why some of Figma's limitations make code-driven generation increasingly appealing

The plugin and source code are available at github.com/joshjhall/google-symbols-figma-plugin. Have thoughts on this approach? Find me on Twitter/X or open an issue on GitHub.

Generative UI Is Three Things. Only One Ships.

Joshua Hall — Thu, 04 Jun 2026 14:15:00 +0000

Google shipped a generative UI experiment in November 2025 that matched a human-designed interface in roughly half the cases it was tested against. The other half it produced an artifact the way an unsteamed dumpling is food — recognizably the right shape, not actually edible. The system could take a minute or more to render a single page and produced different output on every refresh. Most of the takes I read framed the gap as a "more compute will fix it" problem. It won't.

Google's own scoring is honest about it: their implementation earned an ELO of 1736.2, a strong preference over every other output format, and lost only to human experts, whom it merely matched about half the time. That's an impressive result aimed at the wrong target.

Notice which half gets the spotlight. "Comparable to a human expert in half the cases" is the generous read; the inverse is that the other half come out worse. And screens don't arrive one at a time, they chain into workflows. Probability compounds in the direction nobody's gut expects. At a coin-flip per screen, a four-step flow renders cleanly about one time in sixteen: roughly 6% of users get through without a visible failure, and the other 94% hit at least one bad screen. Stretch it to six steps and you're at 98%. The demo reads like a coin toss; the workflow reads like a near-certain stumble. People hear "50%" and picture a wash. It isn't — it's a near-guarantee, dressed up as a fair bet.

"Generative UI" is one name stretched across at least three distinct technical approaches, and the conversation suffers for the conflation. One is the demo everyone admires and nobody can ship. One is a modest improvement on the first that still doesn't work. The third is tame enough that most people won't file it under generative UI at all, and it's the one that wins.

One distinction up front, because the rest depends on it: generating code is not the same as generating interfaces. Tools like v0 and Claude Code generate UI, but the output is a static artifact, a box you build once and reuse. Generative UI in the sense that matters here is a runtime behavior, an interface that adapts to context and input as you use it. The difference is the difference between asking an AI to build you a box for your camping gear and having an AI watch you nearly drop an armful of cans, decide you need a backpack, sew one, and hand it over. The second is a marvel — and a parade of assumptions, each one a fresh chance to be wrong, which is roughly why we evolved language first. Being told what someone needs beats inferring it from watching them fumble.

Three Variants Wearing One Name

The most ambitious variant generates the interface from scratch on every load. The model takes a prompt, writes HTML, CSS, and JavaScript, and the browser renders whatever comes back. It has the highest ceiling — anything could be in the UI — and it's the version Google demoed. It also produced usable interfaces at a coin-flip rate, took minutes per page, burned tokens by the fistful, and rendered differently every refresh. Make the model ten times bigger and faster and it's still stochastic, still expensive, still inconsistent. Compute doesn't change the shape of the problem.

The middle variant constrains the model to a finite component library. Instead of writing layout code, the model emits a serialized payload, usually JSON, naming which existing components to render, with what props, in what arrangement. The client reads the payload and assembles the page. This is a genuine improvement: components stay consistent, accessibility doesn't regress on every refresh, and the model can't ship anything the design system forbids. It also has real momentum. Google open-sourced A2UI in December 2025 to standardize exactly this kind of agent-emitted interface, and a wave of commercial tools now lets non-technical staff assemble forms and reports inside a sanctioned component set. Boring, mostly. But boring is what most corporate software is, and the CIO gets at least some say over the standards — enough to be a little less likely to wake up at 3 AM already sweating before the phone finishes its first ring.

The catch is that the middle variant still hands the model authority over the whole screen. Every button, every input, every layout call is the model's. The variance that made the full-code version unreliable just moves to the component-selection axis: refresh and you might get a different field order, a different empty state, a different primary action. That's the same usability problem with extra steps. Add competing, half-formed standards and the fact that non-designers still can't organize an information flow, and this variant stalls in the near term even where the plumbing works.

The third variant is where products will actually land. Keep the core interface deterministic: designed by a human, shipped as fixed components, predictable across visits. Then carve out a bounded surface where dynamic widget selection happens. The design team builds a finite library of widgets (twenty, thirty, fifty) and a layer above them (an LLM, a plain rules engine, often both) decides which to surface in which context. The model isn't laying out the page. It's picking from a menu. That surface is often supplemental, but it doesn't have to be: it can be the primary one, a form that grows its own fields as it works out which data is still missing.

The 80/20 of Music

Effective interfaces follow the balance that effective music does. Most of a song is familiar: the beat you can clock, the progression you can predict, the structure you expect. A thin sliver is novel: the unexpected modulation, the dropped chorus, the turn you didn't see coming. The familiar parts are why you can listen. The novel parts are why you keep listening. Tilt the ratio too far either way and the song fails: too much repetition bores, too much novelty becomes unlistenable.

Interfaces work the same way. Most of what a user touches needs to be predictable enough that they build muscle memory and stop seeing the chrome. The smaller novel portion — the smart default, the contextual helper, the just-in-time widget — is what makes the product feel intelligent. Invert the ratio and you get generative UI in the full-code sense: little is predictable, and the user relearns the interface on every visit. The model can make a good local call every single time and the experience still fails, because consistency is a property of the sequence of interactions, not any one of them.

The third variant respects the ratio, but the ratio is about the whole product, not each screen. The deterministic core is the 80%. The dynamic surface is the 20%. Some individual screens will lean heavily dynamic, and that's fine; what matters is that the parts the user depends on stay put while the model gets to be clever in a bounded space where being wrong is recoverable.

The Travel App Test

Travel booking is the example I keep returning to, because the assumptions are so visible. Kayak, Travelocity, the airline sites: they're all built around one traveler or one group, one origin, one destination, maybe a round trip with a couple of legs. That covers most trips. It also falls apart the moment a trip doesn't fit the mold, and the interfaces have no graceful way to bend.

Picture a bachelorette party. Six friends converging on Vegas from four cities, all needing to land before 8 PM Friday for Penn & Teller tickets. Two have hard work deadlines pinning their return dates; the others have more flexibility. Today you solve this with six or more browser tabs, a spreadsheet, and a group chat full of screenshots. Nobody ships an interface for it: it's NP-shaped and rare enough not to be worth hand-designing.

Here's what generative UI actually buys you, and it isn't what the demos suggest. You don't design an interface for the bachelorette case. You design an origin widget, a destination widget, a lodging widget, a ground-transport widget, each one independent, each with sensible defaults, and you give the model enough context to know when to flex them. The origin widget either grows a state that accepts up to ten cities, or the model drops a second origin widget on the canvas for parallel tracking. Either way, the origin widget is the origin widget. It neither knows nor cares whether the trip is one person round-tripping to Vegas or six people scattering home from three cities. Build enough flexibility into each component, hand the model good defaults and enough context to choose between them, and the permutations resolve themselves. You were never going to hand-design all of them anyway.

The same machinery handles the far more common case, the business loop trip, A → B → C → D → A. The core interface shows the loop. A bounded panel runs the pigeonhole logic: which legs have a rental car, which don't, which hotels are confirmed, which are still open. Twenty-five holes to fill, sixteen filled, nine left, and the model's job is to rank the nine and surface the most important one next. The widgets that render them are already designed.

Decompose the Permutation, Not the Interface

I shipped a version of this at Reva. Credit and criminal screening reports come back in thousands of permutations, and we had to represent that range while telling both the applicant and the property manager, in plain language, what a given result actually meant, then drive different workflows, validation, and escalation off it. Designing a layout for every combination was never on the table.

So I didn't. I decomposed the screening result into about thirty key metrics. I decomposed the interface into a dozen regions: headline, subheading, summary, a credit line item, and so on. Then each region into the five to twenty text variants that could fill it. The output was a clean, human-readable JSON payload of thirty-odd metrics that drove the i18n keys, and those keys covered every combination: clean credit with a criminal flag, thin file with nothing on it, all of it. I never solved the whole permutation at once. I solved small, independent slices, which is the only reason I could prove the result was mathematically complete. In the end it was a set of rules. Decompose the information problem cleanly enough and most of the apparent complexity was never really there — Shannon's fingerprints are on every logistics problem if you look for them.

Why This De-Risks the Bad Days

The third variant carries a quiet benefit the ambitious ones don't. When the model gets it wrong — and it will, routinely — the damage stays inside a surface the user can ignore. The core interface keeps working, defaults stay sane, and escape hatches are easy to design because the design team owns the chrome.

In the full-code variant, a model mistake is a broken page. In the component-everywhere variant, it's a confusing layout the user has to decode. In the bounded variant, it's an oddly chosen widget the user can dismiss (or correct, which feeds a signal back to sharpen the next call) while their actual task runs on untouched. Same model, same error rate, wildly different blast radius.

What to Actually Build

If I were building generative UI into a product today, the work is concrete. Build a finite widget library, twenty to fifty widgets covering the workflow components your application actually has. The trick is clean separation of concerns: if the origin widget only ever captures origin, it contributes exactly one slice to the information problem at the heart of nearly every logistics task. Define what data each widget reads, what states it can hold, what context makes it relevant. Then build the selection layer — sometimes an LLM, sometimes a rules engine, usually both — that picks widgets based on the live context: what the user is doing, what they've finished, what's still missing.

The same widgets pay off twice, because they work in a chat context as readily as a panel. The user who wants a traditional UI gets widgets in a side panel; the user who wants to type gets the same widgets inline in the conversation. One library, one data contract, two surfaces, and the design investment carries across interaction modes instead of being rebuilt for each.

The full-dynamic dream — an interface generated from scratch every time — ignores why interfaces work at all. Consistency isn't a symptom of stale design; it's the point. The right ratio is most of the song you recognize and a chorus that surprises you. Get it right and generative UI is genuinely useful. Get it wrong and you're shipping an unsteamed dumpling.

Idempotent Design: When Order Shouldn't Matter

Joshua Hall — Wed, 03 Jun 2026 15:00:00 +0000

I pulled up to a Dairy Queen drive-thru last week and ordered a meal in the wrong order. The cashier couldn't take it. Not because the information was missing (I'd given her every piece she needed) but because I'd given them to her in a sequence the point-of-sale system couldn't absorb. She made me repeat elements three times to get the magic incantation right. I know it wasn't her fault.

This bug lives in nearly every piece of software I've used. The interaction is conceptually order-independent (bold then italic, italic then bold, doesn't matter), but the system underneath models it as sequential, and the order leaks out to the user. That leak is what I want to talk about.

Why Idempotent and Not Orthogonal

In math, an idempotent operation produces the same result whether you apply it once or a thousand times. In engineering, idempotency usually means you can retry a request without changing the outcome. mkdir foo fails the second time; mkdir -p foo doesn't. The flag is what makes the operation idempotent.

I'm borrowing the term a little loosely for design, because the closest engineering concept (orthogonality) already means something else in the design world. Orthogonal designs are two or more approaches to the same underlying need. A contact form and an LLM chat that collect the same fields are orthogonal: different surfaces, same outcome, and a team can ship either or both without one blocking the other. That's a useful framing, but it's not what I want to talk about here.

The property I'm after is whether the order of independent actions affects the result. Bold then italic gives you the same component as italic then bold. Adding a leading icon and changing a label are independent operations; either sequence ought to produce the same output state.

Many user-facing interactions ought to be idempotent by default. The cases where order genuinely matters (a wizard, a checkout, an irreversible commit) are the exceptions, and they should be designed as exceptions. Yet most software treats idempotent interactions as if they were sequential, because the data model underneath happens to be sequential and nobody pushed back.

Idempotency in design isn't a strict math property. It's a promise to the user that the tool will meet them wherever they start, in whatever order they think.

The Drive-Thru Test

The Dairy Queen story has a counterexample down the road. I can pull up to a McDonald's drive-thru and say "number one, large, with a Diet Coke." I can also say "Diet Coke, large, number one." Or "a Big Mac and a Coke, make it a meal, make the Coke diet, make the meal large." All three describe the same order, and McDonald's point-of-sale absorbs them. Someone on that team thought through what the user actually says and built the system to handle any phrasing.

Dairy Queen (and most typical configurations from Micros, Aloha, and the rest) has not. Try to upgrade the drink before the meal is built and the system gets confused. The cashier compensates by holding the order in memory, reordering it for the system, then entering it, sometimes mid-sentence while the customer moves on to the next item. Fast-food locations struggle to find people who can handle the drive-thru specifically because of that cognitive overhead.

Think of it as a pigeonhole problem. The order has a fixed set of slots (item, size, modifications, drink), and a sufficiently good interface routes incoming information into the right slot regardless of arrival order. The hard work is on the team building the system, not the user placing the order. Most teams skip that hard work, and the friction lands on the user.

Where the Tools Fall Apart

Design tools are full of this. Figma is the worst offender I work in daily: nested components, instance overrides at every layer, and an override-resolution algorithm where the order you make changes determines whether your other changes survive. Swap a child icon before you change the parent's state and the state change silently fails to propagate. Reverse the order and it works. I'll dig into the specific cases in a future post; the point here is the philosophical one. The tool's behavior is consistent. It just isn't idempotent.

Designers who use Figma long enough split into two groups. One learns the order-dependence as muscle memory and develops a private set of workarounds: set the color last, don't touch the size after you've set the variant, don't swap the nested instance until everything else is locked in. The other stays constantly frustrated and never quite figures out why their components keep breaking. Some of the failure modes are nuanced enough that even experienced designers can't articulate the cause; they just clean up after the tool and chalk it up to Figma being Figma. None of the workarounds are documented anywhere, because the tool can't carry the rule itself. It's tribal knowledge passed designer-to-designer.

Real Constraints vs Shadow Constraints

Not every interaction should be idempotent. Some flows are inherently sequential. You can't ship a package before you've packed it. You can't deploy code before it builds. You can't charge a card before you have the cart total. Pretending those constraints away doesn't make the design better — it makes it unworkable.

The test isn't whether the order matters; it's whether the ordering is a real constraint or a shadow constraint. Real constraints come from physics, business rules the user actually expects, or genuinely irreversible operations. Shadow constraints come from engineering and design decisions: simplifying assumptions, specific UI choices, poor data modeling, shortcuts to ship a feature seven years ago that now require a major refactor, or legacy assumptions carried in from third-party integrations.

The first-name/last-name field is one of the most common shadow constraints in software. At Reva, I pushed the team hard to use full name plus preferred name instead — better UX, better internationalization, what WCAG actually recommends. But every credit and criminal screening service we integrated with required first and last as separate fields. So we built a shim to split a full name on submission, plus a manual override flow for the rare case the split was wrong. It was right about 999 times out of 1,000, which still meant a meaningful number of applicants needed the override. All of that extra design and engineering work existed because of a third-party data model nobody on our team owned. The constraint wasn't real for our users; it was inherited from upstream systems we couldn't change.

Figma's idempotency problems are the same shape, but deeper in the stack. Most of the failure modes are almost certainly direct outcomes of engineering decisions in the override-resolution data model: simplifying assumptions from earlier versions, shortcuts that became load-bearing as the product grew. At this point, some are probably easier to fix with a partial rewrite than a patch. That's the long-term cost of treating idempotency as a series of one-off bugs instead of a property of the system.

Catch It Early or Pay Later

Idempotency, or whatever you want to call this property, almost never makes it onto a design system roadmap. It shows up as individual bugs ("color resets when size changes") that get triaged one at a time and never get connected to the property they share. The property is: the order of independent actions affects the result. Once that's named as a class of problem, the triage changes. Instead of fixing the color-reset bug in isolation, the team asks which other attribute pairs are accidentally coupled.

The catch is that this work pays the biggest dividends when it's done early. Once the data model and the engineering constraints set, fixing them gets exponentially more expensive. Start with a shopping cart table that assumes one account per cart and you've quietly locked out roommates sharing groceries; the DB constraints, the auth checks, and the order-history queries all assume that shape now. DoorDash took years to make office group lunch ordering feel like anything other than a phone passed around a conference room, because the original model didn't anticipate it. Yardi still makes you create a new account with a different email address every time you apply to a property managed by a company you've already rented from. The unique constraint on email is scoped to a single property — a database decision pretending to be a product feature.

The interactions you ship today carry the constraints of the model you commit to. If you're designing a system right now, list the actions that compose into a single state and ask which are genuinely ordered and which only look that way because of how something underneath was modeled. If possible, fix the underlying models before they ship. It's easier to create a many-to-many relationship today that is not used in the UI, than to change that data relationship in the future. Even if you can't make and fulfill the promise to the user right now, leaving the door open gives you options.

Ask how you can apply idempotent design early and often. Make the promise to users in advance instead of begging forgiveness when it feels complicated later.

Your Designers Aren't the Problem

Joshua Hall — Tue, 02 Jun 2026 13:30:00 +0000

A team I cleaned up after had seventeen dropdown components. Not seventeen variants of one dropdown — seventeen separate components, each with its own states, its own spacing, its own quirks. The audit took longer than I expected, mostly because the team kept finding more. "Wait, this one's in a different file. So is this one."

Nobody set out to build seventeen. The first designer had a perfectly good reason — the spacing on the original was wrong for their case. The second needed a different icon slot. The third was working in a feature area where the existing dropdown's hover state didn't match the surrounding components. There was no design leadership conversation to say "let's fix the original instead." There was no shared review process to even notice a new component had been added. So another one went in, and another, and another, and the developers shipping them were doing exactly what they'd been asked to do.

The worst part isn't the count. The worst part is that every one of those dropdowns failed something. Most failed basic accessibility — no keyboard navigation, no screen reader labels, no focus management. Several were unusable on touch interfaces. A few looked fine in light mode and fell apart in dark. Building a dropdown that's accessible, reliable across browsers, responsive across input modes, and theme-correct in any context is genuine work. The team had done that work badly seventeen times instead of doing it well once.

The maintenance cost is the visible cost. The hidden cost is users silently hitting broken interfaces. A user whose screen reader trips on the wrong dropdown doesn't file a bug — they leave. A user whose touch device can't open the right menu doesn't email support — they close the tab. The savvier user, who recognizes the pattern, gets angrier — these are exactly the kind of amateurish mistakes a serious team isn't supposed to make. Seventeen separate components is a maintenance story. Seventeen separately broken components is a user story, and it's the one that should keep the team up at night.

How a Design System Fractures

The story always rhymes. A team ships a design system with components that solve the cases they imagined when they built it. Six months later somebody hits an edge case the team didn't imagine — slightly different padding, an extra slot, a state that wasn't covered — and they have a choice. Extend the existing component, or duplicate it and modify the copy.

Extension feels hard. It requires understanding how the existing component is structured, what its variants mean, which other teams depend on it, and how the change will land in every consumer file. It requires either a conversation with whoever owns the component, or enough confidence to make the call alone. It requires time the designer doesn't have, because they're trying to ship a feature, not rebuild infrastructure.

Duplication feels easy. Right-click, detach instance, rename, modify. Done in three minutes. The new component lives in the same file as the feature work. Nobody is asking permission. The original is untouched.

The local calculation looks obvious. The problem is that the team is climbing the wrong hill. The top of the hill is shipping this feature this sprint. The mountain is a design system that compounds — where every accessibility fix lands once and stays fixed, where every browser quirk gets named and never rediscovered, where the next designer two years from now stands on the work the last one already did. Duplication wins the hill. Extension is the only path up the mountain.

The seventeen-dropdown count is what happens when a team optimizes for the hill every time, for ten years, with nobody ever pulling them back to point at the mountain. Newton's "shoulders of giants" is the global-maximum version of this work. The local-max version is fifty people relearning the same browser bug, individually, in private, with nothing surviving any of those discoveries to help the fifty-first.

Multiply that decision by a hundred features across ten years and you get seventeen dropdowns. The same pattern produces five button variants, eight card layouts, four versions of the same modal. The fracture isn't visible until somebody pulls back and counts. And it outlives the people who made it — the escape key in Photoshop cancels a text edit, the escape key in Illustrator commits it, and Adobe acquired Illustrator something like thirty-five years ago. Design and tech debt are the digital equivalent of an invasive species. Once they're rooted, they're nearly impossible to extract without a coordinated effort somebody almost never wants to fund.

The Misdiagnosis

The first reaction I see when a team confronts the count is to blame the designers. "We need better discipline. We need a stricter review process. We need to enforce the system." This produces a brief flurry of consolidation, followed by another round of fracture starting six months later.

The designers aren't the bug. People follow the curb cuts of the system they're working within. If duplication is the path of least resistance, people will duplicate — not because they're undisciplined, but because they're behaving rationally inside the constraints they were given. The system is what gave them those constraints, and the system is what has to change.

This is hard to see from the inside. Joseph Heller spent five hundred pages on the point in Catch-22 — only Yossarian can see how insane the "sane" rules are, because everyone else has spent so long inside the rules that they've stopped noticing them. Most design teams are full of Yossarians who haven't realized they're Yossarians yet. The dropdowns are a symptom, not the disease.

The disease has multiple roots, and component infrastructure is only one of them. Components that don't extend gracefully are part of it — the original button wasn't built to accept a new state, so the designer forked rather than fight it. But it's also incentives: in most orgs, shipping a feature is rewarded and refactoring an existing component is invisible work, so the rational employee ships the feature and never touches the component. It's also discoverability: a designer may not know the existing dropdown is there, because nothing in the system surfaces it at the moment they're reaching for one. It's also engineering politics: the engineer who would have to refactor the underlying code knows it's been bad for years, can't get the time from their manager to fix it properly, and would rather not be the one to crack the lid on it.

Each of those is a deeper post — incentives, discoverability, refactor politics. The structural fix and the people-and-process fix have to happen together. Audits without structural change produce relapse. Structural change without addressing why people picked duplication in the first place produces beautiful unused components.

What Component Structure Has to Carry

A well-structured component does more than render correctly. It communicates how it's supposed to be used, where it's supposed to be extended, and what it's supposed to forbid — and it communicates why each of those is true. The why matters more than the what. A designer who knows the dropdown forbids inline icon overrides because the alignment math falls apart on right-to-left languages will respect the constraint and propose a real solution. A designer who only knows that overrides are forbidden will route around the rule the first time they hit a deadline.

This is a much higher bar than "the component renders." It means the component has to expose intent — touch target sizes, state layer behavior, slot semantics, padding rules — in ways the designer can read at a glance and the engineer can implement from without ambiguity. The four-layer frame architecture that does some of this is a later post. And the deeper requirement, that the why of a decision needs to live somewhere durable and attached to the component itself, is what I've been calling connective documentation — also a later post. The short version: most orgs have their design decisions trapped in three people's heads and a thousand stale documents, and AI alone isn't going to fix that without a real structure to build on.

The component also has to live somewhere appropriate to its scope. A product-specific card pattern doesn't belong in the same file as the core button. A one-off layout for a specific feature doesn't belong in the shared library at all. Without scope discipline, every component drifts toward the same file, and the file becomes unusable.

Clear separation of concerns is the most underrated discipline in design system work. Nail the separation and most of the headaches above disappear. The frameworks for finding good separation are coming from a few interesting places lately — actor-network theory and object-oriented ontology both have something useful to say about where the meaningful boundaries between concerns actually live, and I think both are going to matter for how we structure design systems and the world models that AI systems are starting to need. Another post for another day.

And scope decisions — "should the button include a loading state, a dropdown trigger, a badge?" — need to be made on something better than the gut feel of whoever shows up to the meeting first. There's a scoring system I use for that. It's the next post.

The Series

This is the first post in a short series about preventing design system fracture. The next three go into the specific frameworks.

CFS — Commonality, Frequency, Severity. A three-axis scoring system for deciding what belongs in a component and what doesn't. One-to-five score on each axis, multiplied (not added) for a logarithmic total out of 125, and a conversation that has structure instead of vibes. It's the single tool that has prevented more variant proliferation on the teams I've worked with than any other.

The three-layer architecture. Core, application, and context layers — what goes where, how components migrate between layers as they mature, and why most teams end up dumping everything into one tier until the tier collapses.

Decomposition without forking. Sub-component patterns, the underscore-versus-angle-bracket naming convention that makes component boundaries readable, and the simple test for when to decompose and when to build inline.

After the frameworks, we build the actual button — the four-layer frame architecture, the icon-system composition, and the points where Figma's limitations push back on what the system wants to express.

The seventeen-dropdown problem isn't a discipline problem. It's a structure problem. Structure is fixable.

Your Coding Agent Doesn't Need Your Secrets

Joshua Hall — Mon, 01 Jun 2026 19:21:20 +0000

Every coding agent I use can read my .env file. Every one of them is a single prompt away from streaming its contents to a server I don't control. The fix has been obvious since the day Claude Code launched — redact on the way out, rehydrate on the way in — and a year later, no major vendor has built it into the client where it belongs.

Third-party proxies that do a version of this exist. The agents themselves don't ship it. That gap is the whole story: the one place this protection makes sense is the one place it's missing.

Start from an uncomfortable assumption — anything you send to an inference endpoint has zero security. Every major provider would object, and on paper they'd have a case. But two years of provider-side leaks have convinced me it's the safer bet to assume the worst. If you wouldn't paste a value into a public Slack channel, you shouldn't hand it to a remote model either. That covers two overlapping buckets: secrets like API tokens, private keys, and passwords; and personal data like names, phone numbers, and the occasional social security number that's somehow both at once. A proxy sitting in the middle can scan for all of it.

The Shape of the Fix

A local proxy sits between the agent and the inference endpoint. On the outbound side it scans the payload for anything that looks sensitive and replaces each match with a deterministic placeholder — [[REDACTED_PHONE:8f3a]], [[REDACTED_TOKEN:3f2a]], one per distinct value. Alongside the redacted payload it appends a short instruction telling the model what those placeholders are: opaque strings the client holds privately, to be treated as inscrutable identifiers and reproduced verbatim if the response needs them.

The redacted prompt goes up. The model does its work having never seen the real value. On the way back, the proxy runs the response through a string replace — every placeholder swapped for its original. The user sees a normal answer. The model saw nonsense tokens. The secret never left the machine.

The prompt-injection step is the part people skip, and it's the part that makes the whole thing work. As long as the model treats [[REDACTED_PHONE:8f3a]] the way it would treat a phone number — and returns the literal string unchanged, same hash on the end — rehydration is a trivial lookup. A single placeholder format can stand in for an unbounded number of distinct values within one session.

If you controlled the stack, you wouldn't need the injection at all. A vendor could carry that mapping in a structured side-channel — a JSON attachment, a field the model is trained to respect — and bake pass-through behavior far below the prompt layer. As an outsider with no access to the stack, prompt injection is the lever I have, and it's good enough to prove the idea works. It is not the version that should ship.

The placeholder-to-value map lives encrypted in memory. Pick a delimiter unlikely to appear in real code and collisions on the rehydration pass round to zero. I'm overstating the simplicity, slightly — there are edge cases. Secrets the model legitimately needs to modify (rare). Secrets that span multiple chunks of streaming output (annoying). False positives on the redaction pass (manageable). None of these are research problems. They're engineering work.

Why "Just Use Presidio" Isn't Enough

Microsoft's Presidio identifies and redacts PII well, including a fair number of international and borderline-uncommon formats. Yelp's detect-secrets is the other obvious building block — but it's a detector, not a redactor. It finds credentials so a pre-commit baseline can block them; it doesn't rewrite anything on the wire. Wire either into a proxy like LiteLLM or Bifrost and you get detection plus outbound redaction.

Two things you still don't get. The first is a clean rehydration path. Presidio technically has a reversible mode, but it emits an AES-encrypted blob rather than a readable placeholder, and reversing it is a separate decrypt pass you have to orchestrate yourself. The mapping that lets you put the original value back was never designed to survive a round trip through a language model — and detect-secrets, being detection-only, offers nothing here at all.

The second, and the one nobody mentions, is the prompt-injection layer that tells the model what the placeholder is. Without it, the model treats [REDACTED_TOKEN] as junk or as a fill-in-the-blank exercise. You get "I'm not sure what REDACTED_TOKEN refers to" instead of clean pass-through. The model has to be told, explicitly, that this is a placeholder and it should leave it alone. Presidio won't do that for you. Neither will detect-secrets.

LiteLLM and Bifrost will both let you script all of this by hand, if you're willing to write the integration. Most developers won't — and more to the point, most developers don't run a local inference proxy at all. Standing one up is a pain, and keeping it working is worse: every few weeks the upstream APIs shift and I'm back tweaking shims to hold the seam together. I don't mind that. The person dipping into Claude Code through something like Cowork has neither the time nor the inclination. A protection only advanced users can stand up isn't a protection. That's the honest limit of my own project, too: a proxy bolted onto someone else's stack will always have seams. The durable version lives inside the tool.

Why This Belongs Inside the Coding Agent

Coding agents are where the problem is structural, not incidental. The entire purpose of a .env file is to hold values the application needs and the developer should never paste anywhere. An agent that reads project files reads .env. An agent that writes new code references what's in it. The agent's job and the secret's purpose are in tension by design.

The sandboxing has genuinely improved. Auto mode and tighter default permissions mean Claude Code goes off-script far less than it did nine months ago. But those are shims on the symptom. No amount of prompt engineering — however clever — can guarantee a secret is never read and sent, and at the volume of inference requests a working developer generates in a day, a one-percent slip stops being a tail risk and becomes a near-certainty that eventually visits everyone. Redact-rehydrate is the first thing in this space that's a fix rather than a fence.

Every major vendor knows the gap is there. The reasons it's still open are the usual ones: it's fiddly, it's lower priority than the next demo, users haven't complained loudly enough yet. Those reasons are real, but none of them are good — and this is the rare case where the hard part is deciding to do it, not doing it.

The Differentiation Nobody Has Claimed

A vendor who shipped this could say something none of its competitors can: we are actively engineering so that we never see your secrets or your personal data — securing what does reach us, and building the tooling so most of it never arrives in the first place. Back that with a code path anyone can audit and it's a real claim, not the "we take your privacy seriously" wallpaper that every settings page already carries.

This sits squarely in Anthropic's lane. Their differentiation from OpenAI, Google, and the rest has always been trust. Whatever you make of any single company, more people right now hand their data to Anthropic with less hesitation than they would to Google or Meta — and compounding that reputation with a feature you can actually verify is about as close to a no-brainer as product strategy gets. Trust you can read in the source beats trust you have to take on faith.

What I'm Building

On evenings and weekends I've been reimplementing the parts of Presidio I care about — entity detection, structured placeholder generation, deterministic replacement, ergonomic encryption — as a Rust library called octarine, wrapped in a local proxy that runs the redact-inject-rehydrate round trip transparently in front of Anthropic, OpenAI, or anything else speaking the same API. It's open source. You can read every line.

It's sizable — the Rust alone runs to roughly 200,000 lines, close to half of it tests, with some nine thousand of them running on every change — and it's also the first substantial thing I've built in Rust, roughly 95% written by the agent with me reviewing as time allowed. I don't offer that as a humblebrag or a confession. I offer it because it's the thesis in miniature. I know where the bodies are buried in a redaction pipeline — the architecture, the validation, the failure modes — and the agent handles the syntax I'd otherwise be looking up. This is attempt three: the first died in Python, the second on a bad Rust architecture, and the third survived because I pulled the core logic into a clean library and let well-worn patterns carry the structure. Along the way I built a handful of my own agents and skills to catch tech debt before it set.

In my own testing it works. The model handles placeholder tokens as opaque strings without further prodding, the round trip is fast enough not to notice, and the false-positive rate drops to tolerable once the patterns are tuned. I'm under no illusion that it's the answer. It's an experiment — useful to advanced users, a good way to learn what Rust can do, and a standing existence proof that the hard parts are tractable. The real answer, when it comes, gets built by someone who controls the whole stack and can do it better than any proxy ever will.

What Anthropic Should Do

Of the handful of companies positioned to ship this, Anthropic is the one I'd bet on — not because it's a small team (it isn't), but because it has the culture and the wherewithal to treat this as a priority. They've had a year. That I can install Claude Code this afternoon and, with one reasonable-sounding request, stream my .env to a remote endpoint is a gap I'd like to see closed — and one I'd happily help close.

It is not a huge lift. A team with their resources could do everything I've done, and more, and better, in a couple of weeks. I'm working around a token budget, a Rust learning curve, and whatever evenings happen to be free; they have none of those constraints. Until someone ships it in the client, anyone building tooling around coding agents should treat secrets-on-the-wire as a first-class concern, not an afterthought. The fix is small. The cost of leaving it unbuilt isn't.

Working with Lennie: The Reality of AI Code Supervision

Joshua Hall — Wed, 09 Jul 2025 17:31:36 +0000

Last week, I posed a hypothesis: AI coding agents might enable experienced product managers and designers to operate across the traditional business/product/engineering divide. After generating 30,000 lines of code, I'm convinced the opportunity is real—but only for those willing to master a completely new kind of supervision.

To test this practically, I started building something I've wanted for seven years: a production-grade design system. Since Style Dictionary launched, I've been fascinated by systematic design token management, but the economics never worked. Building a proper design system takes 3-5 developers working 6-12 months—easily $300,000+ in startup resources for comprehensive testing, documentation, and Storybook examples.

My hypothesis: I can produce something sufficiently comparable in two months using agentic code generation.

After just a few days of focused work, the early results show genuine promise. The journey also revealed exactly why this approach demands serious technical knowledge to succeed.

The Logger Saga: Learning to Manage Lennie

Nothing illustrates the George-and-Lennie dynamic better than my week-long battle to implement enterprise-grade logging. I know this sounds absurdly overengineered for a design system—console.log() would probably work fine—but part of my experiment involves pushing AI supervision to production-quality limits.

Phase 1: The Enthusiastic Amateur

When a method naturally required logging, Lennie dutifully added a few console.log() calls. I asked a simple question: "Should we implement a shared Logger that can be reused across the codebase?"

Lennie correctly identified this as good architecture, then immediately started building a custom logger from scratch. I let it run for several minutes, curious to see what it would produce. The result wasn't terrible—exactly what you'd expect from someone who possesses tremendous strength but has never heard of third-party packages.

Lennie sees a problem and applies raw strength, even when finesse would work better. Rather than researching existing solutions, it enthusiastically reinvents wheels with the same eager intensity it brings to every task.

Phase 2: George Steps In

I stopped the custom implementation and redirected: "Research best practices we could apply to decompose the logger and make it robust enough for a production enterprise environment."

Once I reminded Lennie to actually research first, it recommended Pino as the foundation—exactly matching my own investigation. But I had to explicitly tell it to stop and think. Left to its own devices, Lennie would have continued building a mediocre custom solution indefinitely, convinced it was helping.

"Ma'am, specificity is the soul of all good communication." — Middleman

This redirect became my primary management technique throughout the project. Lennie possesses tremendous implementation power, but George must provide constant course correction to produce professional-quality results.

Phase 3: When Lennie Gets Excited

After establishing the Pino foundation, I asked which components we should extract for reuse across the application. Lennie suggested refactoring async and error handling processes—sensible architectural thinking.

But then Lennie got excited about the possibilities. What started as simple extraction evolved into 6-7 major async components, another 8 helper utilities, dozens of tests (unit, integration, and performance), and comprehensive documentation.

Like hearing about the rabbits, Lennie fixated on making everything perfect. It can't remember that we're building a design system, not an async processing framework. Every task becomes an opportunity for enthusiastic over-engineering unless George carefully constrains the scope.

Phase 4: When Lennie Forgets Everything

The testing phase revealed Lennie's most frustrating behaviors. Despite my constant supervision, Lennie struggled with three recurring patterns:

Lennie's Memory Loop When mocking Pino for unit tests, Lennie would attempt approach A, fail, try approach B, fail, attempt approach C, hit the context window limit, and then—with fresh "memory"—confidently suggest approach A again. I watched this cycle repeat three times before realizing Lennie had forgotten we'd already proven that approach wouldn't work. The attempts were almost exactly the size of the context window, creating a perfect amnesia loop.

Lennie's Reality Distortion Lennie confidently declared our async utilities should handle a million calls in 100ms. When I mentioned my local development container might not match this heroic expectation, Lennie seemed genuinely surprised—as if it had forgotten we were writing code for actual hardware rather than theoretical perfection.

Lennie's Console Explosion During stress testing, Lennie helpfully logged every single operation to the console—all 100,000 of them. Watching test results scroll like the Matrix was oddly mesmerizing, but completely useless for debugging. Lennie couldn't understand why this might be a problem until I explicitly guided it toward proper test logging patterns.

Most frustrating: when tests became complex, Lennie's default response was to give up entirely. "Let's simplify this test" inevitably meant "let's crush this bunny." I spent significant time redirecting: "No, Lennie, we need to fix the test appropriately, not disable it."

The Numbers Tell the Story

The logger saga provides concrete data about what this supervision approach actually delivers. My implementation took roughly 5 hours total—2 hours for basic functionality before I developed the spiral process, then 3 additional hours refining through multiple phases. I'm still not entirely satisfied with some decomposition decisions.

Building the same logger manually would have taken me 2-3 weeks, primarily because I'm not comfortable in TypeScript and would need extensive research. A mid-to-senior TypeScript developer might estimate the logger at 1-2 story points and the parallel async utilities at 3-5 points—easily 1-2 calendar weeks once tests and documentation are included.

The productivity gain isn't just about speed. In those few development days, I generated roughly 1,200 tests across the entire design system—mostly unit tests with integration and performance coverage. This represents better test coverage than any code I wrote 15+ years ago when I focused more on programming.

Writing unit tests became almost enjoyable with Lennie handling implementation. I personally hate the tedium of test creation, but with Lennie managing the mechanical work, I could focus on test strategy while listening to podcasts. Decent coverage for a module (20-60 tests) takes 30-60 minutes with minimal cognitive load.

Lennie also surprised me with sophisticated architectural suggestions I wouldn't have considered without significant research time. The SSH hardening approaches it implemented demonstrate security expertise I lack in bash such as defense-in-depth strategies, input validation techniques, and error handling patterns that prevent information leakage.

The Context Window Reality: George's Endless Patience

These experiences highlighted my biggest frustration with supervising Lennie: even with 200k token context windows, serious development burns through available memory in 45 minutes. Working across dozens of files—essential for comprehensive features like logging and testing—means constantly re-explaining everything to Lennie.

"What if there is no tomorrow? There wasn't one today." — Phil Connors, Groundhog Day

Every context window collapse feels exactly like this moment. Lennie forgets our architectural patterns, coding standards, and project goals. It re-reads recently modified files during compaction, but can't maintain the strategic context that guides good engineering decisions.

George must patiently re-explain the plan every hour, remind Lennie what we're building and why, and redirect its enthusiastic energy toward the right problems. This felt like using hand tools for precision carpentry—effective, but exhausting when you want a factory.

George's Task File Innovation

To maintain continuity across Lennie's memory resets, I developed a living task file that defines current objectives, tracks progress, and documents failed approaches. My prompts now instruct Lennie to update this file frequently with findings, todos, and progress notes.

After context window resets, Lennie can reconstruct 90% of necessary context from this single document. Combined with refined initial prompts and extensive use of the claude.md file, this approach qualitatively improved reliability and reduced the constant need for George to repeat himself.

The improvement was subtle but significant. Instead of explaining the same architectural patterns every hour, I could focus on higher-level guidance and quality review—more like supervising a forgetful but talented worker rather than teaching the same lesson repeatedly.

But this raises the critical question: what exactly does George need to know to supervise Lennie effectively?

What "Serious Engineering Knowledge" Actually Means

The logger implementation clarified exactly what technical knowledge supervising Lennie requires. These aren't advanced concepts—most represent computer science 200-level material or first-year professional experience. But George needs enough expertise to recognize when Lennie is wandering off course:

Pattern Recognition

Understanding when console.log() isn't sufficient for production
Recognizing that custom loggers are almost always unnecessary
Knowing third-party ecosystem options (Pino, Winston, etc.)

Testing Strategy

Test isolation and proper mocking patterns
Understanding that tests should validate behavior, not get disabled when things get hard
Performance expectations grounded in hardware reality
Clean test output for CI/CD compatibility

Architectural Thinking

Component extraction and reuse principles
Error handling patterns and async management
When to refactor vs. when to constrain Lennie's enthusiastic scope creep

Production Awareness

Node version targeting and feature compatibility
Monitoring and observability requirements
Enterprise-grade reliability expectations

None of this requires deep systems programming knowledge, but George needs enough experience to distinguish good approaches from mediocre ones. Lennie can implement either equally well—only human judgment prevents it from crushing the bunny.

The logger saga taught me these supervision fundamentals, but it also revealed a larger challenge: how do you review code at the pace Lennie produces it?

The Code Review Challenge

Working at this pace creates an unprecedented review burden. The logger implementation alone generated more pull requests in a few days than I typically create in months. At 10-20 PRs daily across the entire design system, traditional review processes break down entirely.

You're not just checking for bugs—you're validating architectural decisions, ensuring consistency across Lennie's memory resets, and catching the subtle signs that Lennie has wandered off-task. George must develop different skills than reviewing human-written code, where you can assume the developer remembers yesterday's decisions.

The logger experience taught me to watch for specific patterns that signal trouble ahead.

Red Flags: When Lennie Goes Off Course

After thousands of lines of review during the logger implementation and beyond, certain patterns emerged as reliable warning signs that Lennie was about to crush something:

Method Bloat in Single Files When Lennie starts adding numerous methods to one file, it usually signals missed decomposition opportunities. I'd see logger files growing utility functions for async thread management—completely different conceptual domains crammed together. The same architectural red flags you'd catch reviewing a junior developer's work, but amplified by Lennie's enthusiasm for solving everything in place.

Conceptual Drift Lennie tends to solve adjacent problems it discovers along the way. Building a logger? Might as well add custom error handling. Implementing async utilities? Let's create a performance monitoring framework too. This scope creep happens gradually and requires George's constant vigilance.

Legacy Pattern Defaults Lennie consistently defaulted to outdated approaches until I reminded it of our Node 18+ target. When implementing error handling, it reached for the verror package—a library that hasn't been updated in four years—instead of using typed errors built into Node since version 16. These patterns emerge because Lennie's training includes years of legacy solutions that were once best practice but are now obsolete.

George's Spiral Development Process

Traditional code review assumes human memory and consistent quality across iterations. Managing Lennie requires a completely different approach—what I call spiral development.

"The two most powerful warriors are patience and time." — Leo Tolstoy

I evolved a nine-phase process that sounds excessive but works remarkably well with Lennie's strengths and limitations:

Make it work — Let Lennie get basic functionality in place
Make it right — Guide Lennie to decompose code, apply best practices, integrate third-party solutions
Make it robust — Help Lennie handle edge cases and improve error handling
Make it secure — Dedicated security review and hardening (Lennie needs lots of guidance here)
Make it performant — Optimization pass focused on performance
Make it observable — Add monitoring, consistent logging, debugging capabilities
Make it tested — Comprehensive test coverage (where Lennie actually excels with supervision)
Make it documented — Both human-readable and AI-optimized documentation
Make it integrated — Refactor existing code to leverage new capabilities

Each phase takes 20-60 minutes depending on complexity, but the focused approach prevents Lennie from getting overwhelmed and trying to accomplish everything simultaneously—which inevitably leads to crushed bunnies.

The Iterative Review Advantage

Lennie responds remarkably well to repeated review requests along different dimensions. I asked it to review my SSH script for security improvements over a dozen times. Each iteration found new hardening opportunities I'd never considered—input sanitization techniques, defense-in-depth strategies, error handling patterns that prevent information leakage.

This iterative approach works because Lennie doesn't get frustrated or defensive about criticism like humans might. Ask it to review for third-party package opportunities, then security issues, then performance optimizations, then architectural improvements. Each lens reveals different improvement opportunities that George can guide Lennie toward implementing.

The SSH script eventually ballooned from a simple setup utility to a comprehensive, hardened installation process. Probably overkill, but it demonstrated that genuinely secure code is achievable when George provides patient, systematic guidance—just not on the first or second pass.

Managing Context Across Memory Resets

To maintain continuity across context window collapses, I developed a structured task management approach using markdown files with frontmatter metadata.

Each task follows a template covering objective, problem statement, success criteria, method guide, implementation requirements, technical decisions, and progress notes. The goal is providing sufficient context to keep the agent on track without overwhelming it with excessive documentation.

I use a three-digit, zero-padded naming convention (e.g., 003-implement-logger-utilities.md) that allows the agent to automatically pick up the next task numerically. This supports adding and reprioritizing tasks even while work is in progress—essential for planned experiments with parallel agent coordination.

The format evolves rapidly based on results. If a task doesn't proceed well, I can modify the template before the next session. This flexibility allows continuous improvement in agent guidance without requiring complex tooling.

Tool Reliability and Workflow Adaptation

Claude Code crashes in VS Code roughly once daily—annoying but manageable. I suspect this relates more to VS Code's terminal handling extended histories rather than Claude itself. Sessions with 10,000+ lines of history seem to trigger instability.

The crashes rarely destroy significant work since I maintain frequent commits and detailed task files. Opening a fresh console session usually resolves the issue without losing context. I did encounter one spectacular failure where aggressive file watching during testing (adding 500k-1M watchers) overwhelmed Docker's volume interface, but that was clearly my architectural mistake rather than a tool limitation.

These reliability issues reinforce the importance of systematic context management. The task file system becomes even more critical when tools fail unexpectedly.

The Succession Challenge and Learning Pathways

My logger experience reinforced concerns about entry-level developer training. Lennie consistently made mistakes that any experienced developer would catch immediately, but might confuse someone still learning fundamentals.

We're creating a world where senior professionals can move faster than ever, while eliminating traditional pathways for developing the expertise required to supervise Lennie effectively. This isn't sustainable long-term.

For Designers and Product Managers: My Advice on Getting Started

The learning path varies dramatically based on your existing technical background. Those with engineering experience will naturally fare better, but I've seen the barriers aren't insurmountable for others willing to invest the effort.

Avoid the Code-Hidden Temptation I get why many gravitate toward code-free solutions like Replit for quick prototyping. These tools work well for weekend experiments and stakeholder conversations, but they come with serious caveats I've learned the hard way.

I've watched sales teams sell features they saw in slide decks of possible future projects—nowhere near production-ready. High-fidelity AI-generated prototypes can fool non-technical stakeholders into believing complex features are nearly complete. In software companies, most people understand this limitation. In traditional industries, you might find yourself explaining to an executive why the working demo they just used is still 3-6 months from customer deployment. Trust me, that's not a conversation you want to have.

If You're Ready for Production Development For serious development efforts using tools like Claude Code, the initial setup presents the biggest hurdle. It took me most of a day to configure my development environment, despite substantial experience with containerization and DevOps.

DevOps remains Lennie's weakest area. Lennie can write decent Dockerfiles but produces terrible Docker Compose files. The YAML looks correct, but environment variables and specific configurations are consistently wrong. Since compose arrangements are highly context-specific with limited training examples, current models simply lack sufficient data to handle this complexity reliably.

My Practical Getting Started Advice

Keep environments simple initially — avoid complex DevOps configurations until you're comfortable with basic workflows
Choose constrained projects — a blog built on 11ty using Tailwind and deployed to Cloudflare Pages provides clear boundaries and well-documented patterns
Expect the deployment gap — you might have something working locally in days, but deployment could take a week
Use Lennie as a learning tool — ask questions like "Why did you use different error handling in method X versus method Y?" It often provides better explanations than most professors
Local experiments cost nothing — try building something to understand the process before worrying about production concerns

The Next Generation Challenge

I've been experimenting with letting my younger children build projects using Lennie, with the requirement that they explain how the code actually works when finished. My wife and I are still debating the pedagogical merits of this approach, but it mirrors the advanced calculator problem in mathematics.

I want engineers who look up equations every time rather than memorizing formulas—forgetting a factor when building a bridge has catastrophic consequences. But those same engineers need deep pattern recognition to make intuitive leaps and develop architectural thinking.

The challenge becomes teaching both tool usage and fundamental understanding without creating professionals who only know which buttons to push. We can't ignore these tools, but we also can't skip teaching rudimentary skills.

Economic and Workflow Implications

The productivity gains from my logger experience translate to substantial economic impact. My 5-hour implementation versus 2-3 weeks of manual development represents roughly $1,000 versus $20,000 in time value. Even using offshore resources at $10,000-$25,000 monthly, we're looking at 5x cost savings and 15x time improvements.

But the real change isn't just speed—it's the cognitive load shift. When supervising Lennie rather than coding directly, I spend focused attention on planning and reviewing, with implementation feeling more like being a passenger who occasionally gives directions or shouts "stop" when Lennie veers toward a playground. During highway stretches—testing and documentation phases—I can listen to podcasts while Lennie handles the mechanical work.

This change in mental effort distribution appeals to me more than traditional coding. I've always preferred architectural thinking over syntax research, so having Lennie handle mundane implementation details reduces frustration rather than creating it. The supervision challenge becomes about strategic guidance rather than tactical execution.

Team Structure and Role Evolution

The implications for team composition remain unclear from my limited experiment, but early patterns are emerging. For professionals who can operate across traditional role boundaries—the supposed unicorns—this creates unique productivity opportunities.

I expect low-fidelity mockups will become less common, continuing a 15-year trend from Balsamiq wireframes to high-fidelity Figma designs. Some designers will resist this shift because visual decoration can obscure core UX problems. But it's entirely feasible to build design systems that allow toggling between working prototypes and wireframe views—something I'm actively exploring with Terroir DS.

Smaller teams will accomplish more, but whether organizations invest in additional projects or reduce headcount depends on leadership philosophy. Many will choose layoffs because they're easier to justify to shareholders. I suspect this is usually the wrong path, but it's predictable.

The Convergence Is Real, But George Is Everything

The George-and-Lennie dynamic isn't a temporary limitation—it's the fundamental nature of working with AI coding agents. Even as models improve, someone must maintain strategic context, make architectural decisions, and ensure quality standards. Lennie can crush technical implementation with incredible power, but George's judgment determines whether it crushes the right problems or destroys the work entirely.

For product managers and designers willing to develop these supervisory skills, the convergence opportunity is transformative and immediate. The technical knowledge barrier is real but surmountable, especially for professionals with existing engineering exposure. My spiral development process, task file systems, and iterative review approaches provide concrete frameworks for learning to manage Lennie effectively.

The economic case alone justifies the investment: 15x time improvements and 5x cost reductions fundamentally change what's possible for small teams. But the real opportunity lies in role fluidity—the ability to move seamlessly between strategic product thinking and tactical implementation without losing creative momentum.

The trust required here isn't blind faith in Lennie's capabilities, but confidence in your own supervision skills. Trust that you can recognize when Lennie veers off course, redirect effectively, and maintain architectural coherence across memory resets.

This isn't about replacing engineers or eliminating the need for technical expertise. It's about enabling experienced professionals to operate across traditional domain boundaries when speed and resource constraints demand it. The succession challenge remains real—we need new apprenticeship models that teach both fundamental technical judgment and Lennie supervision skills.

"The secret to getting ahead is getting started." — Mark Twain

But for those ready to embrace the George role, the convergence opportunity is already here.

Looking Ahead: Building at Scale

My Terroir Design System experiment has made material progress from initial hypothesis in less than a week. The logger saga represents just one component in a broader architectural exploration that pushes AI supervision to enterprise-scale limits.

Rather than following traditional MVP approaches, I'm using Terroir as a vehicle to experiment with different AI supervision techniques—meandering somewhat, but deliberately so. This allows me to explore edge cases and architectural possibilities that might not emerge from more constrained development approaches.

In my next post, I'll dive into the technical ambitions driving Terroir—type-safe design tokens, automated documentation generation, comprehensive testing strategies, and the architectural patterns that make design systems truly scalable. This discussion will target engineers and design system specialists interested in the technical possibilities AI supervision enables.

The question isn't whether convergence professionals can build simple applications—we've proven that. The question is whether we can produce enterprise-grade solutions that compete with dedicated development teams. Terroir is my attempt to find out.

This is the second in a series exploring how AI coding agents are reshaping product development. Next: the technical deep dive into building production-grade design systems with AI supervision.

For product managers and designers: What convergence opportunities are you seeing in your current role? Have you experimented with AI coding tools, and what supervision challenges did you encounter? I'm particularly interested in hearing from professionals who've started bridging the traditional role boundaries.

For engineering leaders: How are you adapting team structures and skill development to accommodate AI-enhanced productivity? What new apprenticeship models are you considering for the next generation?

"It's not the notes you play, it's the notes you don't play." — Miles Davis. Perfect metaphor for working with AI coding agents. They want to solve everything by adding more code, but good programming often requires restraint.

Joshua Hall — Tue, 24 Jun 2025 20:07:53 +0000

The Convergence Opportunity: How AI Coding Agents Are Reshaping Product Roles

Joshua Hall ・ Jun 24

#ai #design #programming #productivity