DEV Community: Scott Mallinson

On console warnings and the things we don't remove

Scott Mallinson — Mon, 20 Jul 2026 11:40:00 +0000

Not all worthwhile work shows up in a changelog a user would read. Removing unused tooling and quieting a noisy console are two of those jobs. Neither is user-visible. Both are worth doing.

Removing tooling nobody was using

One service in the booking flow depended on a tool for generating API documentation from source comments. The pattern is familiar. Someone set it up, it worked, and at some point the docs stopped being generated or published. The tooling stayed.

This kind of residue is common in long-running codebases. It isn't broken and doesn't throw errors. It just accumulates: in package.json, in install time, in dependency-scan output, in the head of anyone who looks at the project setup and wonders what that config is for. Clearing it out took a small change. Drop the dependency, the config, and the npm scripts, then check nothing in CI was actually using it.

Is it worth a review cycle? I think so. It sets a norm: we remove what we don't use instead of letting it pile up. The review catches the case where someone was relying on it but hadn't said so. And the commit history records why it went, which helps if anyone ever wants to bring it back on purpose.

Defensive checks for prop types

The more interesting change was in a fare search interface. Some components were receiving props that could be undefined or null in edge cases they weren't built to handle. Nothing was breaking and the app kept working, but React was logging warnings to the console about the unexpected prop values. The fix was to handle the missing-or-malformed cases explicitly before passing props in, rather than letting them through and hoping the component survived.

Simple enough on the surface. The more interesting question is why console warnings pile up in the first place.

Console warnings as signal degradation

A clean console is a working signal. When a new warning appears, developers notice it, look into it, and decide whether it's real. When the console is already full of warnings, that signal degrades. New warnings blend in, and the bar for "something is wrong" creeps upward. It happens gradually. One or two warnings in a rarely-hit edge case seem fine, then a few more turn up elsewhere, and before long people are filtering console output by habit while a genuine error sits unnoticed in the noise.

So the real fix isn't "add null checks". It's treating the console as a meaningful output surface and keeping it clean on purpose: defensive prop handling so components don't get values they can't deal with, treating existing warnings as debt worth paying down, and refusing to merge new code that adds warnings. React's prop warnings are telling you something about the contract between a parent component and its children, namely that the parent is sending something the child didn't expect. Ignore them and you're throwing away information about your component interfaces that might matter during a later refactor or upgrade.

The value of maintenance

Neither change shipped a user-visible feature, and that's fine. A codebase that never gets this kind of attention fills up with clutter until it slows you down in measurable ways: noisy CI output, confusing structure, prop mismatches that graduate from warnings to real errors during an upgrade. Small maintenance compounds. So does neglect. You're choosing which.

Two kinds of correctness: currency bugs and ghost feature flags

Scott Mallinson — Mon, 13 Jul 2026 07:05:00 +0000

Some bugs are loud and obvious. Others sit quietly in a codebase doing exactly what the code says, which isn't quite what anyone intended. The quiet ones are the more interesting category, because finding them is closer to archaeology than debugging. A UI banner that wouldn't clear is the same shape of bug on the front end. These are its data-side cousins.

The currency display bug

In travel booking systems, taxes are complicated. There are base fares, carrier-imposed fees, and government taxes, each with its own rules about which currency it should display in. Domestic US travel adds specific airport facility charges and segment taxes with their own currency handling.

The bug was that these taxes showed up with the wrong currency symbol. The amounts were right and the calculation was fine. What wasn't being carried through to the display layer was the currency context. In the exchange flow, where an agent is repricing a ticket change and quoting the tax breakdown to a customer, that's exactly the kind of error that erodes confidence in the tool. The agent can't be sure which values to trust.

Fixing it meant tracing how those tax line items were assembled for rendering and making sure the right currency followed them through the pipeline. The root cause was a missing currency association at the point where the values were gathered into the display model. The fix itself was small once I found it. Tracing it was the bulk of the work, as usual.

The feature flag that didn't exist

The subtler one corrected a feature flag reference in the pricing details service: a flag name that had been sitting in the code but had never actually been created in the feature flagging service.

Every time the code evaluated that flag, it got a "not found" response, which the SDK treated as "off". Whatever the flag was meant to enable was never reachable. It wasn't failing loudly or throwing errors. It was silently defaulting to disabled, no matter what anyone had configured.

Flag names are strings. Nothing checks a flag name at compile time against what's registered in your flagging service. Create a flag under a slightly different name than the one in the code, and the mismatch is invisible, at build time and at runtime, unless you go looking for it. Removing flags that have outlived their purpose is its own discipline. This is the opposite failure: a flag that was never really there.

The passenger type that was never resolved

A third bug of the same quiet kind. The exchange fare search service generates fare options when a traveller changes a booked ticket, and that calculation depends on the passenger type (adult, child, infant, various discount categories) because each attracts different fares and tax treatment.

The fare generation helper was receiving a passenger type code, a short string like ADT or CHD, but it wasn't performing the lookup that translates that code into the full passenger type definition from the reference data service. For standard adult passengers it probably held up, because the code was making assumptions that happened to be true for the common case. For less common categories, the missing lookup produced wrong results without failing loudly. The fix was to resolve every code through the reference data before fare generation runs.

This is the kind of bug that's easy to wave through in review. The code looks plausible, the variable names suggest the right thing is happening, and the tests probably only cover the common cases. The tell is usually spotting that a code path receives something that looks like an identifier and treats it as if it were the thing itself.

Config drift and why it stays hidden

The flag and the passenger type are the same broader problem: drift. The code and the thing it references, a flag registry or a reference dataset, fall out of sync, and because neither side fails in a detectable way, the drift sticks around. The only symptom is behaviour that's quietly wrong, which is easy to blame on something else or miss entirely.

You usually find these by poking around a nearby area and noticing the mismatch. Validating that references resolve at startup is one mitigation. A naming convention and a clear "create it before you reference it" habit is usually more practical. The loud bugs get fixed because they announce themselves. The quiet ones only get fixed if someone goes looking.

The shape of shared libraries

Scott Mallinson — Mon, 06 Jul 2026 09:03:00 +0000

A shared library is defined less by what it does than by the shape it presents to everything that depends on it: its public surface, the contracts it implies, and the cost of changing either. Two bits of work from recently make that concrete. One was an export that was never declared. The other was a small change that quietly broke a numeric assumption.

Making implicit exports explicit

A couple of icon components from a shared plugin library were being used by a downstream service without being part of the library's declared public surface. The consumer reached in by path, something like import X from 'library/internal/path' rather than import X from 'library'.

That works, right up until it doesn't. Importing by internal path creates a hidden dependency on an implementation detail. Reorganise the library, move the internal path, and the consumer breaks. The library's authors have no way of knowing anyone relied on that path, because nothing declared the relationship.

The fix was to add the components to the library's declared exports. It's close to a one-line change, but it earns its keep: it makes the dependency visible to the maintainers, it brings the components under the same deprecation and versioning treatment as the rest of the public API, and it lets static analysis trace the import graph correctly. Libraries with undeclared consumers tend to become hard to refactor, because changes that look safe from the library's side turn out to break things elsewhere. Declaring the export brings the relationship into the open.

A breaking change disguised as a small one

The shared Node logging library is built on Winston, which ships a default set of severity levels (error, warn, info, and so on), each with a numeric priority where lower means more severe and error is level 0. The gap was that nothing sits above error. There's no way to say "this is worse than an error, this is on fire and needs a human now". critical plays that role in syslog and most structured logging conventions, but Winston doesn't include it by default.

Adding a custom critical level at priority 0 and shifting the existing levels up by one fixes that. The change is tiny in lines of code and deceptively large in blast radius. Levels are usually referenced by name, like logger.error(...), which is fine. But any code comparing levels numerically ("only process events with level <= 1") is now pointing at a different level than before, because error has moved from 0 to 1. Every numeric comparison that wasn't explicitly about critical is suddenly off by one.

That's why a change like this ships as a major version bump and asks consumers to update deliberately. The library exposes the new level and documents the shift; downstream services audit their numeric comparisons before upgrading. Writing the level was the easy part. The cost is rolling it through everything that depends on the library.

The shape is the contract

Both of these are the same lesson from different angles. A shared library's public surface is a contract whether or not you've written it down. An undeclared export honours that contract by accident. A numeric level shift changes it without telling anyone. Maintaining a library that lots of things depend on is mostly the work of keeping the contract explicit: declaring what's public, versioning what changes, and making the coordination visible instead of leaving consumers to find out when something breaks.

Three quiet bugs hiding in a cross-service feature

Scott Mallinson — Sun, 28 Jun 2026 08:05:00 +0000

I shipped a feature that looked simple from the outside: you could save a flight quote from an AI-powered assistant into a separate trip planning view. The user-facing part is simple enough. Getting there meant coordinating changes across a fare search service, an AI assistant backend, and two independent frontend components, each with a slightly different idea of what a "saved quote" was.

That kind of work is where a lot of the interesting coordination lives. It isn't algorithmically hard. It's demanding in a different way: you hold a complete picture of how the system behaves in your head while making changes in repositories that share no context with each other.

The silent configuration bug

Before any of the integration work could start, there was a bug to fix. AI assistant queries were returning fewer flight options than expected. No upsell options were coming back at all. Nothing in the system flagged it. Requests completed successfully and responses looked valid. You just got a narrower result set than you should have.

Tracing the fare search payload for AI assistant requests, I found a field that capped the maximum number of upsell results at zero. Zero maximum upsells means return none. It had probably been set when the AI assistant integration was first wired up, and since upsell results aren't always prominent in early testing, nobody had caught it.

The fix was a one-liner. Finding it was the work: tracing the full request chain to understand why AI assistant queries behaved differently from other consumers of the same API. That's how silent configuration drift goes. The value is technically valid, the system doesn't complain, and it quietly shapes what comes back.

When two components share an event

The core integration challenge was making sure that when a quote was saved from the AI assistant, two separate frontend components updated correctly: the assistant itself, to show the saved status, and the trip planning view, to refresh its basket with the new quote.

When I looked at how the trip planning component handled its refresh, it turned out to have its own local action for the job, a separate event doing the same thing as a shared action the AI assistant was already using. The duplication had grown gradually. The trip planning component came first, the AI assistant integration came later, and the shared event either didn't exist yet or wasn't visible when the local version was written.

The fix was to drop the local duplicate and have the trip planning component respond to the shared event directly. Small change, but it compounds. Both components now respond to the same contract. If the event shape changes, it changes once. If you want to know what triggers a basket refresh, there's one place to look instead of two.

I've written before about what adding an AI layer taught me about type ownership, and this was the same principle wearing different clothes. Shared contracts go beyond type definitions. They're about the system having one authoritative source for each concept. Separate copies that start identical will drift apart eventually, and by the time they do, it's rarely obvious which one is right.

Generating HTML carefully

One edge case in the rendering work. The saved-quote feature lets users copy a formatted version of a quote to the clipboard, with PDF export to follow. The copy content is rendered as HTML, so any string values drawn from API responses or user input need sanitising before they're embedded in the template.

It's easy to miss. When you're building a formatter that turns structured data into an HTML string, interpolating values directly feels natural. But if any of those values come from external sources, even indirectly through several layers of typed objects, you've got an injection path. A quote description field containing a <script> tag shouldn't end up executable in a clipboard payload.

The fix was simple: escape HTML entities in any user-visible string before it goes into the template. Not complicated, but it doesn't surface in happy-path tests or feature demos. You have to think about it on purpose, or you find out about it later in a less pleasant way.

What cross-service delivery actually involves

Cross-service feature work is its own skill. The mechanics of any single change are usually straightforward: a field added here, a schema extended there, an event handler updated in a third repository. The hard part is keeping a clear model of how the pieces connect while you move between codebases that share no context.

None of the things that had gone wrong here were individually complex. A suppressed search result, a duplicate event handler, an unsanitised string in a template. Each was a simple thing that had been allowed to exist because nobody had traced the full flow end to end. That tracing is most of what cross-service delivery actually is. You hold the whole picture, notice the gaps, and fix what you find before the feature ships with them baked in.

The scaffolding tax: getting a new service properly bootstrapped

Scott Mallinson — Wed, 24 Jun 2026 15:32:00 +0000

A lot of engineering work doesn't count as "building features" but has to happen before you can build features at any speed. Scaffolding a new service is squarely that: getting it from its initial template state into something actually deployable and maintainable.

From template to real service

We use an internal template to bootstrap new services. It gives you a working skeleton with the right structure, dependencies, and config patterns already in place. The catch comes afterwards. Once you've created a service from the template, there's a round of housekeeping to strip out the template's own identity and replace it with the new service's: config references, package names, internal identifiers, anything still pointing at the template rather than the service.

It's an hour's work and it matters. Leave a stale reference in the wrong place and you get subtle failures downstream. The wrong image gets pulled, metrics report under the wrong service name, alerts go nowhere because the routing rules don't recognise the identifier. Getting it right upfront is cheaper than tracking it down later.

Wiring into the deployment pipeline

A service isn't real until it's in the deployment pipeline. We use GitOps-managed configuration to define what runs in each environment, so adding a service means updating those definitions: registering it, updating the exclusion lists that gate which services deploy where, and making sure the non-production infrastructure knows it exists.

This looks like configuration editing, but it's closer to integration work. You're establishing the contracts between the service and the infrastructure that runs it. Get it in place and deployments become routine. Skip it and every deployment needs manual intervention.

Getting dependency management right from the start

We also set up the service's automated dependency updates properly from day one: finer-grained grouping aligned to upstream release cadences instead of one coarse batch, plus a first pass to clear the initial upgrade backlog before it builds up. The reasoning behind that grouping is worth a post of its own. The point here is that it's far cheaper to establish on a new service than to retrofit onto an old one.

Why scaffolding quality compounds

A service that's wired into the pipeline, has sensible dependency automation, and starts with clean configuration is one where future changes land quickly. One that's bootstrapped in a hurry, with stale references and manual deployment steps, picks up friction with every change. It's dull work. The alternative is treating it as someone else's problem to fix later, which usually means it never gets fixed at all.

Shipping a conversational search flow across services

Scott Mallinson — Sat, 20 Jun 2026 17:43:00 +0000

Some features are self-contained. Others cross enough systems that the changes have to land together, and the coordination becomes the hard part rather than any single change. Shipping an end-to-end conversational flight search was one of those. A user explores options through an AI assistant, selects one, and that selection saves into their booking basket. That single action spans three services and a frontend, and getting it working meant changing all of them roughly in parallel.

Building the flight options flow

The fare search service got improvements to how flight option data is structured in the conversational path. The previous structure had grown organically and was starting to show: handling and transforming it took more case-by-case logic than it should have. This pass cleaned up the data model, with consistent naming, clearer relationships between entities, and less special-casing at the edges, so downstream services and the frontend can work with it predictably.

The trip quote service got a new PATCH /quotes route for saving a selected option and merging it into an existing basket. It's a partial update. You're not replacing the basket, you're folding a selection into it, which is subtler than it sounds: the merge has to cope with a prior selection already existing, with new options conflicting with something already there, and with keeping the basket consistent throughout. What it accepts, what it returns, how it reports errors was most of the interesting work.

The frontend got updated type definitions to match. That sounds mechanical, but if the types don't reflect the actual shape of the data, you lose the compiler's ability to catch mismatches before they reach production.

Instrumenting the MCP server

A separate strand was adding observability to an MCP server, the component that sits between AI tooling and the backend services it calls, translating tool invocations into API calls and structuring the responses on the way back. The instrumentation covers APM tracing, metrics, and structured logging, so you can trace a tool call end to end: how long it took, whether it succeeded, which backend it hit, and where it failed.

The constraint worth flagging is what you don't log. Tool-call requests can carry user-provided context and identifying information, so the instrumentation records the shape and outcome of each call — trace IDs, durations, status codes, error types — without persisting the content. That boundary is a design problem in its own right, and one I've written about separately.

Context injection for AI coding assistants

A different thread again: a set of hooks for our GitHub Copilot configuration that inject context at different points in a development workflow, things like analytics context, test state, feature-flag configuration, and workspace information. Assistants are most useful when they understand the context they're working in. Without it they give generic answers that are technically correct and no use in your actual codebase.

The catch is that injecting too much backfires. Larger context costs tokens, and past a certain point the assistant spends its attention on the context instead of the problem. So the hooks are built around specificity. Each one fires at the moment its context is relevant: pre-chat hooks set up the initial picture, pre-tool-use hooks add context for the operation about to happen, post-tool-use hooks handle the follow-up. Getting it right is empirical. You find where the assistant gives unhelpful answers, work out what context would have helped, and add a hook there.

Naming as a form of maintenance

The feature had picked up two naming conventions as it evolved, one term in some places and another elsewhere. Neither was wrong, but having both meant reading the code required a constant mental translation. A codebase-wide rename pulled everything onto one vocabulary: service names, endpoint paths, function names, test descriptions.

It's the kind of change that's easy to defer, because it doesn't fix a bug or add a feature. The cost of deferring just compounds quietly. Every new developer has to learn the mapping, every review is a little harder, every search has to account for both terms. Doing it once, properly, is cheaper than living with the split.

Release pipelines should be boring

Scott Mallinson — Tue, 16 Jun 2026 13:26:00 +0000

Release automation tends to get written in a hopeful mood. One pull request merges, one job runs, one tag gets created, everything lands cleanly. You assume only one thing happens at a time. The assumption isn't deliberate. It's just how you think when you're writing the happy path.

The trouble is that the happy path is a special case. As soon as a repository has busy automated dependency updates, several pull requests can merge within minutes of each other, and each merge fires its own release job. The jobs start from roughly the same point and then race each other to write back. We had a shared GitHub Actions template running releases across a set of repositories, and it turned out to be hiding three separate race conditions, each at a different stage of the same job.

Two jobs creating the same tag

The first race is at tag creation. Two jobs start at almost the same moment, both read the current version, both compute the next one, and both try to create the same version tag. One wins. The other fails with a tag conflict.

The fix is a concurrency group on the workflow. GitHub Actions supports this natively: you name a group, and a run that would join it while another is in flight either waits or gets cancelled. For releases you want it to wait. You're serialising, not skipping.

concurrency:
  group: release-${{ github.ref }}
  cancel-in-progress: false

With that in place, two release jobs in the same repo queue instead of colliding. The second runs cleanly once the first is done.

Two jobs pushing to the same branch

Serialising tag creation doesn't cover the last step: pushing the version-bump commit back to the branch. A job checks out the repo, bumps the version, commits, and pushes. If another job pushed in the gap between checkout and push, the working copy is now a commit behind, and git correctly refuses the non-fast-forward push.

There are two halves to fixing this. The first is to sync with the remote at the last possible moment before writing, so the bump lands on a current view of the branch:

- name: Sync with remote before versioning
  run: git pull --rebase origin ${{ github.ref_name }}

- name: Bump version
  run: npm version patch --no-git-tag-version

The ordering matters. Pull and rebase before the bump, so you're rebasing onto current remote state rather than dragging your version commit over changes that might conflict with it.

The second half is to make the push itself resilient. Even with a pre-push sync, two jobs can reach the push inside the same narrow window. So the push step retries: on a non-fast-forward rejection it pulls, rebases the bump onto the new tip, and tries again, bounded so it fails loudly instead of looping forever.

- name: Push Changes
  run: |
    for attempt in 1 2 3; do
      if git push origin HEAD:${{ github.ref_name }}; then
        break
      fi
      if [$attempt -lt 3]; then
        echo "Push failed, pulling and retrying..."
        git pull --rebase origin ${{ github.ref_name }}
      else
        echo "Push failed after $attempt attempts"
        exit 1
      fi
    done

Note that you recover after a failed attempt rather than pulling before every one. Failing first and then recovering avoids unnecessary work, and it avoids a window where a pre-emptive pull could rebase onto a conflicting state.

Cleaning up after a race that already happened

By the time the fixes land, a race can already have left a mess. A job writes a version-bump commit locally but fails to push it, so the version recorded in the package manifest is a step ahead of the tags actually published. Recovering means going through the CI logs to find the run that partially completed, working out what it left behind, and replaying that specific bump cleanly against the current branch. The archaeology takes longer than the fix.

That's the real cost of partial failures in automation. The failure itself is cheap. Reconstructing the state it leaves behind is what costs you. So you make release steps idempotent where you can: a step that's safe to re-run without doubling its effects turns a tense recovery into a re-run.

Why a flaky pipeline is worse than it looks

A release job that fails about half the time sits in an awkward blind spot. It's not bad enough to block anyone. You retry, it passes, you move on. The workaround is cheaper than the fix, so the failure gets normalised, and people stop reading it as a signal and start treating it as weather.

The cumulative cost is real, and it isn't only the wasted retries. An unreliable pipeline changes how people work. If every release risks a babysitting session, the rational move is to batch changes up, which is the opposite of the small, frequent pull requests that make review easier and rollbacks cheaper. A pipeline that runs quietly on every merge takes that disincentive away, and goes back to being something nobody has to think about.

Doing this in a shared template multiplies the payoff. The fixes land once, and every repository that inherits the template gets them without anyone rediscovering the problem repo by repo, one retry button at a time.

Granular Dependabot groups and getting error attribution right

Scott Mallinson — Sat, 13 Jun 2026 20:58:37 +0000

Dependabot has a default behaviour that doesn't scale well: one pull request per outdated dependency. For a repository with a hundred dependencies, a run generates dozens of individual PRs. Most are low-risk patch bumps a developer approves without reading closely — which means either they pile up unreviewed, or people start rubber-stamping them, which defeats the point of the review.

Why grouping matters

Dependency grouping addresses this. Instead of one PR per package, you define groups — "all AWS SDK packages", "all testing libraries", "everything from this upstream source" — and Dependabot combines the relevant updates into a single PR. The result is far fewer PRs, each giving a complete picture of a set of related changes.

The work was moving several repositories from ungrouped or coarsely grouped configs to ones with more granular, purposeful groups. The granularity matters: a group defined as "all dependencies" is barely better than no grouping — you still get one huge PR — while groups built around logical cohesion ("packages released together from the same upstream source") give you something you can actually review with confidence.

Applying it across different stacks

The interesting part of rolling this out is that each repo has a different dependency structure. A Node logging library has nothing in common with a .NET error-handling library or a React frontend. For the Node libraries, grouping around major upstream sources makes sense; for the .NET library, the groups follow NuGet namespaces; for the frontend, it's a mix of framework packages, tooling, and application libraries, each with their own natural groupings. You can copy the structure of the YAML across repos, but you still have to think about what actually belongs together in each — and that thinking is the part you can't automate away.

Error source attribution in a shared library

The other change was about correctness in error reporting. A shared .NET library classifies errors and warnings across several services — capturing not just the message but its origin: which part of the system produced it. The origin is an enum, and one category was missing: a group of trip-related services whose errors weren't covered by any existing value. When those services produced an error, it either failed to classify or fell into a catch-all, making it harder to route the alert and slower to find the responsible team.

Adding the value is straightforward. The more interesting question is why it was missing. This is a common pattern with shared classification types — the enum gets defined early, before all the consumers are known, and then isn't kept in sync as new services adopt the library. The fix doesn't just make reporting more accurate; it removes the ambiguity that wastes on-call time. "The origin is unknown" means more digging before anyone can act; "the origin is the trip services layer" means the right team gets paged immediately.

Platform hardening as a steady-state activity

This is worth naming for what it is: platform hardening. Not new features, not architecture changes, but the continuous work of making the infrastructure more reliable, maintainable, and legible to the people who depend on it. A Dependabot config sits in a file almost nobody reads until dependency updates go wrong; an error-source enum is invisible to end users. Both keep a system that dozens of engineers work in daily manageable over time. The return is long-tailed and largely invisible, which is exactly why it tends to get deprioritised — and doing it consistently, even when nothing more exciting is on the board, is how a platform stays manageable.

What adding an AI layer taught me about type ownership

Scott Mallinson — Tue, 09 Jun 2026 20:12:33 +0000

I've been working on an AI-powered trip planning assistant that sits on top of an existing set of booking microservices. The AI layer is genuinely interesting work — natural language input, iterative trip refinement, the whole thing. But the most valuable engineering work had nothing to do with the AI itself.

It was about types. Specifically, who owns them and what happens when they drift.

The problem with local schemas

The AI assistant service had its own validation schema for incoming flight search requests. It wasn't wrong, exactly — it reflected the shape of requests the service expected at the time it was written. But the canonical definition of what a valid flight search request looks like lives in the fare search service, and over time, subtle differences had crept in.

This is a fairly standard distributed systems problem. When two services independently define what they think the same thing looks like, they'll stay in sync right up until they don't. A field gets added somewhere. A constraint gets tightened. Someone updates one schema and not the other. Nothing breaks immediately — the tests pass, the service starts — but you've created a time bomb.

The fix was straightforward: pull the shared type out of the fare search package and use that directly in the AI assistant, removing the local definition entirely. One source of truth. When the API evolves, every consumer stays aligned automatically.
Simple to describe. Surprisingly easy to defer.

Why AI features make this urgent

Here's the thing about building an AI layer on top of existing services: it doesn't introduce new complexity so much as it surfaces the existing ambiguity.

A traditional integration between two services fails fast. Service A sends a request to Service B, B returns an error, A logs it, someone gets paged. The feedback loop is tight enough that schema drift tends to get caught reasonably early.

An AI assistant is different. The user is expressing intent in natural language. The assistant is interpreting that intent, deciding what to query, constructing requests, handling the responses. There are more layers of abstraction between the user's words and the actual API call. When something goes wrong, it might manifest as a confusing or unhelpful response rather than a clear error — which means it can go unnoticed for longer.

More importantly, the AI layer is making decisions about how to construct requests. If its understanding of what a valid request looks like is subtly out of sync with reality, those decisions will be subtly wrong. Not catastrophically wrong — just wrong enough to be annoying and hard to diagnose.

This is why schema consolidation that might have felt like a nice-to-have became genuinely urgent once an AI layer was involved. The model compounds every ambiguity downstream.

Session state is harder than it looks

The other significant change this week was enforcing a required session-tracking header throughout the conversation service. This one is less about types and more about correctness guarantees.

The header was already being passed in some code paths. The problem was "some" — in a service where every request needs to carry session context to maintain coherent conversation state, optional isn't good enough. A user iterating on a trip in natural language needs the system to remember where they are in the conversation. If that header gets dropped mid-flow, the session context is gone and the experience breaks in a way that's confusing rather than obvious.

Making it a validated requirement — something the service explicitly checks for and rejects requests without — is the kind of change that feels bureaucratic until the alternative happens in production.

The lesson isn't "validate everything". It's more specific: identify the invariants your system actually depends on and make them impossible to violate rather than relying on every caller to get it right.

If you're planning to add an AI layer on top of an existing set of services, the time to audit your type ownership is before you do it, not after.

It's not that the AI makes the type problems worse, exactly. It's that it makes them more consequential and harder to spot. A messy schema definition that was fine when the integration was service-to-service becomes a genuine liability when a model is making decisions based on it.

The boring foundational work — shared types, enforced invariants, single sources of truth — isn't glamorous. But it's what determines whether the interesting work on top of it actually holds up.

The notification that wouldn't leave

Scott Mallinson — Mon, 08 Jun 2026 07:45:00 +0000

There's a category of bug that's almost invisible until you're the one staring at it — the kind where something just doesn't go away when it should.

The problem

In a booking management interface used by travel agents, there's a notification banner that warns agents when they're looking at a booking retrieved from a queue. The intent is clear: flag that this record came from a queue, not from a direct lookup. Once the agent has retrieved the record and the context changes, the banner should disappear.

It wasn't disappearing.

After retrieving a booking from the queue and moving on, the warning banner stayed visible — persisting across view transitions, sitting silently in the terminal in a state that no longer applied. Nothing was broken in the functional sense. Fares could still be searched, bookings could still be modified. The banner was just wrong.

Why these bugs are easy to ship

A lingering notification doesn't fail a unit test. It doesn't throw an error or cause a traceable exception. It simply stays on screen past its welcome, and the only way to catch it is to manually walk through the interaction flow and notice that something feels off.

Tests tend to verify that things happen — components render, events fire, state updates. They're less good at verifying that things stop happening at the right moment. A banner appearing is testable. A banner not appearing after a specific sequence of interactions requires a test that explicitly asserts absence after a lifecycle transition, which is the kind of test that often gets skipped because "it's just UI state".

Except for agents using this tool all day, it's not just UI state. A notification that outlives its context creates a small but persistent cognitive load: is the queue state still relevant? Did something go wrong? Should I be worried? The interface is lying, quietly, in a way that erodes trust.

The fix

The root cause was that the notification state wasn't being cleared on the correct lifecycle event. When the booking was retrieved from the queue, the flag that triggered the banner was set. When the user transitioned away from that context, the flag wasn't being reset — it just carried forward into the next view.

The fix was to wire the dismissal to the right transition point: when the queue retrieval context is left behind, clear the notification state. Once that connection was made, the banner behaved correctly — appearing when relevant, disappearing when not.

The broader pattern

This is a recurring shape in UI work on long-lived, stateful applications. State accumulates. Things that were true in one context bleed into the next if you're not deliberate about cleanup. It's especially common in applications built incrementally over years, where the component that sets a flag and the component that should clear it aren't obviously connected, and the original author of each may not have anticipated how they'd interact.

The fix for any individual instance is usually small. The harder problem is building the habit of thinking about state exit conditions as carefully as state entry conditions — asking not just "when should this appear?" but "when should this definitively stop appearing, and am I certain that's covered?"

Most of the time the answer is yes, and nothing bad happens. Occasionally it isn't, and you end up with a warning banner haunting a terminal long after the thing it was warning about has resolved.

Designing an API endpoint for an AI consumer

Scott Mallinson — Thu, 04 Jun 2026 14:40:29 +0000

Most search APIs follow a familiar pattern: you send a query, you get back a structured list of results. Each result is a record — fields, values, maybe some nested objects. It's designed to be consumed by code that knows what it's looking for and knows how to render it.

Recently we built a new search endpoint with a different consumer in mind: an AI assistant. The response it needs isn't a result set. It's something the assistant can reason about and articulate in natural language.

Why the response shape matters

When a UI component consumes a search API, it knows exactly which fields to read. The price goes in this column. The departure time goes there. Any ambiguity in the response structure is a bug to be fixed in the mapping layer.

When an AI assistant consumes a search API, the situation is more interesting. The assistant has to understand the results, decide what's relevant to say about them, and express that in a coherent response to the user. If you hand it a dense structured result set, it can technically work with it — but you're asking it to do a lot of implicit inference about what the data means and how the fields relate to each other.

A response designed for conversational consumption makes that easier. You surface the information the assistant actually needs, organised in a way that corresponds to how a human would think about it, rather than in the schema optimised for data storage or rendering. The distinction is between a response that answers "what are these records?" and one that answers "what should I know about these options?"

Building it without duplicating the core logic

The practical challenge is that the new endpoint still needs to run the same underlying search — the same pricing logic, the same availability checks, the same filtering. You don't want two separate implementations of that. What changes is the shape of the output, not the logic that produces it.

The implementation threads a new output path through the existing search pipeline: the core logic runs as before, but there's a step at the end that transforms the results into the conversational response format before returning them. This keeps the business logic in one place while allowing the presentation layer to vary by consumer type.

It's a pattern that shows up a lot in API design — the same underlying operation exposed through different interfaces for different consumers. The tricky part is getting the boundaries right so that the shared logic stays coherent and the per-consumer transformation doesn't start leaking back into it.

Conversational APIs as a distinct design problem

There's a broader point worth making here. As AI assistants become a first-class consumer of backend APIs, the assumptions baked into standard API design start to matter in different ways.

Traditional API design treats the consumer as code: deterministic, explicit, able to handle any well-formed response. AI consumers are more like people: they interpret, infer, and sometimes get confused by responses that are technically complete but not easy to reason about. Designing for them means thinking about meaning and context, not just schema validity.

That's a different skill than standard API design, and I suspect it's going to become more important as more systems start exposing AI-facing interfaces alongside their traditional ones.

Keeping orchestration in your code, not your prompt

Scott Mallinson — Sun, 31 May 2026 12:58:03 +0000

We spent two days this week in Bedrock architecture workshops with AWS engineers — tool use, guardrails, embeddings, memory, multi-agent patterns. The biggest practical insight wasn't about a specific service. It was about where to put the coordination logic.

The easy version of tool use

When you first wire up an AI assistant with tool access, there's a compelling path of least resistance: give the model a set of tools, describe what they do, and let it figure out the sequencing. Call the search tool to find options, call the booking tool to select one, call the session tool if you need to track state. The model is genuinely good at this in isolation.

The problem is "in isolation." On a real booking platform, each service has its own validation rules, its own error modes, its own expectations about what came before. A fare search result has an expiry window. A quote operation needs specific metadata from the preceding search. Session identifiers need to flow through in a particular header. These aren't things you want the model deciding about on the fly — they're invariants the system depends on.

It works most of the time. But "most of the time" has a different operational meaning when you're coordinating services that touch financial transactions.

Orchestration belongs in your code

The position we landed on: orchestration logic lives in the application layer. The model's job is to understand user intent and produce structured output. Deciding which services to call, in what order, with what preconditions — that's engineering work that belongs in code, not in a system prompt hoping the model reasons its way to the right sequence.

There are a few reasons this matters beyond the obvious.

Testability. Application-layer orchestration is just code. You can unit test it, integration test it, reason about edge cases before they hit production. Prompt-driven orchestration is probabilistic — you can observe it, you can improve it, but you can't prove it.

Debuggability. When something goes wrong with application-layer orchestration, you have a stack trace. When the model takes a wrong turn through your tools, you have a conversation log and a hypothesis. These are not equivalent debugging experiences.

Operational confidence. The people who run these systems — and who get paged when they break — need to be able to reason about what's happening. Keeping coordination logic in code means it's reviewable, auditable, and consistent. Engineers who didn't write it can still follow it.

None of this means the model is reduced to a dumb text processor. It's still doing the genuinely hard work: understanding intent, handling ambiguity, dealing with the mess of natural language, producing structured output in the format downstream services expect. That's where the intelligence should be applied. The sequencing is deterministic by design.

Model abstraction

There's a related decision that becomes expensive to get wrong: how you reference the model in your code.

If you name a specific model version in twelve places — config files, service constructors, test fixtures — you're setting yourself up for a refactoring exercise every time a better or cheaper model becomes available. On Bedrock, inference profiles let you abstract this properly: instead of hardcoding a specific model identifier, you reference a profile that maps to "the current model for this use case," and update the mapping in one place when things change.

This sounds like obvious engineering hygiene. It usually is. But it's the kind of thing that slips during a prototype phase and doesn't get fixed until someone is midway through a migration counting how many files they need to touch.

The deeper point is that model selection should be a configuration concern, not a code concern. The code should say what kind of inference it needs; what model actually handles that inference should be externally configurable.

Skills and memory over monolithic prompts

The other architectural shift worth recording: as agent capability grows, a single enormous system prompt becomes a liability.

It's hard to update — a change to one capability risks affecting another, and you can't always predict the interactions. It's hard to test in isolation. And it tends to accumulate over time, with each new requirement getting bolted on, until you have something that works but nobody fully understands.

The better direction is modular: discrete Skills for different capabilities, explicit memory management instead of hoping the model retains relevant context from earlier in a long conversation. Each skill can be reasoned about independently, tested in isolation, updated without worrying about unintended side-effects from capabilities that share the same prompt.

This is where AWS's Strands SDK and AgentCore are pointing: an architecture where agent capabilities are composable rather than monolithic. It's closer to how you'd design a conventional service — small, cohesive pieces with clear interfaces — and further from the "give the model everything and see what happens" approach.

What changes

A week spent on architecture rather than shipping features doesn't always feel productive. The useful test is whether it changes how future work gets done.

In this case: when we extend the AI assistant — new capabilities, new service integrations — there's now a clearer pattern to follow. Orchestration in code. Model as intent-resolver and structured-output producer. Capabilities as discrete, testable Skills. Model selection behind an abstraction layer that makes swaps cheap.

None of that is especially dramatic. But getting these things right before the system is large is significantly easier than sorting them out after. Architecture decisions made at the beginning tend to have long half-lives, for better or worse.