Ceci Olivera

Posted on Jun 29 • Edited on Jul 8

An MCP for the most common failure of the Web

#a11y #ai #webdev #mcp

There's a deterministic algorithm at the center of this story and a probabilistic model at the edge of it, and most of what I learned lives in the space between them.

Here's the short version: a designer friend of mine, Marta Herrera Hollingsworth, built, years ago, an algorithm that generates monochromatic color palettes whose contrast relationships are guaranteed mathematically. I turned that algorithm into a library, and then wrapped the library in an MCP (Model Context Protocol) server so that any AI agent building a UI could call it and get colors that can't fail WCAG 2.2 contrast. This is the story of that library, why it exists, the surprisingly stubborn problem of getting models to actually use its output as written, and where it all landed: a better place than where it started, though not the place I expected at the outset.

Why color, of all things

WebAIM scans the homepages of the top million most-visited sites in the world every year, and for seven years straight, low-contrast text has been the single most common accessibility failure they can detect. The February 2026 report found it on 83.9% of homepages, up from 79.1% the year before, averaging 34 separate instances per page. This isn't an edge case. It's the default condition of the web.

And the web is increasingly being written by agents. Microsoft's A11y LLM Eval harness (built by Michael Fairchild and collaborators) measures how accessible model-generated code actually is, and the baseline is bleak: with no accessibility guidance, eight frontier models passed the harness's automated checks only about 12% of the time, with contrast errors featuring heavily. The hopeful part of that same report is that guidance works: a short "please be accessible" instruction lifted pass rates by roughly 24 points, and a structured skill that made the model review its own output did even better, resolving on average 86% of the issues automated tools can measure — with color contrast still the single most persistent problem, even then.

So: a perfect place to put a tool.

What MCP is, and why I used it here

Before getting into the library, it's worth explaining the scaffolding, because the choice of protocol matters.

MCP (Model Context Protocol) exposes three distinct primitives a model can draw on during generation:

Tools: functions that execute logic (calculations, validations) and return structured data.
Resources: data the model reads as a persistent, URI-addressable source of truth, rather than the one-off result of a single call.
Prompts: reusable instruction flows that the server itself exposes, instead of leaving every integration to reinvent them.

A REST API with function calling also gives you tools — that part isn't new. What it doesn't give you, without extra convention, is a standard way to expose a URI-readable source of truth, or a forced sequence of steps that any MCP client understands the same way. That's why I chose MCP: because palette://{hex}/{theme} is a resource the agent can return to at any point during generation, and because I thought a prompt could be the flow that forced the right order of steps. I was wrong about that second part, and I'll come back to it later.

What the library actually does

The library

Give it one seed color and a theme — whether it'll sit on white or black — and it produces a small monochromatic scale, six shades across a 100–900 range, where every contrast relationship has been computed. But the colors are the easy part. The valuable part is what the library ships alongside them: a compatibility matrix. For each shade, it tells you exactly which other shades (plus pure white and black) are safe to use as text on top of it at WCAG 2.2 AA, and which pairings hold only for large text.

Over MCP it offers several ways in:

generate_palette: returns the palette and its full compatibility matrix as structured data.
validate_pairings: takes a list of intended foreground/background pairs and grades each one against that matrix before any CSS gets written, answering with a blunt proceed: true or proceed: false.
generate_css_tokens: returns a ready-to-paste :root {} block. It has an entry condition I'll get to later, because it changes something important about what follows.
check_contrast: a free-form contrast check between any two hex values, independent of a generated palette. The base scale is monochromatic by design — one hue, varying only in lightness — and that's exactly what makes it mathematically predictable. But almost no real interface lives on that alone: an error state, a sale badge, a brand accent all need a color the monochromatic scale doesn't have. check_contrast exists so that decision also runs through arithmetic instead of the model's eye. I'll come back to a problem this doesn't fully solve.
A resource, palette://{hex}/{theme}, that lets an agent pull the whole matrix by URI before making a single color decision.

This is exactly the kind of problem you want to hand off to a tool instead of a model, and the reason is precise: contrast is arithmetic. Working out the luminance ratio between two hex values is a calculation, and calculations are the kind of work you should keep out of a system that predicts text. The whole idea is to take that math out of the model's probabilistic "head" and put it into a tool that returns a deterministic answer.

That was the bet. Expose the library over MCP, and contrast failures should become close to impossible.

The diagnosis: why the model ignores the rules

They didn't become impossible — though the model never ignored the palette. It called the tool, received the output with every color and every rule intact, and did something more persistent: it rewrote the CSS from scratch, in its own conventions, its own variable names, its own structure. And, crucially, without the compatibility rules — the entire point of the tool — which got lost somewhere between reading the output and writing the page.

This wasn't constant or catastrophic. The failures were a minority, but persistent, and they always had the same shape: the safe-pairing information, the whole reason the tool exists, evaporated on the trip from the tool's output to the final stylesheet.

Why does this happen? It isn't an instruction-following failure. It's something simpler and harder to fight: a model trained on an enormous amount of CSS has a very strong sense of what well-written CSS looks like, and when it receives a piece of CSS-shaped data, it doesn't copy it — it regenerates it in that learned shape. The accessibility annotations were never part of that learned shape and were just comments above the :root {} block, so they don't survive the regeneration.
Generally, the System Prompt carries more hierarchical weight because it defines the model's "character" and global operating rules. However, even a very strong prompt can lose out to exposure bias (the examples the model itself has just generated) or to the training bias mentioned earlier.

The pattern is recognizable: the model reliably keeps the parts of the output it can't reconstruct from memory (a specific hex value) and rewrites the parts it can — the structure, the variable names, the comment formatting. There's no disobedience in that. It's the path of least resistance for a system that generates text token by token, and the path of least resistance is always the CSS it's already seen millions of times. No amount of "IMPORTANT: do not rewrite this" reliably beats that, because the instruction is just text, competing against the model's entire sense of the medium.

This reframes the problem: if the instruction loses almost every time against the training prior, the only lever left isn't asking the model to behave differently. It's changing the format of what it receives, so the correct option is cheaper to copy than to rewrite.

Three ways to investigate this without being an AI researcher

One part of the process I don't want to hide: I'm not an AI researcher, and I don't understand a transformer's internal mechanics at the level of the research papers. But that doesn't stop me from reasoning well about how one behaves, and I did it by drawing on three different sources, each with a blind spot the other two didn't share.

The first was a NotebookLM loaded with a dozen papers on LLMs, used as an oracle to ask plain-language questions: why does a model ignore explicit instructions but copy certain formats, what makes a piece of text easier to regenerate than to copy. From there came the hypothesis I eventually tested: that the difference between a block comment and an inline one isn't cosmetic — it changes how hard it is for the model to separate the data from the structure around it.
That same back-and-forth with the oracle also produced something more structural: a priority hierarchy that explains why compliance with logical rules can consistently lose against the statistical biases of training. That mental model is what let me turn my failed attempts into the architecture the project rests on today:

Level	Component	Weight / Nature	Model Interaction
1	Training Prior	Highest. The statistical "gravitational pull."	It wins by default. Models prefer to regenerate familiar shapes learned from billions of tokens rather than copying bespoke rules that feel "unnatural" to their training.
2	System Prompt	High (Structural). Defines persona and operational boundaries.	It sets the stage, but it is text-based. It can be overridden by the Prior or forgotten in long context windows ("Lost in the Middle").
3	Tools / MCP	Medium (Logical). External functions for deterministic logic.	The model treats tool output as "more text" to improvise on. The oracle's own framing for beating the Prior here is Read-Plan-Generate — forcing the model to plan and validate before it generates. It works, but only when something actually forces the sequence; left as a suggestion, the model skips straight to generating.
4	Skills	Variable. Learned capabilities (e.g., coding, math).	Are they system prompts? Usually not. Skills are capabilities baked into the weights through SFT/RLHF. You invoke a skill via a prompt, but the skill itself is part of the model's internal repertoire.

The second was the implementing model, in actual working sessions: it proposed concrete formats, and by what it preserved versus what it stripped out when generating code, it revealed which of those formats actually held up under regeneration.

The third was the most direct: asking the models themselves, after they'd generated something, why they'd done what they did. I don't take that source at face value. A model explaining its own behavior is, at best, a plausible reconstruction, not real introspection — but more than once those answers pointed, in their own words, at exactly the spot where the tool was failing. One of them ended up proposing two reasonable architectural fixes. None of the three sources alone would have gotten me here.

First solution: inline comments

Once the diagnosis was clear, the first fix was a matter of format, not instruction.

A block comment at the top of the file is, to the model, separable documentation: it gets removed at no cost, because dropping a whole block doesn't break anything around it. An inline comment, pinned to the end of a declaration, is different: removing it means editing that specific line, one at a time, instead of discarding an entire block in one move. That extra friction is enough that, most of the time, the model doesn't bother.

The winning format was this:

:root {
    --color-100: #faf2f5; /* ✅ text→900·800·700  ⚠️ lg→600 */
    --color-300: #e9bfcc; /* ✅ text→900·800  ⚠️ lg→700 */
    --color-600: #c86f90; /* ✅ text→900  ⚠️ lg→100·800·white */
    --color-700: #bb4268; /* ✅ text→white·100  ⚠️ lg→300·900 */
    --color-800: #67273f; /* ✅ text→white·100·300  ⚠️ lg→600 */
    --color-900: #3b1521; /* ✅ text→white·100·300·600  ⚠️ lg→700 */
}

Each variable carries, soldered on as an inline comment, the exact list of backgrounds it's safe on. The model copies the whole block, :root {} included, because separating the rule from the variable costs more than copying both together.

Putting the comment inline at the end of the variable creates a syntactic bond (contextual binding). To the model, the CSS variable token and the accessibility rule token sit so close together in attention space that pulling them apart would mean swimming against its own training. The comment gets processed as part of the declaration's logical unit.

Getting here took several failed attempts, and each one taught something. An all-caps warning inside a comment ("COPY AS-IS") changed nothing — a comment that orders you to copy it is, to the model, still just a comment from a tool that doesn't really have enforcement capabilities. A plain-text table placed above the CSS got read and ignored just the same, because the model wrote its own CSS regardless. The format that worked wasn't the most human-readable one — it was the one that most closely resembled what the model already wanted to write.

Second attempt: the forced planning flow

Inline comments solve what survives once the model has the CSS in hand. They don't solve whether the model arrived at that CSS having checked the pairs before writing anything. For that I built plan-palette-usage, an MCP prompt that structures the whole sequence into four steps: read the matrix, plan each pair in plain text before writing any CSS, validate that list with validate_pairings, and only generate tokens if the whole validation came back clean.

The idea was to turn a judgment problem, which models handle inconsistently, into a step-execution problem, which they handle better. It worked, when invoked. The problem, which took me a while to see clearly, was in that condition.

The nuance that changed everything: prompts need a human

I assumed the model could trigger plan-palette-usage on its own, the same way it triggers any other tool when it judges that doing so helps. That's false, and not because of some particular client's limitation — it's true by specification.

Of MCP's three primitives, tools are model-controlled: the model decides when to call them. Prompts are user-controlled: they only fire if a person invokes them explicitly. What changes between clients is the shape of that invocation, not who triggers it. In Claude Code, "explicitly" means literally: you need to type the exact command, /mcp:accessible-palette:plan-palette-usage, autocomplete included; asking the model in natural language to "use prompt X" isn't enough. In Cline, it is enough — if I ask it in my own message to run plan-palette-usage, it invokes the prompt without me typing the command. But the invocation still starts from a person requesting it by name or intent, not from the model deciding on its own that it was worth using. The barrier that matters for this story isn't the command's exact format: it's that in no client does the model trigger it by itself, unprompted.

In practice, this means the four-step flow almost never runs. Nobody is going to teach a non-technical person the exact command or the exact phrase needed to trigger a guided prompt when they just ask for a landing page with something like "make me a page about X using the palette mcp." And this library's accessibility guarantee couldn't depend on that.

Third solution: the gate on the server

This is what actually pushed me to change layers: if the model can't invoke the prompt on its own, then the whole guarantee rested on a mechanism that, in real-world use, almost never fires. I needed something that didn't depend on someone typing a command.

validate_pairings always computed contrast deterministically — that was never the weak point. The weak point was that generate_css_tokens, the tool that actually delivers the CSS, knew nothing about what had happened before it. You could skip validation entirely and ask for the tokens anyway, and the server would hand them over without asking. proceed: false was a good-faith convention between the prompt and the model; the server had no memory of whether that convention had been honored.

The fix, in code, was small: the server now keeps a record of which hex-and-theme combinations have already passed validate_pairings successfully, and generate_css_tokens checks that record before generating anything. If the combination hasn't been validated, the tool throws an error and refuses to respond — the same way it fails on an invalid hex, not as a warning the model can choose to read or ignore.

Technically, that record is an in-memory Set, instantiated once when the Node process starts. The transport is StdioServerTransport, so every client that launches the server spins up a fresh process, and "session" means nothing more sophisticated than that process's lifetime. It doesn't persist to disk, has no TTL. It's deliberately the simplest unit that could work, and — crucially — it doesn't depend on anyone writing anything by hand: it lives inside a tool the model was already going to call anyway.

What this changes is where the guarantee lives. Before, it depended on a person knowing how to invoke a prompt, and on the model honoring a text-based convention. Now there's no decision to make: either validation already passed, in this same process, for this hex and this theme, or the tool produces no output.

I'm still being honest about the real limit: this doesn't stop an agent from writing CSS by hand, bypassing the tool entirely, with whatever colors it likes. No MCP server can prevent that; it's outside its reach. What the gate guarantees is narrower, and at the same time sturdier: if the agent uses this tool to get the tokens, there's no way to get them without having validated first.

How it all landed

What holds the system up today is, in essence, a single piece: the server-side gate, backed by the inline comments so that what's been validated doesn't get lost on the way out. The prompt still exists, still invokable by anyone who knows the command, but I'm no longer sure what real advantage it has over the gate. In theory it forces a re-read of the matrix and explicit reasoning before validating; in practice, I have no evidence that changes the final outcome in any way the gate doesn't already achieve on its own. I'm documenting it for what it is: an optional add-on, not the mechanism anything important depends on.

I want to be concrete about what "works" means here, rather than settling for "I've used it and it's gone well." I generated 45 demo pages across several different agents and models, running on Open Code, Claude Code, GitHub Copilot, and Cline, using only minimal natural-language prompts like "build a landing page about X using the palette mcp," never invoking the guided prompt, and combining this with the accessibility skill from github/awesome-copilot (the same one its author, Michael Fairchild, tuned against his own evaluation harness before publishing it). I ran axe-core against all 45: 35 had zero violations of any kind. Of the 10 that did, 7 were specifically color-contrast violations — the exact problem this library exists to solve, still slipping through, but in a small fraction of the total. I'd like to point out that this sample is extremely small. I encourage you to try it yourself, however you consider. Screenshots and full details are in the repository, under /demo.

It doesn't always come out perfect: every so often a failure still slips through. Not in the tokens the server delivers anymore — those are mathematically closed — but in the final component CSS, where the model can still ignore the manifest if it decides to write by hand outside the tool, or in accent colors, where I don't yet have the same kind of guarantee. Language model behavior is probabilistic, and no patch — not even one that lives on the server — is airtight against a formatting prior operating outside the protocol.

I'm not going to put my own benchmark number on this beyond what I've already shared: accessibility is context-dependent, and your model, your prompts, your task mix all move the result. So here's a direct invitation: install the server, try it against your own workflows, run your own axe-core or Lighthouse against what it generates, and tell me what you get. If you find a case where the model jumps the rails, or a format that works better than mine, I want to know — that's the only way to keep refining this.

What I do stand behind, without reservation, is the direction. A handful of occasional misses, now confined to the part of the process the protocol can't reach, lives in a completely different world from "low-contrast text on 84% of pages" or a 12% unguided pass rate. You're not installing a hard constraint on the whole process; you're installing a hard constraint on the arithmetic part, and leaning hard, with text and with structure, on the part that's still probabilistic.

What comes next

Two things remain open, and I'd rather name them with the same honesty as the rest of this piece than give the impression there's nothing left to fix here.

The first is mine: finding an equivalent, for accent colors, of what inline comments achieve for the base palette — so that an accent validated with check_contrast ends up somehow soldered to the CSS, instead of losing that validation on the first edit.

The second is Marta's: she's working on adapting the original algorithm to APCA (Accessible Perceptual Contrast Algorithm), the perceptual contrast method being defined for WCAG 3.0.

The palette algorithm was created by Marta Herrera Hollingsworth; the library and the MCP server are mine. This piece is a practical companion to my earlier essay, "As We May Code: Why Software Is a Human Problem Dressed in Logic."

Motivation data: WebAIM Million 2026 and the A11y LLM Eval report by Michael Fairchild. I'm part of the GitHub Accessibility Advisory Panel; that's where I came across Michael Fairchild's work, which motivated this project.

DEV Community