Last month I pushed a landing page that scored 97 on Lighthouse accessibility. Green across the board. I shipped it, felt good about it, and moved on.
Two days later a blind user on Mastodon posted a thread about the page. They couldn't figure out what the primary action was. The "Get Started" button was buried inside a div with an aria-label that said "hero section" -- screen readers announced it as a landmark, not a call to action. Lighthouse didn't care. The heading hierarchy jumped from h1 to h4 because I'd styled them to look right visually. The contrast ratios passed because they were technically 4.52:1, but the text was 14px on a busy background image. Compliant. Unusable.
That thread cost me nothing except pride. But it made me build something.
The setup
I have a CI pipeline that runs on every deploy. Adding Lighthouse and axe-core to it was trivial -- most teams already do this. The interesting part was what I added on top: an AI agent that takes the axe-core results, the rendered HTML, and a screenshot of every key page, then evaluates them together.
The agent isn't a wrapper around an API call. It runs as a Claude-based script inside the pipeline, with a structured prompt that acts as an accessibility specialist. It receives three inputs: the raw axe-core JSON output, the full DOM snapshot, and a set of screenshots captured by Playwright. Its job is to find the problems that rule-based tools cannot.
Here's the core of it:
import subprocess, json, httpx
# 1. Run axe-core via Playwright
result = subprocess.run(
["node", "scripts/axe-audit.js", target_url],
capture_output=True
)
axe_results = json.loads(result.stdout)
# 2. Capture screenshots
subprocess.run(["node", "scripts/capture-screens.js", target_url])
# 3. Feed everything to the AI agent
with open("audit-prompt.md") as f:
prompt_template = f.read()
prompt = prompt_template.format(
axe_json=json.dumps(axe_results, indent=2),
dom_snapshot=open("snapshots/dom.html").read(),
page_count=len(screenshots)
)
response = httpx.post(
"https://api.anthropic.com/v1/messages",
headers={"x-api-key": API_KEY, "anthropic-version": "2023-06-01"},
json={
"model": "claude-sonnet-4-20250514",
"max_tokens": 4096,
"messages": [
{"role": "user", "content": [
{"type": "text", "text": prompt},
*[{"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": img}} for img in screenshot_data]
]}
]
}
)
The prompt itself is about 800 words. It tells the agent to evaluate five dimensions that axe-core structurally cannot: logical heading hierarchy, link text clarity in context, focus order coherence, color reliance for meaning (beyond contrast ratios), and whether interactive elements are visually distinguishable from decorative ones.
What it actually finds
I ran this on 14 production pages across three projects over the past four weeks. Here are the aggregate numbers.
Axe-core found 168 violations total. The AI agent confirmed all 168 and flagged an additional 147 issues -- 87% more. Of those 147, about 40% were contextual problems that no rule-based scanner can detect.
Some examples from real audits:
Misleading link text. A navigation bar had three links all labeled "Learn More." Axe-core doesn't flag this because each link is technically valid. The agent flagged it immediately: "Screen reader users navigating by links will hear 'Learn More' three times with no way to distinguish destinations."
Heading hierarchy that makes visual sense but structural nonsense. A pricing page used h2 for plan names and h4 for feature lists, skipping h3 entirely. Sighted users don't notice. The agent caught it because it evaluates the DOM structure, not just individual elements.
Color as the sole differentiator. A form used green and red borders to indicate valid and invalid fields. It passed contrast checks. The agent looked at the screenshot and noted: "Field validation states are communicated only through border color. Users with color vision deficiency will not be able to distinguish valid from invalid fields. Add an icon or text label."
Fake buttons. A div styled to look like a button, with an onClick handler, but no role="button", no tabindex, no keyboard handler. Axe-core catches some of these, but this one had an aria-label that made it pass the automated check. The agent flagged it because it could see the visual appearance in the screenshot didn't match the semantic markup.
Inconsistent focus indicators. Some buttons had visible focus rings, others didn't. Each individual element passed its own check. The agent evaluated the page as a whole and noted the inconsistency -- something no element-by-element scanner can do.
What it gets wrong
Honesty matters here. The agent produces false positives. About 15% of its additional findings are noise -- things like flagging decorative images that intentionally have empty alt text, or questioning heading levels that are actually correct for the content structure. You need a human reviewing the output.
It also occasionally contradicts axe-core on severity. Axe might rate something as "moderate" while the agent calls it critical based on its understanding of the user flow. Usually the agent is right, but not always.
The pipeline outputs both reports side by side. The team reviews the AI agent's findings during the same PR review where they look at the axe-core output. It adds maybe 10 minutes to the review process. For what it catches, that's cheap.
Why this matters more than you think
The European Accessibility Act enforcement started in June 2025. WCAG compliance is no longer optional for companies serving EU customers. But WCAG compliance measured by automated tools covers roughly 30% of the actual success criteria. The other 70% requires human judgment.
An AI agent doesn't replace human judgment. But it gets you from 30% automated coverage to something closer to 55-60%. That middle ground -- the contextual, visual, structural problems that are too nuanced for rules but too tedious for humans to catch on every deploy -- is exactly where AI agents earn their keep.
I've been running this for a month now. It's caught 11 issues that would have shipped to production and been invisible to our automated checks. Three of those would have been WCAG AA failures that no scanner would have flagged.
The tooling isn't polished. The prompt needs tuning for each project. But the approach works, and it works today, with models that already exist.
Getting started
If you want to try this pattern: start with axe-core in your CI (if you haven't already), add Playwright for screenshots, and feed both outputs to an AI model with a structured prompt focused on the five dimensions I mentioned above. The whole thing is maybe 200 lines of code and a well-crafted prompt.
The hard part isn't the code. It's writing a prompt that evaluates accessibility the way an experienced auditor does -- holistically, in context, with an understanding of how real users actually navigate. That took me three weeks of iteration to get right.
If you're building with AI tools and wondering how much technical debt you're shipping without realizing it, it's worth stepping back and assessing your workflow. I built a free 2-minute quiz that scores how safely you're using AI in your development process -- not just for accessibility, but across the board: Vibe Code Risk Assessment. No signup, no email capture, just a score and some things to think about.
If you've tried something similar or have a different approach to catching the stuff scanners miss, I'd genuinely like to hear about it.
Top comments (0)