<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Michael Fairchild</title>
    <description>The latest articles on DEV Community by Michael Fairchild (@mfairchild365).</description>
    <link>https://dev.to/mfairchild365</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F152506%2F86319f69-42bd-469e-8c09-996986a2d551.jpeg</url>
      <title>DEV Community: Michael Fairchild</title>
      <link>https://dev.to/mfairchild365</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/mfairchild365"/>
    <language>en</language>
    <item>
      <title>AI-generated accessibility, an update — frontier models still fail, but skills change the game</title>
      <dc:creator>Michael Fairchild</dc:creator>
      <pubDate>Thu, 21 May 2026 14:40:00 +0000</pubDate>
      <link>https://dev.to/mfairchild365/ai-generated-accessibility-an-update-frontier-models-still-fail-but-skills-change-the-game-5629</link>
      <guid>https://dev.to/mfairchild365/ai-generated-accessibility-an-update-frontier-models-still-fail-but-skills-change-the-game-5629</guid>
      <description>&lt;p&gt;A few months ago I shared early results from the &lt;a href="https://github.com/microsoft/a11y-llm-eval" rel="noopener noreferrer"&gt;A11y LLM Eval&lt;/a&gt; project, a benchmark that measures how accessibly LLMs generate UI code. The &lt;a href="https://dev.to/mfairchild365/embedding-accessibility-into-ai-based-software-development-282k"&gt;previous post&lt;/a&gt; showed that LLMs default to inaccessible code, explicit accessibility instructions can dramatically change that, and manual testing is still essential.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://microsoft.github.io/a11y-llm-eval-report/" rel="noopener noreferrer"&gt;latest report&lt;/a&gt; is out, with new models, a redesigned test scope, and a brand new mechanic: &lt;strong&gt;skills&lt;/strong&gt;. Two things stand out:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The newest frontier models (GPT‑5.5, Claude Opus 4.7, Gemini 3.1 Pro Preview, Claude Haiku 4.5, and others) still fail accessibility checks by default.&lt;/li&gt;
&lt;li&gt;A well-written skill can produce the highest pass rates we've measured. Skills can even let a weak-baseline model outperform the leaders, though they can cost more tokens to run.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmsn59056zro2lsfy5t48.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmsn59056zro2lsfy5t48.png" alt="Screenshot of the overview section of the report" width="800" height="143"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Note that the pass rate reflects only this harness's automated checks (a curated set of axe-core WCAG rules plus hand-written assertions per test case). Automated testing can detect only a subset of accessibility issues: 100% here means the sample passed every check that was run, not that the page is WCAG conformant or fully accessible.&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Default ("control") accessibility is still bad: average pass rate is &lt;strong&gt;12%&lt;/strong&gt;, with &lt;strong&gt;GPT‑5.4 Mini&lt;/strong&gt; leading at &lt;strong&gt;25%&lt;/strong&gt;. Newer does not mean more accessible.&lt;/li&gt;
&lt;li&gt;Custom instructions still pay off. The Basic instruction set lifts pass rates by &lt;strong&gt;+48.5pp&lt;/strong&gt; to 60%.&lt;/li&gt;
&lt;li&gt;Skills go further. The &lt;code&gt;Building Accessible UI&lt;/code&gt; skill, run as a two-turn Generate then Review workflow, hits &lt;strong&gt;86% pass rate (+74.6pp)&lt;/strong&gt;. The top performer is &lt;strong&gt;Gemini 3.1 Pro Preview&lt;/strong&gt;, a model that scored only &lt;strong&gt;8%&lt;/strong&gt; on control.&lt;/li&gt;
&lt;li&gt;The skill's review turn costs about &lt;strong&gt;5.5× the input tokens&lt;/strong&gt; of control. Quality is not free.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What's new in this report
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Results are now fully agentic. The previous report called LLM APIs directly with a single prompt and response. This report drives each evaluation through the &lt;a href="https://github.com/github/copilot-sdk" rel="noopener noreferrer"&gt;GitHub Copilot SDK&lt;/a&gt; as an actual agent, with tool use, multi-turn reasoning, and the same instruction and skill loading mechanics that Copilot agents use in production. The numbers below reflect how these models behave when wrapped in an agent loop, not in a one-shot API call. It also means the new "skills" variant is only possible because we're running through an agent runtime.&lt;/li&gt;
&lt;li&gt;8 models evaluated across 32 prompt cases (1,280 control samples). The model lineup is mostly fresh: GPT‑5.4, GPT‑5.4 Mini, GPT‑5.5, Claude Opus 4.7, Claude Sonnet 4.6, Claude Haiku 4.5, Gemini 3.1 Pro Preview, and Gemini 3 Flash Preview.&lt;/li&gt;
&lt;li&gt;Instruction sets tracked are now Basic and Minimal. The Detailed expert-level instruction set from the previous run isn't in this report.&lt;/li&gt;
&lt;li&gt;Skills, reusable task-specific guidance packages, were added as a new mechanic. The first skill evaluated is &lt;code&gt;Building Accessible UI&lt;/code&gt;. More on this below.&lt;/li&gt;
&lt;li&gt;A new variant token and pass-rate snapshot lets you compare quality against token cost across control, instructions, and skill turns.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Control: frontier models, same accessibility problem
&lt;/h2&gt;

&lt;p&gt;The headline finding from the previous post was that LLMs default to producing inaccessible code. With the latest models in hand, that hasn't changed.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwdgkw39ac2nx6l7qb1x0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwdgkw39ac2nx6l7qb1x0.png" alt="Screenshot of the control overview" width="800" height="220"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A few observations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GPT still leads on control, but the top score is well below the 41% GPT‑5.2 posted previously. The prompt set has changed and the assertions are stricter, so the numbers aren't directly comparable, but the conclusion is the same: nobody is shipping accessible code by default.&lt;/li&gt;
&lt;li&gt;Claude Haiku 4.5 sits at 3%, with an average of nearly 6 WCAG failures per sample. Sonnet 4.6 and Gemini 3 Flash Preview aren't far behind.&lt;/li&gt;
&lt;li&gt;The hardest test case, Shopping Home Page (React, Dark theme), produced a 0% pass rate with 15.55 average WCAG failures across all models. Component density compounds the problem fast.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The training-data hypothesis from the previous post still seems to fit. The open web is overwhelmingly inaccessible, so models trained on it inherit those patterns regardless of how capable they are at code in general.&lt;/p&gt;

&lt;h2&gt;
  
  
  Instruction sets: still the cheapest win
&lt;/h2&gt;

&lt;p&gt;Custom instructions are still the fastest thing a team can ship to improve accessibility.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe1x7mxvrvidtrqx9onee.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe1x7mxvrvidtrqx9onee.png" alt="Screenshot of the instruction set table" width="800" height="119"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The Basic instruction file produces nearly 5× the control pass rate while only increasing average input tokens by roughly 50%. Even the one-line Minimal instruction ("All output MUST be accessible.") more than triples it.&lt;/p&gt;

&lt;p&gt;If you do nothing else, ship an instruction file. The &lt;a href="https://aka.ms/a11y-custom-instructions" rel="noopener noreferrer"&gt;basic instructions&lt;/a&gt; are a good starting point to customize for your team's stack and design system.&lt;/p&gt;

&lt;h2&gt;
  
  
  Skills: the new mechanic
&lt;/h2&gt;

&lt;p&gt;The biggest change in this report is the introduction of &lt;strong&gt;skills&lt;/strong&gt;. Where instruction sets are always-on guidance loaded into the agent's context for every task, a skill is a reusable, task-specific package that bundles guidance, examples, supporting files, scripts, and a tool-use workflow. The agent loads the skill only when it's relevant, and within a skill, only loads the slice it needs for the task at hand.&lt;/p&gt;

&lt;p&gt;This matters because it changes two things at once: &lt;em&gt;what&lt;/em&gt; guidance the model sees and &lt;em&gt;when&lt;/em&gt; it sees it. Skills can carry far more detail than an instruction file without flooding the context window, and the two-turn Generate then Review pattern gives the model a structured second look at its own output before it's done. Together that's why skills outperform instructions in this report.&lt;/p&gt;

&lt;p&gt;The first skill evaluated is &lt;a href="https://github.com/microsoft/a11y-llm-eval/tree/main/config/skills/building-accessible-ui" rel="noopener noreferrer"&gt;Building Accessible UI&lt;/a&gt;. It is purpose built to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Activate when the agent is creating UI, so generated code is more accessible by default.&lt;/li&gt;
&lt;li&gt;Contain expert-level guidance and checklists for many components and patterns.&lt;/li&gt;
&lt;li&gt;Only pull the guidance for the specific component or pattern being built, to limit token impact on the context window.&lt;/li&gt;
&lt;li&gt;Run a two-turn workflow: Generate the UI, then Review it against the skill's checklist and fix issues.&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Variant&lt;/th&gt;
&lt;th&gt;Pass rate&lt;/th&gt;
&lt;th&gt;Delta vs. control&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;Building Accessible UI&lt;/code&gt;, Generate (turn 1)&lt;/td&gt;
&lt;td&gt;82%&lt;/td&gt;
&lt;td&gt;+70.4pp&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;Building Accessible UI&lt;/code&gt;, Review (turn 2)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;86%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+74.6pp&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy8frq20xene8brkgq6a4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy8frq20xene8brkgq6a4.png" alt="Screenshot of skill table" width="800" height="323"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A few observations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The best model under the skill is Gemini 3.1 Pro Preview, the same model that scored 8% on control. With the right scaffolding, a weak baseline can outperform the leaders.&lt;/li&gt;
&lt;li&gt;The review turn pulls its weight. Asking the agent to self-check against the skill's checklist adds 5.6pp on top of an already strong first turn, closer to how a human accessibility reviewer actually works.&lt;/li&gt;
&lt;li&gt;Skills don't dominate the prompt. Only 14 to 18% of input tokens come from the skill itself, compared to 100% for instruction sets. Most of the context window is still free for the actual task.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The cost question
&lt;/h2&gt;

&lt;p&gt;Skills win on quality, but they aren't free.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa1ezwzh3lk9r2infuxwc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa1ezwzh3lk9r2infuxwc.png" alt="Screenshot of the token impact table" width="800" height="270"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The skill's review turn averages roughly 5.5× the input tokens and 2.7× the API calls of control. At scale, that's a meaningful budget impact.&lt;/p&gt;

&lt;p&gt;A practical split:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Instruction sets are broad, always-on guardrails. Cheap, simple to ship, and the best ratio of accessibility improvement to tokens spent. The downside is that the token impact of custom instructions can add up quick if your project has instructions for multiple domains, such as accessibility, security, content, etc. Use these as the default for any team but keep them short.&lt;/li&gt;
&lt;li&gt;Skills are focused, procedural guidance for higher-stakes work, or when you have the budget.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Recommendations
&lt;/h2&gt;

&lt;p&gt;The advice from the previous post still holds, with one new lever:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Ship a project-tailored instruction file today. Start from the &lt;a href="https://aka.ms/a11y-custom-instructions" rel="noopener noreferrer"&gt;basic instructions&lt;/a&gt; and customize for your stack, design system, and component library.&lt;/li&gt;
&lt;li&gt;Add a skill for high-stakes UI work if your token budget allows. The two-turn Generate then Review pattern materially improves outcomes.&lt;/li&gt;
&lt;li&gt;Bake automated accessibility checks into CI/CD and block PRs on regressions.&lt;/li&gt;
&lt;li&gt;Keep manual testing by humans, including people with disabilities. None of these tools (automated checks, instructions, or skills) can cover all accessibility requirements.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Closing
&lt;/h2&gt;

&lt;p&gt;The trajectory hasn't changed. AI keeps scaling how much UI we ship, and the open-web training data keeps scaling inaccessible patterns alongside it. Frontier models alone aren't going to fix this; the latest results from Claude 4.7, Gemini 3.1, and GPT‑5.5 make that clear.&lt;/p&gt;

&lt;p&gt;The toolbox has changed, though. Instructions still pay off in minutes. Skills are a new lever, and a powerful one: capable of taking an 8% baseline model to 86% on these checks. The work now is to pick the right tool for the right task, enforce it in CI, and keep humans in the loop where it matters most.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://microsoft.github.io/a11y-llm-eval-report/" rel="noopener noreferrer"&gt;See the full report&lt;/a&gt; and the &lt;a href="https://github.com/microsoft/a11y-llm-eval" rel="noopener noreferrer"&gt;a11y-llm-eval repository on GitHub&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>a11y</category>
      <category>llm</category>
      <category>ai</category>
      <category>benchmark</category>
    </item>
    <item>
      <title>Embedding Accessibility into AI based software development</title>
      <dc:creator>Michael Fairchild</dc:creator>
      <pubDate>Thu, 12 Mar 2026 16:07:21 +0000</pubDate>
      <link>https://dev.to/mfairchild365/embedding-accessibility-into-ai-based-software-development-282k</link>
      <guid>https://dev.to/mfairchild365/embedding-accessibility-into-ai-based-software-development-282k</guid>
      <description>&lt;p&gt;At &lt;a href="https://conference.csun.at/event/2026/summary" rel="noopener noreferrer"&gt;CSUN-AT 2026&lt;/a&gt;, I spoke with my colleague Mallika Meiyappan on Embedding Accessibility into AI based software development. Here are some key take aways.&lt;/p&gt;

&lt;p&gt;AI is causing rapid transformation across the entire development lifecycle. It's embedded in design tools, developer workflows, content creation, and user experiences. This speed and scale increase both productivity and risk of scaling accessibility issues.&lt;/p&gt;

&lt;p&gt;Unless accessibility is intentionally built into AI powered workflows, we risk scaling accessibility barriers as fast as we scale productivity.&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;AI is scaling development speed—and accessibility problems if accessibility isn’t intentionally built into AI workflows.&lt;/li&gt;
&lt;li&gt;LLMs generate poorly accessible code by default, largely because they’re trained on web code where most sites already have accessibility issues.&lt;/li&gt;
&lt;li&gt;Explicit accessibility instructions dramatically improve results, with structured guidance pushing some models from near-zero to over 90% pass rates.&lt;/li&gt;
&lt;li&gt;Teams should embed accessibility into AI tooling and pipelines, using custom instructions, CI/CD checks, and continued manual testing.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  LLMs don't do a great job of generating accessible code
&lt;/h2&gt;

&lt;p&gt;At Microsoft, I built an evaluation tool to benchmark how well LLMS produce accessible code. That tool is available at &lt;a href="https://github.com/microsoft/a11y-llm-eval" rel="noopener noreferrer"&gt;github.com/microsoft/a11y-llm-eval&lt;/a&gt;. The tool contains a test suite of prompts to generate pages and common components, then evaluates the resulting code against the &lt;a href="https://github.com/dequelabs/axe-core/" rel="noopener noreferrer"&gt;axe-core&lt;/a&gt; automated scanner via &lt;a href="https://playwright.dev/" rel="noopener noreferrer"&gt;playwright&lt;/a&gt;. Axe-core is great, but it's a generic testing tool and can't test keyboard behaviors or know to expect certain semantics or other behaviors. Because of this, each prompt has an additional suite of custom tests that go beyond what axe-core can test.&lt;/p&gt;

&lt;p&gt;That being said, it's important to note that these tests &lt;strong&gt;do not fully evaluate &lt;a href="https://www.w3.org/WAI/WCAG22/quickref/" rel="noopener noreferrer"&gt;WCAG&lt;/a&gt; or guarantee fully accessible results&lt;/strong&gt;. Manual testing is still essential.&lt;/p&gt;

&lt;p&gt;The prompts do not contain anything about accessibility. This is done to establish a baseline/control for how well the LLMs produce accessible code by default, without explicit prompts for accessible code.&lt;/p&gt;

&lt;p&gt;The results paint a pretty bleak picture. &lt;a href="https://microsoft.github.io/a11y-llm-eval-report/" rel="noopener noreferrer"&gt;View the most recent report&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzsutenp61jz6w71xadii.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzsutenp61jz6w71xadii.jpeg" alt="Screenshot of the report as of Feb 2026" width="800" height="730"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GPT 5.2 takes the lead with 41% passing.&lt;/li&gt;
&lt;li&gt;The top 3 models are all GPT models.&lt;/li&gt;
&lt;li&gt;The rest of the models score zero or near zero, including Gemini 3 pro, Grok 4 Fast Non-Reasoning, Gemini 3 Flash Preview, DeepSeek V3.2, Claude Haiku 4.5, Claude Sonnet 4.5, and Claude Opus 4.6.&lt;/li&gt;
&lt;li&gt;This results in an average score of about 10% across all models.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Why are results so bad?
&lt;/h3&gt;

&lt;p&gt;It's difficult to know for sure, but what we do know is that about &lt;a href="https://webaim.org/projects/million/" rel="noopener noreferrer"&gt;95% of websites have accessibility issues&lt;/a&gt;. So it's safe to assume that these models are being trained on code that is inaccessible, and thus producing results that are inaccessible.&lt;/p&gt;

&lt;p&gt;So why is GPT so much better? I'm not sure, but my guess is that they are training on a higher quality data set that has more accessible code than what you would find generally in the wild.&lt;/p&gt;

&lt;h2&gt;
  
  
  What can devs do to improve results?
&lt;/h2&gt;

&lt;p&gt;This is where custom instructions for accessibility come into play. Custom instructions are files (usually .md files) that enable you to define common guidelines and rules that automatically influence how AI generates code. &lt;a href="https://code.visualstudio.com/docs/copilot/customization/custom-instructions" rel="noopener noreferrer"&gt;Visual Studio Code has some great documentation on this&lt;/a&gt;. If set up correctly, the agent will automatically use these instructions for all prompts.&lt;/p&gt;

&lt;p&gt;As part of the LLM-Eval project, I've benchmarked 3 different custom instruction files for accessibility.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqzsocis5d8jl77huxg8m.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqzsocis5d8jl77huxg8m.jpeg" alt="Screenshot of the report summary for instruction sets" width="800" height="181"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Minimal: just says "All output MUST be accessible." This alone, resulted in a &lt;strong&gt;18 percentage point jump.&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Basic: says "All output MUST be accessible.
Use semantic HTML first; only use ARIA when necessary, and ensure full keyboard support.
Conform to &lt;a href="https://www.w3.org/TR/WCAG22/" rel="noopener noreferrer"&gt;WCAG 2.2 Level AA&lt;/a&gt;." This resulted in a &lt;strong&gt;37 percentage point jump&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Detailed: is full-on expert level guidance. This resulted in a &lt;strong&gt;48 percentage point jump&lt;/strong&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;So just mentioning the word "accessibility" has a huge impact in results. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdvhv1vxri22btekwzlhw.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdvhv1vxri22btekwzlhw.jpeg" alt="Screenshot of detailed results of instruction sets" width="800" height="1066"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;But if you look closer, with the detailed instructions set:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Some models, like GPT scored over 90%.&lt;/li&gt;
&lt;li&gt;Other models only saw marginal improvements.&lt;/li&gt;
&lt;li&gt;Still other models, kept scoring zero (looking at you Grok).&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  So what instructions should I use?
&lt;/h3&gt;

&lt;p&gt;I've published the detailed instructions at the &lt;a href="https://github.com/github/awesome-copilot/blob/main/instructions/a11y.instructions.md" rel="noopener noreferrer"&gt;Awesome Copilot&lt;/a&gt; project. This is a great place to start.&lt;/p&gt;

&lt;p&gt;But these instructions are still generic. It's best to customize your instructions to fit your specific project. &lt;a href="https://code.visualstudio.com/docs/copilot/customization/custom-instructions" rel="noopener noreferrer"&gt;GitHub has great guidance on this&lt;/a&gt;. Here are some tips:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Define team and project specific workflows, tools, standards, design systems, and component libraries.&lt;/li&gt;
&lt;li&gt;Use precise language, like MUST, MUST NOT, SHOULD, and SHOULD NOT.&lt;/li&gt;
&lt;li&gt;Use lists to format your instructions when possible. LLMs love structure.&lt;/li&gt;
&lt;li&gt;Ask an agent to optimize your instructions. This can be very helpful.&lt;/li&gt;
&lt;li&gt;DO NOT paste entire standards or guidelines like WCAG or ARIA in your instructions. This will often result in worse code.&lt;/li&gt;
&lt;li&gt;DO NOT put critical resources behind links - agents will not follow these links.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What about other aspects of software development?
&lt;/h2&gt;

&lt;p&gt;AI is having a huge impact on all aspects of software development. Here are some insights and opportunities:&lt;/p&gt;

&lt;h3&gt;
  
  
  Research
&lt;/h3&gt;

&lt;p&gt;Change: AI is being leveraged to speed up product and UX research. "Synthetic users" are AI bots that pretend to be users and give feedback and insights on ideas and designs. Additionally, AI is analyzing more data than ever and identifying trends that result in new features or changes.&lt;/p&gt;

&lt;p&gt;Opportunity: "Synthetic users" can be used to help provide quick feedback on accessibility too, but they cannot replace lived experiences and insights from people with disabilities. It may also be possible to leverage AI to help detect accessibility issues from customer feedback and data insights - but we need to be careful and mindful of privacy. &lt;/p&gt;

&lt;h3&gt;
  
  
  Design
&lt;/h3&gt;

&lt;p&gt;Change: Designers are being asked to use AI now more then ever. AI is being used for rapid prototyping, and some designers are moving away from static designs to vibe coded prototypes. Speed is a huge pressure, and it's common for designers and developers to work in parallel, rather than a classic hand off from design to engineering. We are even seeing a desire for designers to contribute directly to production code, but this has yet to become a reality.&lt;/p&gt;

&lt;p&gt;Opportunity: AI can be leveraged to help designers annotate for accessibility quickly and accurately, as well as review their designs and annotations. Additionally, designers can leverage custom instructions for accessibility to improve their vibe coded prototypes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Testing
&lt;/h3&gt;

&lt;p&gt;Change: Development is happening a much larger scale than ever before, and testing is struggling to keep up.&lt;/p&gt;

&lt;p&gt;Opportunity: Leverage AI to facilitate and assist in testing. Clear policy and quality gates are now more important than ever and need to be consistently enforced. Ensuring that accessibility is baked into the CI/CD pipeline and blocks pull requests is essential. &lt;strong&gt;Manual testing by humans remains essential.&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>a11y</category>
      <category>llm</category>
      <category>ai</category>
      <category>benchmark</category>
    </item>
  </channel>
</rss>
