TL;DR A great agent skill is not a pile of documentation. It is a tightly scoped
SKILL.mdwith a description engineered for discovery, ruthless conciseness, anti-patterns stated up front, a checklist workflow, and a feedback loop. The format is an open standard that works across Claude Code, OpenAI Codex, Google Antigravity, Gemini CLI, and Cursor. This post synthesizes the official authoring guidance from Flutter, Anthropic, Google, and OpenAI into one recipe, hands you a complete copy-pasteable Flutter skill, and shows you how to actually evaluate it instead of guessing.
In my last article, I wrote about the official Dart and Flutter Agent Skills and why they stop your AI from writing 2022 Flutter. The most common reply I got was some version of the same question:
"Cool. How do I write my own?"
So I went and read the actual playbooks. Not the hot takes, the primary sources: Flutter's skill docs and eval framework, Anthropic's skill authoring best practices, Google's Antigravity skill docs, and OpenAI's Codex skill guide. The good news is they agree on almost everything. The better news is that the gap between a skill that works and a skill that gets silently ignored comes down to a handful of decisions, and most people get them wrong.
Here is the recipe, Flutter-flavored.
Table of Contents
- Why a bad skill is worse than no skill
- The anatomy you need before the recipe
- One format, every agent
- The recipe: 9 ingredients of a skill that works
- A complete Flutter skill you can steal
- How to actually evaluate your skill
- The security caveat nobody mentions
- The honest take from the community
- The ship-it checklist
- FAQ
- Wrapping up
Why a bad skill is worse than no skill
AI agents are generalists. They average across years of Flutter code, much of it deprecated, and hand you the most statistically common answer instead of the currently correct one. The Flutter team named this the knowledge gap: the framework ships features faster than language models can update their training data. Skills exist to close that gap by handing the agent a task-specific, expert workflow.
But here is what nobody tells you. A poorly written skill does not just fail to help. It actively costs you. Every skill's metadata sits in the agent's context budget at all times. A vague skill that never triggers is dead weight. A skill with a fuzzy description that triggers on the wrong tasks is worse, because now your agent is following the wrong playbook with full confidence.
The bar is not "wrote some Markdown." The bar is "the agent reliably finds it, trusts it, and follows it." Everything below is in service of that bar.
The anatomy you need before the recipe
A skill is the simplest possible thing: a folder with one required file.
building-riverpod-async-screens/
├── SKILL.md # Required: metadata + instructions
├── references/ # Optional: deep-dive docs loaded on demand
├── examples/ # Optional: reference implementations
├── scripts/ # Optional: scripts the agent runs, not reads
└── assets/ # Optional: templates, images
The SKILL.md itself is YAML frontmatter plus a Markdown body:
---
name: building-riverpod-async-screens
description: "Build a Flutter screen that loads async data with Riverpod..."
---
# Building Riverpod Async Screens
[instructions go here]
The magic that makes this scale is progressive disclosure. At startup the agent loads only the lightweight metadata (name, description, path) of every skill. It reads the full SKILL.md only when a task matches, and it reads anything in references/ or examples/ only when the body points it there. If you write Flutter, you already know this pattern: it is deferred loading for the context window. OpenAI, Anthropic, and Google all describe the exact same mechanism.
One format, every agent
This is the part that makes writing a skill worth your time. SKILL.md is an open standard (published at agentskills.io, originated at Anthropic, since adopted across the ecosystem). One skill works almost everywhere:
| Tool | Vendor | Where skills live |
|---|---|---|
| Claude Code | Anthropic |
.claude/skills/ (project), ~/.claude/skills/ (personal) |
| OpenAI Codex | OpenAI |
.codex/skills/ (project), ~/.codex/skills/ or ~/.agents/skills/
|
| Antigravity |
.agents/skills/ (workspace), ~/.gemini/antigravity/skills/ (global) |
|
| Gemini CLI |
SKILL.md standard locations |
|
| Cursor / Copilot | Various | supported with manual placement |
The Flutter team's installer targets the cross-tool location directly:
npx skills add flutter/skills --skill '*' --agent universal
The --agent universal flag drops everything into .agents/skills, the folder compatible agents auto-discover. Write a skill once, and your whole team gets the same expertise regardless of which agent they prefer. Codex adds a distribution layer on top (it calls the authoring format a "skill" and the installable package a "plugin"), but the core file is identical.
The recipe: 9 ingredients of a skill that works
Every official source converges on these. I have ordered them by how much they matter in practice.
1. The description is 80% of the battle
If your skill does not trigger, it is almost never the instructions. It is the description. This is the single most important line in the entire file, because it is the only part the agent reads when deciding whether to load your skill at all, often choosing from 100+ candidates.
Three rules from the official guidance:
- Write in third person. The description is injected into the system prompt. "I can help you build screens" and "You can use this to..." both cause discovery problems. Write "Builds Flutter screens that...".
- State what it does AND when to use it. Include concrete trigger words a developer would actually type.
- Front-load the key use case. Codex's guide is explicit: put the trigger terms first so matching still works if the description gets truncated. Antigravity recommends adding a "Do not use" clause to stop over-activation.
Compare:
# Weak: vague, no triggers, will rarely fire correctly
description: Helps with Flutter screens.
# Strong: what + when + triggers + boundary
description: Build a Flutter screen that loads async data with Riverpod,
handling loading, error, and data states with AsyncValue. Use when
fetching from a repository or API and rendering spinners, retry UI, and
lists. Do not use for purely static screens with no async data.
2. Be ruthlessly concise
Anthropic puts it perfectly: the context window is a public good. Your skill shares it with the system prompt, the conversation, every other skill's metadata, and the user's actual request. The default assumption must be that the agent is already very smart.
Do not explain what Flutter is. Do not explain what a widget is. Do not define JSON. Challenge every sentence: does the agent really not know this? Keep the SKILL.md body under 500 lines. If it grows past that, split it into references/ files.
<!-- Bad: wastes tokens on what the model already knows -->
Flutter is Google's UI toolkit. A widget is a building block of the UI.
To make a network call, you first need an HTTP client, which is a piece
of software that...
<!-- Good: assumes competence, gets to the point -->
Use the `http` package for REST calls. Wrap responses in a typed model.
3. Match the degrees of freedom to the task
This framing from Anthropic is the one most people miss. Think of the agent as a robot walking a path:
-
Narrow bridge with cliffs (low freedom): one correct sequence, high cost of failure. Give exact, rigid instructions. Example: "Run exactly
dart run build_runner build --delete-conflicting-outputs. Do not modify the flags." - Open field (high freedom): many valid routes, context decides. Give direction and trust the agent. Example: "Structure the feature using the layered approach; choose folder names that fit the existing project."
Fragile, deterministic Flutter operations (code generation, migrations, platform config) want low freedom. Architectural and design decisions want high freedom. Most skills need a mix.
4. Lead with anti-patterns, not just patterns
This is what makes the official Flutter skills so effective, and it is the ingredient that separates a senior skill from a junior one. Do not only say what to do. Ban the wrong instinct explicitly.
The official flutter-build-responsive-layout skill does exactly this. It does not just say "be responsive." It says: do NOT switch layouts on MediaQuery.orientationOf, do NOT check for "phone" vs "tablet", do NOT lock orientation. Those negative rules are what stop the model from reaching for the plausible-but-wrong pattern it learned from a thousand old tutorials.
## Rules
- Use `AsyncValue.when` to render data/loading/error. Never assume data is present.
- Do NOT use `FutureBuilder` for server state. It re-runs on every rebuild
and causes duplicate network calls.
- Do NOT swallow exceptions or show an infinite spinner on failure.
5. Turn the task into a checklist workflow
For any multi-step task, give the agent a checklist it can copy into its response and tick off. This prevents skipped steps, which is the most common failure mode on complex work. Both Anthropic and Flutter's own skills use this pattern.
## Workflow
Copy this checklist and track progress:
- [ ] Define the immutable data model.
- [ ] Add the repository method returning `Future<Model>`.
- [ ] Create the provider that calls the repository.
- [ ] Build the screen with `ref.watch` + `AsyncValue.when`.
- [ ] Implement the error branch with a retry action.
- [ ] Run `dart analyze` and fix everything. Repeat until clean.
6. Add a feedback loop
The highest-leverage pattern in the entire playbook: run validator, fix errors, repeat. Give the agent an objective check it can run and a rule to keep going until it passes. In Flutter, you have world-class validators for free.
After generating code, run `dart analyze`. If it reports issues, fix them
and run it again. Only present the result when analysis is clean and
`flutter test` passes.
This single habit improves output quality more than almost anything else, because it converts "looks right" into "provably compiles and lints clean."
7. Use progressive disclosure deliberately
Keep the main SKILL.md as a lean overview and push depth into linked files. Three patterns, named by the Antigravity docs:
-
Router pattern:
SKILL.mdonly. For focused, single-purpose skills. -
Reference pattern:
SKILL.md+references/. For skills with deep API detail. -
Few-shot pattern:
SKILL.md+examples/. For skills where output quality depends on seeing worked examples.
Two rules when you split: keep references one level deep from SKILL.md (the agent may only partially read nested files), and add a table of contents to any reference file longer than 100 lines so the agent can see the full scope even on a partial read.
8. Kill time-sensitive information
Never write "before August 2025, use the old API." It rots. Instead, put deprecated guidance in a collapsed "old patterns" section so the current path stays clean while history stays available.
## Old patterns
<details>
<summary>Why not FutureBuilder? (legacy)</summary>
`FutureBuilder` re-runs its future on every rebuild unless cached, causing
duplicate calls. Providers cache and dedupe by default. Prefer providers.
</details>
9. One canonical example beats ten adjectives
If output quality depends on style, show a complete input/output example rather than describing it. The model matches patterns far better than it follows prose. One correct, runnable Dart snippet anchors the entire skill.
A complete Flutter skill you can steal
Here is a full, working skill that bundles every ingredient above. It targets a spot where AI agents reliably write outdated Flutter: loading async data. Drop this into .agents/skills/building-riverpod-async-screens/SKILL.md and it works in Claude Code, Codex, and Antigravity.
---
name: building-riverpod-async-screens
description: Build a Flutter screen that loads async data with Riverpod,
handling loading, error, and data states with AsyncValue. Use when
fetching from a repository, API, or database and rendering spinners,
retry UI, and lists. Do not use for static screens with no async data.
---
# Building Riverpod Async Screens
Wire an async data screen the way a senior Flutter dev would: a typed
provider, `AsyncValue` state handling, and explicit loading/error/data
branches. No raw `FutureBuilder`, no manual `setState` for server state,
no swallowed errors.
## Rules
- Use a `FutureProvider` for read-only data, or an `AsyncNotifier` when the
screen also mutates state. Do NOT use `StatefulWidget` + `setState` for
server state.
- Watch with `ref.watch` inside `build`. Use `ref.read` only inside callbacks.
- Render all three states with `AsyncValue.when`. Never assume data exists.
- Always give the error branch a retry path. Do NOT swallow exceptions or
show an infinite spinner on failure.
- Keep shared providers in their own file: one feature, one providers file.
## Workflow
Copy this checklist and track progress:
- [ ] Define the immutable data model.
- [ ] Add the repository method returning `Future<Model>`.
- [ ] Create a `FutureProvider` (or `AsyncNotifier`) that calls the repository.
- [ ] Build the screen: `ref.watch` the provider, render with `AsyncValue.when`.
- [ ] Implement the error branch with a retry that invalidates the provider.
- [ ] Run `dart analyze` and fix all issues. Repeat until clean.
## Example
// product_providers.dart
final productProvider = FutureProvider.autoDispose<List<Product>>((ref) async {
final repo = ref.watch(productRepositoryProvider);
return repo.fetchProducts();
});
// product_screen.dart
class ProductScreen extends ConsumerWidget {
const ProductScreen({super.key});
@override
Widget build(BuildContext context, WidgetRef ref) {
final products = ref.watch(productProvider);
return Scaffold(
appBar: AppBar(title: const Text('Products')),
body: products.when(
data: (items) => ListView.builder(
itemCount: items.length,
itemBuilder: (_, i) => ListTile(title: Text(items[i].name)),
),
loading: () => const Center(child: CircularProgressIndicator()),
error: (err, _) => ErrorRetry(
message: 'Could not load products',
onRetry: () => ref.invalidate(productProvider),
),
),
);
}
}
Old patterns
Why not FutureBuilder? (legacy)
FutureBuilder re-runs its future on every rebuild unless cached, causing
duplicate network calls. Providers cache and dedupe by default. Prefer
providers for any server state.
Notice how much work the description does, how the rules ban wrong instincts before listing right ones, how the workflow ends in a validator loop, and how the example is complete enough to copy. That is the whole recipe in one file.
## How to actually evaluate your skill
This is the step almost everyone skips, and it is the one that separates a skill that feels good from one that is good. Both Anthropic and the Flutter team are emphatic: do not trust vibes. Measure.
### Build the evaluation first
Anthropic calls this evaluation-driven development, and the order matters:
1. **Find the gap.** Run the agent on a real task with no skill. Document exactly where it fails or writes outdated code.
2. **Write three eval scenarios** that target those failures.
3. **Establish a baseline.** Measure performance without the skill.
4. **Write the minimum instructions** needed to pass.
5. **Iterate.** Re-run, compare to baseline, refine.
This guarantees you are solving a real problem instead of documenting an imaginary one. A simple eval is just structured expectations:
{
"skills": ["building-riverpod-async-screens"],
"query": "Build a screen that loads the user's order history from OrderRepository and shows it in a list",
"expected_behavior": [
"Creates a FutureProvider or AsyncNotifier that calls OrderRepository, not a StatefulWidget with setState",
"Renders loading, error, and data states using AsyncValue.when",
"Includes a retry action in the error branch that invalidates the provider",
"Generated code passes `dart analyze` with no errors"
]
}
Grade it the way Flutter does
The Dart and Flutter teams run an experimental evals framework (open-sourced at the flutter/evals repository) built around critical user journeys: realistic developer tasks rather than toy prompts. They score on two axes, which is a great rubric to copy for your own skills:
-
Deterministic correctness: does it compile, pass
dart analyze, and pass the tests? Objective, machine-checkable. - Qualitative performance: is the reasoning sound, the output concise, the approach safe? Graded by an automated model judge and by expert humans.
For your own skill, that translates to a dead-simple loop: run the task with and without the skill, then ask "did the deterministic checks pass, and is the code meaningfully better?" If the skill does not move either axis, it is not earning its context budget.
Use two agents: author with one, test with the other
Anthropic's most practical tip: develop the skill with one instance (call it the author) and test it with a fresh instance that has no memory of the conversation (the tester). The author helps you write and tighten the SKILL.md. The tester reveals what the instructions actually communicate to a cold agent. When the tester stumbles, bring the specific failure back to the author and refine. Repeat. This observe-refine-test loop is how the official skills were hardened, and it works because the model understands both how to write agent instructions and what an agent needs to receive.
The security caveat nobody mentions
Skills can include scripts and reference external resources. That means an untrusted skill can introduce vulnerabilities or quietly exfiltrate data. Before you install a community skill, read it, the same way you would read a dependency before adding it to pubspec.yaml. For any skill that runs terminal commands or touches infrastructure, add an explicit "Safety" section documenting exactly what it does. Treat skills as code, because they are.
The honest take from the community
When the official skills dropped, the Flutter corner of X and Reddit reacted the way it always does: screenshots, threads, and declarations that AI coding just changed again. I want to be straight, because the skeptics have a point worth hearing.
More than one experienced Flutter dev read the actual skill files and came away underwhelmed, noting the initial set is fairly thin and covers ground a competent dev already knows. That is fair. And it is also the wrong frame.
A skill is not a magic file that makes your agent brilliant. It is a discipline. The value is not in any single skill the Flutter team shipped. It is in the workflow the format unlocks: codify a pattern once, evaluate it, refine it on a loop, and every future session inherits it. The teams that win with AI in 2026 are not the ones with the best model. They are the ones who got good at writing down what they already know, then testing that the agent actually follows it.
That is the real reason to learn this recipe. Not to consume the official skills, but to write the ones your team actually needs.
The ship-it checklist
Before you commit a Flutter skill, verify:
- [ ] Description is third person, states what AND when, and front-loads trigger words.
- [ ] Description has a "Do not use" boundary to prevent over-activation.
- [ ]
SKILL.mdbody is under 500 lines; depth is inreferences/orexamples/. - [ ] No content explaining things the model already knows.
- [ ] Anti-patterns are stated explicitly, not just the happy path.
- [ ] Multi-step work is a copyable checklist.
- [ ] There is a validator loop (
dart analyze/flutter test). - [ ] No time-sensitive info outside a collapsed "old patterns" section.
- [ ] At least one complete, runnable example.
- [ ] References are one level deep; long ones have a table of contents.
- [ ] At least three eval scenarios exist and the skill beats the no-skill baseline.
- [ ] Any scripts are audited and documented for safety.
FAQ
What is a Flutter agent skill?
A folder containing a SKILL.md file that gives an AI coding agent task-specific, expert instructions for a Flutter or Dart workflow. It loads on demand via progressive disclosure, so it adds expertise without permanently bloating the context window.
What makes an agent skill good?
A precise, trigger-rich description (the single biggest factor), ruthless conciseness, explicitly stated anti-patterns, a checklist workflow, a validator feedback loop, and at least one complete example, all verified against evaluations rather than vibes.
How do I write the description so the skill actually triggers?
Third person, state both what the skill does and when to use it, front-load the trigger words a developer would type, and add a "Do not use" clause to prevent it firing on the wrong tasks.
How do I evaluate a skill?
Build evals before writing docs. Run the task without the skill to establish a baseline, write three scenarios with expected behaviors, then measure deterministic correctness (compiles, passes dart analyze and tests) and qualitative quality against that baseline.
Does a skill I write for Claude Code work in Codex and Antigravity?
Yes. SKILL.md is an open standard. Skills that stick to the core format (frontmatter plus Markdown instructions) work across Claude Code, Codex, Antigravity, Gemini CLI, and Cursor. Only advanced, tool-specific features need adjustment.
How is a skill different from a rules file or AGENTS.md?
Rules and AGENTS.md are always-on, repository-wide instructions (setup commands, standards). A skill is loaded only when its description matches the current task. Use always-on files for global rules and short if/then triggers, and skills for specific, repeatable workflows.
How long should a SKILL.md be?
Keep the body under 500 lines. If it grows past that, move depth into one-level-deep references/ files and keep the main file as a lean overview.
Wrapping up
The official Dart and Flutter skills are a starting point, not the destination. The real unlock is the recipe behind them: a discovery-optimized description, concise expert instructions, anti-patterns stated out loud, a checklist, a validator loop, and an evaluation that proves it works. Get those right and you can encode your team's hardest-won Flutter patterns into something every agent on the team follows automatically.
Write one skill this week. Pick the task where your AI agent annoys you most, encode the correct pattern, and evaluate it against the no-skill baseline. Then tell me what you built. I read every comment, and I want to know which Flutter pattern you taught your agent first. 🥊
Sources: Flutter docs: Agent skills and AI Evaluations, flutter/skills, Anthropic: Skill authoring best practices, Google Antigravity skills docs, and OpenAI Codex: Agent skills.
Top comments (0)