DEV Community: Debbie O'Brien

How I Documented an Entire Product in 4 Days with an AI Agent

Debbie O'Brien — Wed, 13 May 2026 20:18:51 +0000

I had 55 pages of documentation to write, 59 screenshots to capture, and a product that was still shipping features and being rebranded weeks before release. I did it in four days with Goose, an open-source AI agent by Block, part of the Linux Foundation, and I want to walk you through exactly how. Not the polished version. The real one: how I built it, how it works, everything that broke along the way, and what I learned from it.

The Problem

The AI Platform by Zephyr Cloud is a desktop app where teams collaborate with AI specialists in channels. Think Slack meets AI agents. The product had been moving fast for months. Features were shipping, the UI was evolving, and the documentation was... not keeping up. What existed was a handful of developer-focused reference pages. Markdown files describing CRDT schemas and workflow adapter formats. Useful if you were building the product. Useless if you were trying to use it.

We needed end-user documentation. The kind where someone installs the app, opens the docs, and understands how to create a channel, mention a specialist, and get work done. And we needed it before the official release, which was a few weeks away.

Why an AI Agent

I have written plenty of documentation by hand. It is one of the most time-consuming parts of shipping a product. Not because the writing itself is hard, but because of everything around it. You need to understand the feature by reading source code. You need to take screenshots. You need to crop and optimize them. You need to keep the screenshots updated when the UI changes. You need to maintain consistent voice and structure across dozens of pages. And you need to do all of this while the product is still changing underneath you.

I had been using the agent for other tasks in the codebase and thought: what if I could create a way to write all the documentation from source code, capture screenshots that could be recaptured any time the app changes, and also improve the documentation based on those screenshots.

For those unfamiliar, Goose is an open-source AI agent that runs on your machine. It can read and write files, run shell commands, interact with APIs, and use extensions and skills to specialize in different tasks. Skills are markdown files that encode instructions, conventions, and tooling for a specific task. When you load a skill, the agent follows those instructions. When you improve the skill, every future session benefits. It is the difference between telling an agent what to do every time and teaching it once.

The Plan

Before writing a single page, I sat down and created a phased plan. This turned out to be the most important decision of the whole project. You have an idea in your head but no real structure, and you need to think it through before throwing an agent at it. We created a tracer bullet format with sub-tasks so the agent could work phase by phase and tick off what it had done. One night I even went to bed and left it working on a task. The next morning I reviewed everything it had done and iterated over the parts that needed adjusting. I deliberately avoided using a loop where the agent just runs through everything unattended. I wanted to stay in charge and monitor how things were going, because I was also refining the skills as I went along.

Phase 0: Restructure. Delete developer-focused content from the user guide. Move reference docs to a separate section. Set up the directory structure.
Phase 1: Getting Started. Installation, account creation, platform tour, first channel. The first five minutes of the product.
Phase 2: Daily Use. Chat, messaging, threads, specialists. The features people use every day.
Phase 3: Power Features. Projects, tasks, workflows, knowledge garden. Features that experienced users reach for.
Phase 4: Settings. Connections, sandbox, MCP servers, billing, permissions, browser extensions. Every settings page documented.
Phase 5: Polish. Screenshots for all pages. Cross-linking. Consistent voice. Image optimization.
Phase 6: Undocumented Features. Go through the app screen by screen and find anything I missed. This phase caught the embedded browser, the code editor panel, and several settings pages that had no documentation at all.

The phased approach mattered because it gave me clear stopping points. After each phase, I could commit, review, and course-correct.

The Three Skills I Built

Here is where it gets interesting. I did not just use the agent to write documentation. I built three skills that taught it how the documentation works, and those skills evolved throughout the project as I hit problems and found better approaches.

1. write-docs: The Style Guide in Code

This skill is 513 lines of instructions that define how every documentation page should be written. It covers:

Voice and tone. Casual and friendly. Direct. Confident. "Click Settings" not "You may want to consider clicking Settings."

Formatting rules. Bold for UI elements the user needs to find. Italics for text the user will see but not interact with. Code backticks for anything the user types. No emojis. No em dashes.

Page structure. Start with what the user sees, not how it works internally. One idea per paragraph. Lead with the action. A full page template with frontmatter, headings, screenshots, callouts, and cross-links.

What not to document. Internal implementation details, developer workflows, API references, features behind feature flags. This is user documentation, not a code tour.

The skill also includes a verification checklist that the agent walks through before committing. Content checks (no emojis, no em dashes, UI elements bolded), screenshot checks (optimized, cropped, registered in the manifest), and a build check (pnpm build must pass with no dead links). It is not an automated gate. It is instructions baked into the skill that the agent follows every time.

Why does this matter? Because without it, every documentation session would start with me re-explaining the same conventions. With the skill loaded, the agent writes in the right voice from the first sentence. And when I noticed a pattern I did not like (too many callouts per page, screenshots that were too large), I updated the skill once and every future page followed the new rule.

2. doc-screenshots: Automated Screenshot Capture

This is the most technically interesting skill and the one that saved the most time. It is 478 lines of instructions backed by 1,722 lines of tooling code across four scripts: a bash CLI, a Python manifest runner, a Swift OCR text finder, and a Python highlight overlay renderer.

Why Not Playwright?

The first question people ask: why not use Playwright? I use Playwright every day. I love it. But it would not have worked here.

The AI Platform is a Tauri desktop app. The UI runs in a native webview, not a browser tab. Playwright automates browsers. It cannot connect to a Tauri webview. Even if you could somehow attach to the webview's DevTools protocol, you would be fighting against the native window chrome, the system title bar, and the fact that the app's routing and state management are wired through Tauri's IPC bridge, not standard browser navigation.

I needed something that works at the OS level: find the window, click things on screen, capture what the user actually sees. That led me to Peekaboo, a macOS automation tool that interacts with apps through accessibility APIs and screen coordinates.

The Pipeline

The pipeline works like this:

Peekaboo finds the app window and focuses it. If you need to navigate somewhere first, it clicks UI elements by their visible text.
Peekaboo --retina captures the window at 2x retina resolution without the drop shadow.
A Swift script using the Vision framework runs OCR on the captured image. It finds every piece of text and returns pixel-accurate bounding boxes.
A Python script using Pillow draws highlight overlays, borders, and spotlight effects on the image based on the OCR results.
pngquant and optipng compress the final image. This typically reduces file size by 50 to 60 percent with no visible quality loss.

No hardcoded coordinates for content elements. No browser automation. No authentication tokens. The agent looks at the actual app window, reads the text on screen, and figures out where things are.

The pipeline originally used three separate native macOS tools stitched together. I filed an issue on the Peekaboo repo requesting retina capture support, and the maintainer shipped it within days. That simplified the pipeline to a single peekaboo image --retina call plus the Swift OCR script.

The Screen Takeover Problem

There is a real trade-off with this approach. Peekaboo needs the app window visible and in focus. While the audit or batch capture is running, it is clicking through your app, opening dialogs, navigating between pages, pressing Escape to close things. Your screen is not yours for the duration.

A full audit takes about 10 minutes. A full recapture takes 15 to 20. During that time, you cannot touch the mouse or keyboard without breaking the run. In practice, you kick off the batch, go make coffee, and come back to 59 freshly captured, cropped, and optimized screenshots. Captures can technically run in the background, but navigation clicks need the window in focus and control of the mouse. Even with a second monitor, if you move the mouse it interferes with the run. The agent needs your machine for the duration. Treat it as a coffee break. It is also not ready for CI yet since macOS CI runners do not have a logged-in GUI session with the Accessibility and Screen Recording permissions that Peekaboo needs.

The key insight was the screenshot manifest. Instead of capturing screenshots one at a time, I defined all 59 of them in a YAML file:

screenshots:
  - id: getting-started/app-overview
    output: docs/public/images/getting-started/app-overview.png
    crop: window
    description: >
      Full app window showing the icon rail, channel list,
      and a chat conversation.
    validate:
      - Channels

  - id: getting-started/create-channel-dialog
    output: docs/public/images/getting-started/create-channel-dialog.png
    crop: main
    steps:
      - click: '+'
        near: 'Channels'
      - wait: 1.5
    cleanup:
      - press: 'Escape'

Each entry declares what to capture, how to navigate there, what to crop, and what text should appear in the final image (the validate field). The manifest runner executes them in sequence, resetting the app state between each one.

The manifest means that when the UI changes, you do not retake screenshots by hand. You run the manifest and get all 59 back in one batch. An --audit mode walks every navigation step and reports which targets are broken. A --compare mode recaptures everything and saves new versions alongside the originals for side-by-side review.

I ran the audit while writing this blog post. 50 of 59 passed. Every failure was about test data that had changed (renamed channels, deleted workflows), not broken navigation. The core paths all still worked. The lesson: treat screenshot test data like E2E fixtures. Navigation screenshots are stable. Content-dependent ones need a dedicated docs workspace with controlled data.

3. docs-preview: Deploy and Verify

The simplest skill, at 155 lines, but it solved two problems at once.

Why not localhost? The documentation site builds with Rspress. You can run pnpm dev and preview on localhost:3000, but that only works for you. You cannot share a localhost URL in a PR review, paste it into a Slack thread, or hand it to a teammate to check your work. I needed shareable URLs.

The docs build uses the withZephyr() Rspress plugin, which uploads the built site to Zephyr Cloud's edge network on every pnpm build. The whole cycle takes under 2 seconds. Build, upload, deploy, live URL. I timed it while writing this post: 1.8 seconds for 55 pages and 59 images to go from source files to a production-ready URL on a global CDN.

That means every time the agent finishes writing or updating a page, it can build and hand me a live URL to check in the browser. No local server to start, no port conflicts, no "works on my machine." Just a URL that anyone on the team can open.

The URL problem. Every build produces a unique URL with a hash suffix that changes each time. AI agents are bad at this. The URL has a fixed project number (like 213) and a per-build hash (like 4a62f09db). Before the skill existed, the agent would sometimes "increment" the project number thinking it was a build counter, or type a URL from memory with a fabricated hash. Both produce links that have never existed and always 404.

The skill stamps that out. It pipes the build output to a log file and re-greps the log whenever the URL is needed. It includes explicit warnings about not reusing stale URLs and not typing URLs from memory. Simple, but it eliminated a genuinely annoying class of failures.

Verifying the Docs With Playwright CLI

There is an important distinction in this workflow. Peekaboo automates the desktop app to capture screenshots. But who verifies that the documentation pages themselves render correctly?

That is where Playwright CLI comes in. It is a command-line tool that wraps Playwright's browser automation into simple terminal commands. The agent uses it to open the built documentation site in a real browser, take a DOM snapshot, and verify that headings and images rendered correctly.

The verification flow looks like this. After the agent writes a page, it runs playwright-cli snapshot to get the full DOM tree and checks that the H1 matches, all images loaded, the sidebar navigation includes the new page, and the table of contents lists the right H2 headings. If something is missing or broken, it fixes the page and rebuilds.

This matters because a build passing does not mean the page looks right. Rspress generates static HTML that hydrates with React, so a page can exist but render incorrectly if something is off in the markdown or frontmatter. Playwright actually loads the page in a browser engine and lets the agent inspect what a user would see. It catches dead images, broken navigation links, callouts that rendered as raw markdown instead of styled containers, and layout issues that only show up in the browser.

Two tools, two targets. Peekaboo verifies the app. Playwright CLI verifies the docs about the app.

Working With the Agent, Not Watching It

I want to be clear about something: this was not me kicking off an agent and walking away. It was a constant back-and-forth, like working with a colleague sitting right next to you.

Every page went through iteration. I would review what the agent wrote, point out what was wrong, ask for restructuring, and push back on phrasing. The getting started guide in particular went through several rounds of reworking. What is the right order to introduce features? Should installation come before the platform tour or after? How do you title a page so someone scanning the sidebar instantly knows what it covers? These are editorial decisions that an agent cannot make alone.

One technique that worked well was passing screenshots directly to the agent and saying "check all the clickable items on this and document anything I missed." This shifted the process from documenting based on source code to documenting based on what a user actually sees. The agent could look at a screenshot, identify buttons, tabs, and menu items through OCR, cross-reference them with the existing docs, and flag the gaps. That is how I caught undocumented features like the embedded browser and the code editor panel in Phase 6.

The quality of what the agent produced was good first-draft material that needed editorial direction, not a rewrite. The voice was right because the skill defined it. The structure was right because the template enforced it. What I spent my time on was the higher-level decisions: how to organize the getting started flow, what to emphasize, what to cut, and making sure the documentation told a coherent story rather than just listing features.

You can see the output at docs.theaiplatform.app. The Platform Tour shows the structure I landed on for the getting started flow. The Chat section shows how a feature area breaks down into overview, channels, and messaging pages. The Settings section shows the most straightforward pages where the structure was consistent enough that the agent could produce near-final drafts with minimal editing.

A Day-by-Day Walkthrough

Day 1: The Kickoff

Day 1 was about the plan. I sat down and mapped out the phased approach: what to tackle in what order, how to break 55 pages into manageable batches, and what the agent would need to know before writing the first page. This was the most important work of the entire sprint. The product was also being rebranded, so I ran a rename pass across the existing documentation. Four commits. No new content yet, but the groundwork was laid.

Day 2: The Evening Sprint

Phases 0 through 4 in a single evening. This sounds aggressive, and it was. But the phased plan made it possible. Each phase had a clear scope, and the agent could read the source code to understand each feature before writing about it.

The first commit kicked off Phase 0, which restructured everything, moving 6,769 lines of developer-focused content out of the user-facing docs. Then Phases 1 through 4 each produced a batch of pages with screenshots.

Twelve commits in about ninety minutes. All the scaffolding, all the content, all the initial screenshots. The quality was rough in places (I would fix that in later phases), but the coverage was there. Every major section of the product had at least a first-draft page.

Day 3: The Real Work

Day 3 had 43 commits. This is where the polish happened and where most of the problems surfaced.

Phase 5 started with adding missing screenshots and cross-links. Then the big disruption: the app's sidebar got redesigned mid-sprint. Text labels were replaced with an icon rail. Every screenshot showing the sidebar was wrong. Every navigation step clicking a text label was broken. The manifest paid for itself here. I updated the navigation steps, re-ran the batch, and had all 59 screenshots regenerated in minutes instead of retaking them by hand.

I also added reset steps to the manifest on day 3. Before each screenshot, the runner presses Escape twice and clicks the Chat icon to return to a known state. Without this, a failed screenshot left the app in a broken state that cascaded into every subsequent capture.

Day 4: Finish and Ship

Day 4 was Phase 6 (undocumented features) plus a thorough review pass. The embedded browser and code editor panels had no documentation at all. The agent read the source components, I opened the app to verify what the UI actually looked like, and wrote the pages together.

The review pass caught real issues: contradictory text on the account creation page, screenshots that were cropped too loosely, duplicate content between the workflows overview and the build-and-run page.

The final commit merged the PR: 55 documentation pages, 59 screenshots, and the three skills.

What Broke Along the Way

The Rebrand

The product was rebranded from Zephyr Agency to The AI Platform during the documentation sprint. The rename itself is mechanically simple (find and replace), but the follow-on work is not. Alt text on 59 screenshots. Config files. Every page referencing the product name. Sentences that started with the product name suddenly reading awkwardly with the article "The" prepended. This is not an agent problem. It is just the reality of documenting a product that is still evolving. But it added real friction to a sprint that was already moving fast.

OCR Is Not Perfect

The Vision framework's OCR is very good, but not flawless. It occasionally misreads text. "Get update" becomes "Get undate." The letter "I" gets confused with "l" in certain fonts. When the agent tries to click "Get update" and OCR returns "Get undate," the navigation step fails.

The workaround I built into the skill: search for a substring instead of the full text, use nearby anchor text to disambiguate, or fall back to coordinate-based clicking. The continue_on_failure flag on manifest steps lets non-critical navigation steps fail without aborting the entire screenshot.

Tooltips and Hover States

Moving the mouse to click an element sometimes triggers a tooltip that appears in the screenshot. The fix was straightforward once I understood it: move the cursor away from interactive elements before capturing. The script now does this automatically, but it cost me a round of retakes before I figured out what was happening.

What Worked Surprisingly Well

Skills as Accumulated Knowledge

The three skills started small and grew with every problem I hit. The doc-screenshots skill started as a wrapper around screencapture and Pillow. By the end, it had manifest batch processing, audit mode, validation, reset steps, coordinate-based fallbacks, card-level pixel scanning, and anti-tooltip cursor management.

Each improvement was triggered by a real problem. And because skills persist across sessions, the fix was permanent. The next time anyone on the team works on documentation, all of those fixes are already loaded.

The Manifest as a Screenshot Database

Defining all 59 screenshots declaratively in YAML turned out to be the single most valuable technical decision. Not because batch capture is faster than individual capture (it is), but because it made screenshots a reproducible artifact. The sidebar redesign on day 3 proved it: update a width constant and a few navigation steps, run one command, and all 59 screenshots are regenerated. No manual retakes.

Reading Source Code for Accuracy

The agent reads the actual source code before writing documentation. When the docs said "click the + button next to Channels," it was because the agent had found that button in the component tree, not because it was guessing. That said, source code is not always the final truth. The running app sometimes differs from what the code suggests. The skill instructs the agent to verify text against screenshots using OCR and update the docs when they do not match.

By the Numbers

What I Would Do Differently

Start the skills earlier. The skills were created during the documentation sprint itself. If I had written even a rough version of the write-docs and doc-screenshots skills before starting, the first day would have gone smoother. The early pages needed more revision because the conventions were not yet codified.

Find a way to run screenshot audits in CI. As mentioned above, the navigation clicks need a real display, so CI is not an option yet. But even running --audit locally before merging a PR that touches the UI would catch most stale screenshots early.

Write the manifest first, content second. I wrote pages and captured screenshots as I went. It would have been faster to define the full manifest up front (just the navigation steps, no content), run it once to see what the app actually looks like everywhere, and then write the pages based on real screenshots instead of source code alone.

What You Can Take Away

If you are thinking about using an AI agent for documentation, here is what I think matters most.

Teach the agent, do not just instruct it. A prompt that says "write documentation for this feature" produces generic content. A skill that defines your voice, your formatting rules, your page structure, and your verification checklist produces documentation that sounds like your team wrote it. The upfront investment in the skill pays off on every subsequent page.

Make screenshots reproducible. Manual screenshots are the first thing that goes stale. A declarative manifest that can regenerate every screenshot in one command is worth the engineering effort. It changes screenshots from a one-time cost to a maintained artifact.

Phase your work. Even if you are using an agent, "write all the docs" is not a plan. Break it into phases with clear scope and clear deliverables. This gives you stopping points, review points, and the ability to course-correct.

Expect things to break. OCR will misread text. The UI will change mid-sprint. Preview URLs will go stale. The difference between a frustrating experience and a productive one is whether you encode the fix into a skill so it never happens again.

Review everything. The agent does not replace your judgment. It replaces the mechanical work. You still need to read every page, check every screenshot, and verify that the documentation matches what the user actually sees. The agent writes the first draft. You make it right.

Making Docs Agent-Ready

Writing 55 pages for humans was only half the problem. Agents need to read documentation too.

I added llms.txt and llms-full.txt to the documentation site using the Rspress @rspress/plugin-llms plugin. The llms.txt file is a structured index of every page with one-line descriptions. The llms-full.txt file is the entire documentation site as a single 3,000-line markdown file that an agent can ingest in one request. Every page also has "Copy as Markdown" and "Open in Claude" buttons so users can feed specific pages to an LLM directly.

This is live now. Any agent that can fetch a URL can read the entire documentation in seconds.

Automated Video Walkthroughs (Work in Progress)

Screenshots document a single state. But some features are easier to understand when you see them in motion. Creating a channel, mentioning a specialist, watching the response stream in. These are flows, not static screens.

I have a proof of concept for automated video walkthroughs using Peekaboo. The same manifest that defines screenshot navigation steps can drive a screen recording session: navigate to the starting point, start recording, walk through the steps, stop recording. The tooling exists in early form and produces usable results, but it is not production-ready yet. I am still working on consistent timing, smooth scrolling, and keeping the recordings tight enough to be useful without being rushed.

The goal is to embed these videos directly in the documentation pages so that when the UI changes, both screenshots and videos can be regenerated from the same manifest. That is not done yet, but the foundation is there.

The Future: Documentation in an Agent-First World

Here is what I keep thinking about. I just spent four days writing 55 pages of documentation. It is good documentation. People will use it. But the way people use software is changing.

If you have a product with AI specialists built in, the product itself can guide you. Instead of leaving the app to read a documentation page about how to create a workflow, you ask the specialist in the app and it walks you through it. Instead of searching the docs for how to configure a setting, you describe what you want and the agent does it for you.

That does not mean documentation is dead. It means its role is shifting. Documentation becomes the knowledge layer that agents draw from. The llms.txt work is a step in that direction. But the bigger shift is making the product itself so intuitive, with specialists that genuinely help, that fewer people need to leave the app to figure things out.

We are not there yet. Right now, the documentation is essential. But the future we are building toward is one where the product teaches you how to use it, and documentation exists as a reference layer for agents and for the edge cases that in-app guidance does not cover.

The documentation is live at docs.theaiplatform.app. If you want to try The AI Platform, it is available for macOS, Windows, and Linux.

And yes, this blog post was also created using Goose. It took about five hours of back-and-forth: pulling git history, running the audit and compare, timing preview builds, drafting sections, and then iterating step by step, redrafting, re-checking, and fixing everything until it was right. Agent-driven, not agent-written. Same process as the docs.

How I Used AI to Fix Our E2E Test Architecture

Debbie O'Brien — Wed, 29 Apr 2026 18:28:37 +0000

I joined a project with an existing Playwright E2E test suite, 38 spec files, ~165 tests, around 14,000 lines of test infrastructure. My first step was simple: run the tests locally.

8 out of 130 non-skipped tests passed. A 6% pass rate.

The confusing part? CI was green. It turned out CI ran everything with workers: 1, multiple workers plus the dev environment meant running tests locally just wasn't possible.

Step 1: Analysis — asking questions I didn't know the answers to

I had zero domain knowledge of this codebase. No context on why tests were written a certain way, what the custom wrappers did, or where the real problems were. So I started asking AI to analyze everything, the Playwright configs, the page objects, the spec files, the CI workflows. I asked questions to help me understand the codebase and to figure out what we could do to get tests running locally.

Over a few days, this produced 18 analysis documents covering Architecture, Root causes, Anti-patterns, Silent bugs and Test isolation

The analysis phase was about building a map of a codebase I didn't understand. Every document was a question answered.

Step 2: The tracer bullet plan

With the analysis done, I had a clear picture of what needed to change. But the question was: in what order, and how do you avoid a big refactor that breaks everything?

The answer was tracer bullets, a concept from The Pragmatic Programmer. The idea is to build a thin end-to-end slice through all the layers to prove the architecture works, then expand from there.

I created 8 tracer bullets, each targeting a specific slice:

UI fixture chain — Use worker-scoped and test-scoped fixtures. Prove: fixtures work, teardown works, tests pass in CI.
API fixture chain — Same pattern for API tests. Prove: composable fixtures work for API scenarios.
Expand UI migrations — Apply the proven UI pattern to more files.
MFE-scoped projects — Split one Playwright project into 7 projects by MFE folder (Applications, Organizations, Projects, etc.), each with dependencies: ['Setup'].
Teardown project — Add a cleanup project using Playwright's project dependencies.
API fixture expansion — Composable API fixtures (ownerOrg → ownerProject).
UI migration at scale — Remaining UI spec files.
API setup project — Replace the no-op globalSetup with a proper setup project.

The key insight: the dependency graph told me which bullets could run in parallel. Bullets 1 and 2 were independent. Bullet 4 was independent. Bullet 3 depended on 1. This became important later when running multiple AI sessions.

What a tracer bullet looked like in practice

Bullet 1 targeted a single file with 5 tests. The steps:

Add the fixture infrastructure (currentUser → sharedOrg → project)
Migrate projects-settings-general.spec.ts to use the fixtures
Run locally, verify tests pass
Push, verify CI is green

Step 3: I created a skill to do the work

Once I had a plan with all 33 tasks organized into phases. I needed something to work through them consistently — same process every time, same quality bar, same benchmarking. So I built a skill: pw-test-improvement.

What the skill does

A strict 7-step process for every change:

Identify — Pick one item from the implementation tracker
Baseline — Run the affected tests 3× before changes, record pass rate and timing
Fix — Apply the change following embedded Playwright best practices
Test — Run 3× after changes, all must pass
Compare — Document before/after benchmarks
Update — Mark the tracker item done
Commit — Only when asked, with a structured PR description

The skill had built-in knowledge: Playwright's locator priority (getByRole > getByLabel > getByText > ...), a list of anti-patterns to avoid (waitForTimeout, no-op assertions, CSS class selectors, forced clicks without justification), and migration patterns for replacing the Actions wrapper with direct Playwright calls.

It used the Playwright CLI to run tests directly and capture results.

The architecture changes

Fixtures replaced boilerplate

The biggest change was moving from repeated beforeAll/afterAll blocks to Playwright fixtures. Before: each of 5 test files independently called getUser(), createOrg(), createProject() — 15 API calls total. After: worker-scoped fixtures shared across files — 7 calls total (53% reduction).

The key distinction was worker-scoped vs test-scoped:

Worker-scoped ({ scope: 'worker' }) — created once, shared across all tests in that worker. Good for expensive setup like orgs and projects.
Test-scoped (default) — created fresh for each test. Good for data that tests mutate.

Project structure

The Playwright config went from one project running all 38 spec files to 7 projects, each pointing to its MFE folder:

{ name: 'Applications',  testDir: 'apps/ui/applications/e2e', dependencies: ['Setup'] },
{ name: 'Organizations', testDir: 'apps/ui/organizations/e2e', dependencies: ['Setup'] },
{ name: 'Projects',      testDir: 'apps/ui/projects/e2e',      dependencies: ['Setup'] },
// ... Subscriptions, Host, User Profile

This meant you could run --project=Applications to test just what you need, HTML reports grouped by area, and heavy specs got their own parallelism settings.

The serial cascade fix

4 actual test failures looked like 57. Application tests used serial mode, so when the first test failed, all subsequent tests in that describe block were marked "did not run." The fix: split heavy specs into a dedicated project, increase timeouts (30s → 60s for beforeAll), cap workers to prevent API overload, and use worker-scoped fixtures to share expensive setup.

What went wrong

Not everything worked first time.

The cleanup project broke CI. We added a teardown project with Playwright's project dependencies to clean up test data after runs. It worked locally. In CI, it caused failures — the cleanup ran against a shared environment and interfered with other pipelines. Had to revert it.

Not everything should be a fixture. We tried converting everything to fixtures. After reviewing Playwright docs, we rejected one of the fixtures before doing it as worker-scoped fixtures share across files, which would pollute serial tests that need per-file isolation with different options.

How I worked with AI

This wasn't "tell AI to fix it." It was a collaboration process:

Ask questions relentlessly — "What does this method do?" "Why is this test flaky?" "According to Playwright docs we can do X, can you verify your suggestion based on the docs" I asked hundreds of questions during the analysis phase which lasted a few days.
Challenge every suggestion — "Are you sure? What about edge case X?" If the AI suggested a pattern, I'd ask it to explain why and if it was sure that was a good way of doing it.
Use docs as ground truth — I'd link to Playwright docs and ask "does this align with whats in the docs?" The AI's training data can be outdated; the docs are current.
Validate with multiple tools — I used Goose, Claude Code, and GitHub Copilot. Different tools catch different blind spots and have different opinions just like when you work with different team mates.
Check confidence explicitly — "What's your confidence level on this? why only a 7? How can we get a 10 confidence level?" This surfaces uncertainty the AI might not volunteer and also goes deeper to understanding what we haven't thought about and how we can improve things.

Running it in practice

I ran up to 4 AI sessions in parallel — based on which tracer bullets were independent of each other. The dependency graph from the implementation plan told me what could safely run at the same time.

I'd switch between sessions to check progress, read through what was being changed, and step in when something needed verifying. The AI did the mechanical work, applying patterns, running tests, capturing benchmarks. I did the oversight, deciding what to fix next, catching when a suggestion didn't look right, and verifying against the actual Playwright docs.

Never more than 4 at a time. I wanted to read and understand everything that was happening.

What we measured

Metric	Before	After	Change
API calls per file	15	7	53% reduction
UI test setup lines	8	3	62% reduction
API setup/cleanup lines	15	3	80% reduction
Files with manual try/finally	15	0	Fixtures handle it
Boilerplate removed	—	—	~1,000 lines

What we created along the way

18 analysis documents
5 implementation guides
33 tasks with verification commands
1 skills (test improvement)

Lessons learned

About testing:

Green CI doesn't mean tests work locally
One real failure can cascade into dozens of phantom failures in serial mode
Web-first assertions (expect(locator)) catch timing issues that manual checks miss
Fixtures aren't always the answer, some setup belongs in beforeAll

About working with AI:

AI is better at applying known patterns than inventing new ones, give it a clear process
The analysis phase was the highest-leverage use of AI, it found things I'd have missed for weeks
Multiple tools > one tool, cross-checking catches hallucinations and enhances confidence in the approach
The skill made it scalable, without it, every fix would need the same instructions repeated
Keep the human in the loop, 4 parallel sessions, never unattended
Find the time to do these kind of tasks. They take time at first but then you achieve so much more.
Use AI just like it's a new colleague that you don't know very well who never turns on their camera so it's hard to get to know them and therefore you can't fully trust them but you know they have good opinions and are good at their job but you need to be sure they have thought things through and are not just being lazy and making bad decisions.

Getting Started with Claude Code: A Guide to Slash Commands and Tips

Debbie O'Brien — Tue, 31 Mar 2026 21:06:39 +0000

When you first open Claude Code, it's not immediately obvious what commands are available to you. I spent some time today exploring the slash commands and keyboard shortcuts thanks to Matt Pocock's Claude Code for Real Engineers course, and found them genuinely useful for day-to-day work. Here's a quick rundown of what each one does and when you might reach for it.

Slash Commands

`/intro` - Setting Up Your Project Instructions

/intro creates a claude.md file where you can define instructions for how Claude should behave in your project. If you're working in a team or want consistent responses across sessions, this is a good place to start.

`/terminal-setup` - Fixing Multi-Line Input

By default, hitting Enter sends your message immediately, which can be frustrating when you're trying to write something longer. /terminal-setup configures your terminal so that Option + Enter (or Alt + Enter on Windows) gives you a new line instead.

One thing to note: you'll need to restart your terminal app after running this for the changes to take effect.

`/model` - Changing the Default Model

If you want to switch which model Claude Code uses, /model lets you do that. Straightforward, but easy to miss if you don't know it's there.

`/usage` - Checking Your Subscription

/usage shows your current usage for your subscription plan. Handy for keeping track of where you are without having to leave the terminal.

`/context` - Understanding What's in Your Context Window

This one I found particularly useful. /context gives you a breakdown of what's currently loaded in your conversation, with estimated usage by category:

System prompts
System tools
Skills
Messages
Free space

┌─────────────────────────────────────────────────────────────┐
│  /context                                                   │
│  └─ Context Usage                                           │
│                                                             │
│  claude-opus-4-6 · 15k/1000k tokens (1%)                   │
│                                                             │
│  Estimated usage by category                                │
│  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   │
│  ● System prompt:      5.6k tokens  (0.6%)                  │
│  ● System tools:       8.3k tokens  (0.8%)                  │
│  ○ Skills:              715 tokens  (0.1%)                   │
│  ○ Messages:             58 tokens  (0.0%)                   │
│  □ Free space:         952k tokens  (95.2%)                  │
│  ■ Autocompact buffer:  33k tokens  (3.3%)                   │
│                                                             │
│  [████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░] 1%  │
│   ^^^^                                                      │
│   used                              free space              │
└─────────────────────────────────────────────────────────────┘

It also tells you when autocompaction will happen, that's when Claude automatically trims older context because the token limit is running low. If you've ever wondered why Claude seems to "forget" something from earlier in a long session, this command helps explain what's going on.

`/clear` - Starting Fresh

/clear wipes your chat history and context window. It's essentially the same as closing and starting a new Claude session. Useful when you're switching to a completely different task and don't need the previous context hanging around.

`/ide` - Connect to Your IDE

There's a Claude Code extension for VS Code, and you can connect to it by running /ide. Once connected, things like git diffs will open in VS Code instead of displaying in the terminal. If you're reviewing changes regularly this is a much better experience, you get proper syntax highlighting and the familiar side-by-side diff view rather than trying to read through diffs in the terminal.

`/resume` - Browse Previous Sessions

Type /resume and use the up and down arrow keys to browse through your previous sessions. There's also a search box so you can find a specific session across all sessions in the repo.

Tips

Interrupting Claude with Escape

Press Escape at any time to interrupt Claude while it's generating a response. If you want it to continue from where it left off, just type "go." Press Escape again if you want to stop it for good.

This is helpful when you realise partway through that you need to rephrase your question or Claude is heading in the wrong direction.

Rewind with Escape + Escape

Press Escape twice to enter rewind mode. This lets you scroll back through your conversation using the up arrow key. When you land on the point you want to go back to, press Enter and you'll get a few options:

Restore code and conversation - rolls back both your files and the chat
Restore conversation - rewinds the chat but keeps your code as-is
Restore code - reverts your files but keeps the conversation
Summarize from here - condenses everything from that point forward
Never mind - cancels and takes you back to where you were

This is really useful when Claude has gone down the wrong path and you want to undo a series of changes without manually reverting files yourself.

Stash Your Prompt with Ctrl + S

This one would have saved me a lot of time if I'd known about it sooner. If you're mid-way through typing a prompt and realise you need to ask something else first, press Ctrl + S to stash it. Your current prompt gets set aside, you can type and submit something else, and then the stashed prompt automatically restores in the input field, ready for you to send or stash again.

If you decide you no longer need the stashed prompt, just press Ctrl + C to get rid of it.

Before I knew this existed, I was copying my prompt to the clipboard, typing the other thing, and then pasting it back in. Not the end of the world, but once you've done that a few times in a session it gets old fast.

Paste Images Directly into Claude Code

Something I didn't expect from a terminal-based tool: you can copy and paste images right into Claude Code. Just copy an image and paste it into the input field, then ask questions about it. Useful for things like sharing a screenshot of an error and asking what's wrong, pasting a design mockup and asking Claude to build it, or getting help interpreting a diagram or chart.

Bash Mode with `!`

Prefix any input with ! to run it as a bash command directly from Claude Code. For example, !npm run typecheck will run your typecheck and show the output. The useful part here is that any error messages from those commands are now in Claude's context, so you can immediately ask it to help fix whatever went wrong.

You can also run long-running processes like !npm run dev and then press Ctrl + B to send it to the background. You'll see a message like:

Command was manually backgrounded by user with ID: be96u9i91. Output is being written...

A background task indicator will appear, and you can use the arrow keys to navigate to it and press Enter to view the shell details:

Shell details

Status:  running
Runtime: 2m 15s
Command: npm run dev

Output:
> dev
> react-router dev
  ➜  Local:   http://localhost:5173/
  ➜  Network: use --host to expose

From the shell details view, you can press X to stop the background process, or press the left arrow key to go back to your conversation.

This means you can keep your dev server running in the background while continuing to work with Claude in the foreground. Because the output is being captured, Claude can see what's happening with the process so if something crashes or throws an error, it already has that context and can help you debug it.

Suspend Claude with Ctrl + Z

If you need to run a bash command outside of Claude, something you don't want in its context, press Ctrl + Z to suspend the process. Run whatever you need to in your terminal, then type fg to bring Claude back. Handy for things like checking credentials, running unrelated scripts, or anything you'd rather keep out of the conversation.

Ending and Resuming Sessions

Press Ctrl + C twice to end your current session. Claude persists sessions locally, so when you exit it gives you a command to resume that session, something like claude --resume <session-id>. Just copy and paste it to pick up where you left off.

If you've already closed the session and didn't save the command, no problem. Open Claude Code and use the /resume slash command to browse your history.

If you just want to jump straight back into your most recent session, claude --continue does exactly that.

Managing Permissions

When Claude needs to run something, it will ask for permission with a few options:

Yes - allow it this once
Yes, and don't ask again for... - allow it going forward
No - block it, with the option to give a reason or suggest a different command

These choices are saved to a file called settings.local.json inside the .claude folder in your project. Inside that file you'll find a permissions property with an allow array listing everything you've approved. You can edit this manually to add commands, for example:

{
  "permissions": {
    "allow": [
      "Bash(pnpm typecheck)",
      "Bash(pnpm *)"
    ],
    "deny": [
      "Bash(git push *)"
    ]
  }
}

Use wildcards to allow a range of commands—Bash(pnpm *) will permit any pnpm command. Use deny to explicitly block things you never want Claude to run, like Bash(git push *).

Permissions aren't limited to bash commands either, they also cover things like web search and other tools.

By default, settings.local.json is ignored via .gitignore so your permissions stay local to your machine. If you want to share them with your team, rename the file to settings.json.

Hope this helps you move faster with Claude. Have fun.

How I built a practical agent skill that turns rough READMEs into polished project docs

Debbie O'Brien — Tue, 24 Mar 2026 21:44:06 +0000

If you're new to agent skills, start with my beginner guide first:

What Are Agent Skills? Beginners Guide

That post covers what skills are, how they get loaded, and how to build a tiny one from scratch.

This post picks up where that one stops.

Instead of another tiny example, I want to show you what a practical skill looks like when it solves a real problem.

We are going to take the idea of a skill and use it to turn rough project READMEs into polished docs that are consistent, accurate, and reusable across repos. I picked README generation because the output is easy to judge, it comes up again and again, and once you get it right for one project you want the same quality bar everywhere.

The problem with one-off README prompts

You can absolutely ask an agent to improve your README and get something decent back.

Sometimes it will even be very good.

But if you do that across multiple projects, the cracks show up quickly:

badge styles are inconsistent
section order changes from repo to repo
install commands drift away from the actual package manager
social links get guessed
simple projects end up with bloated READMEs
the agent repeats the same repo-scanning work every time

That is exactly the kind of problem skills are good at solving.

Not because they magically make the model smarter, but because they turn a vague prompt into a reusable workflow.

The first version was just one file

I did not start with a big architecture.

The first version of readme-wizard was just a single SKILL.md with instructions telling the agent to:

detect the project name, description, license, git remote, package manager, and CI setup
add a better structure to the README
use shields.io badges
include a Quick Start section with real commands
show a project structure tree
add contributor avatars, documentation links, and optional social badges

That first version worked.

And that matters.

One of the easiest mistakes to make with agent workflows is over-engineering too early. A single file is often enough to prove whether the workflow is useful before you invest more time into it.

Here is the important part: start with the smallest thing that can produce a useful result on a real project.

What broke in practice

Once I started testing the skill on real repos, the limitations showed up quickly.

The main issue was not that the agent could not write a README. It could.

The issue was consistency.

The single-file version was asking the SKILL.md to do too many jobs at once:

writing guidance
badge formats
project-type adaptation rules
README structure templates
Mermaid diagram templates
instructions for how to detect project metadata

That creates a few problems.

First, the file gets bloated fast. By the time I had all those rules and templates inline, it was over 150 lines and hard to maintain.

Second, the agent had to figure out how to inspect the repo on every single run. There was no scanning script yet — just instructions saying "detect the package manager, find the license, parse the git remote." The agent would improvise that detection work each time. Sometimes it got it right. Sometimes it missed a CI workflow file, guessed at the wrong package manager, or invented social links that did not exist.

Third, all of that detection reasoning burned tokens and produced inconsistent results. The kind of work that should be boring and repeatable was instead fuzzy and error-prone.

The turning point: treat the skill like a workflow, not a prompt

That was the point where the skill stopped being just a better prompt and started becoming a real workflow.

The structure ended up looking like this:

.agents/skills/readme-wizard/
├── SKILL.md
├── scripts/
│   └── scan_project.sh
├── references/
│   └── readme-best-practices.md
├── assets/
│   ├── badges.json
│   ├── diagrams.md
│   └── readme-template.md
└── evals/
    └── evals.json

Every part has a different job. And that is the point.

`SKILL.md` became the orchestrator

Instead of being one giant wall of instructions, SKILL.md became the thin coordinator.

Its job is to define the workflow:

run the scan script
read the README best-practices guide
build from the template
pull badge formats from the badge catalog
validate against the eval assertions
only load diagram templates if the project actually needs them

That is a much better use of the main skill file.

It keeps the top-level instructions focused on sequence and judgment instead of burying everything in one place.

Here is what the workflow section of the final SKILL.md looks like:

## Workflow

### 1. Scan the project
Run `scripts/scan_project.sh <project-directory>` to collect structured JSON metadata.

### 2. Read the best practices guide
Read `references/readme-best-practices.md` before writing.

### 3. Build the README
Use `assets/readme-template.md` as the base structure.
Replace {{PLACEHOLDER}} markers with actual project data from the scan.

### 4. Add badges
Read `assets/badges.json` for the full badge catalog.
Only include badges for things that actually exist.

### 5. Validate the output
Review the generated README against the assertions in `evals/evals.json`.

### 6. Optionally add a diagram
Only read `assets/diagrams.md` if the project has multiple components.

Short, focused, and easy to follow. Each step points to another file instead of trying to carry everything inline.

The script handled the mechanical work

The biggest improvement was moving repo scanning into a script.

The skill now runs scripts/scan_project.sh <project-directory> and gets structured JSON back with things like:

project name
description
license
owner and repo
package manager
CI provider and workflows
social links
directory structure

Instead of the agent improvising that detection work every time, it runs one script and gets clean, structured data back. Boring and repeatable. Exactly what you want for metadata gathering.

The current reference version also goes a bit further. It checks local files first, then uses the GitHub API to look up the repo homepage and crawls it for additional social links. That is a good example of how a skill can evolve — start with the reliable local-file path, then add enrichment once the core workflow is stable.

References and assets gave everything a home

The remaining pieces fell into two folders.

references/readme-best-practices.md holds the writing guidance: section order, tone, project-type adaptation, badge rules, and common pitfalls. The agent only reads it when it is about to write, not every time the skill loads.

assets/ holds reusable inputs: badges.json for badge formats, readme-template.md for the base README structure, and diagrams.md for Mermaid templates when a project is complex enough to justify one.

This is where the skill becomes easy to customize. Want to change badge styles? Edit the badge catalog. Want a different README structure? Edit the template. Want to skip diagrams for simpler repos? The skill just avoids loading that asset entirely.

Keeping domain knowledge and data out of the main instructions makes the whole thing much easier to maintain.

Evals made the quality bar explicit

Once the skill was doing real work, I wanted a way to define what good actually meant.

That is what the evals are for.

The evals/evals.json file includes prompts for different cases:

a straightforward README improvement request
a casual "make this look professional" request
a minimal project that should not get bloated
a badge-focused request that should only generate real badges

I like this part because it forces the standards out into the open.

Instead of vaguely feeling that the README is better, you can check for specific things:

no placeholder text
badges only for real metadata
Quick Start commands that match the detected package manager
section depth proportional to the project
no fabricated social links

That makes the skill easier to improve without drifting.

The larger lesson

The interesting thing about this project is not really README generation.

The larger lesson is that a useful skill usually stops looking like a prompt pretty quickly.

It becomes a small system.

Some parts should stay flexible and language-driven.

Some parts should be deterministic.

Some parts should be reusable data.

Some parts should act as tests.

Once you see that pattern, it applies to a lot more than READMEs:

commit message workflows
code review checklists
release note generation
internal documentation standards
repo audits
team-specific engineering conventions

That is the shift I find most useful when working with agents.

You stop asking the model to improvise the whole workflow every time.

Instead, you give it a structure that makes good behavior easier.

If you want to build your own skill

If you want to build your own skill, this is the path I would recommend:

Start with one SKILL.md.
Test it on a real project as early as possible.
Watch for repeated logic and consistency failures.
Move mechanical work into scripts.
Move domain knowledge into references.
Move templates and data into assets.
Add evals once the skill matters enough to maintain.

That sequence keeps the architecture earned.

You are not building a folder structure for its own sake. You are extracting parts only when they prove they deserve to exist.

Try it yourself

If you want to explore the full tutorial series or inspect the finished reference implementation, the repo is here:

debs-obrien/learn-agent-skills

And if you just want to try the skill without building it yourself:

npx skills add debs-obrien/learn-agent-skills

Then open any project and tell your agent:

Improve the README for this project using the readme-wizard skill.

The point is not just that a skill can write a better README.

The point is how you get from a useful first draft to something reusable.

I Used Skill Creator v2 to Improve One of My Agent Skills in VS Code

Debbie O'Brien — Sat, 21 Mar 2026 08:32:36 +0000

I just published a video showing how I used Skill Creator v2 to improve an existing AI skill inside VS Code, and honestly, I was seriously surprised at how much this thing does.

What impressed me most is that it does much more than just rewrite instructions.

It can:

review an existing skill
suggest targeted improvements
run evals against a baseline
compare outputs side by side
generate benchmark summaries
help optimize descriptions for better triggering

In the video, I ran it against a skill I had already created to see whether the updated version actually performed better, or if there was anything I was missing.

And it was a ton of fun.

The Skill I Tested

The skill I used for the demo was a skill I had already created called README Wizard.

It basically generates a polished, professional README for any project and is meant to kick in whenever someone mentions:

improving a README
project documentation
badges
first impressions for a repo
making a GitHub repo look more professional

It also checks project metadata, reads best practices, uses badges, Mermaid diagrams, and works from a README template. (need to create a video for this too.. on it)

So rather than creating a skill from scratch, I wanted to see if Skill Creator v2 could improve something real that I had already built.

Finding Skill Creator v2

The first thing I did was go to skills.sh and search for Anthropic.

From there, I found Skill Creator, and it now shows a summary of the skill which is nice.

The skill covers test case creation, evaluation and also runs parallel test cases with and without the skill to measure impact, capturing timing and token usage for comparison.

And on top of that, it generates an interactive browser-based reviewer showing outputs, qualitative feedback, and benchmark metrics.

It also includes description optimization, which is really important for improving skill triggering accuracy by testing realistic trigger and non-trigger queries.

Installing the Skill

Installing it was pretty straightforward.

I copied the install command from the page, pasted it into the terminal, and then selected where I wanted the skill installed.

Looking Inside the Skill

Before running it, I wanted to see what was actually inside the Skill Creator skill.

Skills are written for agents, not really for you to sit there and read through line by line, but I always like to have a look.

And this one is pretty complex.

It includes:

the SKILL.md file
its own agents
references and schemas
Python scripts
an eval viewer
review tooling

What I found really cool is that it comes with its own agents:

an analyzer agent
a comparator
a grader

The analyzer looks at comparison results and tries to understand why the winner won and generate better suggestions.

The comparator compares two outputs without knowing which skill produced them.

The grader evaluates expectations against the execution transcript and outputs.

Running It Against a Real Skill

I then used Skill Creator against my README Wizard skill.

I’ve done this a couple of times now, and I found that in VS Code I sometimes need to be a little more explicit if I want the full benefit of the sub agents.

Claude seems to pick that up more naturally because the skill was built for it, but in VS Code I wanted to make sure it really used everything available.

So I’d definitely encourage being explicit there.

What It Found

Very quickly, it started identifying issues with my skill.

Things like:

the workflow being under-specified
missing guidance for handling existing or missing READMEs
README best practices being too thin
sections that should only appear if relevant links exist
eval coverage being too small
missing edge cases
limited project detection

For example, it pointed out that a personal learning repo probably doesn’t need the same sections as every other project.

It also spotted that I only had two evals and suggested adding more realistic test cases, including edge cases like minimal projects and badge-focused README requests.

Applying Improvements

Once it had reviewed everything, it started applying targeted improvements across the skill files.

This part was honestly kind of exciting to watch because it moved fast.

It updated the skill instructions and made the guidance more explicit.

It improved the best practices.

It tightened up the logic around when certain sections should or shouldn’t be included.

And it expanded the eval coverage.

I could go through the changed files while it was working and see that it wasn’t just randomly changing things. It was making focused improvements that actually made sense.

That part gave me a lot of confidence.

Adding More Evals

One thing I really liked was how it expanded the eval set.

I had two evals.

It added more.

For example, it created cases around:

minimal project README generation
badge-focused requests

And these evals work like tests.

They include assertions such as whether the output has:

a project description
a quick start or usage section
appropriate badges
the right structure for the kind of project being documented

This was super useful because it meant I wasn’t just guessing whether the skill was better. I could actually measure it.

Running Sub Agents in Parallel

Then came one of the coolest parts.

It launched sub-agent runs in parallel.

It ran:

the improved skill
the old skill baseline

side by side across multiple test cases.

That meant it could directly compare the version with the new changes against the original version.

This is where the workflow really stood out to me. It wasn’t just making edits and calling it a day. It was actually testing whether the changes improved results.

The Results

After the runs completed, it graded all the outputs against their assertions and generated benchmark results.

The improved skill outperformed the baseline on two out of four evals and tied on the other two.

The overall result improved from 81 to 97.5.

That’s a 15.7% improvement.

Some of the biggest wins came from improving the skill’s ability to:

generate good content even when metadata is sparse
adapt README length and sections to different project types instead of always forcing the full template

The Workspace It Creates

Another thing I wanted to show in the video was the workspace it creates while doing all this.

It creates a workspace folder where it stores things like:

skill snapshots
old skill outputs
grading results
benchmark data
iteration files

You don’t necessarily need to go through all of that manually, but it’s very cool that you can.

If you want to inspect exactly what happened at each stage, it’s all there.

That level of visibility is really nice.

The HTML Eval Viewer

Then I asked whether there was a way to see the benchmarks in HTML.

And yes, there is.

Skill Creator has an eval viewer for that.

This was another really nice surprise.

It launched an HTML review page showing:

old skill vs improved skill
formal grades
pass/fail results
benchmark comparisons
review flows for feedback submission

It’s made for a human to read.

You can actually review what happened and decide whether you agree with the results.

I really liked that.

Description Optimization

And then, because apparently this skill wasn’t done showing off yet, I ran the description optimization flow as well.

This generates trigger and non-trigger queries to see whether your skill description is actually good enough to fire when it should, and stay out of the way when it shouldn’t.

That workflow lets you:

review trigger queries
review non-trigger queries
edit them
export the eval set
run the optimization loop

That is super valuable.

A lot of the time, the problem with a skill is not the logic inside it. It’s that the description is not specific enough, or not clear enough, for the agent to trigger it properly.

So I really liked that this was built in too.

Final Thoughts

If you’re already building custom skills for:

GitHub Copilot
Claude Code
or other coding agents

this is absolutely worth checking out.

And yes, the video is long.

But the skill does a lot, and I found it hard to cut it down because I kept finding more things it could do.

So I pretty much left it as is.

Have fun.

Watch the Video

If you’re creating your own skills already, or even just experimenting with prompts and instructions, I’d be really curious to know how you’re approaching it.

Build Websites, Games, and Teaching Resources With Google Gemini for Free (No Coding Required)

Debbie O'Brien — Sun, 15 Mar 2026 17:14:32 +0000

I literally came home from a podcast interview and my husband said, "Debbie, I've built three websites." I said, "What?" And he said, "Yes, I've built three websites." And I said, "I heard you, but what?"

My husband is not in tech. He has never cared about anything I do in tech. He does not know how to build anything. He works in the public sector and is rarely on the computer. And yet, he was able to build three websites in 10 minutes. He watched a video for 10 minutes and just went for it. I knew I had to tell the world about this.

How he did it

He went to gemini.google.com. That's it. This is free. He paid absolutely zero money. I do have a pro account, but he has a free account and can just build websites. Seriously, you've got to check it out.

When you open Gemini, you'll see a bunch of options like create image, create music, create video, write anything, help me learn. There's a lot you can play around with. But there are a couple of other things that I find are a little bit hidden. If you click into the options, you'll see Canvas, deep research, guided learning, and more.

Canvas is the one we want. Open a canvas, pick the fast model (totally fine for this), and you're ready to go.

Building a Batman game for toddlers

After my husband built his three websites, the next day he started building games for our kids. I wanted to show you what that looks like so I typed in "build a Batman game for toddlers." That's it. A simple prompt.

The first thing that pops up is all this code. Don't be scared by that. As my husband said, "I watched it spit out all the code that you normally write by hand." I normally write this code by hand. That's the insane thing.

Once the code finishes generating, it jumps into preview mode. And there it was. A Batman city helper game where you move Batman to collect stars. I was just moving with the trackpad and it worked. It was actually really simple for toddlers to play.

Just use your imagination

The cool thing is you just have to have an imagination and think about how to make it better. I typed "can you add sound?" and Gemini added a cheerful ping whenever Batman catches a star or a balloon. I could hear the little sound effects and honestly it was amazing.

You can even select a specific area of the preview, drag a box over it, and ask Gemini to make changes right there. I dragged over the top area and said "can you add the person's name here?" You don't even have to specify "the right hand corner" or anything. Just drag and ask. It figured it out.

Building multiple things at the same time

Here's something people don't even know about. While one thing is generating, you can open a new canvas and start building something else. I had the Batman game going and at the same time asked it to create a website for a circus act.

One line. That's all I typed. And out came a full circus website. "A night of pure magic." It looked incredible. And I could iterate over it, add the actual address, YouTube links, whatever I wanted.

Then I created some games for learning numbers for toddlers. I used to be a school teacher. I used to have to prep and create all this stuff in my free time. And now I can just create resources on the fly.

The number learning game blew my mind

The toddler number game it created was called "Number Fun" with a bubble pop game. It actually spoke out loud and said "pop the bubbles in order, start with one." You go one, two, three. I tried doing it wrong on purpose and it said "find number four." It had a home button where you could click on count items and count fruits. One, two, three, four. It had sound you could turn on and off.

This is what kids want. This is the kind of interactive learning material that used to take ages to build. And I made it with one prompt.

Sharing your creations

You've got access to the code if you know how to code and want to tweak things yourself. But the really cool thing is the share button. You can copy the content, share it with someone, or just copy a link. My husband was sharing things with me saying "here's a website I've started, can you help me improve it?"

You can also go to previous versions and see changes saved. And there's an option to add Gemini features, which adds AI stuff to your creation. Great for writing stories. We actually created a book that reads the story to you. I can't even remember all the things we created in just a couple of minutes.

The one thing that is missing is there's no quick and easy deploy button. You can share a link and people can see it, which is kind of like it's deployed but not really deployed. It's a bit weird. But for getting a design together and having someone help you with the deployment part later? It works.

The only limit is your imagination

Whether you're creating games, building your own book, making material for teaching, or putting together a website and having someone help you with the deploy part, you're in control.

This is free. You might hit some limits if you keep going nonstop, and then you can pay for more. But to try it out, it costs nothing. Just go to gemini.google.com and start building.

I encourage you all to play around with this. Think about the possibilities. The technical barrier is gone. The only problem now is your imagination. So start being creative and just start imagining things.

Have fun building.

What Are Agent Skills? Beginners Guide

Debbie O'Brien — Wed, 04 Mar 2026 18:27:36 +0000

AI agents are smart. But they're generic. Your agent is trained on a ton of general knowledge, but it doesn't have your specific domain knowledge. It doesn't know your preferences, your team's conventions, or how you personally want things done.

When we learn a new skill — playing basketball, riding a bike — we're adding knowledge we didn't have before. Skills work the same way for your agent. You give it the domain knowledge it's missing, personalized to how you want things done.

What is a skill?

A skill is a reusable set of instructions that teaches an AI agent how to do a specific task well. Think of it like a recipe card you hand to a talented chef. The chef knows how to cook, but they don't know your family's secret sauce. The recipe card tells them exactly what to do.

Without a skill → the agent produces generic output
With a skill → the agent follows your instructions and produces exactly what you want, every time

At its simplest, a skill is just one file: a SKILL.md with a name, description, and instructions. That's it. You can add extras like scripts, references, assets, and evals — but you don't have to. All you need right now is the SKILL.md file.

Let's build one.

Build your first skill

Open VS Code in your project directory. We're going to create a good-morning skill step by step.

Step 1: Create the folder structure

Create a new folder in your project root. You can use .agents/, .github/, or .claude/ — they all work. The .agents/skills/ path is the cross-agent convention that works with Copilot. Inside that, create a skills folder, and inside that, create a folder called good-morning. This folder name is your skill's name.

your-project/
└── .agents/
    └── skills/
        └── good-morning/

Step 2: Create the SKILL.md file

Inside the good-morning folder, create a file called SKILL.md. It must be in capital letters — that's how agents find it.

your-project/
└── .agents/
    └── skills/
        └── good-morning/
            └── SKILL.md

Step 3: Add the frontmatter

Open SKILL.md and add the YAML frontmatter at the top:

---
name: good-morning
description: "A skill that responds to good morning with a cheerful greeting"
---

Two important things here:

The name must match the folder name. If the folder is called good-morning, the name must be good-morning. If they don't match, your editor will flag it.
The name and description are always in context. Every time you're working in this project, the agent sees the name and description so it knows what skills are available. Keep the description short and specific, this is how the agent knows when to use the skill.

Step 4: Write the instructions

Everything below the frontmatter is the skill body. This only gets added to context when the skill is called, not all the time. The agent only loads these instructions when it decides to use the skill.

Add the body below the frontmatter:

---
name: good-morning
description: A skill that responds to good morning with a cheerful greeting
---

# Good Morning Skill

When the user says good morning, respond with:

- "Hi Debbie, hope you have a great day!"
- Ask if they have done any sport today
- Include a funny joke about sports

## Example

**User:** Good morning

**Agent:** Hi Debbie, have you done any sport today? Here's a funny joke about sports: Why did the soccer player bring string to the game? Because he wanted to tie the score!

That's the complete skill. One file. A few lines of instructions. Make it as personal as you like, put your own name in there, change the topic from sports to whatever you want.

Test it

Start a new session from the same directory (skills are discovered at session start) and type:

Good morning

The agent finds the skill, reads the SKILL.md file, and responds.

In GitHub Copilot: "Hi Debbie, have you done any sport today? Here's a funny joke about sports: Why did the bicycle fall over? Because it was too tired from all that cycling!"

In Claude Code: Open Claude Code from the same project directory, say "good morning", and you get the same thing: "Hi Debbie, have you done any sport today? Here's a funny joke for you: Why do basketball players love donuts? Because they can always dunk them!"

Skills work across agents. The same SKILL.md file works in Copilot, Claude Code, and others. Each agent discovers the skill, reads the instructions, and follows them.

That's a skill in action. Now imagine instead of "good morning", the instructions told the agent how to generate a polished README, write commit messages in your team's format, or review code against your standards. Same idea, bigger impact.

How skills get loaded

Skills are designed to be efficient with context windows. They use a three-level loading system. The agent only loads what it needs, when it needs it.

Level 1 is always in the agent's context. It's just the name and description (~100 words). This is how the agent decides whether to use the skill. If someone says "improve my README", the agent scans its available skills and picks the one whose description matches.

Level 2 loads when the skill triggers. The full SKILL.md body with all the instructions, steps, and examples. This is ideally under 500 lines.

Level 3 loads on demand. Scripts, references, and assets that the agent pulls in only when it needs them. Scripts can even run without being loaded into context at all, saving tokens. And some resources might not load at all for certain projects. For example, a diagram template file only needs to be read if the project is complex enough to need an architecture diagram. Simple projects skip it entirely.

This matters because context windows are limited. A well-designed skill is lean at the top and detailed at the bottom.

Where skills live

Skills can be installed at two levels:

Project-level: in your project directory, available only when you're in that directory
Global: in your home directory, available from anywhere

Each agent checks slightly different locations:

GitHub Copilot (VS Code):

# Project-level (any of these work)
your-project/.github/skills/
your-project/.claude/skills/
your-project/.agents/skills/

# Personal (works from any directory)
~/.copilot/skills/
~/.claude/skills/
~/.agents/skills/

Claude Code:

# Project-level
your-project/.claude/skills/

# Personal (works from any directory)
~/.claude/skills/

The .agents/skills/ path is part of the Agent Skills open standard which is a cross-tool standard, but Claude Code uses its own .claude/ directory structure, not .agents/.

The skills ecosystem

There's a whole directory of skills at skills.sh where you can browse and discover skills built by the community.

To install a skill, use the skills CLI:

npx skills add anthropics/skills --skill skill-creator

This installs the skill-creator skill from Anthropic. A skill that helps you create other skills. One command and it's ready to use.

You can see what you have installed:

npx skills list

And search for skills:

npx skills find

Skills work across multiple AI agents — Copilot, Claude Code, Cursor, Goose, and many more. The skills CLI handles installing to the right location for each agent.

I Needed an APP to Track My Learning Journey, AI built it in less than half an hour with a single prompt

Debbie O'Brien — Sat, 17 Jan 2026 20:31:26 +0000

I have been trying to build a Learning Hub App for a good few months using various tools and have had many iterations back and forth on trying to get it to work and ended up going round in circles fixing one thing and breaking another.

Today I gave it a shot using goose coding agent and Claude Opus 4.5 as my modal. I used a simple prompt. No .prd file. Just clear instructions of what I wanted and I asked it to come up for a plan for it.

And it did. The plan was great. It started off with: What a great idea! Let me put together a comprehensive plan for this app. I'll call it "AI Learning Hub" - your personal knowledge management system for AI learning resources.

I got a Core Features Overview, Tech Stack Recommendation, Data Structure, UI Layout Concept, Build phases and some extra nice to have stuff for future improvements.

I asked it to change two things and then off it went. In less than half an hour it was not just built but tested before my eyes by using the Playwright MCP. I watched as it added links tested out the filters and used the site like a user would ensuring everything worked.

AI Learning Hub - Complete!

My agent told me: I've built your AI Learning Hub app! Here's a summary of what's been created:

Features Implemented
Add URLs with metadata fetching: ✅ Auto-fetches title, description, detects category
Categories: ✅ Blog, Video, Podcast, Course, Paper, Other
Learning workflow :✅ Queue → Learning → Completed
Search: ✅ Full-text search across resources
Category filter: ✅ Filter dropdown
Notes: ✅ Rich notes modal per resource
Drag & Drop: ✅ dnd-kit installed, drag handles ready
Content Ideas: ✅ Sidebar with types (Blog, Video, Tutorial, Thread)
Password protection: ✅ Environment variable based
Netlify ready: ✅ Config file created

And how do I add the URLs of the sites I want to add, easy. Just paste them into goose and it will use the Playwright MCP which I already configured and it will populate them all for me so I don't have to.

Conclusion

And that's it, finished and working in less than half an hour while I drank a beer on a Saturday night watching it all in amazement that something that should have taken me weeks to build was built before my very own eyes in minutes.

Now here is the thing. I didn't open an editor once. I haven't looked at the code. It is working as it should and that is really all I care about for this particular project. I studied coding. I care about the quality of code but right now I am ok with not caring. I am ok with trusting the agent and LLM to ensure the code is good and meets the standards it should.

I will add tests and check performance and out of curiosity I might just look at the code when doing the pr. But I am seriously blown away with how easy it is to do this.

Try it yourself

Want to give it a try yourself: Here is the prompt I used:

I would like you to build me an app so that I can easily manage urls for blog posts, podcasts, videos and other things that I would like to learn when it comes to AI. It would be great to be able to easily add the URL and then have a title and description field which can be populated when adding it. search by category would be great. I would be cool to have some sort of system like a todo list so when it is done it goes to a different place but is still findable should i want to share it with someone. maybe even notes so i could add some notes on it for later findings or note taking. should be able to prioritize things so that i learn things based on a particualar order maybe drag or drop so i can change it. it should be a fun app that i can easily deploy, nice and easy on the eye. it would also great to have a section where i can put ideas on content creation based off of the stuff I have learnt. these could be create blog posts, videos etc. just an idea and not sure if this will look great but we could try it out. can you come up with some sort of a plan for this.

How I Use AI Agents + MCP to Fully Automate My Website’s Content

Debbie O'Brien — Wed, 14 Jan 2026 19:29:47 +0000

Recently I have been playing with a lot of tools to help automate simple tasks just so I can keep my website uptodate. As I create a lot of content from videos to blog posts and appear as guests on many podcasts I want to have this reflected on my site as it's good to have all this info in one place to easily share with others and it's also great to look back on. But it is tedious and it takes time, time that I have very little of. So this is a perfect use case for AI to take over this task. So where do I start.

Before AI

First of all let me tell you what it is like to add a new podcast episode to my site. I use Nuxt content so basically each podcast is just a markdown file with some yaml. This yaml contains things like the date, the name of the podcast the image url and the host for example. Last year I was simply getting an old podcast episode and clicking duplicate in VSCode and then renaming everything with the new podcast information. So basically manually clicking the link to new episode and copying and pasting the information from that site where it hosted into my markdown file. Then I had to download the image and upload it to cloudinary, get the image name from there and paste that into my file. Cloudinary is great for managing images and keeping my site performant but the extra work of downloading and uploading the images was tedious and meant that sometimes it took me ages to add new episodes to my site cause I simply couldn't be bothered doing it.

Automating with Prompts and the Playwright MCP

I started to automate some of the process by using reusable prompts in VSCode. I created instructions for Copilot of what it needed to do and then all I had to do was press the play button to run the prompt in a new chat and then give it the URL for the podcast. I had the Playwright MCP installed so Copilot would use it to navigate to the URL of where the podcast was hosted and find the relevant information it needed to complete the metadata. It was pretty good and saved a lot of time and I could even bulk update new episodes by giving it more than one URL.

However the images where still an issue and I was tempted to actually just stop using Cloudinary just cause it was quicker and easier to use images stored in the public folder of my site. But then I would loose out on the benefits of Cloudinary and it's image optimization.

Automating with Goose, Playwright MCP, Cloudinary MCP & GitHub MCP

I then started playing around with Goose, a coding agent from Block. Goose is a desktop app although there is also a CLI available. I decided to give it a go and see if I could improve the way I automated this process. I probably could have continued playing around in VSCode and achieved similar but lately I have been trying to code, or should I say, get tasks done, without using an editor. As in just review the pull request later on CI and let the agent do it's thing cause I really believe this is the way we are heading so I want to keep experimenting on how it feels to code this way.

So I copied my prompt into Goose and saved it as a recipe. Recipes seem pretty similar to prompts but you can use parameters so I could add the podcast url as a parameter and it will automatically get detected. There are also a lot of other options which I haven't got round to properly checking out but this seemed enough for my use case. Now I have the Playwright MCP navigating to the site and getting all the info I need for the podcast page. I then just asked Goose to download the image for me and add it locally and it did. This was great but did I really want to just stop using Cloudinary just cause I was lazy!

So I thought what if Cloudinary had an MCP and then Goose could just use that MCP server to upload the image and then update the image metadata with the correct image id. Now if it could do that then all my problems would be solved. And so I looked in Goose's extensions and searched for Cloudinary and would you beleive it there was an MCP server for Cloudinary. Not only that but it actually worked. Goose used the Playwright MCP to navigate to the site and get all the content it needed including the image and then the Cloudinary MCP was used to add the image to my Cloudinary account using my API key stored in the extension's settings. It even figured out which folder to save it to without me asking.

And that was it. It all just worked. I checked my Cloudinary account and the images were there. I then asked Goose to run the dev server and verify it's work using the Playwright MCP by navigating to the podcasts page to ensure everything looked as it should. Not only could I see the browser being opened and see the new podcast episodes with images but I could also ask for a screenshot of the page.

Then one more thing of course. We had come so far so may as well finish it all off. I then asked Goose to create a pull request which it did using the used the GitHub MCP which I previously configured. I then reviewed the code just incase anything looked wrong, especially with regards to the cloudinary URL, even though I had visually reviewed it and as you can imagine, it was good to go. I merged it and new podcast episodes were added to my site.

Conclusion

So heres the thing. It took me time to figure all this out and set up the process and ensure it was all working. Yes it would have been quicker to copy and paste myself. But now it's done and the next time I want to add a podcast episode I just have to run my recipe in Goose and pass in the podcast URL. I am a guest on one tonight so when that is out I will be able to add it easily to the site. In fact if I had a team of people working on my site I could even share the recipe with them and they could simply run it.

I am using Nuxt content for my site which means I have no CMS. My content lives in markdown files and it makes it very easy as a developer to add content but perhaps not so easy for non developers. But now, now even my mother could add a new podcast episode to my site. That is just amazing. This is just my personal site but think about the possibilities of this use case for many other businesses.

I am very impressed with what Goose can do. The more I am using it the more it is blowing my mind. I am now going to go ahead and add other receipes for the rest of the content I add or perhaps just modify this recipe with parameters so I can have one recipe. I shall keep playing around. This is fun.

Let me know if you found this interesting or are doing something similar or have used any of the MCPs mentioned above. We are living in exciting times so if you haven't started to experiment yet then what are you waiting for. Just play around and have fun.

Debugging My Zsh Config With Goose (and Why Agentic AI Actually Helped)

Debbie O'Brien — Mon, 15 Dec 2025 15:37:30 +0000

I’ve been playing around with a few things recently and wanted to share a real experience that genuinely surprised me.

You might have seen the news from the Linux Foundation announcing the formation of the Agentic AI Foundation. As part of that, a few projects were donated into the foundation, including the Model Context Protocol (MCP), Agents.md, and Goose.

I’m guessing a lot of people haven’t heard of Goose yet. I hadn’t either until recently, so I figured I’d dig into it and see what it’s actually about.

If you want the full announcement, you can read it on the Linux Foundation website. This post isn’t about the announcement though — it’s about what happened when I tried to use Goose for real.

What is Goose?

Goose was released in early 2025, so it’s still very new. It’s open source (which I love), and it’s a local-first AI agent framework. It combines language models with extensible tools like MCP so it can actually do things, not just talk about them.

There’s both a desktop app and a CLI. I’m much more of a desktop app person than a CLI person, but everything seems to be going CLI these days, so I decided to give the Goose CLI a try.

The docs live at block.github.io, and I did start reading them. Like most people, I got through the first bit and then thought, “I’ll figure it out as I go.”

That decision led me directly into a very familiar kind of developer pain.

The Problem: Goose Wouldn’t Run

I installed the desktop app first and then followed the docs to install the CLI. The instructions said that after updating my .zshrc, I should be able to run the goose command.

I couldn’t. No matter what I did, I kept getting:

zsh: command not found: goose

I sourced my .zshrc. I ran goose --help. Nothing worked. The agent kept telling me everything was done correctly, and I kept replying with some version of, “No, it’s not working.”

This is where it got interesting.

Goose Didn’t Argue — It Investigated

Instead of looping on generic advice, Goose pointed out something important: terminal config changes are only picked up when a new session starts. That’s something I always forget.

So I closed the terminal, opened a new one, and tried again.

Still broken.

At this point, Goose acknowledged that something wasn’t right. It explained that if sourcing the file and restarting the terminal didn’t work, then one of two things was probably happening:

The change didn’t save correctly
Another startup config was overriding the PATH

Honestly, that explanation alone would normally make me sigh and prepare to lose an hour.

Instead, Goose suggested checking the file directly.

Letting an AI Read My `.zshrc`

Goose used a tool to open my .zshrc and read it. I didn't need to install this tool. Reading .zshrc files is not something I do often and I definitely don’t enjoy debugging them.

Goose scanned through all the usual stuff, pnpm, bun, nvm, PATH exports and then immediately spotted the problem.

When it had added the Goose PATH export earlier, it didn’t include a newline. That meant the new export was stuck onto the end of the previous line, creating a syntax error.

I wouldn’t have noticed that quickly. I was looking at the file and just seeing noise.

Goose explained exactly what went wrong, showed me the broken line, and explained that it should really be two separate lines.

Then it fixed it. It replaced the bad line with two clean, correctly formatted lines and even added a comment to make it clearer for the future.

After that, Goose asked me to restart my terminal one more time. This time, when I typed goose, the command worked. I could see all the available commands. Sessions, MCP servers, bundled tools, everything was there.

At that point, I just sat back for a second. Not because this was some massive, complex bug, but because this is exactly the kind of small, annoying issue that can completely derail your flow.

Why This Actually Matters

If I had debugged this myself, I would have figured it out eventually. But it would have taken time, frustration, and a lot of trial and error. Instead, the agent noticed something was off. It inspected a real config file and identified a subtle syntax error, fixed it safely and then it explained what happened.

That’s the difference between AI that answers questions and AI that actually helps you get unstuck.

I also didn’t know Goose could edit files like this. Seeing it work through the problem step by step, without pretending everything was fine when it wasn’t, made a big difference.

We don’t have the answers to everything as developers. That’s normal. What is changing is how quickly we can get unblocked.

If something doesn’t feel right, push back. Say it’s not working. Let the agent iterate. Let it re-check assumptions. That’s where this starts to become genuinely useful.

If you’re curious about Goose, it’s worth a look. And even if you’re not, this kind of experience is a good reminder that using AI well isn’t about shortcut, it’s about reducing unnecessary friction.

That’s it. Have fun experimenting with AI.

Useful Links:

How I vibe code: Improving my site design with Goose and Gemini 3

Debbie O'Brien — Thu, 20 Nov 2025 12:32:20 +0000

OK this was so much fun: Googles Gemini 3 is amazing. just got it to redesign my home page. I was having fun with this one so no real idea what I wanted just vibing along. It gave me a matrix style hero component which blew me away. This is so cool and the fact that I can spend less than an hour to improve my personal site is insane.

I used goose coding agent for this one which is open source and free and I just put my Gemini API key in which I am still using a free trial so my total cost for having fun was zero.

Was quite impressed that by giving Goose the link to an image it just downloaded it for me and added it to my public folder of my site. One less tedious task for me to do.

Towards the end I had the crazy idea of creating 7 hero component designs that change when you refresh the page. Why? Cause it's cool. This is maybe not how you build production apps but it sure is great for prototyping and getting to learn how new tools work and improving your communication with AI Agents and LLMs.

I encourage you all to take time out of your day and play around. Build a personal site even if you never deploy it. Improve your personal site and modify the design just for fun. Have fun cause Gemini 3 is pretty amazing and the tools we have available to us right now is insane.

And of course don't forget to run the Playwright healer agent after you have changed your design so your tests are updated. All it takes is a prompt. I didn't show it in this video but check out my other videos on Playwright Agents.

Have fun and happy vibe coding

Links:
Goose: AI Coding Agent: https://block.github.io/goose/
Gemini 3: https://blog.google/products/gemini/gemini-3/

Playwright MCP Servers Explained: Automation and Testing

Debbie O'Brien — Mon, 17 Nov 2025 13:10:33 +0000

Did you know Playwright has two MCP servers. Yes kinda confusing, let me explain it. The Playwright MCP server is great from Browser Automation, filling out forms for example or even using so LLM's can verify their work by opening the browser and taking a page snapshot to see it actually implemented what it said it did. It is built in to GitHub Copilot Coding Agent so if you assign a pr to Copilot it will use Playwright which you can see in the session logs. It is very cool indeed.

Then we have another Playwright MCP server called Playwright Test MCP which is built into Playwright test and is for, yes you guessed it, testing. It has some similar tools as the Playwright MCP server but it also has other ones that you only need if you are testing. It starts running when you use the Playwright Agents, Planner, Generator and Healer. However this MCP server only supports TypeScript/JavaScript for now.

So depending on your needs you can use one MCP server or the other. The Playwright MCP server you need to install while the Playwright Test MCP server is installed when you run an npx command when using the latest version of Playwright.

npx playwright init-agents --loop=vscode

The installing of the MCP server is done for you and it doesn't matter what other MCP server you have as the agent will only use the tools that it has assigned to it.

Check out the docs for more info on how to get started. Have fun and happy testing with Playwright MCPs

https://playwright.dev/docs/test-agents
https://github.com/microsoft/playwright-mcp

DEV Community: Debbie O'Brien

How I Documented an Entire Product in 4 Days with an AI Agent

The Problem

Why an AI Agent

The Plan

The Three Skills I Built

1. write-docs: The Style Guide in Code

2. doc-screenshots: Automated Screenshot Capture

Why Not Playwright?

The Pipeline

The Screen Takeover Problem

3. docs-preview: Deploy and Verify

Verifying the Docs With Playwright CLI

Working With the Agent, Not Watching It

A Day-by-Day Walkthrough

Day 1: The Kickoff

Day 2: The Evening Sprint

Day 3: The Real Work

Day 4: Finish and Ship

What Broke Along the Way

The Rebrand

OCR Is Not Perfect

Tooltips and Hover States

What Worked Surprisingly Well

Skills as Accumulated Knowledge

The Manifest as a Screenshot Database

Reading Source Code for Accuracy

By the Numbers

What I Would Do Differently

What You Can Take Away

Making Docs Agent-Ready

Automated Video Walkthroughs (Work in Progress)

The Future: Documentation in an Agent-First World

How I Used AI to Fix Our E2E Test Architecture

Step 1: Analysis — asking questions I didn't know the answers to

Step 2: The tracer bullet plan

What a tracer bullet looked like in practice

Step 3: I created a skill to do the work

What the skill does

The architecture changes

Fixtures replaced boilerplate

Project structure

The serial cascade fix

What went wrong

How I worked with AI

Running it in practice

What we measured

What we created along the way

Lessons learned

Getting Started with Claude Code: A Guide to Slash Commands and Tips

Slash Commands

/intro - Setting Up Your Project Instructions

/terminal-setup - Fixing Multi-Line Input

/model - Changing the Default Model

/usage - Checking Your Subscription

/context - Understanding What's in Your Context Window

/clear - Starting Fresh

/ide - Connect to Your IDE

/resume - Browse Previous Sessions

Tips

Interrupting Claude with Escape

Rewind with Escape + Escape

Stash Your Prompt with Ctrl + S

Paste Images Directly into Claude Code

Bash Mode with !

Suspend Claude with Ctrl + Z

Ending and Resuming Sessions

Managing Permissions

How I built a practical agent skill that turns rough READMEs into polished project docs

The problem with one-off README prompts

The first version was just one file

What broke in practice

The turning point: treat the skill like a workflow, not a prompt

SKILL.md became the orchestrator

The script handled the mechanical work

References and assets gave everything a home

Evals made the quality bar explicit

The larger lesson

If you want to build your own skill

Try it yourself

`/intro` - Setting Up Your Project Instructions

`/terminal-setup` - Fixing Multi-Line Input

`/model` - Changing the Default Model

`/usage` - Checking Your Subscription

`/context` - Understanding What's in Your Context Window

`/clear` - Starting Fresh

`/ide` - Connect to Your IDE

`/resume` - Browse Previous Sessions

Bash Mode with `!`

`SKILL.md` became the orchestrator

Letting an AI Read My `.zshrc`