AI agents write code fast. But "fast" doesn't mean "correct."
I build skillshare — a CLI that manages AI skills across 50+ tools like Claude Code, Cursor, and OpenCode. Over the past few months, I've been letting AI agents handle more of the development. The biggest lesson? The bottleneck isn't code generation. It's verification.
I kept manually re-running commands, checking files, eyeballing output to verify what the agent said was "done." So I built infrastructure to let the agent verify its own work.
Three layers, all agent-runnable
E2E (Docker Sandbox)
─────────────────────────
Integration (testutil.Sandbox)
─────────────────────────────────
Unit Tests (go test)
Nothing revolutionary. The key: every layer is something the AI agent can run on its own.
Integration tests create an isolated fake HOME per test — with .claude/, .cursor/, 50+ target directories. The agent runs a test, gets pass/fail in seconds. No side effects, no cleanup.
E2E tests run in a Docker devcontainer. Clean Debian, simulates a real user. make devc to start, make devc-reset if anything breaks. The agent can try anything without risk.
Runbooks: the thing I didn't expect
I started writing step-by-step test scripts mostly for my own documentation:
1. Run: skillshare install runkids/skillshare
2. Verify: skill directory exists in source
3. Run: skillshare list
4. Verify: output shows [tracked] badge
5. Run: skillshare sync
6. Verify: symlinks exist in all targets
Turns out this is exactly what AI agents need. Explicit steps + expected results = no ambiguity. The agent stops guessing and follows the script. The runbook became the spec.
Eating our own dog food
The part I'm most happy about — skillshare uses its own skill system to teach AI agents the workflow:
.skillshare/skills/
devcontainer/SKILL.md # How to use the container
cli-e2e-test/SKILL.md # How to run E2E tests
The devcontainer skill teaches: always execute inside the container, use ssenv for HOME isolation, edit on host and docker exec picks up changes instantly.
The E2E skill orchestrates the full flow: check container is running → detect relevant runbooks → execute with isolation → report results.
A skill management tool using its own skills to enable AI-driven development. When I update the workflow, I update the skill. Next time the agent picks up a task, it gets the latest instructions automatically.
The loop
Spec → Write code → Unit test → Integration test → Devcontainer E2E → Open PR
My job: write spec, write runbook, review PR.
Five things I'd tell someone starting out
- Invest in test isolation early — agents need a sandbox they can't break
- Write runbooks — explicit verification beats vague expectations
- Teach agents your workflow — via skills or context files, not just your API docs
- Use containers — cheaper than debugging "works on my machine"
- Add --json flags — structured output lets agents verify programmatically
Honest takeaway
AI agents are fast, confident, and wrong in new ways. I don't assume they write perfect code — I assume they can fix their own mistakes, as long as the feedback loop is fast enough.
Runbooks, sandboxes, and built-in skills are that feedback loop.
I want to hear from you
I've shared my approach, but I know there are many ways to tackle this. Some questions I'm still thinking about:
- How do you let AI agents verify their own work? Runbooks? Automated test suites? Something else entirely?
- What's your experience with AI agents in Docker/containers? Any gotchas I should know about?
- Do you write context files (CLAUDE.md, CONTEXT.md, skills) for your projects? What do you put in them?
If you're building CLI tools, developer tools, or anything where AI agents are part of the development workflow — drop a reply, quote post, or DM. I genuinely want to learn from what others are doing.
Still early in this journey. Let's figure it out together.
→ GitHub: https://github.com/runkids/skillshare
→ Full write-up: https://skillshare.runkids.cc/blog/e2e-testing-for-ai-agents
Top comments (0)