A GitHub repository called design.md has been trending recently, accumulating over 1,400 stars. The concept is straightforward: provide AI agents with a persistent design document they can reference throughout their work.
This approach addresses a practical challenge in agent development that many teams encounter.
The Context Challenge
When working on complex tasks, AI agents need to understand the broader picture. What's the architecture? What constraints exist? What approaches have been tried before?
Typically, agents get context from:
Current conversation (limited window)
Code comments (often outdated)
Documentation (if it exists)
The issue is that this context is fragmented and temporary. When conversation moves forward, earlier context disappears. When documentation is outdated, agents make incorrect assumptions.
A design.md provides a single source of truth that persists across sessions.
What Belongs in design.md
An effective design.md answers these questions:
- What are we building?
Beyond feature lists, document the core purpose. Why does this project exist? What problem does it solve?
- What are the key architectural decisions?
Document major choices and their rationale:
"PostgreSQL was chosen over MongoDB because ACID guarantees are required for financial transactions"
"Microservices architecture was adopted because components have different scaling requirements"
- What constraints exist?
Technical constraints (performance requirements, browser support), business constraints (budget, timeline), and regulatory constraints (GDPR, HIPAA).
- What has been tried before?
Document failed approaches to prevent agents from suggesting rejected solutions.
- What are the current challenges?
Known issues, technical debt, areas needing improvement help agents prioritize work.
How Agents Use design.md
When starting a task, agents can:
Read design.md to understand context
Make decisions aligned with documented architecture
Avoid solutions violating constraints
Reference design.md in reasoning
This leads to more coherent and consistent work. Agents work within a broader framework rather than just reacting to immediate tasks.
Keeping design.md Updated
The main risk with design.md is becoming outdated. Effective practices include:
Make it part of the workflow
Update design.md immediately when making significant architectural decisions. Waiting until "later" means it never happens.
Version control it
Keep design.md in the repository. During PR reviews, check if design.md needs updating.
Review it regularly
Schedule periodic reviews (monthly or quarterly) to ensure the document reflects current reality.
Let agents help
Agents can assist in maintaining design.md by:
Suggesting updates when noticing inconsistencies
Summarizing changes from recent commits
Flagging outdated information
Observability in Agent Workflows
Even with good design.md, observing what agents actually do is important. This is particularly relevant for GUI agents interacting with complex interfaces.
Consider a GUI agent tasked with "fill out this form and submit it". The agent needs to:
Locate form fields
Enter correct data
Handle validation errors
Submit the form
Verify success
Each step can fail in different ways. Without observability, only the final result is visible: success or failure. The reason for failure remains unknown.
Building Observable Workflows
Good observability includes:
- Step-by-step logging
Record each action:
What was observed (screenshots, DOM state)
What decision was made
What actually happened
Whether it matched expectations
- Performance metrics
Track:
Success rate per task type
Average steps to completion
Time per step
Failure modes
- Error categorization
When things go wrong, categorize errors:
Perception errors (agent didn't see the right element)
Decision errors (agent chose wrong action)
Execution errors (action failed due to external factors)
This data helps identify where improvements are needed.
Systematic Benchmarking
CUA Benchmark provides systematic observability through:
100 test cases across 5 different web applications
Standardized task definitions
Automated result verification
Detailed performance metrics
Running agents against CUA Benchmark provides quantitative data:
Overall success rate
Success rate by task type
Average steps per task
Common failure points
This data is valuable for iterative improvement. Instead of guessing what to optimize, specific areas where agents struggle can be identified and addressed.
A Practical Example: Mano-AFK
Mano-AFK is an open-source autonomous application builder that demonstrates these principles. The workflow includes:
Receiving natural language description of what to build
Generating a PRD (Product Requirements Document)
Writing the code
Deploying to a test environment
Running tests (lint, API, E2E)
Auto-fixing any issues
Delivering the final application
Throughout this process, the agent references rules.md and preferences.md files to maintain consistency across projects. These files provide persistent context that guides decisions.
CUA Benchmark results for Mano-AFK:
W8A16 quantization: 58.0% accuracy
W8A8 quantization (Cider): 54.0% accuracy, but faster inference (~1,453 tok/s prefill)
These numbers show that W8A8 version is slightly less accurate but significantly faster. Depending on use case, one might be preferred over the other.
Without systematic benchmarking, this data wouldn't exist. Only vague impressions like "it works sometimes" or "it's kind of slow" would remain.
Practical Recommendations
When building agent workflows, these practices have proven effective:
- Start with design.md
Before writing agent code, document architecture, constraints, and key decisions. This document guides both human developers and AI agents.
- Build observability from day one
Don't add logging later. Design agent workflows to be observable from the start. Every step should produce some form of output that can be inspected.
- Use benchmarks
Establish a benchmark suite early. Run it regularly. Track metrics over time. This provides objective data on whether changes are improvements or regressions.
- Iterate based on data
When low success rates are observed, examine failure modes. Instead of making broad changes, identify specific failure patterns and address them directly.
- Keep context persistent
Whether through design.md, rules.md, or other mechanisms, ensure agents have access to persistent context. Conversation history is too ephemeral for complex projects.
Moving Forward
AI agent engineering is still in early stages. Best practices are still being figured out, but two things are becoming clear:
Agents need persistent context to do good work (design.md)
Agent workflows need systematic observability to improve (benchmarks and logging)
These aren't advanced techniques. They're foundational practices that make everything else work better.
If you're interested in seeing these principles in action, Mano-AFK (https://github.com/Mininglamp-AI/Mano-AFK) is an open-source autonomous application builder that uses persistent context files and systematic benchmarking to improve agent reliability.
For those working on GUI agents, Mano-P (https://github.com/Mininglamp-AI/Mano-P) implements think-act-verify loops and online reinforcement learning, achieving 58.2% success rate on the OSWorld benchmark (specialized models category).
Both projects are Apache 2.0 licensed. Stars and contributions are welcome.
Top comments (0)