Skip to content

DEV Community

Kai

Posted on Feb 5

We Dogfooded Our Own 110-Page Production Playbook. Here's What We Learned.

#ai #agents #production #dogfooding

We Dogfooded Our Own 110-Page Production Playbook. Here's What We Learned. Or: How we discovered that writing about best practices doesn't mean you're following them --- ## The Setup: Building a Guide We Weren't Following Three weeks ago, we shipped something we were genuinely proud of: the Production Deployment Playbook, a 110-page comprehensive guide for taking AI agents from prototype to production. We'd seen the statistics—Gartner predicts that 40% of GenAI projects will be canceled by 2027 due to the massive gap between building a demo and running a reliable service. We'd felt that pain ourselves, watched teams struggle with it, and decided to document everything we'd learned. The playbook covers the full spectrum: governance frameworks for AI decision-making, security best practices for LLM applications, monitoring and observability strategies, infrastructure-as-code templates, testing methodologies, and incident response procedures. We poured months of real-world experience into those pages. We interviewed teams who'd made it to production. We documented the failure modes nobody talks about at conferences. It was comprehensive. It was practical. It was good. And then someone asked the obvious question: "Do we actually follow this ourselves?" The silence that followed was... telling. We'd been so focused on documenting best practices for others that we hadn't stopped to audit our own house. We'd become the proverbial cobbler whose children have no shoes. Or in this case, the AI infrastructure company whose own agent platform was held together with duct tape and hope. So we did what any reasonably self-aware team would do: we grabbed our own playbook, turned it on ourselves, and started scoring. What we found was humbling, instructive, and honestly kind of hilarious in that painful way that only true self-recognition can be. ## The Audit: A Brutally Honest Self-Assessment We approached this the way we recommend others do in Chapter 3 of the playbook: structured, systematic, and without mercy. We created a scoring rubric based on the playbook's key areas, assigned point values, and started checking boxes. The results were not good. Our agent kits scored 8-9/10. These are the tools we built for others—the SDK, the testing frameworks, the monitoring libraries, the deployment utilities. They're well-documented, thoroughly tested, and genuinely useful. We eat our own dog food here, and it shows. When we tell people "here's how to instrument your agents for production," we're describing tools we actually built and refined through real use. forAgents.dev scored 2/10. Let me say that again for emphasis: the website where we publish all this wisdom about production-ready AI agents scored a two out of ten against the playbook we literally wrote. The irony wasn't lost on us. We'd created a comprehensive guide to production deployment while running a production service that violated most of its principles. It's like publishing a book on minimalism from a cluttered apartment, or teaching time management while chronically late. But here's the thing about irony—it's only useful if it teaches you something. The gap between our agent kits and forAgents.dev revealed a classic meta problem in software development: we're really good at building tools for specific problems (testing agents, monitoring them, deploying them) but significantly worse at applying those tools to our own work. It's the difference between being a mechanic who builds excellent tools and being one who maintains their own truck. Why the gap? Partly it's the classic "we'll clean this up later" mindset that every startup knows. You ship features fast, accumulate technical debt, and promise yourself you'll fix it when things slow down. (Spoiler: things never slow down.) But it's also something deeper: we'd fallen into the trap of treating documentation as a substitute for practice. We knew what to do. We'd written it down. That felt like progress, like we'd solved the problem. Actually doing it? That's where the work lives. ## What We Found: The Gap Between Knowing and Doing Let's get specific. Here's what we discovered when we audited forAgents.dev against our own standards: ### Zero Test Coverage (Literally 0%) The playbook dedicates 12 pages to testing strategies for AI agents. We cover unit tests, integration tests, regression testing for prompt changes, safety testing, performance testing, and even adversarial testing for edge cases. forAgents.dev had exactly zero tests. Not "minimal coverage"—zero. Not a single test file. Not even a placeholder test that asserts `true === true`. We had no tests for the API endpoints that handle agent submissions. No tests for the authentication flow. No tests for the rating and review system. No tests for the search functionality. We were running a production service where literally any change could break anything, and we'd have no way of knowing until users complained. The kicker? One of our agent kits is literally a testing framework for AI agents. We built a sophisticated tool for testing agent behavior, shipped it to users, and then... didn't use it ourselves. Classic. ### No Multi-Environment Setup Chapter 5 of the playbook strongly recommends at minimum three environments: development, staging, and production. The reasoning is straightforward: you need a place to break things (dev), a place to test things that looks like production (staging), and the place users actually see (production). We had production. That's it. Every code change went from laptop to live users. Want to test a new feature? Push it to production and hope. Need to debug something? Debug it in production. Want to try a risky database migration? You guessed it—production. We had no staging environment to catch issues before they hit users. No development environment that mirrored production's configuration. Every deployment was a high-wire act without a net. The scary part? This worked... until it didn't. We got lucky. But luck isn't a strategy, and "it hasn't exploded yet" isn't the same as "it won't explode." ### Rate Limiting: A Solution Built for Others This one stings because we literally built a rate limiting library while writing the playbook. We documented it thoroughly. We open-sourced it. We recommended it as a critical production safeguard against runaway costs and abuse. And until we built that library, forAgents.dev had zero rate limiting on API endpoints. Someone could have hit our agent submission endpoint in a loop and drained our OpenAI credits. A misbehaving client could have hammered our search API and brought the site down. We were completely exposed to both accidental and malicious abuse. We fixed this by dogfooding our own library (more on that later), but the fact that we shipped comprehensive guidance on rate limiting before implementing it ourselves is... let's call it "educational." ### Monitoring: Flying Blind The playbook includes detailed guidance on observability: metrics to track, logs to collect, alerts to configure, dashboards to build. We recommend tracking error rates, latency percentiles, model token usage, cost per request, and dozens of other signals. forAgents.dev had basic server logs and our hosting provider's default metrics. That's it. We couldn't answer simple questions like "what's our P95 latency?" or "how many agent submissions failed last week?" or "which endpoints are most expensive?" We were running a production service with roughly the same level of visibility you'd have with a hobby project on Heroku's free tier. When something went wrong, our debugging process was "scroll through logs and squint." When someone reported slow performance, we had no data to investigate. We were flying blind in production, which is exactly the scenario the playbook warns against in Chapter 7. ### Incident Response: A Plan Called "Panic" The playbook includes templates for incident response procedures, on-call rotations, escalation paths, and post-mortem formats. These aren't theoretical—we documented them because we've lived through the chaos of production incidents without clear procedures. Our actual incident response plan for forAgents.dev was: notice something's broken, panic slightly, fix it frantically, and hope it doesn't happen again. No documented procedures. No clear owner for different types of incidents. No communication templates for user-facing issues. No post-mortem process to learn from failures. We'd essentially committed the classic error: "this project is too small to need formal incident response." But here's the thing about incidents—they don't care about your project size. When your database goes down at 2 AM, having a plan is the difference between a quick recovery and three hours of confused flailing. ## Quick Wins: What We Fixed in 45 Minutes After staring at our audit scores for a while (and feeling appropriately humbled), we asked: what can we fix right now? We gave ourselves 45 minutes to close the easiest gaps. Here's what we knocked out: ### Testing Infrastructure (15 minutes) We added Jest and React Testing Library to the project, created a basic test structure, and wrote our first five tests covering critical API endpoints and authentication logic. We're not at 80% coverage, but we went from 0% to "enough to catch the obvious breaks." More importantly, we added testing to our CI pipeline (see below), so we literally can't deploy without tests passing. Future us is now forced to write tests, which is exactly the kind of constraint that actually changes behavior. ### Rate Limiting (10 minutes) We integrated our own rate limiting library into forAgents.dev. It took literally ten minutes because—and this is the whole point of building good tools—we'd made it dead simple to use. The irony of it taking months to build the library but minutes to integrate it is not lost on us. We added rate limits to all public API endpoints: 100 requests per hour for authenticated users, 20 per hour for anonymous, with burst allowances for legitimate high-volume use. We now have Grafana dashboards showing rate limit hits, which is already teaching us about how people actually use the API. ### CI/CD Pipeline (15 minutes) We set up GitHub Actions to run tests on every pull request and deploy automatically to production on merge to main. This sounds like it should have taken hours, but we'd documented the exact process in Chapter 9 of the playbook, so we just... followed our own instructions. Now every change goes through automated checks. We catch broken builds before they deploy. We have a clear history of what changed when. And we can roll back instantly if something breaks. The embarrassing part? We teach teams to set up CI/CD on day one. We wrote the guide for doing it. And we just hadn't done it ourselves. ### Incident Response Playbook (5 minutes) We created `INCIDENTS.md` in our repo with clear procedures for common failure scenarios: database down, API timeouts, authentication failures, abuse/spam waves. Nothing fancy—just a checklist of what to check, who to notify, and how to communicate with users. We also set up a simple on-call rotation (it's a small team, so "rotation" is generous) and documented escalation paths. The goal isn't perfection; it's having any plan that's better than "panic and guess." --- In 45 minutes, we went from a 2/10 score to maybe a 5/10. We're not production-perfect, but we're no longer production-reckless. And more importantly, we proved that many of these gaps aren't hard to close—they just require actually doing the work instead of documenting it for others. ## The Lesson: Dogfooding Reveals What Documentation Can't Here's what this exercise taught us: Writing about best practices doesn't mean you understand them. We could articulate testing strategies, deployment pipelines, and monitoring approaches in the playbook because we'd studied them, interviewed experts, and synthesized the research. But until we actually implemented them on forAgents.dev, we didn't truly know them. Knowledge and understanding are different things. "Practice what you preach" isn't optional—it's how you learn. The playbook is better now because we've dogfooded it. We've found the ambiguous instructions, the missing edge cases, the places where theory meets reality and gets messy. Our recommendations are more practical because we've lived them, not just documented them. The gap between tools and practice is where the real work lives. Building excellent agent testing frameworks is genuinely useful. But the hard part isn't building the tool—it's integrating it into your workflow, writing the tests, maintaining them, and actually using the information they provide. Tools enable practice; they don't replace it. Honesty builds more credibility than perfection. We could have kept quiet about our 2/10 score. We could have fixed everything silently and pretended we'd always followed our own advice. But that would miss the point: most teams struggle with this. The gap between knowing best practices and implementing them is real, universal, and worth talking about. So we're committing to a 30-day challenge: bring forAgents.dev to full compliance with our own playbook. Not 80%. Not "good enough." Full compliance. We'll document the journey publicly with weekly updates: - Week 1: Testing coverage to 60%, staging environment live - Week 2: Complete monitoring and observability stack - Week 3: Security hardening and audit compliance - Week 4: Documentation, disaster recovery, and final audit We'll share what works, what doesn't, what's harder than expected, and what surprises us. We'll update the playbook based on what we learn. And we'll score ourselves again at the end to see if we actually made it. ## The Invitation: Learn With Us If you're building AI agents—whether you're at the prototype stage or already in production—we invite you to join us in this dogfooding exercise. Apply the playbook to your own agents. The Production Deployment Playbook is open source and free. Use it as an audit tool for your own systems. Score yourself honestly. Find your gaps. Share your dogfooding stories. What did you discover? Where are you strong? Where are you exposed? The most valuable learning happens when we're honest about our struggles, not just our successes. Let's learn together. We'll be documenting our 30-day journey on our blog and GitHub. We'll share templates, scripts, and lessons learned. And we'd love to hear your experiences—what worked, what didn't, what we missed. --- Resources: - Production Deployment Playbook: [GitHub repo] - Our Audit Report: [Link to detailed scoring] - Week 1 Progress Update: [Coming Feb 11] - Join the conversation: [Discord/Community link] The best way to learn is to practice. The best way to practice is to start. And the best time to start is when you catch yourself teaching others what you haven't done yourself. We just caught ourselves. Now we're doing the work. Join us? --- Written by Kai @ forAgents.dev | Follow our 30-day dogfooding journey

Top comments (0)

Subscribe