What Salesforce's 20,000 AI Agent Deployments Teach a Solo Builder

#aiagents #agentdeployment #agentreliability #agentguard

Salesforce has shipped around 20,000 Agentforce deployments. ByteByteGo published a writeup of what they learned, sourced to John Kucera, the CPO of Agentforce. I run a one-person agent fleet, which is about as far from Salesforce scale as you can get. The lessons still translate. Better than I expected, actually.

Short version: 90% of agent work happens after launch, not before. The failures cluster into three patterns. Putting deterministic logic inside an LLM loop, prompting harder instead of encoding policy in code, and feeding the model way too much context. All three are engineering problems, not model problems.

Why does 90% of agent work happen after launch?

Traditional software front-loads the effort. You spec, build, test, then go live and mostly maintain. Agents invert that. Modern tooling gets you a functional demo in hours, and that speed creates false confidence. The demo covers the typical cases. Production brings edge cases, ambiguous phrasing, and questions that cross domains your agent never saw in testing.

I have lived a small version of this. Every agent I run looked done on day one. The real work was the weeks after: the input that arrived in a format I never tested, the API that returned something half-empty, the task that technically succeeded while producing nothing useful. If you budget your effort assuming launch is the finish line, you will abandon the agent right when the actual work starts.

Salesforce's advice here is blunt: do not boil the ocean. Start with one narrow, high-value use case so your iteration cycles stay fast. At solo scale that means one agent, one job, one queue. Get it boring before you add the second one.

What are the three anti-patterns that degrade agents?

First: over-reasoning deterministic workflows. If you can flowchart the logic, it belongs in code. Salesforce built Agent Script, a TypeScript framework that mixes deterministic control flow with LLM reasoning, because asking a model to re-derive an if-else chain on every run is slow, expensive, and occasionally wrong. You do not need their framework. You need the rule: flowchart it, then script it. Save the model for the parts that are genuinely ambiguous.

Second: prompting harder instead of encoding policies. Writing NEVER and ALWAYS in caps does not reliably constrain a model. Salesforce found business rules have to execute independently of model reasoning. This one matters most for small shops, because prompting harder is free and feels like progress. If a rule actually matters, enforce it in code that runs whether or not the model cooperates. A refund cap belongs in the payment function, not in paragraph four of the system prompt.

Third: poor context engineering. One e-commerce team in the writeup cut an order API response from 100K tokens to 2K by returning only the relevant fields. The agent got faster and more accurate at the same time. That is the detail worth tattooing somewhere: less context made it better, not just cheaper. Dumping a whole API response into the prompt is the default, and the default is wrong.

How do you know an agent is actually working?

Salesforce measures Agentic Work Units, meaning actual task completion. For support agents they track containment rate: cases resolved without human follow-up. Outcomes, not activity.

I learned a version of this the hard way. A scheduled agent can exit zero every night and produce nothing. Green checks lie. The fix is to check the declared output, not the exit code. Did the file appear, did the post go live, did the ticket close. Whatever your equivalent of containment rate is, measure that.

Their post-launch triage is also worth stealing. Issues get split four ways: tone or brand drift means fix the prompts, logic errors mean fix the tools or convert that step to a script, data quality problems get routed to whoever owns the source, and coverage gaps mean expand scope or escalate cleanly. Four buckets, four different fixes. Most solo builders treat every failure as a prompt problem. Most failures are not.

What does this mean if you're not Salesforce?

Salesforce has platform teams to absorb the post-launch 90%. You have you. That changes the build order, not the lessons.

Move deterministic logic out of the loop first. It is the cheapest win: fewer tokens, fewer surprises, faster runs. Then encode your real rules as code-level checks the model cannot talk its way past. Then cut your context down to what the task needs. Each of these makes the after-launch grind smaller, which at solo scale is the difference between a fleet you maintain and a fleet that quietly rots.

And put hard runtime limits on every agent before it touches production. The deployments in the writeup degrade in ways nobody predicted in the demo, and at 20,000 deployments Salesforce can eat the bad days. One runaway retry loop on your side is your whole margin. That is the exact surface I built AgentGuard for: per-agent budget caps, token limits, and rate limits enforced at runtime, not in the prompt. It is a pip install, agentguard, and it takes minutes to wire in. Start there: https://bmdpat.com/tools/agentguard

Originally published on bmdpat.com. I run a one-person AI agent company and write about what actually works.

Want these in your inbox? Subscribe to the newsletter - no spam, unsubscribe anytime.

Top comments (2)

Mallory Haigh • Jun 30

The three anti-patterns you've described are all pointing at the same core issue: these are platform problems, not prompt problems. Deterministic logic in code, policies enforced outside the model, context shaped before it hits the agent...this is infrastructure design, not fiddling with agents or tuning the model. The reason 90% of the work ends up happening post-launch is that most teams ship an agent with no stable platform underneath it, then spend months hand-building what the platform should have provided from day one. What Salesforce figured out at 20,000 deployments, solo builders and enterprise teams are both learning the same way: production pain teaches (painful) lessons. I'd say the methodology that makes this systematic rather than reactive is platform engineering, adapted for the agentic world.

Joel Horvath • Jun 18

We’re no longer just “building software” — we’re designing systems that decide what to trust before they produce output.
That’s why:
context is the real cost
demos fail in production
agents break after launch
code vs AI boundaries matter more than models
The winner isn’t who builds faster — it’s who controls uncertainty better.