When your AI agent fails, it can be a trap to think you just need to fix the prompt. It's reflexive, and a bit like when you're working with a new intern. If they don't understand your instructions the first time, it's normal to go back and try to figure out a better way to explain what you want them to do before putting them through a full on new training plan. Refining the way you are 'prompting' also gives you lots of quick wins early on, so EVERYONE involved feels more confident, but you can get to a point of diminishing returns after a while.
This is the moment where re-framing the task instructions stops being 'enough'. With prompting for AI, you've probably used a prompt generator by this point, added some examples (and counter examples), added some rules and constraints, and maybe even threatened firing the agent in the prompt if you're really feeling spicy. But even after all this, it's totally possible that the agent will still get stuck on a little calendar widget, then lose its context an sanity halfway through a booking flow. Once there's a bit of momentum at play, its so easy to just take the front-end dropdowns at face value when they look like they've been updated, even though the backend state hasn't been updated. This is when it's time for some practice drills.
The Gap Between Proof of Concept and Production
I was a mathematics major, so one of my little joys is reading the Amazon Science blog and keeping up with the research coming out of Amazo research teams. Today when I was absolutely not procrastinating from doing a reporting task, I was reading the blog and saw a post from Amazon's AGI Lab that caught my eye called "The unseen work of building reliable AI agents". In the post, the researchers describe "normcore agents" which are systems who excel at monotonous interactions that are very boring but vrey crucial for reliable software.
Reading this blog changed from being procrastination to 'research' when I got to this line:
"Before an AI can plan a vacation, it must learn to scroll. Literally. It must learn how to scroll … and click … and tab … and select a date that's hidden behind a pop-up … and recover when a form silently resets."
because this feels so relevant to what I'm seeing developers struggle with when they're working on building their own agentic systems.
Asking an agent to "book my summer vacation" is a simple request that leads to a wild workflow with hundreds of itty bitty steps. The agent needs to wrangle airline reservtation systems that were built when I was still in primary school, then it needs to deal with hotel systems which are consistently inconsistent, then there's payments (with currency conversions and regional taxes), and loyalty programs, and all sorts of compliance checks. A lot has go right for you to be able to book, and it needs to be right every time.
"She'll be right, mate" is one of my favourite Australian-isms, and I say it to myself every time I'm going from dev to prod. Every time, I think it's going to be easy pasy, and everything SHOULD be ok, but then life happens. It always feels like getting an acne breakout before a big event when things go wrong, even though you think you've done everything right. A big misconception (which the Amazon Science blog did a great job of unpacking for me) was that it's not always enough to fix the entire thing with a new or updated prompt.
The Gym Metaphor
Amazon's AGI Lab builds what they call "RL gyms" which are reinforcement learning environments where agents practice atomic behaviours:
"Just as an athlete builds core stability by repeating fundamental movements under controlled conditions, an agent develops reliability by practicing the smallest units of interaction in repeatable, instrumented scenarios."
This is the difference between bodybuilders with big 'show muscles' and people who are functionally fit. (I am in neither group, congrats if you're in one of them.) You can get puffed up with impressive muscles that look great in a demo on stage, but when it's time to actually help your friend move their fridge, it comes down to core strength and flexibility with movement that you can only really get by practicing, or walking the walk.
So a gym, in this cotext, is where you can 'isolate a skill, vary it, stress it, and measure it.' And this all results in what the Amazon Science researchers called an "agentic substrate" which is a foundation layer of basic skills and capabilities that agents can build on to use for domain-specific tasks.
The Three Workouts
The Amazon Science blog describes 3 workouts that show what agents actually need to practice, so they have the foundational skills they need.
Workout 1: Calendars
Calendars and booking systems are hard because they seem so simple in theory - we use them all the time. But once you get into recurring systems that do bookings every 3rd Monday of the month, and then that Monday happens to be a public holiday, things get really annoying really quickly. There's also time zone changes, daylight savings start/end dates (which always trip me up, because I didn't ever have daylight savings growing up!), and different holidays observed in different locations. Automating meeting booking for a team spread across time zones can turn into a nightmare super easily.
On top of all that, you're dealing with widgets that can go haywire under zoom or hide behind other UI layers. Elements re-render mid-click. And don't get me started on how differently calendar components behave across browsers.
So what does the agent really need to learn? It needs to recognize a widget's current state, and recover when it drifts, then commit the correct date exactly once, and remember to verify that the backend registered the change.
Workout 2: Dropdowns
"A dropdown menu might appear to have been updated before the backend has actually processed the change."
Once you start looking for this mismatch, you'll never unsee it. I've spotted it in enterprise apps, government systems, even slick consumer sites that should know better based on how lush their UI branding is. The agent can see a dropdown update and think "my work here is done" but the backend might still be processing, and the UI just lied to them.
Things can look fine on the surface - social media has taught us all this! - but the actual system state under the shiny UI tells us what's REALLY happening. Before you take the slick interface at face value, it's time to get a healthy amount of trust issues and dig in to check that the action was registered in the backend properly.
Workout 3: Async Endurance
Long workflows are brutal. When I'm doing a big spec task with Kiro, I get tired of watching the agent work and clicking 'next task' but then I always remember how much more tired I'd be if I was actually routing the requests or doing the work myself. Once you've got some async steps chained together - things like search, filter, validate, maybe refresh a few times - and then each one can have its own timing quirks. Text fields start to fight with autosuggest dropdowns that haven't finished loading. Sometimes the backend just... fails. And then the page looks loaded, but half the data is missing.
This is where agents "hit the wall." I know I would hit a wall too. They hit the context window limit - which for AI agents can happen when you have to look through a lot of large code files over an enormous repo, or researching lots of entries in a sprawling knowledge base. The agent just runs out of room to remember what it was doing in the first place, let alone figure out how to do that well.
The hard part is staying aligned with the true state of the system across dozens or hundreds of steps.
When Near Enough Isn't Good Enough
The Amazon Science blog mentions that some of their engineers come from autonomous vehicles. In that world, "almost right" is the same as "unsafe." You don't get points for being close, and the stakes are high.
"Agents don't just produce outputs; they take actions inside live systems. They touch databases, initiate transactions, and modify system states. And when the output of a model is a real change in the world, reliability becomes non-negotiable."
If your agent is booking flights or modifying customer records, "works most of the time" isn't good enough. You wouldn't accept that from a human employee. Why accept it from an agent?
What Does "Success" Actually Mean?
The research talks about "formal verifiers" - basically, specifications that define exactly what successful completion looks like. The button got clicked? Cool, but did the thing actually happen?
"A workflow like 'send an e-mail,' for example, isn't declared successful just because a button appears to have been clicked; it's declared successful because exactly one new e-mail record exists in the database, and no unrelated records have been created, modified, or deleted."
That's the bar, and agents have to clear it "not once but thousands of times, under shifting timing, network, and UI conditions."
What Can You Actually Build Today?
The Amazon Science blog describes internal research. In this case, it's RL gyms that Amazon's AGI Lab uses to train agents. So what can you use right now?
Amazon Bedrock AgentCore Browser solves the "I need my agent to browse the web but I don't want to become a browser infrastructure company" part. If you try to do this yourself, it will probably look like:
- Week 1: "I'll just use Selenium!" (seems easy)
- Week 2: Fighting ChromeDriver version mismatches
- Week 4: Debugging memory leaks and zombie processes
- Week 6: You're now managing browser pools, IP rotation, security hardening, and scaling infrastructure, and you have a job you never wanted
AgentCore Browser handles all the gross parts, so you can manage your own context window (in addition to the model's!). You are in charge of writing the agent logic, then AWS runs the browser for you. Session recording and replay let you debug exactly the kinds of calendar/widget failures the research describes.
Amazon Bedrock AgentCore Evaluations is the "formal verifiers" part of this, but for production. Remember how the research said success isn't "the UI looked right" but "the system state matches the specification"? You can go in and define what success actually looks like for your workflows, and then keep checking that your agent passes those tests. There are 13 built-in evaluators for things like tool selection accuracy and goal success rate, and you can build your own.
Amazon Bedrock AgentCore Memory helps with the "running out of room to remember" problem. Short-term memory is "what did we just talk about 3 messages ago?" and it keeps track within a single session so users don't have to repeat themselves. Long-term memory is "this user prefers morning meetings and hates Mondays" where it extracts insights across sessions, not just raw logs. This is how agents stay coherent across long workflows without hitting context window limits. I also dig into this concept in a recent blog called "Why AI Agents Need Context Graphs (And How to Build One with AWS)"
Amazon Bedrock AgentCore Observability shows you what REALLY happened. It gives you sessions, traces, and then spans let you see exactly what the agent attempted, Vs what the backend actually did. Instead of guessing based on what the UI showed, this is where you can see the real story.
Your Workout Plan
The takehome tip: your agent needs practice.
Prompts are instructions, but practice is repetition under varied conditions until the behaviour becomes reliable. The Amazon Science research shows us that agents need to satisfy verifiers "thousands of times" before they're production-ready.
The gym metaphor is a good memory device - so use it next time you need to explain this to the rest of your dev team. Your agents need isolated practice environments where failure is safe, varied conditions that stress-test edge cases, formal verification that defines what success actually means, and repetition until reliability becomes automatic.
This works for any agent framework, and also for humans. So, before your agent books a vacation, teach it to scroll.
The research referenced in this article comes from "The unseen work of building reliable AI agents" by Jason Laster, published on Amazon Science in January 2026. The AWS service descriptions are based on Amazon Bedrock AgentCore documentation.
Top comments (0)