Everyone has seen that version of AI agents where everything just works. The reasoning is clean, every tool call lands, every output is exactly what you wanted. And then you try to build one yourself for production, and honestly? It's a pretty different experience.
Last week in London, we got engineers, tech leads, and builders into a room for Agents in Production, a meetup hosted by Orkes. The whole evening was basically one long honest conversation about that gap between demo agents and the ones you actually have to keep running in production.
The evening ✨
The format was simple: two talks, and then drinks and questions.
What made the room really awesome was the mix. Half the people there were already building agents in production and running into real problems. Things like state, retries, observability, and all the stuff that doesn't show up in any demo. The other half were earlier on, and were trying to figure out where to even start without repeating everyone else's mistakes. Honestly, both groups had a lot to share with each other.
Talk 1: When Agents Meet Reality, and Why Execution Is the Hard Part
I kicked things off with a talk I've been sitting on for months.
The short version: stop only asking whether your agent is smart. Start asking if it's actually operable. Because the second you try to run a clever agent in production, a pretty different set of problems comes up:
- State has to persist across steps that might span minutes, hours, or even days.
- Failures are partial and messy. Not the kind of clean crash you can just catch and retry. More like silent degradations mid-task, the kind you only notice when someone else tells you.
- Humans need visibility into what's happening at each stage, and the ability to step in without breaking the whole workflow.
- Long-running coordination between agents, tools, and humans needs infrastructure most teams just aren't thinking about enough.
This is where orchestration actually earns its keep. Not as a buzzword, but as the actual difference between an agent that demos well and one you'd put in front of real users. Can you observe it? Can you recover when it fails? Can a human step in without everything falling over?
And based on the questions after, the room was feeling this too.
Talk 2: From Prototype to Production, and How First Databank UK Did It
Where talk one was the argument, talk two was the evidence.
Dan Miller from First Databank UK walked us through how his team actually orchestrates three production AI agents using Orkes Conductor:
- Noisy cloud alerts. Triaging and surfacing only what actually matters.
- Time-consuming SPIKE investigations. Automating the research and synthesis work.
- Manual clinical guidance monitoring. Keeping a continuous eye on changing medical guidelines.
What made Dan's talk so good was how honest it was. He didn't skip the hard parts. Things like the retries, the human checkpoints, the observability that needs to be talked about more. Orkes Conductor gave his team durable execution, full observability, and human-in-the-loop checkpoints. Basically, all the boring stuff that turns a clever prototype into something a team can actually rely on.
And the clinical angle made it hit even harder. When your agent is working somewhere that patient safety matters, the bar for observability and control just jumps way up.
The conversation that followed
Once the talks wrapped, I honestly expected the room to slide into small talk or people to start leaving. It didn't though. People stayed locked in and continued to ask questions until we had to leave because the venue was closing for the night.
A few themes kept coming up:
Safety and trust. When do you actually trust an agent's decision? Where do humans need to stay in the loop, and how do you design those handoffs so they don't turn into bottlenecks? And nobody was speaking in the abstract either. People were wrestling with this in stuff they'd shipped that week.
The "how do we even start" question. The gap between "we've seen the demos" and "we've actually shipped something real" is way wider than it looks from the outside. There was real hunger for patterns, reference architectures, and honest stories about what didn't work.
Cross-industry patterns. Engineers from fintech, healthcare, dev tools, and retail kept comparing notes and landing on the same problem which is putting these agents out there and building them in a way so that we can trust them.
One more thing: Agentspan
We also got to drop something new at the event: Agentspan, a framework for building agents in a durable, production-ready way. It's basically our direct answer to everything the evening's talks were circling around.
The reaction in the room made it pretty clear this is what people have been looking for and were excited to get started.
Next stop: Amsterdam
London confirmed something I'd been suspecting for a while. There's a real, growing community of people who want to stop talking about agents in theory and start sharing what actually works (and what doesn't) when you're running them in production. So yeah, we're doing it again.
If you're:
- Building agents right now and want to compare notes with people hitting the same walls,
- Thinking about building and want to skip a few expensive mistakes,
- Or just trying to make sense of where all of this is actually heading,
...this is the room for it.
If you are in Amsterdam and want in drop a comment or shoot me a message on LinkedIn. I'm collecting names now, and I'll reach out as soon as we've locked in a date and venue.
See you in Amsterdam!
Top comments (0)