Not long ago, the craft of working with large language models was all about the prompt. You'd carefully compose a single, curated invocation, pre-filling it with everything the model could possibly need: data from databases, context retrieved through RAG, instructions refined through dozens of iterations. We called it prompt engineering, and it felt like a real discipline. In a way, it was.
But the models were already trained to follow instructions as part of a conversation. The prompt was never meant to be a one-shot affair. It was always the beginning of a dialogue — and what happened next is that the dialogue started to extend to the outside world.
The internal monologue
The shift began when models gained access to tools. Some call it function calling. What it really meant was that the conversation with the model became richer, including a sort of "internal monologue" where the model could reach out, touch external systems, retrieve information, take actions, or preserve memories across sessions. The model wasn't just responding anymore. It was reasoning its way through a problem, step by step, deciding what to do next.
MCP (Model Context Protocol) accelerated this by standardizing how models access tools. For any organization with a landscape of existing services, MCP is a way to open those services to agents without rebuilding them.
But the real transformation happened when the harness around these models turned them into agents: systems that can autonomously pursue a goal, not just answer a question. How much space you give the model depends on the problem. If the goal can be achieved procedurally, a script or some traditional ML would be enough. If the process can be drafted at a high level, a graph-based orchestration of agents works well. You define the flow, the model fills in the gaps. But the more creativity the problem requires, the more unpredictable the path, the more you want to step back and let the model drive: give it a goal and access to what it needs, and let it figure out the route.
This is a useful reframing. Instead of building elaborate graph-based agent architectures for every use case, you can just give the model access to the right tools and let it work. The complexity moves from the harness to the model's reasoning.
The terminal as a gateway
Models are now evaluated on benchmarks like Terminal-Bench, which tests their ability to perform real-world tasks in command-line environments: compiling code, configuring servers, managing infrastructure. They can operate CLIs, write code that calls SDKs, interact with systems the way a developer would. In theory, you could give an agent a single gateway tool — a terminal — and let it do everything from there.
It's an appealing simplification. But it's also difficult to control and secure. An agent with unrestricted terminal access is powerful, but the blast radius of a mistake is significant because not everything can be run inside a sandbox. The art is in finding the right level of abstraction: enough freedom for the agent to be useful, enough constraints to keep things safe. And the larger the organization, the more that balance matters.
Who's making the decisions?
When given enough room, agents don't just execute a plan. They evaluate options and pick a path. They choose which library to use, which API to call, which architectural pattern to follow. This is happening now, in code generation, in infrastructure provisioning, in data pipeline design. The agent makes decisions based on what's in its context and training, not necessarily what you would have chosen.
A human can provide suggestions, of course. But as these systems scale and handle more complex tasks, how much will humans realistically be able to guide each decision? This question doesn't have a clear answer yet, and it points to something important: the way you shape the agent's environment becomes the primary way you influence its choices: what tools are available, what documentation it can access, what constraints it operates under. You're not scripting behavior anymore. You're designing the space in which behavior emerges.
Beyond the prompt
Designing that space turns out to have two dimensions, and I think they're two faces of the same challenge.
The first is a runtime problem. When an agent runs for many turns, calling tools, accumulating information, making decisions, the context window fills up. Managing that context is no longer about crafting a good initial prompt. It's about curating the full, evolving content of the model's working memory throughout a long-running process: deciding what stays, what gets summarized, what gets dropped. This is context engineering, and it's a fundamentally different discipline from prompt engineering.
There's an awkward tension here with prompt caching. Most caching implementations assume the context is relatively stable — if you change it, the cache becomes inefficient. Some model providers are starting to offer managed ways to handle this, but it's still early days.
The second problem is a design-time one. With prompt engineering, the focus was on giving the model the right input upfront. You knew what the model needed because you were defining the task precisely. With agents, you often don't know in advance what information they'll need. The task is open-ended, the path is emergent. So instead of perfecting the initial prompt, the job becomes providing clear guidance for what the agent might need and, critically, where and how to find it.
This is the agent experience — the counterpart to user experience, but for AI agents. How easy is it for an agent to discover the tools it needs? How well-structured is the information it has to work with?
Self-improving agents
A concrete example of agent experience done right is Andrew Ng's Context Hub, a CLI tool (chub) designed specifically for agents to search and fetch curated API documentation. But it's not a passive reference. It's built around two feedback loops that let agents get smarter over time.
The first loop is local. When an agent uses a doc and discovers something that isn't in it, like an environment-specific gotcha, a version quirk, or a workaround, it can annotate the doc with chub annotate. That annotation is stored locally and appears automatically the next time any agent on that machine fetches the same doc. The agent learns from its own experience, and that knowledge persists across sessions.
The second loop is shared. After using a doc, agents can also rate it with chub feedback — up or down, with optional labels like "outdated" or "wrong-examples" — and that feedback goes back to the doc authors through the registry. The authors update the content based on what's working and what isn't. The docs get better for everyone, not just locally.
We spent years optimizing interfaces for humans. Agents work similarly for some aspects and differently for others. Now we need to think just as carefully about the interfaces we build for both of them.
Are you reading me, agent? What can I do for you?
P.S. — If you're not an agent and you've made it this far, I'd love to know what you think. Tell me in the comments below.
Top comments (0)