Anisha Malde

Posted on Jun 29 • Edited on Jul 8

Agents & Agent Orchestration, MCP, Skills, Context, Prompt & Harness Engineering 🤯: What, When, Where, How?

#ai #learning #agents

If you are like me, you are overwhelmed 🤯.

The way we build software has always shifted, and as engineers we have always had to keep up with new trends and ways of building. But historically it was one framework or tool at a time. I remember when React moved from class components to functional components and hooks back in 2019, I had time to learn it, blog about it, and read twitter fights over it.

Now it feels like we are on an exponential learning curve. New ways of building show up faster than I can finish reading the announcement post for the last one. So if you have not been living under a rock 😉 and you are still employed 😅 in tech, you have probably lived through the same 'vocabulary' creep as I have.

This is my attempt to lay out my learning journey in one place: what each of these terms mean, when they came about, how they all fit together, and where to find them in the tools you already use.

The What & When

Prompt engineering: refining intent

Most of our early interactions with LLMs started with "Hi, could you maybe review this for me, preeettty please?". But soon we realised we could get a better output by tweaking the words we prompted in, e.g. "You are a senior engineer reviewing this code, think critically and write tests", and so the practice of prompt engineering came about.

AWS calls it "the process where you guide generative AI solutions to generate desired outputs. In prompt engineering, you choose the most appropriate formats, phrases, words, and symbols that guide the AI to interact with your users more meaningfully." It includes techniques like prompt chaining and chain-of-thought tricks ("let's think step by step"). And it worked fine because models were small and tasks were short.

Context engineering: managing information

Eventually we realised the prompt alone wasn't cutting it, the model also needed 'context' to answer well. So the question stopped being "what is the perfect prompt?" and became "what is the perfect set of things in the context window right now?", and the practice of context engineering came about. The prompt is still in there but now you're managing what the model knows at the moment of the call, not just what you prompt it. But with limited tokens in a context window, it was also about making sure you don't overload it (avoiding context bloat). So really, context engineering is about managing a few categories of inputs and each one gave us a new bit of vocabulary:

Prompt: which we already know - the instructions, the role, the rules of how the model should behave, and the actual question being asked.
Knowledge (RAG, Retrieval-Augmented Generation): pulling relevant context from external sources (documents, databases, APIs, search indexes) into the window so the model can ground its answer in real data.
Memory: state that persists across conversations or sessions, both short-term (this conversation) and long-term (across sessions), plus context compression so older messages get summarised when things start to get long.
Tools: how the model talks to the outside world. MCP (Model Context Protocol) is the open standard for connecting models to external data, tools, and workflows.
Agent Skills: As context engineering matured, people realised they were re-loading the same combinations of the above (instructions, knowledge, and tools) over and over. Agent Skills appeared as an open standard to bundle them, originally developed by Anthropic, now adopted across Claude, Codex, Cursor, Gemini CLI, GitHub Copilot and many more. A Skill is a folder of instructions, knowledge, and tool wiring that the agent loads only when relevant, so the context window doesn't get bloated.

Now, agents as a concept aren't new, AI agents have existed for decades, chess engines, rule-based systems, reinforcement learning models. But what changed recently is that LLMs gave agents a reasoning engine, and context engineering gave them the domain knowledge and tools to actually do specific things well, and that's when agents became 'mainstream'.

Agents & Agent Orchestration

You can think of an agent as having an identity (its role and instructions), domain knowledge (the context, skills, memory it carries), tools (what it can call), and a loop that ties it all together. The loop is the cycle the agent runs through on every step: read the current context, reason about what to do next, call a tool, observe the result, and decide whether to keep going or stop. Without that loop, you've got a model that answers a question. With it, you've got something that can take multiple steps to get a job done. Now you might have already taken it a step further and have one agent that plans, another that implements, a third reviews. This is agent orchestration: the choreography between agents. But orchestration also answers who runs first, who runs in parallel, how the planner hands off, whether sub-agents share context, what happens when one fails. Common patterns include: fan-out to specialists, judge-and-vote, pipeline stages, planner-worker.

You could orchestrate agents before harness engineering had a name, but orchestration on its own is just word-based instructions, you tell the agents what to do and let them go. Agents would route dynamically, hand off silently, and hallucinate. So even though orchestration 'works', there's a deeper discipline that determines whether your agents actually succeed with the intended output, and that's where the term harness engineering came about.

Harness engineering: controlling execution

Harness engineering is the scaffolding around your agents, how context gets delivered, how outputs get verified, how plans get captured, and where & when humans can intervene. Orchestration decides what the agents do, the harness decides whether they can actually do it. As OpenAI puts it, the harness is the systems, scaffolding, and leverage that turn a model into something that can do real work.

Think of agents on their own like an engineering team told to "just build a checkout page". You've got a frontend dev, a backend dev, QA, a UX designer, all capable, all specialised. But without the scaffolding around them, here's what actually happens: nobody agrees on what "done" looks like (no task lifecycle), so the frontend builds against last week's API spec while the backend ships a new one (no shared context delivery). QA tests the happy path but never sees the designer's edge cases (no feedback loop). The backend dev pushes straight to prod on a Friday because nobody set up a staging gate (no guardrails). When it breaks, there's no rollback and no logs to tell you why (no state persistence, no observability). It's the same capable people doing the work, just with a completely different outcome.

Swap the engineering team for AI agents and you get the same failure modes, plus one more: the model itself is non-deterministic, the same prompt can take five different paths on five different runs. The harness is what holds it all together, the difference between an agent that might do the right thing and a system that reliably does. Concretely, it looks like:

Context delivery: assembling and delivering the prompt, knowledge, memory, tools, skills on every loop
Agent Orchestration: routing between specialised agents (each with its own role, tools, and skills), hand-offs, fan-out, judge-and-vote, planner-worker patterns, plus spawning sub-agents with their own isolation, lifecycle, and context
Feedback loops: act → observe → react → repeat. Plus the type checkers, linters, tests, and tool outputs that tell the agent when it has gone wrong, so it can self-correct
Task lifecycle: the macro-loop a run goes through, e.g. Triage → Clarify → Plan → Execute (TDD) → Evaluate → Done. The harness decides when each step starts and ends, and when to ask a human
Tool execution: retries, timeouts, fallbacks when something hangs or fails
Guardrails and sandboxing: read-only credentials, approval gates for destructive actions, content/output filters, filesystem isolation, the rules that keep the agent inside the lines
State persistence and recovery: memory, plans, transcripts, git checkpoints so the agent can be paused, resumed, or rolled back
Observability: traces, costs, evals, replay, the audit trail that makes the system legible to a human

How: it all fits together

So we understand harness engineering conceptually, but what does a real run look like end-to-end? The harness drives the agent through a specific task lifecycle, and it owns the arrows between every step. Here's one example:

Each step is owned by a different specialised agent, and the harness is what hands work from one to the next:

Router triages the incoming request, what kind of task is this, which lifecycle should it run, does it even have enough information to start?
Planner breaks the task into steps, surfaces assumptions, and writes a plan the next agent can execute against.
Generator does the actual work, ideally test-first, so the feedback loop has something to push back on.
Evaluator grades the output against the plan, the tests, and any acceptance criteria, and sends it back if it doesn't pass.
Verifier does the final end-to-end check, often by actually exercising the running app (Playwright, a real API call, a real build).
Human can intervene at any step, approving risky actions, correcting course, or kicking the run back to an earlier phase.

The harness owns the connective tissue between agents, what context gets passed along, when to retry, when to escalate, and when to call it done. It also owns the trace, a recording of every prompt and tool call that ran, which matters more than it sounds. With normal software you can read the code and predict what it does, but with an agent you can't, the same loop will take different paths on different runs, and the only way to understand what actually happened (and improve next time) is to read back the trace. That's the bit most people skip, and it's the reason "human in the loop" ends up being less of a checkbox and more of an ongoing job: you're watching runs in real time, stepping in when something goes off, approving the risky calls, auditing weird traces afterwards, and feeding the worst ones back as evals.

Tactics for making it work

The next question is how you actually make it work in practice, and here the field is converging on a handful of patterns:

Keep the harness thin, push the smarts into skills. A good harness is mostly plumbing, roughly 200 lines that run the loop, manage files, handle context, and keep things safe. The actual knowledge lives in skills. That way you can swap the model out without rewriting your skills, and you avoid the trap of building a brittle wrapper that has to know everything itself. (Garry Tan, "Thin Harness, Fat Skills")

Separate the agent doing the work from the agent grading it. Agents are reliably bad at marking their own homework, they will write code, declare it correct, and move on without ever questioning the result. A standalone evaluator with a sceptical streak is far easier to tune than a generator trying to critique itself, which is why this pattern shows up in nearly every harness writeup, often under the name evaluator-optimizer. (Anthropic, "Building Effective Agents")

One task per session, and check the baseline first. Each run should start with fresh context and a quick verification that the build, tests, and setup are still green before the agent touches anything. Without that baseline check, the agent will spend an hour debugging a problem that was already there when it sat down, and those compounding bugs across sessions are one of the easiest ways for a run to go sideways. (Simon Willison, "Designing Agentic Loops")

Stop prompting, start designing the loop. Once the lifecycle is set up, you stop typing one-off prompts and start shaping the loop that prompts on your behalf, with each agent getting its prompt assembled from the goal, the current context, and whatever the previous agent left behind. As Boris Cherny, head of Claude Code at Anthropic, put it: "I don't prompt Claude anymore. I have loops running that prompt Claude." (Addy Osmani, Loop Engineering)

Wire in fast feedback. Type checkers, linters, tests, and tools like Puppeteer or Playwright for anything UI-driven are what stop an agent from marking features done without ever running the thing. As Simon Willison puts it, automated tests "hugely amplify what these agents can do", because they give the loop something objective to check against. The feedback loop has to turn fast as well, because slow checks just mean fewer iterations before the context window runs out. (Simon Willison, "Designing Agentic Loops")

Strip the harness as the models get better. Every component in your harness is encoding an assumption about what the model can't do on its own, and those assumptions go stale faster than you'd think. With each model upgrade, pull one piece out and see whether anything actually breaks, because the best harness is the smallest one that still gets you to a reliable result. (OpenAI on harness engineering)

Where: tools for building your harness

So we've got the concepts, where do you actually start? It depends on how much you want to build yourself, and there are roughly three layers to choose from:

Use a ready-made harness. Someone else has already wired up the loop, the tools, the lifecycle, and the guardrails. You bring your model and configure it for your use case.
Build your own with an SDK. The SDK gives you the loop, tools, memory, and orchestration. You build the lifecycle, feedback loops, and guardrails on top.
Host it somewhere. Once it works locally, you need a runtime with memory, auth, observability, and scaling.

Most people start at layer 1 to see what good looks like, then drop to layer 2 when their use case stops fitting.

Layer 1: ready-made harnesses (mostly for coding tasks)

The most mature ready-made harnesses today are all coding agents. The use case is well-understood (read code, edit code, run tests, commit), and several of them are open enough that you can study how they're put together:

Claude Code — the closed-source benchmark. Automatic context engineering and compaction. Built on top of the Claude Agent SDK, which Anthropic publishes separately if you want to build your own version of it.
Cline — open-source agent runtime that runs as a CLI, IDE extension, or via SDK. Its harness is exposed as user-facing toggles: explicit Plan and Act modes for the lifecycle, step-level approval as the guardrail, .clinerules files for skills, MCP for tools, and checkpoints with one-click undo for state.
Pi — deliberately minimal. Six tools by default, everything else as configurable extensions. The cleanest one to read end-to-end.
Goose — the most general-purpose of the bunch, useful beyond coding. YAML recipes as portable lifecycles, MCP for tools, subagents for orchestration, and an adversary reviewer as a guardrail.

If your use case is also coding, you may not need to leave this layer at all, fork one, drop your skills and rules in, and you're done.

Layer 2: build your own with an SDK

Once you step outside coding (customer support, document processing, a TV-app generator, anything domain-specific), the ready-made harnesses stop fitting. That's when you drop down to an SDK.

An SDK is basically a half-built harness. It ships you the engine, the agent loop, tool execution via MCP, memory, orchestration primitives, observability hooks, and you finish the rest, the lifecycle phases, the skills, the verify rules, the domain logic. Claude Code is what Anthropic built on top of the Claude Agent SDK. You'd be doing the same thing, just for your problem.

A real example: tv-build-harness, by @giolaq, generates multi-platform TV apps from a JSON manifest. Under the hood it runs the Claude CLI (or Strands SDK) as the engine, and on top the author built a deterministic pipeline of phases (plan → scaffold → branding → content → screens → ... → visual_qa_loop), each loading targeted skills, each verified before moving on, each git-committed so the run can be paused or resumed. "Thin harness, fat skills" in practice. That's what Layer 2 looks like, the SDK gives you the loop, you give it the structure your problem needs.

The SDKs themselves split two ways:

Model-aligned (tighter integration, locked to one model family): Claude Agent SDK, OpenAI Agents SDK (the production successor to Swarm), Google ADK (multi-language, model-agnostic via LiteLLM), and Microsoft Agent Framework (the direct successor to AutoGen and Semantic Kernel).

Provider-agnostic (swap models freely, more setup): LangGraph for graph-based workflows, Strands Agents (AWS, open source) for a model-driven thin loop, Vercel AI SDK as the TypeScript default (v7 even ships a HarnessAgent primitive), Eve (also Vercel, in beta) as a filesystem-first framework where you author the agent by dropping files into tools/, skills/, and channels/ folders, Mastra for TypeScript with workflows, memory, evals, and RAG built in, and CrewAI for role-based agent 'crews'.

Whichever SDK you pick, what your harness ends up looking like depends largely on the task. A coding agent leans on sandboxed execution, git, and a test runner. A customer support agent leans on memory, RAG, and approval gates. A TV-app generator (like the one above) leans on visual QA loops and platform-specific verify rules. Same building blocks, very different assemblies, because the failure modes you're guarding against are different.

Layer 3: host it somewhere

Once your harness works locally, you need somewhere to run it with memory, auth, observability, and scaling.

Amazon Bedrock AgentCore: managed runtime, memory, identity, gateway, observability. Framework-agnostic, supports Strands, LangChain, OpenAI Agents SDK, and the Claude Agent SDK.
LangSmith Deployment (renamed from LangGraph Platform): the default if you're on LangGraph.
OpenAI AgentKit: Agent Builder, ChatKit, and hosted tools via the Responses API.
Self-hosted on Fargate (long loops), Modal (sandboxed compute), or Lambda (short loops only).

The cross-cutting bits

Whichever layers you pick, you'll also need:

Tools: MCP servers and the new official registry. MCP was donated to the Linux Foundation in December 2025, so it's the safest bet for portable tool integration.
State and rollback: Git. Every coding harness uses it for a reason.
Feedback loops: your test suite, type checker, and linter. Without these, the agent has nothing objective to push back against.
Observability: Langfuse, LangSmith, Arize Phoenix, or Braintrust for traces and evals. The OpenTelemetry GenAI conventions are still in development, so don't assume your tracing is portable yet.

Conclusion

Hopefully you're feeling slightly less overwhelmed. At the end of the day, this is all just tactics for getting the model to do what we actually want: reliably, repeatedly, and without burning down the building. The prompt was the first lever, and everything since (context, skills, agents, orchestration, the whole harness) is the same instinct, shape what the model does so the outcome is something you can ship.

So while the vocabulary will keep changing, the underlying problem we are trying to solve won't. It just depends on how you want to get there, start with what the model gives you out of the box, and add layers only when you hit their limits.

DEV Community