In the previous post, we explored the key patterns behind AI agents — the ReAct loop, planning, and how a few tools turn a language model into an autonomous problem-solver.
With this post, we’re starting a series exploring popular AI agents by disassembling them — extracting their system prompts, tools, and session traces to show you exactly what’s happening under the hood.
We’re starting with GitHub Copilot.
You’re in VS Code. You type: Implement minimal agentic loop in Python using Anthropic API with a run bash tool and human confirmation before executing scripts.
Copilot reads your project structure. Creates agent.py. Checks for syntax errors. Finds an issue with import handling. Patches the file. Checks errors again. Delivers the final result.
Ninety seconds. Eleven API requests. 123,783 tokens. You saw a smooth sequence of actions. Under the hood, three different models were working together, five of those eleven requests were invisible overhead, and 97% of those tokens were input — the model reading, not writing.
Here’s what’s actually going on.
All prompts, tools, and session traces referenced in this post are extracted and available in the agenticloops/agentic-apps-internals GitHub repo.
Three Modes Under the Hood
GitHub Copilot isn’t one agent. It’s three distinct configurations with very different footprints. (Copilot also has an Edit mode, but we didn’t capture it in this analysis — we focused on Ask, Plan, and Agent.)
Ask mode is the lightest. Zero tools, a compact system prompt, and pure text completion — the model works only from context provided in the conversation. Code suggestions use a clever trick: a special 4-backtick format with filepath comments that the IDE parses and applies. The model never touches a file directly. First request cost: 654 input tokens.
Plan and Agent modes share the same large base prompt — personality, coding guidelines, formatting rules, planning examples — weighing in at ~6,000 tokens. What separates them is tools and permissions.
Plan mode adds 22 read-only tools — file search, grep, semantic search, directory listing, error checking, web fetching. It can explore your entire codebase, but it cannot change a single line. Those 22 tool schemas add ~2,300 tokens on top of the prompt. First request cost: 8,484 input tokens.
Agent mode gets 65 tools with full read/write/execute access. The 65 tool schemas add ~10,500 tokens — almost double the system prompt itself. First request cost: 16,738 input tokens. That’s 25x more than Ask mode, and most of the gap is tools, not prompt.
The mechanism that lets Plan and Agent share a base prompt is a <modeInstructions> block appended at the very end — a final-authority override that flips behavior without changing anything above it. Same prompt, different tail, different tools.
Thanks for reading! Subscribe for free to receive new posts and support my work.
Your Request Never Goes Straight to the Model
The first surprise: when you type a message in Copilot, the main model isn’t even the first thing that sees it.
In Ask mode, gpt-4o-mini categorizes your question into one of 16 predefined categories — generate_code_sample, workspace_project_questions, create_tests, web_questions, and so on — before the main model is invoked. This routing shapes how VS Code handles the response downstream.
In every mode, another gpt-4o-mini call generates a conversation title (“Minimal Agentic Loop with Anthropic API”).
In Agent mode, gpt-4o-mini also runs alongside the main model, generating activity summaries after each significant action — those short status messages you see updating in the Copilot panel while the agent works.
The actual model doing the heavy lifting — gpt-5.3-codex (user-selectable) — only handles the real work. Everything else is delegated to cheaper, faster models. In our captured agent session, 5 of 11 requests were this kind of overhead. You never see them, but they’re there in every session.
How Plan Mode Prevents Itself from Coding
This is one of the most interesting patterns in the system. Plan mode shares the same base prompt as Agent mode — including instructions about writing code, applying patches, and running commands. But the mode override completely flips the behavior:
<modeInstructions>
You are a PLANNING AGENT, pairing with the user to create a detailed,
actionable plan.
Your SOLE responsibility is planning. NEVER start implementation.
<rules>
- STOP if you consider running file editing tools —
plans are for others to execute
</rules>
The override defines a four-phase workflow:
Discovery — Spawn a sub-agent to autonomously research the codebase
Alignment — Use
ask_questionsto clarify ambiguities with the userDesign — Draft a comprehensive implementation plan
Refinement — Iterate until the user approves
And it includes a quantified stopping condition for research: “Stop research when you reach 80% confidence you have enough context to draft a plan.”
Not “when you have enough” — “when you hit 80%.” This kind of specificity shows up everywhere in Copilot’s prompts.
The plan style guide is equally deliberate: “NO code blocks — describe changes, link to files/symbols.” Plans describe what to do, not how. Code belongs in Agent mode.
Agent Mode: “It’s Bad to Just Show Code”
Agent mode’s system prompt doesn’t just allow autonomy — it demands it. Two XML sections work in tandem.
The <autonomy_and_persistence> section says:
Unless the user explicitly asks for a plan, asks a question about the code, is brainstorming potential solutions, or some other intent that makes it clear that code should not be written, assume the user wants you to make code changes or run tools to solve the user’s problem. In these cases, it’s bad to output your proposed solution in a message, you should go ahead and actually implement the change.
And <task_execution> reinforces it:
You are a coding agent. You must keep going until the query or task is completely resolved, before ending your turn and yielding back to the user. Persist until the task is fully handled end-to-end within the current turn whenever feasible and persevere even when function calls fail.
The wording is deliberate. “It’s bad” — not “avoid” or “prefer not to.” The model is told that showing code is worse than writing it. And “persevere even when function calls fail” — don’t give up on first error, figure it out.
The prompt also includes an <ambition_vs_precision> section that acts as a context-dependent behavior slider:
For tasks that have no prior context (brand new), you should feel free to be ambitious and demonstrate creativity. If you’re operating in an existing codebase, you should make sure you do exactly what the user asks with surgical precision.
New project? Be creative. Existing codebase? Be surgical. Most agents don’t make this distinction.
65 Tools, But Most Are Invisible
Agent mode’s 65 tools span file operations (read, write, search, create), shell execution, VS Code integration (errors, tests, extensions, git diffs), Jupyter notebooks, Python environment management, Mermaid diagram rendering, web fetching, and MCP extensions.
The tool summary breaks them down by category: 7 file read, 4 file write, 7 shell, 2 web, 10 VS Code, 9 notebook, 4 Python env, 13 MCP, 4 Mermaid, and a handful of planning, questions, multi-agent, GitHub, and container tools.
The MCP tools are worth calling out. In our capture, all 13 MCP-provided tools come from the Pylance extension — docstring generation, import management, syntax error checking, code refactoring, environment detection. These aren’t built into Copilot; they’re injected by VS Code extensions through the Model Context Protocol. Install a Docker extension with MCP support, and the agent suddenly gains container management capabilities. Install GitKraken, and it can create PRs and manage branches.
This means the tool surface area isn’t fixed — it grows with your VS Code setup.
But 65 tool schemas aren’t free. The session data tells the story: Agent mode’s first main request consumes 16,738 input tokens, while Plan mode’s equivalent request uses 8,484 — both with nearly identical ~24K character system prompts. The ~8,250 token gap is almost entirely tool schemas. That puts the cost at roughly **190 tokens per tool definition**, meaning Agent mode’s 65 tools add ~10,500 tokens of overhead to every single request, while Plan mode’s 22 tools add ~2,300. Ask mode, with zero tools and a smaller prompt, runs at just 654 tokens — over 25x cheaper per request.
The V4A Patch Format
Agent mode doesn’t edit files by overwriting them or running sed. It uses a custom diff format called V4A, defined in the <applyPatchInstructions> section:
Three lines of context above and below each change. The @@ operator to disambiguate when context lines aren’t unique — pointing to a class or function scope. Multiple @@ for deeply nested code.
The system prompt enforces this strictly: “NEVER print this out to the user, instead call the tool and the edits will be applied and shown to the user.”
Why not unified diff? Unambiguous parsing. The IDE knows exactly where to apply every change without guessing.
The Rules That Prevent Infinite Loops
Every agent builder learns this the hard way: without explicit guardrails, agents get stuck retrying the same failing action forever.
Copilot handles this with a simple “3-strike rule”: “Do not loop more than 3 times attempting to fix errors in the same file. If the third try fails, you should stop and ask the user what to do next.”
And for tool failures, the prompt at the start of<applyPatchInstructions> says: “If you have issues with it, you should first try to fix your patch and continue using apply_patch.” Try to recover before escalating.
Sub-Agents: Fresh Context on Demand
Both Plan and Agent modes can spawn sub-agents via runSubagent. Each invocation gets a fresh context — stateless, synchronous, autonomous. The caller sends a detailed prompt and waits for a single response.
The tool description is explicit about the constraint: “Each agent invocation is stateless. You will not be able to send additional messages to the agent, nor will the agent be able to communicate with you outside of its final report. Therefore, your prompt should contain a highly detailed task description for the agent to perform autonomously.”
In our plan mode session, you can see this in action: Turn 2 immediately fires a runSubagent call for the Discovery phase. The sub-agent explored the workspace, found it was empty, and returned a structured report covering: key files found, conventions detected, technical unknowns, and safe defaults to assume.
The planning agent then adapted — instead of drafting a plan full of assumptions, it used the ask_questions tool to present the user with four structured multiple-choice questions: project base (empty folder vs existing repo), runtime (Python 3.11 + pyproject.toml vs requirements.txt vs uv), confirmation gate style (every command vs session approval vs allowlist), and scope (single-turn MVP vs multi-step). Each question included a recommended option. Only after getting answers did it draft the implementation plan.
This is the Discovery → Alignment → Design workflow playing out exactly as designed. The sub-agent does the research, surfaces unknowns, and the main agent uses those unknowns to ask precise questions rather than guessing.
The Commentary Channel
Agent mode has a streaming architecture that creates the “watching the agent think” experience in VS Code.
The <Intermediary_updates> section tells the model to send progress updates through a commentary channel every 20 seconds. Before starting work, acknowledge the request. While exploring, explain what you’re finding. Before editing, describe what you’re about to change. And if the model is thinking for more than 100 words without acting, it must interrupt itself to send an update.
On top of this, gpt-4o-mini generates activity summaries alongside the main model’s work — the compact status messages in the Copilot panel. In our agent session, 4 of 11 requests were these background summaries.
This dual-channel approach (model-generated commentary + overhead model summaries) is what makes Copilot feel responsive even during long operations.
Redundancy as a Prompt Engineering Strategy
Here’s something you notice when you read both the system prompt and the user prompt side by side.
The system prompt says: “Persist until the task is fully handled end-to-end within the current turn.”
The user prompt — injected by VS Code as <reminderInstructions> — says: “You are an agent—keep going until the user’s query is completely resolved. ONLY stop if solved or genuinely blocked.”
Same instruction, stated twice, in different contexts. This isn’t accidental. The reminder instructions also add specifics the system prompt doesn’t cover:
Tool batches: You MUST preface each batch with a one-sentence why/what/outcome preamble.
Progress cadence: After 3 to 5 tool calls, or when you create/edit > ~3 files in a burst, report progress.
Requirements coverage: Read the user’s ask in full and think carefully. Do not omit a requirement.
Important behaviors are reinforced across multiple injection points to reduce drift over long conversations.
Prompt Engineering Patterns Worth Stealing
Across all modes, consistent patterns emerge.
XML tags for behavioral boundaries. The prompt uses XML tags — <autonomy_and_persistence>, <task_execution>, <ambition_vs_precision>, <modeInstructions> — instead of markdown headers for its behavioral sections. This isn’t a style choice.
XML tags are a proven prompt engineering technique across major model providers. Anthropic’s Claude was specifically fine-tuned to recognize XML tags as a prompt organizing mechanism — during pre-training, they wrapped data in XML tags, teaching the model to treat tagged content with different weight. Anthropic recommends XML for “complex prompts that mix instructions, context, examples, and variable inputs.”
OpenAI’s models aren’t trained the same way, but XML works well there too. The GPT-5 prompting guide explicitly recommends XML for instruction organization, noting that Cursor found “structured XML specs like <[instruction]_spec>improved instruction adherence.” The GPT-4.1 guide says XML is “convenient to precisely wrap a section including start and end, add metadata to the tags for additional context, and enable nesting” and “performed well in long context testing” — though it suggests starting with markdown for general formatting.
Copilot’s approach matches the emerging best practice: XML for behavioral sections that need hard boundaries and override semantics (<modeInstructions> overriding <task_execution>), markdown for formatting guidance and examples. XML costs roughly 15% more tokens than equivalent markdown — but for behavioral instructions where adherence matters more than token economy, it’s worth it.
Capitalized NEVER for hard constraints. The prompt uses NEVER for its strictest rules: NEVER tryapplypatchorapply-patch, NEVER add copyright or license headers, NEVER output inline citations, NEVER print this out to the user. Capitalized NEVER creates strong constraints that the model rarely violates.
Quantified constraints everywhere. “3 lines of context before and after.” “80% confidence.” “3 retries max.” “Every 20 seconds.” “More than 100 words of thinking triggers an update.” Specifics outperform vague guidance like “a few” or “brief.”
Layered prompt architecture. Identity and policies at the top (shared across all modes). Then <coding_agent_instructions>, <personality>, <task_execution>, formatting rules. And at the bottom, <modeInstructions> as the final authority that can override everything above. This lets them maintain one prompt and fork behavior per mode.
Plan quality examples. The <planning> section doesn’t just tell the model to plan — it shows three examples of high-quality plans and three examples of low-quality plans, making the expected output unambiguous.
What a Real Session Looks Like
We captured the same task across all three modes. Here’s what happened.
Agent mode completed the task in 1 minute 29 seconds across 11 requests. The main model made 6 calls: read the directory (which was empty), created agent.py in a single 35-second generation, checked for errors with get_errors, found an import issue with the anthropic package, applied a patch to add runtime import handling, verified errors were gone, and delivered the final response. Five overhead calls ran alongside for titling and activity summarization. Total: 123,783 tokens — 97% input, 3% output.
Why 97% input? Because the full system prompt, tool schemas, environment info, and conversation history are resent with every single request. No incremental deltas. Each of the 6 main model calls includes the complete ~24,000 character prompt plus all 65 tool schemas plus the growing conversation. The model reads far more than it writes.
The transcript also reveals that gpt-5.3-codex uses explicit thinking blocks before acting — short internal reasoning like “Planning empty workspace handling” and “Using get_errors for syntax check.” These appear as structured thinkingmarkers in the API response, separate from the user-visible commentary.
Plan mode took 3 minutes 54 seconds across 7 requests. It spawned a sub-agent for discovery, explored directories with list_dir and file_search, used ask_questions to clarify with the user, then delivered a detailed implementation plan. Total: 52,990 tokens — less than half of Agent mode, but nearly 3x the wall time. Planning is cheaper but slower.
Ask mode finished in 1 minute 4 seconds with just 3 requests: categorize, title, answer. Total: 2,330 tokens. The categorization model classified the question (about ReAct vs plan-and-execute patterns) as unknown — it didn’t fit neatly into any of the 16 routing categories, which are tuned for VS Code-specific tasks like create_tests, vscode_configuration_questions, or terminal_state_questions. No tools, no context gathering. Over 50x cheaper than Agent mode.
What Engineers Can Take Away
Use multiple models. Don’t run everything through your most capable model. Copilot uses gpt-4o-mini for routing, titling, and summarization — tasks that don’t need the big model but make the UX significantly better.
Progressive capability saves money. Zero tools for simple Q&A. Read-only tools for research. Full access for execution. Match the tool surface to the task, not the other way around.
Mode overrides beat separate prompts. Share one base prompt across multiple modes. Override behavior at the end with a final-authority section. Easier to maintain, fewer inconsistencies.
Read-only modes are underrated. Plan mode explores extensively without risk. When your agent only needs to understand code — not change it — strip the write tools.
Assign responsibility with strong language. “It’s YOUR RESPONSIBILITY” prevents the model from deferring to the user. “It’s bad to just show code” prevents laziness. Weak language produces weak behavior.
The overhead is real. 5 of 11 agent mode requests are invisible overhead. Factor this into your cost estimates, latency budgets, and rate limit planning.
The Architecture
How We Captured This
All data was captured using AgentLens, an open-source MITM proxy that intercepts LLM API traffic. It sits between the agent and the API, recording complete untruncated requests and responses — system prompts, tool schemas, model responses (including thinking blocks), token counts, and timing. Nothing is inferred; it’s the raw wire protocol.
Explore the Raw Data
Everything referenced in this post is available in the agenticloops/agentic-apps-internals:
System prompts for all modes — agent, ask, and plan
Complete tool catalog (65 tools) — schemas, descriptions, mode deltas
Prompt engineering analysis — stats and patterns across modes
Session traces — turn-by-turn data for each mode
Raw session data — complete API payloads for independent analysis
What’s Next
Next in the series: we’ll disassemble Claude Code — Anthropic’s CLI agent — and see how a terminal-native agent approaches the same problems differently.
Thanks for reading! Subscribe for free to receive new posts and support my work.
Which AI agent should we disassemble? Drop it in the comments below.



Top comments (0)