DEV Community

Cover image for Does How You Feed Context to an LLM Agent Change What It Remembers? I Tested With Canary Strings.
mkxultra
mkxultra

Posted on

Does How You Feed Context to an LLM Agent Change What It Remembers? I Tested With Canary Strings.

I work with three LLM agents daily — Claude Code, Codex CLI, and Gemini CLI. Before each task, I load project context (design docs, data models, implementation guides) into the agent so it understands what it's working on.

But there are multiple ways to load that context. You can have the agent run a shell command and read the output. You can point it at a file. You can split the context across several files.

Does the delivery method affect how much the agent actually retains?

I ran a small experiment using canary strings — unique, unpredictable markers embedded throughout the context — to measure retention objectively. Here's what I found.

Background: What's a "Base Session"?

A base session is a pattern for multi-agent development: you load project context into an agent once, record the session_id, and resume that session for every subsequent task. Instead of re-explaining your project from scratch each time, the agent picks up where it left off — already understanding your codebase.

This article isn't about the base session pattern itself (I wrote about that separately here). But the experiment below tests how to best construct one — specifically, whether the method you use to inject context changes what the agent retains.

Experiment Design

The Three Delivery Patterns

Pattern Method Description
A Shell command Agent executes a command; reads the stdout as context
B Single file Context written to one file; agent reads it
C Split files Context split across 5 files; agent reads them in order

The Context

I used a real development context from my own project (konkondb, an AI materialized-view system): 2,664 lines, roughly 50,000 tokens. It includes design docs, data models, CLI specs, and implementation guides — the kind of thing you'd actually feed an agent before asking it to write code.

Canary Strings: Measuring Retention Objectively

I inserted 19 unique canary strings at section boundaries throughout the context — one before each of the 18 sections, plus one trailing marker after the last section:

@@CANARY_001_240ba3af@@   <- before section 1
[section 1 content]

---

@@CANARY_002_dfdf025b@@   <- before section 2
[section 2 content]

---

...

@@CANARY_019_ee18680a@@   <- after the last section
Enter fullscreen mode Exit fullscreen mode

Each canary is generated from a SHA-256 hash of its section's content — deterministic but impossible for a model to guess. The agent is never told how many canaries exist or what they look like.

After building the base session, I swap out the canary-injected files for the originals (or delete them entirely), resume the session, and ask: "How many canary strings were there? List all of them."

If the agent recalls all 19 with exact values, the full context at least reached the model and was processed well enough to reproduce those markers. Combined with the separate comprehension and detail-accuracy tests, this gives a reasonable picture of retention. Any gaps in canary recall reveal which sections may have been lost or degraded.

Additional Tests

Beyond canary recall, I also ran:

  • Comprehension test (5 questions) — drawn from the beginning, middle, and end of the context
  • Detail accuracy test (2 questions) — exact SQL constraints, function signatures

Preventing Re-reads

To ensure I was measuring memory, not the agent's ability to re-fetch:

  1. The canary-injected generation source was swapped back to the original, so rerunning the same command would no longer produce canaries
  2. For Patterns B/C, context files were deleted
  3. The agent was explicitly instructed to answer from memory only — no file reads, no command execution

Caveats Up Front

This was exploratory:

  • N=1 per condition. LLM output is non-deterministic; I'm not claiming statistical significance.
  • Single corpus: one project's context (2,664 lines, ~50K tokens). Different domains or scales may behave differently.
  • Single point in time: specific agent/model versions as of March 2026. Updates may change results.

Read this as directional signal, not proof.

Agents Tested

Label Agent / Model
claude-ultra Claude Opus 4.6 via Claude Code
codex-ultra OpenAI via Codex CLI
gemini-ultra Gemini 3.1 Pro Preview via Gemini CLI

Results

Canary Recall

                     Pattern A          Pattern B          Pattern C
                    (shell command)    (single file)      (5-file split)
Claude  |################### 19/19 |################### 19/19 |################### 19/19
Codex   |################### 19/19 |################### 19/19 |################### 19/19
Gemini  |##############_____ 14/19 |################### 19/19 |################### 19/19
Enter fullscreen mode Exit fullscreen mode

Claude and Codex recalled all 19 canaries across every pattern. Gemini scored 19/19 on Patterns B and C, but dropped to 14/19 on Pattern A (shell command execution).

Comprehension and detail-accuracy tests: all agents passed all questions across all patterns — including Gemini on Pattern A.

Detailed Scores

Agent Pattern Count Recall Precision
claude-ultra A 19/19 19/19 (100%) 19/19 (100%)
claude-ultra B 19/19 19/19 (100%) 19/19 (100%)
claude-ultra C 19/19 19/19 (100%) 19/19 (100%)
codex-ultra A 19/19 19/19 (100%) 19/19 (100%)
codex-ultra B 19/19 19/19 (100%) 19/19 (100%)
codex-ultra C 19/19 19/19 (100%) 19/19 (100%)
gemini-ultra A 19/19 14/19 (74%) 14/14 (100%)
gemini-ultra B 19/19 19/19 (100%) 19/19 (100%)
gemini-ultra C 19/19 19/19 (100%) 19/19 (100%)

Note the precision column for Gemini Pattern A: 14/14 (100%). Every canary it listed was correct. It didn't hallucinate any — it simply never received the missing five.

What Happened in Gemini Pattern A

Tool-Level Truncation, Not Memory Failure

When Gemini executed the shell command, the output exceeded the CLI tool's size limit. The middle section was truncated:

... [56,330 characters omitted] ...
Enter fullscreen mode Exit fullscreen mode

This physically removed canaries 004 through 008. They never reached the model.

Gemini's own API stats confirm it:

Pattern API calls Input tokens
A (shell command) 2 22,438
B (single file) 3 58,352
C (5-file split) 2 51,778

Pattern A delivered less than half the input tokens of the other patterns. The context simply didn't arrive.

Gemini was honest about it, too — it reported that the output had been truncated and that it could only list what it actually read. No hallucinated canaries, no guessing.

How Each Agent Handled Large Output

The most interesting secondary finding was how each agent behaved when the shell command produced oversized output:

Agent Behavior on Pattern A Tool calls
Claude Detected oversized output → autonomously saved to file → read in 500-line chunks 10
Codex Successfully captured full output directly (not measured)
Gemini Accepted truncated output → answered honestly from what it received 2

Claude's self-recovery is notable: nobody instructed it to save the output to a file and re-read it. The agent chose that strategy on its own when it detected the output was too large for a single tool response.

For reference, here are Claude's tool-call counts across all patterns:

Pattern Tool calls Read strategy
A 10 Bash → save to file → 6 chunked reads (500 lines each)
B 8 7 chunked reads of ctx_full.md (500 lines each, due to 2000-line read limit)
C 6 5 files, one read each (~530 lines/file, within limit)

For Claude, Pattern C required the fewest tool calls of the three patterns. At this context size the efficiency gain was minor, but when you're pushing closer to tool-read limits, pre-split files reduce round-trips and the risk of hitting per-call size caps.

Key Takeaways

1. At this scale, delivery method doesn't matter — if the context arrives intact

With 2,664 lines (~50K tokens) against context windows of 128K–1M tokens, there was plenty of headroom. Claude and Codex retained everything regardless of how context was delivered.

When context window capacity isn't the bottleneck, the delivery method doesn't measurably affect retention. Note that this experiment measured recall of canary markers and answers to comprehension questions; it did not verify whether split-file versus single-file loading affects the model's internal processing in other ways.

2. The only failure was in the toolchain, not the model

Gemini's 14/19 wasn't a memory problem. It was a plumbing problem. The context was truncated before it ever reached the model.

Context retention failures can happen in the toolchain's data path, not in the model's attention mechanism.

In this experiment, this was the single most actionable finding. When you're debugging why an agent seems to "forget" part of your context, check whether the context actually made it through your tool pipeline before blaming the model.

3. File reads are safer than shell command output

Shell command output is unpredictable in size. It can exceed tool limits, get truncated, or behave differently across CLI implementations. File reads let you verify the content and size beforehand, and splitting is trivial.

Practical Recommendations

Priority Recommendation
Do this Use file reads, not shell commands, to load context into agent sessions — avoids truncation risk
Should do Split files to 500–1,000 lines — stays within typical tool read limits
Nice to have Embed canary strings in your context — gives you an objective QA check on session quality

How Canary Strings Work

If you want to try this yourself, the core idea is simple:

import hashlib

def make_canary(index: int, content: str) -> str:
    """Generate a unique canary from section content hash."""
    hex_part = hashlib.sha256(content.encode()).hexdigest()[:8]
    return f"@@CANARY_{index:03d}_{hex_part}@@"
Enter fullscreen mode Exit fullscreen mode

Insert one at each section boundary in your context file. After the agent loads the context, ask it to list every canary it found. Compare against your ground truth.

What's Next

  • Larger contexts (100K+ tokens): At the edge of context windows, delivery method differences might actually surface.
  • Multiple trials (N=5+): Statistical confidence instead of directional signal.
  • Systematic study of agent recovery behavior: Claude's autonomous fallback to file-based reading was unexpected and worth exploring across more conditions.

This experiment was run in March 2026 using Claude Code (Opus 4.6), Codex CLI (OpenAI), and Gemini CLI (3.1 Pro Preview). Results reflect those specific versions and may change with updates.

Top comments (0)