The "Long-Term Memory" Agent Is a Fantasy on 8GB
2026's LLMs are expected to run as agents by default. Call tools, receive results, decide next action, call again. Claude Code, Cursor, Devin — all built on "long-running loop" strategies.
This strategy physically cannot work on 8GB local VRAM.
I tested a llama.cpp-based tool-calling agent with RTX 4060 Laptop (8GB) + Qwen2.5-7B Q4_K_M. The result is simple: beyond ~5 tool calls, response quality visibly degrades. Past 10 calls, the model starts ignoring results from tools it just called.
This article breaks down why this happens from KV cache and Context Rot perspectives, then examines 3 viable workarounds for 8GB.
How Much KV Cache Does Each Tool Call Eat?
Consider the token cost of one tool call cycle:
One tool call cycle:
System prompt : ~500 tok (fixed)
User instruction : ~200 tok (fixed)
Conversation history : variable (accumulates)
Tool definitions (schemas) : ~300 tok × number of tools
LLM response (tool_call) : ~100 tok
Tool execution result : ~500-2000 tok (variable)
With 5 tools defined and average 800 tokens per result, KV cache accumulation per step:
| Step | Cumulative Tokens | KV Cache (fp16) | VRAM Remaining (7B Q4_K_M) |
|---|---|---|---|
| 0 (initial) | ~2,200 | 0.12 GB | 2.60 GB |
| 3 | ~4,900 | 0.26 GB | 2.46 GB |
| 5 | ~6,700 | 0.36 GB | 2.36 GB |
| 10 | ~11,200 | 0.60 GB | 2.12 GB |
| 20 | ~20,200 | 1.08 GB | 1.64 GB |
| 30 | ~29,200 | 1.56 GB | 1.16 GB |
def agent_vram_estimate(steps, tokens_per_step=900, base_tokens=2200,
model_gb=4.68, overhead_gb=0.6,
n_layers=28, n_kv_heads=4, head_dim=128, dtype_bytes=2):
"""Estimate VRAM consumption for agent loop"""
total_tokens = base_tokens + steps * tokens_per_step
kv_gb = (2 * n_layers * total_tokens * n_kv_heads * head_dim * dtype_bytes) / (1024**3)
total_gb = model_gb + kv_gb + overhead_gb
return {
"steps": steps,
"tokens": total_tokens,
"kv_cache_gb": round(kv_gb, 2),
"total_vram_gb": round(total_gb, 2),
"remaining_gb": round(8.0 - total_gb, 2)
}
for s in [0, 5, 10, 20, 30, 50]:
r = agent_vram_estimate(s)
print(f"Step {s:2d}: {r['tokens']:,} tok, KV={r['kv_cache_gb']}GB, remaining={r['remaining_gb']}GB")
At ~30 steps, remaining VRAM drops below 1GB. At 50 steps, OOM is theoretically visible. Q4 KV cache quantization (--cache-type-k q4_0 --cache-type-v q4_0) compresses by ~3.5x, but even then, 100+ step loops are unrealistic.
But a more serious problem hits before OOM.
Context Rot — Long Context Kills Quality
Even when everything fits in VRAM, response quality collapses as context grows. This is known as "Context Rot."
Chroma Research reports that LLM information reproduction accuracy decreases inversely with token count. Degradation is especially pronounced in "intermediate result accumulation" patterns — exactly what agents do.
Microsoft and Salesforce's joint research "LLMs Get Lost In Multi-Turn Conversation" (arXiv:2505.06120) provides specific numbers. Converting benchmark prompts into multi-turn conversations (agent-workflow-like), they report average 39% performance drop across 6 generative tasks. Even reasoning-specialized models like o3 and DeepSeek-R1 weren't immune.
With 7B models, degradation starts earlier. What I observed with Qwen2.5-7B:
- Steps 3-5: Normal operation. Accurately references tool results, selects appropriate next action
- Steps 5-8: Begins forgetting initial instructions. Redundantly re-calls the same tools
- Steps 8-10: Ignores recent tool results. Hallucination rate climbs
- Steps 10+: Loses conversational direction. Tool calls become unrelated to the objective
This is the same structure as "Lost in the Middle" (Liu et al., TACL 2024). In agent scenarios, tool results from steps 3-4 get pushed to the "middle," and only the system prompt (beginning) and latest results (end) get referenced.
Do Larger Models Solve This?
Important counter-evidence:
GPT-4.1 showed no degradation in tool-heavy conversations. Parloa's testing confirms large models maintain stable performance in long conversations.
MemAgent extrapolates from 8K context to 3.5M token tasks with under 10% performance loss (OpenReview). RLM (Recursive Language Model) maintains 91.33% accuracy across 1000 documents and 10M+ tokens.
However, these all involve large models with tens to hundreds of GB of memory, or cloud inference.
For 7B models running on 8GB VRAM:
- The context window itself is physically limited (as shown above)
- Fewer Attention heads means weaker long-range dependency retention
- GQA (Grouped Query Attention) saves KV cache, but doesn't improve the model's actual "memory capacity"
"The problem is mitigated with sufficient model size" is true. "On 8GB, you must engineer around it" is equally true.
Workaround 1: Short Loops × Context Reset
The simplest and most effective approach. Cut the agent loop short and reset context at each loop boundary.
def short_loop_agent(task: str, tools: list, max_steps_per_loop: int = 5):
"""Short-loop × reset strategy agent"""
memory = [] # Only carry summaries between loops
while not is_task_complete(memory):
# Rebuild context with minimum necessary info
context = build_context(
system_prompt=SYSTEM_PROMPT,
task=task,
memory_summary=summarize(memory[-3:]), # Only last 3 summaries
tools=tools
)
# Execute short loop
for step in range(max_steps_per_loop):
response = llm.generate(context)
if response.tool_call:
result = execute_tool(response.tool_call)
context.append(result)
memory.append({
"step": step,
"tool": response.tool_call,
"result_summary": summarize_result(result)
})
else:
break
# End of loop: reset context, carry only summaries
# KV cache is freed
The key is memory_summary. What passes between loops isn't raw tool results — it's summaries. This prevents KV cache accumulation while retaining necessary information.
5 steps × 6 loops = 30-step equivalent task, processed at ~6,700 tokens per loop (0.36GB KV cache). Compared to 1.56GB for running 30 steps straight, VRAM consumption is less than a quarter.
Workaround 2: Persistent Q4 KV Cache
arXiv:2603.04428 "Agent Memory Below the Prompt" (2026) proposes persisting agent KV cache to disk with Q4 quantization, loading directly into Attention layers when needed.
Validated on Apple M4 Pro:
- FP16 KV cache budget of 10.2GB holds only 3 agent contexts
- Q4 quantization fits 4x more agent contexts in the same memory
- TTFT improvement from cache restoration: up to 136x (22–136x for Gemma, 11–76x for DeepSeek)
The core insight: "avoid recomputation." Normally, restoring context requires recalculating prefill for all tokens. Persistent KV Cache skips this entirely by loading pre-saved KV states directly.
The paper validated on M4 Pro, but the principle applies equally to RTX 4060. llama.cpp has experimental KV cache save/restore APIs (--save-state, --load-state). Saving per-agent KV snapshots on NVMe SSD and loading on task switch avoids prefill recomputation. On 8GB — where you can only hold one agent context at a time — this "swap" strategy's benefit is even larger than on M4 Pro.
Workaround 3: Dynamic Tool Selection (Tool Loadout)
More tool definitions means worse selection accuracy. Berkeley's function-calling leaderboard confirms that as tools increase, description overlap makes correct selection harder. Empirically, 5–10 tools is the practical ceiling for 7B models. Tool definitions themselves consume context and pressure KV cache.
Solution: "don't define all tools at all times."
def dynamic_tool_selection(query: str, all_tools: list, max_tools: int = 5):
"""Dynamically select tools based on query"""
# Lightweight classifier determines query category
category = classify_query(query) # "search", "code", "data", etc.
# Select tool subset based on category
tool_groups = {
"search": ["web_search", "file_search", "grep"],
"code": ["run_python", "read_file", "write_file"],
"data": ["sql_query", "csv_parse", "chart_generate"],
}
selected = tool_groups.get(category, all_tools[:max_tools])
return selected
Loading all 20 tool definitions costs ~6,000 tokens. Narrowing to 5 tools: ~1,500 tokens. The 4,500-token difference saves 0.04GB per step in Q4 KV cache. Looks small, but over 30 steps this accumulates to 1.2GB+ difference.
8GB Agent Design Principles
Combining all three workarounds:
Principle 1: Loops Stay Under 5 Steps
7B models maintain context quality up to ~6,000–8,000 tokens. At ~900 tokens per tool call, 5 steps is the limit.
Principle 2: Memory Carries as "Summaries"
Never leave raw tool results in context. Summarize at each loop boundary. Next loop only sees summaries.
Principle 3: Maximum 5 Tool Definitions
Dynamic tool selection loads only what's needed per step. "Universal agents" don't work on 8GB.
Principle 4: Monitor "Context Quality"
Track tool call "hit rate" (whether called tools matched the objective). When it drops, reset the loop. Use as automatic reset trigger.
The 8GB Constraint Improves Agent Design
As I wrote in the 128K context article — the 8GB constraint isn't a handicap. It's a design forcing function.
Cloud-scale models can brute-force 100-step agent loops. But as Microsoft and Salesforce's research shows, being able to run it and maintaining quality are separate problems. Even o3 degrades by 39%.
The 8GB constraint doesn't hide the fact that "quality drops at 5 steps." That's precisely why it leads to "fundamentally correct design" — short loops, summary carry-over, dynamic tool selection. These design principles apply directly to cloud environments too — and arguably should be applied there.
What determines agent performance isn't context length. It's context quality.
References
- "Context Rot: How Increasing Input Tokens Impacts LLM Performance" (Chroma Research): https://research.trychroma.com/context-rot
- "Lost in the Middle: How Language Models Use Long Contexts" (Liu et al., TACL 2024): https://arxiv.org/abs/2307.03172
- arXiv:2603.04428 — Agent Memory Below the Prompt: Persistent Q4 KV Cache for Multi-Agent LLM Inference on Edge Devices
- "LLMs Get Lost In Multi-Turn Conversation" (Microsoft Research & Salesforce, arXiv:2505.06120): https://arxiv.org/abs/2505.06120
- Parloa Labs — Long Conversations and LLM Performance: https://www.parloa.com/labs/insights/long-calls-LLM-performance/
- Berkeley Function-Calling Leaderboard: https://gorilla.cs.berkeley.edu/leaderboard.html
- llama.cpp: https://github.com/ggerganov/llama.cpp
Top comments (0)