plasmon

Posted on Apr 21 • Edited on Apr 27 • Originally published at qiita.com

I Ran an LLM Agent on 8GB VRAM — It Broke After 5 Tool Calls

#ai #llm #gpu #programming

The "Long-Term Memory" Agent Is a Fantasy on 8GB

2026's LLMs are expected to run as agents by default. Call tools, receive results, decide next action, call again. Claude Code, Cursor, Devin — all built on "long-running loop" strategies.

This strategy physically cannot work on 8GB local VRAM.

I tested a llama.cpp-based tool-calling agent with RTX 4060 Laptop (8GB) + Qwen2.5-7B Q4_K_M. The result is simple: beyond ~5 tool calls, response quality visibly degrades. Past 10 calls, the model starts ignoring results from tools it just called.

This article breaks down why this happens from KV cache and Context Rot perspectives, then examines 3 viable workarounds for 8GB.

How Much KV Cache Does Each Tool Call Eat?

Consider the token cost of one tool call cycle:

One tool call cycle:
  System prompt              : ~500 tok (fixed)
  User instruction           : ~200 tok (fixed)
  Conversation history       : variable (accumulates)
  Tool definitions (schemas) : ~300 tok × number of tools
  LLM response (tool_call)   : ~100 tok
  Tool execution result      : ~500-2000 tok (variable)

With 5 tools defined and average 800 tokens per result, KV cache accumulation per step:

Step	Cumulative Tokens	KV Cache (fp16)	VRAM Remaining (7B Q4_K_M)
0 (initial)	~2,200	0.12 GB	2.60 GB
3	~4,900	0.26 GB	2.46 GB
5	~6,700	0.36 GB	2.36 GB
10	~11,200	0.60 GB	2.12 GB
20	~20,200	1.08 GB	1.64 GB
30	~29,200	1.56 GB	1.16 GB

def agent_vram_estimate(steps, tokens_per_step=900, base_tokens=2200,
                        model_gb=4.68, overhead_gb=0.6,
                        n_layers=28, n_kv_heads=4, head_dim=128, dtype_bytes=2):
    """Estimate VRAM consumption for agent loop"""
    total_tokens = base_tokens + steps * tokens_per_step
    kv_gb = (2 * n_layers * total_tokens * n_kv_heads * head_dim * dtype_bytes) / (1024**3)
    total_gb = model_gb + kv_gb + overhead_gb
    return {
        "steps": steps,
        "tokens": total_tokens,
        "kv_cache_gb": round(kv_gb, 2),
        "total_vram_gb": round(total_gb, 2),
        "remaining_gb": round(8.0 - total_gb, 2)
    }

for s in [0, 5, 10, 20, 30, 50]:
    r = agent_vram_estimate(s)
    print(f"Step {s:2d}: {r['tokens']:,} tok, KV={r['kv_cache_gb']}GB, remaining={r['remaining_gb']}GB")

At ~30 steps, remaining VRAM drops below 1GB. At 50 steps, OOM is theoretically visible. Q4 KV cache quantization (--cache-type-k q4_0 --cache-type-v q4_0) compresses by ~3.5x, but even then, 100+ step loops are unrealistic.

But a more serious problem hits before OOM.

Context Rot — Long Context Kills Quality

Even when everything fits in VRAM, response quality collapses as context grows. This is known as "Context Rot."

Chroma Research reports that LLM information reproduction accuracy decreases inversely with token count. Degradation is especially pronounced in "intermediate result accumulation" patterns — exactly what agents do.

Microsoft and Salesforce's joint research "LLMs Get Lost In Multi-Turn Conversation" (arXiv:2505.06120) provides specific numbers. Converting benchmark prompts into multi-turn conversations (agent-workflow-like), they report average 39% performance drop across 6 generative tasks. Even reasoning-specialized models like o3 and DeepSeek-R1 weren't immune.

With 7B models, degradation starts earlier. What I observed with Qwen2.5-7B:

Steps 3-5: Normal operation. Accurately references tool results, selects appropriate next action
Steps 5-8: Begins forgetting initial instructions. Redundantly re-calls the same tools
Steps 8-10: Ignores recent tool results. Hallucination rate climbs
Steps 10+: Loses conversational direction. Tool calls become unrelated to the objective

This is the same structure as "Lost in the Middle" (Liu et al., TACL 2024). In agent scenarios, tool results from steps 3-4 get pushed to the "middle," and only the system prompt (beginning) and latest results (end) get referenced.

Do Larger Models Solve This?

Important counter-evidence:

GPT-4.1 showed no degradation in tool-heavy conversations. Parloa's testing confirms large models maintain stable performance in long conversations.

MemAgent extrapolates from 8K context to 3.5M token tasks with under 10% performance loss (OpenReview). RLM (Recursive Language Model) maintains 91.33% accuracy across 1000 documents and 10M+ tokens.

However, these all involve large models with tens to hundreds of GB of memory, or cloud inference.

For 7B models running on 8GB VRAM:

The context window itself is physically limited (as shown above)
Fewer Attention heads means weaker long-range dependency retention
GQA (Grouped Query Attention) saves KV cache, but doesn't improve the model's actual "memory capacity"

"The problem is mitigated with sufficient model size" is true. "On 8GB, you must engineer around it" is equally true.

Workaround 1: Short Loops × Context Reset

The simplest and most effective approach. Cut the agent loop short and reset context at each loop boundary.

def short_loop_agent(task: str, tools: list, max_steps_per_loop: int = 5):
    """Short-loop × reset strategy agent"""
    memory = []  # Only carry summaries between loops

    while not is_task_complete(memory):
        # Rebuild context with minimum necessary info
        context = build_context(
            system_prompt=SYSTEM_PROMPT,
            task=task,
            memory_summary=summarize(memory[-3:]),  # Only last 3 summaries
            tools=tools
        )

        # Execute short loop
        for step in range(max_steps_per_loop):
            response = llm.generate(context)
            if response.tool_call:
                result = execute_tool(response.tool_call)
                context.append(result)
                memory.append({
                    "step": step,
                    "tool": response.tool_call,
                    "result_summary": summarize_result(result)
                })
            else:
                break

        # End of loop: reset context, carry only summaries
        # KV cache is freed

The key is memory_summary. What passes between loops isn't raw tool results — it's summaries. This prevents KV cache accumulation while retaining necessary information.

5 steps × 6 loops = 30-step equivalent task, processed at ~6,700 tokens per loop (0.36GB KV cache). Compared to 1.56GB for running 30 steps straight, VRAM consumption is less than a quarter.

Workaround 2: Persistent Q4 KV Cache

arXiv:2603.04428 "Agent Memory Below the Prompt" (2026) proposes persisting agent KV cache to disk with Q4 quantization, loading directly into Attention layers when needed.

Validated on Apple M4 Pro:

FP16 KV cache budget of 10.2GB holds only 3 agent contexts
Q4 quantization fits 4x more agent contexts in the same memory
TTFT improvement from cache restoration: up to 136x (22–136x for Gemma, 11–76x for DeepSeek)

The core insight: "avoid recomputation." Normally, restoring context requires recalculating prefill for all tokens. Persistent KV Cache skips this entirely by loading pre-saved KV states directly.

The paper validated on M4 Pro, but the principle applies equally to RTX 4060. llama.cpp has experimental KV cache save/restore APIs (--save-state, --load-state). Saving per-agent KV snapshots on NVMe SSD and loading on task switch avoids prefill recomputation. On 8GB — where you can only hold one agent context at a time — this "swap" strategy's benefit is even larger than on M4 Pro.

Workaround 3: Dynamic Tool Selection (Tool Loadout)

More tool definitions means worse selection accuracy. Berkeley's function-calling leaderboard confirms that as tools increase, description overlap makes correct selection harder. Empirically, 5–10 tools is the practical ceiling for 7B models. Tool definitions themselves consume context and pressure KV cache.

Solution: "don't define all tools at all times."

def dynamic_tool_selection(query: str, all_tools: list, max_tools: int = 5):
    """Dynamically select tools based on query"""
    # Lightweight classifier determines query category
    category = classify_query(query)  # "search", "code", "data", etc.

    # Select tool subset based on category
    tool_groups = {
        "search": ["web_search", "file_search", "grep"],
        "code": ["run_python", "read_file", "write_file"],
        "data": ["sql_query", "csv_parse", "chart_generate"],
    }

    selected = tool_groups.get(category, all_tools[:max_tools])
    return selected

Loading all 20 tool definitions costs ~6,000 tokens. Narrowing to 5 tools: ~1,500 tokens. The 4,500-token difference saves 0.04GB per step in Q4 KV cache. Looks small, but over 30 steps this accumulates to 1.2GB+ difference.

8GB Agent Design Principles

Combining all three workarounds:

Principle 1: Loops Stay Under 5 Steps

7B models maintain context quality up to ~6,000–8,000 tokens. At ~900 tokens per tool call, 5 steps is the limit.

Principle 2: Memory Carries as "Summaries"

Never leave raw tool results in context. Summarize at each loop boundary. Next loop only sees summaries.

Principle 3: Maximum 5 Tool Definitions

Dynamic tool selection loads only what's needed per step. "Universal agents" don't work on 8GB.

Principle 4: Monitor "Context Quality"

Track tool call "hit rate" (whether called tools matched the objective). When it drops, reset the loop. Use as automatic reset trigger.

The 8GB Constraint Improves Agent Design

As I wrote in the 128K context article — the 8GB constraint isn't a handicap. It's a design forcing function.

Cloud-scale models can brute-force 100-step agent loops. But as Microsoft and Salesforce's research shows, being able to run it and maintaining quality are separate problems. Even o3 degrades by 39%.

The 8GB constraint doesn't hide the fact that "quality drops at 5 steps." That's precisely why it leads to "fundamentally correct design" — short loops, summary carry-over, dynamic tool selection. These design principles apply directly to cloud environments too — and arguably should be applied there.

What determines agent performance isn't context length. It's context quality.

References

"Context Rot: How Increasing Input Tokens Impacts LLM Performance" (Chroma Research): https://research.trychroma.com/context-rot
"Lost in the Middle: How Language Models Use Long Contexts" (Liu et al., TACL 2024): https://arxiv.org/abs/2307.03172
arXiv:2603.04428 — Agent Memory Below the Prompt: Persistent Q4 KV Cache for Multi-Agent LLM Inference on Edge Devices
"LLMs Get Lost In Multi-Turn Conversation" (Microsoft Research & Salesforce, arXiv:2505.06120): https://arxiv.org/abs/2505.06120
Parloa Labs — Long Conversations and LLM Performance: https://www.parloa.com/labs/insights/long-calls-LLM-performance/
Berkeley Function-Calling Leaderboard: https://gorilla.cs.berkeley.edu/leaderboard.html
llama.cpp: https://github.com/ggerganov/llama.cpp

DEV Community