If you're running an AI agent on Amazon Bedrock and injecting persistent memory into every conversation, where you put that memory in the request matters a lot — both for how well the agent uses it and for what it costs you.
I learned this the direct way while connecting agent-memory-daemon to OpenClaw running on Amazon Bedrock AgentCore Runtime. The setup works beautifully. My agent now remembers my preferences, my projects, and the weird Bedrock timeout I debugged three weeks ago. But along the way I hit a subtle interaction between memory injection and prompt caching that's worth documenting.
This post walks through the architecture, the Bedrock prompt caching rule that tripped me up, and the one-line fix that cut my cache-related costs dramatically.
The setup: persistent memory for a serverless agent
OpenClaw lives in a container on AgentCore Runtime. AgentCore freezes the container when idle, which is great for cost (zero idle spend) but hostile to long-term memory (every wake is a blank slate). agent-memory-daemon solves this by running as a background process in the same container, doing two things:
Extraction — watches the session transcript directory and pulls out facts, decisions, and preferences worth remembering. Writes them as individual markdown files with YAML frontmatter.
Consolidation — periodically reorganizes the memory directory: merges duplicates, resolves contradictions, prunes stale content, and maintains a concise MEMORY.md index under a strict size budget.
Memory is synced to S3 between invocations. When a new conversation starts, the container restores the memory directory and reads MEMORY.md to bring the agent up to speed.
The daemon itself is cheap. It makes a few Haiku calls per day — my config targets about $0.25/month for the daemon's own LLM usage. The magic happens in what it produces: a curated, size-budgeted MEMORY.md that's always ~18KB regardless of how many sessions the agent has had.
Discord → EC2 bot → AgentCore Runtime → container
├── openclaw (the agent)
├── agent-memory-daemon (curator)
└── server.py (HTTP glue + S3 sync)
The daemon writes files. The agent reads files. The filesystem is the interface. No SDK, no API, no coupling.
Injecting the memory
On every invocation, I load MEMORY.md from S3 and pass it to OpenClaw as context. My first version looked like this:
memory_context = load_memory_from_s3() # ~18KB of curated memory
effective_message = message
if memory_context:
effective_message = (
f"[LONG-TERM MEMORY - persisted memory from previous sessions]\n\n"
f"{memory_context}\n\n"
f"[END OF MEMORY]\n\n"
f"User message: {message}"
)
messages = [{"role": "user", "content": effective_message}]
I stuffed the memory into the user message. The agent saw it. It remembered my preferences. Everything worked.
I also had Bedrock prompt caching enabled through OpenClaw's config:
{
"agents": {
"defaults": {
"models": {
"amazon-bedrock/...claude-haiku-4-5...": {
"params": { "cacheRetention": "short" }
}
}
}
}
}
Claude Haiku 4.5 supports prompt caching with a 5-minute TTL on the "short" retention mode. Cache reads are billed at ~10% of the regular input rate. On paper, my 18KB memory (~4,500 tokens) should have been getting served from cache at roughly a tenth of the price on every turn after the first.
Then I looked at Cost Explorer.
What Bedrock actually caches
Three days of usage, broken down by token type:
| Line item | Tokens (millions) | Cost |
|---|---|---|
| Cache Read | 12.69 | $1.40 |
| Cache Write | 7.09 | $9.75 |
| Input (uncached) | 31.91 | $35.10 |
| Output | 4.72 | $25.96 |
The "Input (uncached)" line is the one that doesn't make sense if caching is working. I had 12.69M cache reads, which meant something was being cached — OpenClaw's internal system prompt was getting cached fine. But 31.91M tokens were paying full input price. Where were they coming from?
Here's the rule that trips people up: Bedrock prompt caching caches a stable prefix. It looks at the beginning of the request, finds the longest chunk that's identical to a previously-cached request, and serves that from cache. Everything after the divergence point is recomputed and billed as regular input.
Now look at my code again:
messages = [{"role": "user", "content": effective_message}]
effective_message is "[LONG-TERM MEMORY]...18KB of memory...User message: {message}". The user's actual question is appended at the end.
What Bedrock sees:
- Turn 1:
messages[0].content = "[MEMORY]...same 18KB...User message: what time is it?" - Turn 2:
messages[0].content = "[MEMORY]...same 18KB...User message: tell me a joke"
Those two strings share a stable 18KB prefix of memory content, but they're both in messages[0].content. The cacheable prefix is actually the system prompt that OpenClaw builds on top — OpenClaw's own system content, its tool definitions, its skill metadata. Once the request stream reaches the user message, Bedrock sees variance (the actual user question) and stops caching.
So the memory was sitting in a position where it couldn't be cached. Every turn paid full price for those 4,500 tokens.
The fix
The change is small. Move the memory to a system message, before the user message:
messages = []
if memory_context:
messages.append({
"role": "system",
"content": (
"You have access to long-term memory from previous sessions. "
"Use this to answer questions about the user's preferences and history.\n\n"
f"{memory_context}"
),
})
messages.append({"role": "user", "content": message})
Now the memory is part of the stable system prefix. It sits alongside OpenClaw's own system prompt, tool definitions, and skills — the stuff that genuinely doesn't change between turns. Bedrock sees the same system block on every request and serves it from cache at 10% of the regular rate.
A one-line architectural change. A 90% discount on the biggest line item in the bill.
Verifying it worked
After deploying, I asked OpenClaw for its usage stats via the /usage full chat command:
🦞 OpenClaw 2026.2.26
🧮 Tokens: 9 in / 516 out
🗄️ Cache: 99% hit · 67k cached, 715 new
📚 Context: 34k/200k (17%)
67K tokens served from cache, only 715 new tokens computed. Before the fix, the 4,500-token memory injection was in the "new" bucket every turn. Now it's in the 67K cached bucket.
The change to Cost Explorer followed. The "Input (uncached)" line dropped, and the "Cache Read" line absorbed that traffic at a tenth of the price.
Three takeaways
1. Prompt caching only caches a stable prefix. Everything up to the first point of variance between requests is cacheable. Everything after is not. If you're injecting repeated context, put it early in the request — system prompt, tool definitions, or the first message of a consistent message sequence.
2. User content is almost always the wrong place for stable context. The user's actual question varies every turn. Anything you concatenate with it inherits that variance and becomes uncacheable. Pull it out into a system message.
3. Watch cache writes in your bill. Cache writes cost more than regular input (1.25x on Haiku 4.5). If you see high cache writes, it means your TTL is expiring between requests and the cache is being rewritten each time. Keep the cache warm — for cacheRetention: "short" (5-min TTL), a heartbeat every ~4 minutes avoids cold-cache rewrites.
The daemon, revisited
None of this is a critique of agent-memory-daemon — the daemon did exactly what it was supposed to do. It produced a stable, size-budgeted 18KB memory file. The integration code I wrote around it was putting that output in the wrong place.
In fact, the daemon's design (stable output size, consistent content, regular regeneration rhythm) is ideal for prompt caching. As long as you feed it into a system message, Bedrock can cache the whole thing for the TTL window, and the daemon's periodic consolidation doesn't bust the cache more often than necessary.
If you're running OpenClaw or any agent on Bedrock and want persistent memory without a managed memory service, the pattern works well:
- Run agent-memory-daemon alongside your agent
- Sync the memory directory to S3 between sessions (or use a mounted filesystem if available)
- Load the curated
MEMORY.mdat the start of each conversation - Inject it as a system message, not user content
- Enable
cacheRetentionon your model config
The daemon handles the hard part (curating memories without bloat). Bedrock handles the cheap part (caching the stable prefix). You just have to put the memory in the right place.
Code
tverney
/
agent-memory-daemon
Open-source memory manager daemon for AI agents. Filesystem-native, LLM-pluggable, framework-agnostic. Works with OpenClaw, Strands, LangChain, AgentCore Runtime or any agent that can write a file.
Open-source memory manager daemon for AI agents
Open-source memory consolidation and extraction daemon for AI agents. Filesystem-native, LLM-pluggable, framework-agnostic.
Agents feed it raw observations as markdown files; the daemon runs two complementary modes:
- Consolidation — periodically reorganizes, deduplicates, and prunes existing memory files via a four-phase pass (orient → gather → consolidate → prune)
- Extraction — watches for new session content and runs an LLM pass to identify facts, decisions, preferences, and error corrections worth remembering, writing them as individual memory files
The filesystem is the interface — no SDK, no API, no MCP required. The LLM backend is pluggable (OpenAI, Amazon Bedrock, or anything with a chat API).
memconsolidate is a standalone, agent-agnostic daemon — available to anyone building with OpenClaw, Strands, LangChain, or any custom agent framework.
How it works
Consolidation (reorganize existing memories)
- Agents write markdown memory files (with YAML frontmatter) to a watched directory
- A three-gate…
tverney
/
openclaw-agentcore-personal
Deploy your own personal OpenClaw on AWS Bedrock AgentCore — serverless, ~$9/mo, one-click CloudFormation, Discord/WhatsApp/Telegram 🦞
Deploy Your Personal OpenClaw on AWS AgentCore — Serverless, ~$9/month
Cost-optimized OpenClaw deployment using AWS Bedrock AgentCore Runtime. Connect via Discord, WhatsApp, Telegram, or Slack. ~$9-15/month infrastructure.
What Is This?
A single-user, serverless deployment of OpenClaw on AWS. Instead of running an EC2 instance 24/7, the AI runs on-demand via AgentCore Runtime — the container freezes between invocations, so you only pay when you use it.
All messaging plugins (WhatsApp, Telegram, Discord, Slack) are pre-installed in OpenClaw. This template includes a Discord bot by default, but you can connect any platform directly through the OpenClaw Web UI.
Architecture
You (Discord / WhatsApp / Telegram / Slack)
│
▼
┌──────────────────────────────────────────────────────────┐
│ AWS Cloud │
│ │
│ EC2 t4g.nano ──invoke──▶ AgentCore Runtime │
│ (Discord bot) (OpenClaw container) │
│ │ │
│ IAM Role │
│ │ │
│ Bedrock │
│ (Haiku/Sonnet/Nova) │
│ │
│ ┌─────────┐ ┌──────────┐ ┌─────────┐…the full AgentCore deployment, including the system-message fix and the S3 sync layer
Part of the OpenClaw Challenge.

Top comments (1)
Follow-up on the heartbeat detail: if you're on
cacheRetention: "short"(5-min TTL), the heartbeat interval matters more than people realize.I landed on 4min for safety margin. Might try 4.5min to reduce daemon chatter without risking cache expiry.