Hopkins Jesse

Posted on Apr 10 • Edited on Apr 24

Building an AI Agent Memory System That Actually Remembers (Lessons from 6 Agents)

#ai #automation #productivity #opensource

Three months ago, I had a problem: six AI agents were generating reports every day, and none of them could remember what happened yesterday.

Each agent would wake up fresh, execute its task, write a report, and then vanish. The next day, a new instance would spawn with zero context. It was like working with a team of goldfish. Technically competent, but with a three-second memory span.

I tried the obvious solutions first. I dumped everything into a single MEMORY.md file. That became unwieldy at 500 lines. I tried vector embeddings and semantic search. Overkill for "what did agent01 do yesterday?" I tried having agents read each other's reports. That created a combinatorial explosion of context tokens.

What finally worked was a three-tier memory system that mirrors how humans actually remember things: short-term working notes, medium-term daily logs, and long-term curated memory. Here's how I built it, where I messed up, and what the agents can now do that they couldn't before.

The Problem with Agent Amnesia

Let me paint the picture. I had six agents:

hunter-content - writes articles and content
bounty-hunter - finds and completes crypto bounties
airdrop-hunter - tracks airdrop opportunities
data-collector - gathers market data
strategy-audit - audits trading strategies
polymarket-research - researches prediction markets

Each agent ran daily via cron. Each produced useful output. But none of them learned from previous runs.

On day 17, bounty-hunter would try the same bounty that failed on day 3. On day 42, airdrop-hunter would ask me for the same wallet address it had used 40 times before. It was exhausting.

The fundamental issue: AI agents are stateless by default. They don't persist between sessions unless you explicitly build that persistence. And most "agent memory" solutions I found were either:

Too simple - a single text file that becomes unreadable at scale
Too complex - vector databases, embeddings, semantic search for problems that don't need them
Too abstract - papers and frameworks that sound great but don't ship

I needed something that worked tomorrow, not something that required a PhD in information retrieval.

The Three-Tier Architecture

The solution I landed on borrows from how human memory works:

┌─────────────────────────────────────────────────────────┐
│  Long-term Memory (MEMORY.md)                           │
│  - Curated, high-signal lessons                         │
│  - Updated weekly by main agent                         │
│  - ~500 lines max, manually pruned                      │
└─────────────────────────────────────────────────────────┘
                          ^
                          | Weekly review
                          v
┌─────────────────────────────────────────────────────────┐
│  Medium-term Memory (memory/YYYY-MM-DD.md)              │
│  - Raw daily logs from all agents                       │
│  - Auto-created, auto-expiring (30 days)                │
│  - ~2000 lines/day, searchable                          │
└─────────────────────────────────────────────────────────┘
                          ^
                          | End-of-day flush
                          v
┌─────────────────────────────────────────────────────────┐
│  Short-term Memory (HEARTBEAT.md + runtime context)     │
│  - Current task state                                   │
│  - In-progress work                                     │
│  - Cleared after task completion                        │
└─────────────────────────────────────────────────────────┘

Tier 1: Short-term (Working Memory)

This is the agent's "RAM" - what it's working on right now. For my setup, this lives in two places:

HEARTBEAT.md - A small file (kept under 50 lines) that tracks what the agent is currently doing. Think of it as a sticky note on the monitor.
Runtime context - The conversation history and tool outputs from the current session. This disappears when the session ends.

Example HEARTBEAT.md:

## Active Tasks
- [x] Check email inbox (checked 08:00)
- [x] Review calendar (2 meetings today)
- [ ] Update bounty tracker (in progress)
- [ ] Write dev.to article (scheduled 14:00)

## Current State
- Last bounty submitted: 2026-04-07 (Galxe #4421)
- Pending approvals: 2
- Token budget remaining: 45,000

The key insight: keep this small and ephemeral. If it grows beyond 50 lines, it's no longer working memory - it's becoming a log, and should move to tier 2.

Tier 2: Medium-term (Daily Logs)

This is the agent's "journal" — raw, unfiltered logs of what happened each day. Each day gets its own file:

memory/
├── 2026-04-01.md
├── 2026-04-02.md
├── 2026-04-03.md
...

Each file follows a simple template:

# 2026-04-08

## Agent Runs

### bounty-hunter (09:15)
- Checked 12 new bounties
- Submitted 3 (Galxe #4421, #4422, Layer3 #881)
- Rejected 9 (blacklist hits: 2, low reward: 4, expired: 3)
- Earnings: $47 estimated

### airdrop-hunter (10:30)
- Tracked 5 new airdrops
- Registered wallets for: Zora, LayerZero
- Skipped: (gas too high, requirements unclear)

### hunter-content (14:00)
- Wrote article: "Building an AI Agent Memory System"
- Word count: 2,100
- Submitted to dev.to

## Decisions Made
- Added "Galxe" to preferred platforms (good payout, fast approval)
- Blacklisted "BounceBit" (scam signals: see lessons-learned.md)

## Open Questions
- Should we increase token budget for content agents?
- Need to verify LayerZero airdrop eligibility

Why daily files instead of one big log?

Natural boundaries - Days are meaningful units. You can say "what happened last Tuesday?" without parsing a 10,000-line monolith.
Easy expiration - I auto-delete files older than 30 days. With a single file, you'd need to manually prune. With daily files, it's find memory/ -mtime +30 -delete.
Parallel writes - Multiple agents can write to the same day's file without locking issues (they append, not overwrite).
Grep-friendly - Need to find when something happened? grep -r "Galxe" memory/ gives you instant results with dates.

Tier 3: Long-term (Curated Memory)

This is the agent's "wisdom" — distilled lessons that persist indefinitely. It lives in MEMORY.md at the workspace root.

Unlike daily logs, this file is curated. Not everything makes it here. Only information that matters:

# MEMORY.md - Long-term Memory

## Agent Identities
- hunter-content = 吴用 (Wu Yong), content expert, uses xiaomi/mimo-v2-pro
- bounty-hunter = 时迁 (Shi Qian), stealth operations
- airdrop-hunter = 戴宗 (Dai Zong), speed and reconnaissance

## Critical Workflows
- Bounty submission → wait 24-48h for approval → check status → claim
- Article writing → humanizer pass → dev.to submission → track engagement
- Airdrop registration → document tx hash → monitor eligibility

## Known Pitfalls
- Galxe bounties expire quickly (check deadline before starting)
- LayerZero airdrop requires mainnet tx (testnet doesn't count)
- dev.to rate limits: max 3 posts/day, space them out

## Preferences
- Preferred chains: Ethereum, Arbitrum, Optimism (low gas)
- Avoid: BSC (scam-heavy), unknown L2s
- Content tone: technical but conversational, first-person

## Relationships
- Steve = human user, makes final decisions
- main agent = coordinator, assigns tasks
- agentX = executor, completes tasks

The weekly review ritual:

Every Sunday (or whenever I remember), I read through the past week's daily logs and extract lessons into MEMORY.md. This is manual — I don't trust agents to curate their own long-term memory yet.

The process:

Read memory/2026-04-01.md through memory/2026-04-07.md
Look for patterns: repeated mistakes, successful strategies, new discoveries
Add to MEMORY.md if it's broadly applicable
Remove outdated info ( MEMORY.md has a soft 500-line limit)

This is the most important part of the system. Without curation, you just have data hoarding. With curation, you actually learn something.

Implementation Details

File Structure

/root/.openclaw/workspace/
├── MEMORY.md                      # Long-term memory
├── HEARTBEAT.md                   # Short-term state
├── memory/                        # Daily logs
│   ├── 2026-04-01.md
│   ├── 2026-04-02.md
│   └── ...
├── reports/                       # Agent outputs
│   ├── bounty-hunter.md
│   ├── airdrop-hunter.md
│   └── ...
└── articles/                      # Content outputs
    ├── devto-article-1.md
    └── ...

Writing to Memory

Agents write to memory using simple file operations. Here's the pattern:

// Example: Agent writing to daily log
async function writeDailyLog(agent: string, entry: string) {
  const today = new Date().toISOString().split('T')[0];
  const logPath = `/root/.openclaw/workspace/memory/${today}.md`;

  const timestamp = new Date().toLocaleTimeString('zh-CN', { 
    timeZone: 'Asia/Shanghai',
    hour: '2-digit', 
    minute: '2-digit' 
  });

  const logEntry = `### ${agent} (${timestamp})\n${entry}\n\n`;

  await fs.appendFile(logPath, logEntry, 'utf-8');
}

For long-term memory, the main agent does the writing during weekly review:

// Example: Updating MEMORY.md
async function updateLongTermMemory(lesson: string, category: string) {
  const memoryPath = '/root/.openclaw/workspace/MEMORY.md';
  const content = await fs.readFile(memoryPath, 'utf-8');

  // Find or create category section
  const categoryHeader = `## ${category}`;
  if (!content.includes(categoryHeader)) {
    await fs.appendFile(memoryPath, `\n${categoryHeader}\n\n`);
  }

  // Append lesson
  const lessonEntry = `- ${lesson}\n`;
  await fs.appendFile(memoryPath, lessonEntry);
}

Reading from Memory

Agents read memory at the start of each session:

// Agent startup routine
async function initializeAgent() {
  // 1. Read long-term memory (always)
  const longTermMemory = await fs.readFile('MEMORY.md', 'utf-8');

  // 2. Read recent daily logs (last 3 days)
  const recentLogs = [];
  for (let i = 0; i < 3; i++) {
    const date = new Date(Date.now() - i * 24 * 60 * 60 * 1000);
    const dateStr = date.toISOString().split('T')[0];
    try {
      const log = await fs.readFile(`memory/${dateStr}.md`, 'utf-8');
      recentLogs.push(log);
    } catch (e) {
      // File doesn't exist yet, skip
    }
  }

  // 3. Read heartbeat state (if exists)
  let heartbeatState = '';
  try {
    heartbeatState = await fs.readFile('HEARTBEAT.md', 'utf-8');
  } catch (e) {
    // No heartbeat file, create fresh
  }

  // 4. Combine into context
  return {
    longTermMemory,
    recentLogs: recentLogs.join('\n'),
    heartbeatState,
  };
}

What Changed After Adding Memory

The difference was immediate and dramatic:

Before memory system:

Agents repeated mistakes daily
I had to manually provide context every session
No institutional knowledge - each agent lived in isolation
Token waste: re-explaining the same things over and over

After memory system:

bounty-hunter remembers which platforms pay out reliably
airdrop-hunter tracks registration dates and eligibility windows
hunter-content knows which topics performed well
Agents can reference "last week I tried X and it failed because Y"

Concrete examples:

Bounty blacklist — After bounty-hunter got scammed by "BounceBit" (fake bounty, stolen work), the incident was logged in daily memory, then curated into MEMORY.md. Now all agents check the blacklist before starting work.
Content optimization — hunter-content noticed that articles with "Lessons from X Trades" in the title got 2x more engagement. This was added to MEMORY.md under "Content Preferences", and now all articles follow this pattern.
Gas optimization — airdrop-hunter learned that registering for airdrops during UTC 02:00-06:00 saves 30-50% on gas fees. This is now a standard practice encoded in long-term memory.

Where I Messed Up

Mistake 1: Trying to Automate Curation

Initially, I had agents automatically extract "lessons learned" and write to MEMORY.md. This was a disaster. The agents would write things like:

"Lesson: Completing tasks successfully is important for productivity."

That's not a lesson - that's a fortune cookie. I realized curation requires human judgment about what's actually worth remembering. Now I do the weekly review manually.

Mistake 2: Keeping Daily Logs Forever

I thought "more data is better" and kept daily logs indefinitely. After three months, I had 90+ files totaling 200,000+ lines. Searching became slow. Context became bloated.

Now I auto-delete logs older than 30 days. If something is truly important, it gets promoted to long-term memory during the weekly review. Everything else expires.

Mistake 3: Not Enforcing Size Limits

HEARTBEAT.md grew to 300 lines because agents kept appending tasks without clearing completed ones. It became unreadable.

Now there's a hard 50-line limit. If an agent wants to write something, it must first check if there's room. If not, it needs to clean up old entries. This forces discipline.

The Token Economics

Let's talk cost, because memory isn't free - it consumes context tokens.

Daily memory read cost:

MEMORY.md: ~500 lines ≈ 8,000 tokens
Last 3 daily logs: ~6,000 lines ≈ 100,000 tokens
HEARTBEAT.md: ~50 lines ≈ 800 tokens
Total: ~109,000 tokens per agent startup

At $0.50 per million tokens (Claude Haiku pricing), that's $0.05 per agent run. With 6 agents running daily, that's $0.30/day or ~$9/month.

Is it worth it?

Absolutely. The alternative is me spending 30 minutes per agent session providing context manually. At my effective hourly rate, that's $15-20/day. The memory system pays for itself 50x over.

When NOT to Use This System

This isn't a one-size-fits-all solution. Don't use it if:

You have one agent doing one simple task - A cron job that backs up files doesn't need memory.
Your agents are truly stateless by design - Some workflows benefit from fresh starts every time.
You can't commit to weekly curation - Without the weekly review, daily logs become a data graveyard.
You're working with sensitive data - Don't store API keys, passwords, or personal info in plain text memory files. Use a secrets manager.

The Verdict

Building a memory system for AI agents is less about the technology and more about the discipline. The file structure is trivial. The hard part is:

Actually doing the weekly review
Resisting the urge to automate curation prematurely
Deleting old data instead of hoarding it
Enforcing size limits even when it feels uncomfortable

If you're willing to do the work, the payoff is real. Your agents will feel less like amnesiac contractors and more like a team that actually learns from experience.

And honestly? That's the whole point of building agents in the first place - not to replace human judgment, but to amplify it by handling the boring stuff consistently while you focus on what matters.

The memory system described here is part of the OpenClaw workspace structure. If you want to see the actual files, they're open source at github.com/openclaw/workspace. I'm not affiliated with OpenClaw — I just use it because it gives me the flexibility to build systems like this.

💡 Further Reading: I experiment with self-hosting, privacy stacks, and open-source alternatives. Find more guides at Pi Stack.

DEV Community