DEV Community: Qiushi

Teaching an AI to Attend Your Teams Meeting: A Real-Time Voice Pipeline

Qiushi — Sat, 14 Mar 2026 22:25:48 +0000

Microsoft Teams has a lot of AI features now. Copilot can summarize meetings after the fact. It can generate action items. It can transcribe. What it cannot do — at least not in any form that worked for us — is let you summon an AI agent mid-meeting with a wake word and have a real conversation, with the AI hearing everyone, speaking back, and remembering context from the whole call.

We wanted that. So we built it.

The result is an open-source voice pipeline that integrates a Claude-backed AI agent into live Teams meetings. You say a wake word, the AI activates, hears your question in context, and responds in natural speech — all in real time, with echo cancellation so it doesn't confuse its own voice for input.

GitHub: teams-meeting-agent-public

Why not just use the Teams API?

Teams does have a calling API. You can add bots to meetings. But the meeting bot API is designed for structured integrations — transcription services, recording bots, meeting note-takers. The audio pipeline it exposes is not well-suited for a low-latency conversational agent that needs to hear all participants, do speaker identification, respond in under two seconds, and handle interruption.

We also have specific constraints: we work across a Mac (where the Teams client runs) and a Linux GPU server (where we run inference). The GPU server is where we want the speech recognition — running faster-whisper on CUDA is significantly faster and more accurate than anything you can do on a Mac in real time. That means we need a bridge between two machines, with audio flowing across an SSH tunnel.

The path of least resistance turned out to be: use PulseAudio virtual devices on the Linux side to intercept Teams audio, do all processing there, and build a custom WebSocket relay to coordinate everything.

Architecture overview

Mac (Teams client)                    Linux GPU Server
┌─────────────────────┐               ┌──────────────────────────────────┐
│                     │               │                                  │
│  Microsoft Teams    │◄──────────────│  teams_speaker (null-sink)       │
│  (speaker output)   │               │  teams_virtual_mic (null-sink)   │
│  (mic input)        │               │  teams_mic_input (virtual-source)│
│                     │               │                                  │
│  bridge.py          │◄─WebSocket────│  ws_relay.py (port 8765)         │
│  (wake word,        │───speak cmd──►│                                  │
│   transcript buf)   │               │  stt_pipeline.py                 │
│                     │               │  (faster-whisper + VAD)          │
│  OpenClaw Agent     │               │                                  │
│  (Claude + memory)  │               │  tts_pipeline.py                 │
│                     │               │  (Edge-TTS → PulseAudio)         │
└─────────────────────┘               │                                  │
         │                            │  speaker_id.py                   │
         └────────SSH tunnel──────────│  (ECAPA-TDNN voiceprints)        │
              (port 8765)             └──────────────────────────────────┘

The flow for a single exchange:

Teams audio plays through teams_speaker (a PulseAudio null-sink)
STT pipeline captures .monitor stream, runs VAD + speaker ID + whisper
Transcript is sent via WebSocket to the Mac bridge
Bridge detects wake word → buffers context → sends to OpenClaw agent over HTTP
Agent generates reply → bridge sends speak command back via WebSocket
TTS pipeline synthesizes with Edge-TTS → streams to teams_virtual_mic
Teams hears the AI speaking through teams_mic_input

Everything except the agent LLM call runs locally. STT is on-device CUDA, TTS streams within 200ms of the first Edge-TTS chunk, and the whole round-trip from wake word to first spoken word is typically under two seconds.

The PulseAudio trick

The whole system depends on a PulseAudio setup that most people haven't seen before. On the Linux server, we create three virtual audio devices:

pactl load-module module-null-sink \
    sink_name=teams_speaker \
    sink_properties=device.description=Teams_Speaker

pactl load-module module-null-sink \
    sink_name=teams_virtual_mic \
    sink_properties=device.description=Teams_Virtual_Mic

pactl load-module module-virtual-source \
    source_name=teams_mic_input \
    master=teams_virtual_mic.monitor \
    source_properties=device.description=Teams_Mic_Input

teams_speaker is a null-sink — audio goes in and plays to nothing. But null-sinks in PulseAudio automatically create a .monitor source that exposes the audio as a readable stream. So by setting Teams' speaker output to Teams_Speaker, we get teams_speaker.monitor — a real-time PCM stream of everything Teams is playing, including all meeting participants. The STT pipeline reads from this.

teams_virtual_mic and teams_mic_input work the same way in reverse. The TTS pipeline writes synthesized speech to teams_virtual_mic. The monitor exposes it as a readable source (teams_mic_input), which we set as Teams' microphone input. So when the AI "speaks", Teams hears it as a microphone signal.

This is entirely transparent to Teams. It doesn't know it's talking to virtual devices. No API access required. No bot registration. The Linux server just appears to be a meeting participant with a very smart microphone.

There's one complication: the virtual devices need to be created inside the Chrome Remote Desktop (CRD) PulseAudio session, not the system PulseAudio. CRD runs its own isolated PulseAudio daemon with a non-standard socket path. The startup script detects this automatically:

PULSE_PATH=$(ssh "${SSH_ALIAS}" \
  "cat /proc/\$(pgrep -u \$USER pulseaudio | tail -1)/environ 2>/dev/null \
   | tr '\0' '\n' | grep PULSE_RUNTIME_PATH | cut -d= -f2")

It reads the environment of the running PulseAudio process to find the socket path, then exports PULSE_SERVER=unix:${PULSE_PATH}/native before every pactl call. Everything else just works.

STT pipeline: faster-whisper + VAD + echo cancellation

The STT pipeline is where most of the interesting signal processing happens. It runs on the Linux server and has four responsibilities: capture audio, detect speech boundaries, transcribe, and suppress its own voice.

Audio capture uses PulseAudio's parec to stream raw PCM from teams_speaker.monitor at 16kHz mono (the format faster-whisper expects). Audio comes in as continuous 30ms chunks.

Voice activity detection uses Silero VAD. The raw audio stream is chunked into 512-sample frames and fed to the model. The VAD runs fast enough that it doesn't add perceptible latency. Speech segments are accumulated until a silence gap triggers a transcription.

Transcription uses faster-whisper with distil-large-v3 on CUDA. Distil-large-v3 is a distilled version of Whisper large-v3 — comparable accuracy, roughly 5x faster. For Chinese-English mixed meetings (our primary use case), it handles code-switching without needing language hints.

Echo cancellation is the part that took the most tuning. Without it, the AI would transcribe its own TTS output, which creates feedback loops where it hears itself speaking and tries to respond. The solution is a SpeakingState object that coordinates between the TTS and STT pipelines:

class SpeakingState:
    def __init__(self):
        self._speaking = False
        self._tail_suppress_until = 0.0

    def set_speaking(self, val: bool):
        self._speaking = val
        if not val:
            self._tail_suppress_until = time.time() + ECHO_TAIL_SUPPRESS_SEC

    def is_suppressed(self) -> bool:
        return self._speaking or time.time() < self._tail_suppress_until

When TTS starts playing, set_speaking(True) is called. STT checks is_suppressed() before processing any VAD-triggered segment. After TTS finishes, suppression continues for a configurable tail window (we use 0.8 seconds) to catch audio still draining through the PulseAudio buffer.

Barge-in detection is the other half of the interruption story. When a human speaks while the AI is talking, we want to stop the AI mid-sentence. This is done with energy-based detection rather than VAD, because VAD is too slow for a real-time interrupt trigger:

# Fast 200ms window vs slow 3.2s baseline
fast_energy = rms(audio[-200ms:])
slow_energy = rms(audio[-3200ms:])
if fast_energy > slow_energy * BARGE_IN_RATIO:
    trigger_interrupt()

If the fast window's energy spikes above a ratio of the slow baseline, it signals a barge-in. The TTS pipeline receives an interrupt command and stops playback immediately.

Speaker identification: voiceprint matching

Not everything said in a meeting should reach the AI. You might want only the meeting host to be able to invoke the agent, or only people on your team. Speaker identification solves this.

The implementation uses speechbrain's ECAPA-TDNN model — a speaker verification model that produces 192-dimensional speaker embeddings. For each audio segment, we extract an embedding and match it against a set of registered voiceprints using cosine similarity:

similarity = cosine_similarity(embedding, voiceprint)
if similarity > SPEAKER_MATCH_THRESHOLD:  # 0.30 by default
    return "matched_speaker_name"

The threshold of 0.30 is intentionally low — we'd rather have a false positive (recognizing an unknown speaker as known) than miss a legitimate user. For a trust boundary where only certain people can invoke the agent, you'd raise this.

The model is language-agnostic. It works equally well for Chinese and English speakers, which matters for our meetings. Inference runs in under 10ms on CUDA, so it adds negligible latency to the transcription pipeline.

Speaker identity flows into the WebSocket relay as a trust layer. Transcripts are tagged with either verified (matched a known voiceprint) or untrusted (unknown speaker). The bridge uses this to filter who can invoke the wake word.

Wake word detection and the bridge

The bridge runs on the Mac. Its job is to sit between the Linux server and the OpenClaw agent, making routing decisions about what gets sent where.

When a transcript arrives from the Linux server, the bridge checks for wake words using regex pattern matching. The wake word list is configurable — in our setup it includes "hey claude", "hey agent", and a few Chinese equivalents. The bridge also supports a "presentation mode" where the wake word requirement is relaxed and all transcripts flow through.

When a wake word is detected, the bridge transitions to "engaged" mode: it buffers subsequent transcripts, accumulates context from multiple speakers, and flushes the buffer to the OpenClaw agent's HTTP API when there's a natural pause. This means the AI doesn't just hear one sentence — it gets a multi-turn context window of what was being discussed when it was summoned.

The bridge also handles the response path. When OpenClaw generates a reply, the bridge sends a speak command back to the Linux server's WebSocket relay:

{"cmd": "speak", "text": "Here's the answer to your question..."}

The TTS pipeline picks this up, synthesizes it, and plays it through the virtual microphone.

Connection resilience is important here — meetings can last hours. The bridge implements automatic reconnection with exponential backoff and heartbeat pings to detect silent disconnections before they cause dropped transcripts.

TTS pipeline: streaming synthesis

Edge-TTS is a free, high-quality TTS service that produces natural-sounding speech. The limitation is that it doesn't stream — it waits for the full text to be synthesized before returning audio.

We work around this with chunked generation. Edge-TTS internally streams MP3 data, and the edge-tts Python library exposes this. We pipe the MP3 stream through ffmpeg to convert it to PCM on the fly, then write to PulseAudio as chunks arrive:

async for chunk in communicate.stream():
    if chunk["type"] == "audio":
        process.stdin.write(chunk["data"])  # ffmpeg stdin
        # ffmpeg is already decoding and writing to PulseAudio

The result is that the first audible audio plays within 200-300ms of the speak command arriving. For a meeting context where people are already talking and waiting for a response, this latency is barely perceptible.

The TTS pipeline integrates with the same SpeakingState used by STT. Before writing each chunk, it checks for an interrupt signal. If barge-in was detected while audio was being generated, playback stops mid-sentence and the pipeline sends an acknowledgment back to the bridge so the agent knows the response was cut short.

One-command startup

The entire system — SSH tunnel, remote PulseAudio setup, both processes in tmux sessions — starts with a single script:

./start_meeting.sh

The script handles the full orchestration:

Check if SSH tunnel is already running on port 8765; create it if not
SSH into the Linux server, detect the CRD PulseAudio socket path
Load the three virtual audio devices (idempotent — skips if already loaded)
Write a launcher script on the remote to avoid shell-escaping issues with the socket path
Start main_linux.py in a remote tmux session (teams-voice)
Poll until the WebSocket is accepting connections
Start bridge.py in a local tmux session (teams-bridge)

After startup, the output shows exactly what's running and what to configure in Teams:

══════════════════════════════════════════════════
  Teams Voice Agent — RUNNING
══════════════════════════════════════════════════
  Session key  : voice-meeting-20260313-1430
  Bridge tmux  : teams-bridge  (local)
  Remote tmux  : teams-voice   (on gpu-server)
  Tunnel       : localhost:8765 → gpu-server:8765

  ⚠️  Set Teams audio: Speaker=Teams_Speaker, Mic=Teams_Mic_Input
══════════════════════════════════════════════════

The only manual step is setting Teams' audio devices to the virtual ones. After that, everything is hands-free.

In practice

In a typical meeting, the pipeline is completely silent until activated. STT is running continuously, speaker ID is tagging everyone, but nothing flows to the agent. The AI is listening but not present.

When someone says "hey claude, can you look something up", the bridge catches the wake word, buffers the next few sentences of context, and sends the whole thing to the agent. The agent responds in under two seconds, the voice comes through everyone's speakers (since the AI is speaking through the virtual microphone), and then the pipeline goes quiet again.

We use it primarily for knowledge retrieval during technical discussions — pulling up documentation, cross-referencing notes, summarizing what was decided earlier in the call. The agent has access to the OpenClaw memory system, so it can retrieve context from past meetings about the same project. This is the part that still surprises people in meetings: the AI not only answers the question but references a decision from last week's call.

Barge-in works better than expected. If someone starts talking while the AI is mid-response, it stops within about half a second. There's a brief artifact from the audio that was already in the PulseAudio buffer, but it's not disruptive.

The main limitation is that it requires Chrome Remote Desktop to be running on the Linux server. CRD creates the PulseAudio environment that the virtual devices live in. Without it, you'd need to adapt the startup script to work with whatever PulseAudio setup you have. The core pipeline code doesn't care — it just needs the right device names to exist.

What we'd do differently

The current architecture puts all the signal processing on the Linux server, which makes sense for us but isn't universal. If you have a GPU-capable Mac or just want a simpler setup, the STT pipeline could run locally using MLX-Whisper — roughly the same accuracy on Apple Silicon, no remote server needed.

Speaker identification is currently based on pre-registered voiceprints. A more robust approach would be to do diarization-style "who spoke when" identification without requiring enrollment, using something like pyannote. This would let the agent attribute meeting contributions even for people who haven't registered.

The wake word system is regex-based, which works but is brittle to accents and speech recognition errors. A proper wake word model (like openWakeWord) would be more reliable, especially for non-English activations.

Takeaway

The interesting engineering here isn't the AI part — Claude handles that. It's the audio plumbing: getting the right audio to the right place at the right time, without the AI confusing its own voice for input, without adding enough latency to make conversation awkward.

PulseAudio virtual devices are genuinely powerful for this kind of audio routing. The null-sink + monitor pattern is a clean way to intercept audio streams without patching into any application's internal pipeline. More people should know it exists.

The full source is at github.com/QiushiWu95/teams-meeting-agent-public. The README has setup instructions for the hardware configuration we use (GPU server + Mac over Tailscale), but the pipeline itself should adapt to other setups with moderate effort.

This article was originally published on claw-stack.com. We're building an open-source AI agent runtime — check out the docs or GitHub.

From Code Completion to Code Team: How We Turned Claude Code into an Engineering Department

Qiushi — Tue, 10 Mar 2026 16:40:05 +0000

We use Claude Code every day. It's excellent. It handles complex refactors, writes tests, navigates large codebases, and catches bugs we'd miss. But after months of running it as part of an autonomous multi-agent system, we noticed something: Claude Code is a powerful tool, but it's fundamentally passive. It waits for instructions, executes them, and stops. It doesn't monitor itself, plan ahead, review its own output, or remember what went wrong last time.

That's not a criticism — it's a design choice. Claude Code is built to be a coding assistant, not an autonomous engineering agent. The question we kept asking was: what would it take to turn it into one?

The answer became what we call the V2 architecture: a three-layer system that wraps Claude Code with monitoring, planning, and review. This post describes what we built and why each layer exists.

The passive tool problem

When you run Claude Code directly, the interaction model is: you give it a task, it works on it, it finishes (or gets stuck). There's no process watching whether it's still making progress, no structured plan it's executing against, and no second opinion on whether the output is actually correct.

In practice, this creates three failure modes we hit repeatedly:

Stuck loops. Claude Code sometimes gets into states where it's re-trying the same failing approach. Without external monitoring, the session just keeps running until you notice something is wrong — which, if you're running it autonomously overnight, might be hours later.

No upfront plan. For tasks with multiple steps or dependencies, jumping straight into code before having a clear implementation plan often leads to mid-task pivots that are expensive to recover from. The natural thing for a human engineer is to sketch the approach first. Claude Code doesn't do this by default.

No cross-review. A model reviewing its own output has blind spots — the same reasoning that produced a bug often produces a rationale for why the bug is fine. A second model with a different training distribution catches different things.

Each of these is solvable. Together they become the V2 architecture.

V2 architecture overview

The system has three layers that operate around every Claude Code session:

Layer 1: Heartbeat (watchdog)
  └─ tmux session monitor
  └─ detects stuck/crashed → auto-recover

Layer 2: Skill-Driven Dev (planning)
  └─ SKILL.md written before code
  └─ implementation blueprint

Layer 3: Dual Review (verification)
  └─ Claude Code self-review
  └─ Gemini CLI cross-review

Claude Code still does the actual coding. The layers don't replace it — they wrap it.

Layer 1: Heartbeat

Claude Code runs in a tmux session. The Heartbeat is a watchdog process that polls that session every 30 seconds and inspects the terminal output. It's looking for one thing: whether Claude Code's prompt is visible, which indicates it has finished and is waiting for input.

SOCKET="${TMPDIR:-/tmp}/openclaw-tmux-sockets/openclaw.sock"
LAST5=$(tmux -S "$SOCKET" capture-pane -p -J -t claude-code:0.0 -S -5)
echo "$LAST5" | grep -q "❯" && echo "DONE" || echo "RUNNING"

The ❯ is Claude Code's shell prompt. If it's visible, the session is idle. If it's not visible for longer than the timeout threshold, something is wrong.

When the Heartbeat detects a stuck session, it has three recovery strategies in order of escalation: send a gentle interrupt, close and restart the session with the same task context, or page the orchestrator (Orange) for human-in-the-loop intervention. Most stuck sessions resolve at step one.

This sounds simple, and it is. But without it, autonomous coding sessions are brittle. Claude Code gets stuck on network errors, permission issues, or loops where it convinces itself it's making progress when it isn't. The Heartbeat converts these from silent failures into handled exceptions.

Layer 2: Skill-driven development

Before any non-trivial task goes to Claude Code, the orchestrator writes a SKILL.md file. This is a structured implementation plan — the equivalent of a design doc — that Claude Code then executes against.

The skill file structure:

skills/
  feature-name/
    SKILL.md    # implementation plan + steps

A typical SKILL.md has:

Goal — what the task is and what done looks like
Reference style — which existing code to model after
Outline — the specific sections or steps, with the key technical details filled in
Implementation steps — ordered list of what to do and in what sequence
Privacy/safety checklist — things to verify before commit

This is the file you're reading right now in its original form — this blog post was written against its own SKILL.md.

The planning step forces precision before any code is written. Ambiguous tasks get clarified at planning time, not mid-implementation. It also gives Claude Code a success criterion to check against rather than having to infer when it's done.

The other benefit is reuse. Skills accumulate over time. When a similar task comes up again, the orchestrator can search the skill library for relevant patterns and adapt an existing plan rather than starting from scratch. Over time this is how the system builds institutional knowledge about how certain types of tasks should be approached.

Layer 3: Dual review

After Claude Code finishes and stages its changes, two reviews run before commit.

Claude Code self-review. The first review uses Claude Code itself — but in a separate session, reviewing the diff rather than the code it just wrote. This catches straightforward issues: leftover debug output, incomplete implementations, test files that test the wrong thing.

Gemini CLI cross-review. The second review pipes the staged diff to Gemini:

git diff --cached | gemini -p "Review this diff for: security issues, privacy leaks (IPs, emails, API keys), code quality. Output: PASSED or list of issues."

The cross-review is the more important one. A different model with a different training distribution reliably catches things Claude Code's self-review misses — particularly security issues and privacy leaks. We've had Gemini catch hardcoded test credentials, internal hostnames that shouldn't be in public code, and logic errors that Claude Code's self-review described as intentional design decisions.

The output format is strict: either PASSED or a list of issues. If there are issues, the commit is blocked and the problems are sent back to Claude Code for remediation. The loop continues until Gemini passes it.

The full V2 workflow

Putting it together, the orchestrator's AGENTS.md describes a fixed sequence for every coding task:

1. memory_search → find relevant lessons and patterns from past work
2. Write SKILL.md (plan before code)
3. Launch Claude Code via tmux (with Heartbeat active)
4. Wait for Heartbeat signal: DONE
5. Gemini Review on staged diff
6. If issues: send back to Claude Code, loop
7. Commit
8. Update lessons/MEMORY.md with what was learned

Steps 1 and 8 are what give the system memory. Before starting, the orchestrator searches its vector memory for lessons from similar past tasks — prior decisions, failure modes, patterns that worked. After finishing, it writes what it learned back to memory. Over time this creates a feedback loop where the system gets measurably better at certain types of tasks.

What this isn't

This isn't a fully autonomous engineering team. The orchestrator (Orange) still needs a human (Qiushi) to approve anything that touches production, involves financial operations, or represents a significant architectural decision. The V2 architecture automates the routine coding work; it doesn't automate judgment.

It's also not a replacement for Claude Code — it's a harness for it. The coding quality still comes from Claude Code. The architecture just ensures that quality gets checked, that sessions don't fail silently, and that the system accumulates knowledge rather than starting fresh every time.

The principle

The pattern here is one we've found ourselves returning to: AI tools are most powerful when they're not standalone, but when they're embedded in systems that monitor them, direct them, and check their output. Claude Code alone is a strong coder. Claude Code with a Heartbeat, a planning layer, and a cross-review step is closer to a reliable engineering workflow.

The same principle applies to any capable but passive AI tool. The tool does the work. The system ensures the work is worth keeping.

This article was originally published on claw-stack.com. We're building an open-source AI agent runtime — check out the docs or GitHub.

Memory System v2: Solving the Context Bloat Problem

Qiushi — Mon, 09 Mar 2026 21:16:01 +0000

In our last post on building a persistent memory system, we described the MEMORY.md bloat problem: after six weeks, the file had grown to over 700 lines, and we fixed it by switching from inline content to pointer-based entries. The fix worked. MEMORY.md got compact, session startup improved, everything was fine.

Then it bloated again.

Four weeks later, MEMORY.md was back to 92,000 characters and 790 lines. The organizer pipeline kept writing new facts inline rather than deferring to per-topic files. Our byte-size limit wasn't being enforced consistently. The original fix had patched the symptom, not the cause.

More troubling, we had started noticing that sessions were hitting context limits mid-task even when MEMORY.md was under control. The agent would read a few files, run a search, and then stall — not because it had run out of memory, but because its context window was full of tool output from earlier in the same session.

And there was a third problem we'd been tolerating: every time we ran /new to start a fresh session, the agent lost all awareness of what it had just been doing. Our long-term memory system (v1) handled facts, preferences, and project knowledge well. But the short-term working state — what task was in progress, what decisions were just made, what the next step was — vanished completely. The user had to manually remind the agent to update its memory files before resetting, or accept losing the context.

Three problems, one theme: no systematic lifecycle for context at any timescale.

Measuring before fixing

Before changing anything, we wrote session-stats.py to analyze the last 15 sessions and understand where context was actually going. The output was clarifying.

Session context breakdown (15 sessions, chars):
┌──────────────────┬───────────┬───────────┬────────────┐
│ Category         │ Total     │ % of ctx  │ Avg/session│
├──────────────────┼───────────┼───────────┼────────────┤
│ Tool results     │ 1,842,300 │   82.5%   │   122,820  │
│ System prompt    │   268,100 │   12.0%   │    17,873  │
│ Assistant text   │    64,700 │    2.9%   │     4,313  │
│ User input       │    55,900 │    2.5%   │     3,727  │
└──────────────────┴───────────┴───────────┴────────────┘

The most extreme session: 159,000 characters of tool results, 1,500 characters of user input and assistant text combined. The actual conversation was almost invisible in its own context window.

System prompt was 17K chars per session on average. We knew MEMORY.md was loaded at startup, but seeing it account for 12% of total context across all sessions, including sessions where nothing memory-related happened, made the number concrete. The agent was paying 17K chars of context tax on every session, regardless of what it was doing.

The two problems were now measurable: tool results bloating within a session, and MEMORY.md bloating across sessions. Both were solvable, and we had numbers to evaluate solutions against.

Solution 1: Context pruning

The within-session problem is that tool outputs accumulate. The agent reads a file — that's 8K chars of context. Runs a search — another 4K. Edits a file, sees the diff — 2K. Reads the test output — 6K. After a moderately complex task, the context is mostly tool output from earlier steps that the agent no longer needs to reference.

OpenClaw's contextPruning feature handles this with a TTL-based approach: after a configurable time window, tool outputs beyond the most recent turn are replaced with a placeholder. The content is gone from the active context, but the agent can see that something happened.

Our configuration:

contextPruning:
  mode: cache-ttl
  ttl: 30
  minPrunableToolChars: 100
  hardClearRatio: 0

With ttl: 30, any tool result older than 30 seconds is eligible for pruning on the next turn. minPrunableToolChars: 100 prevents replacing tiny tool outputs that cost almost nothing. hardClearRatio: 0 means we never do a full wipe — we keep the most recent turn intact.

The effect is that the agent operates with a sliding window of recent tool context rather than the full accumulated history. For tasks involving repeated file reads or search-iterate loops, this is the difference between hitting context limits at step 8 and finishing the task.

One concern we had: would pruning break the agent's ability to reference earlier work? In practice, no. For most tasks, the agent either needs the output of the most recent tool call, or it needs a general fact that should be in memory rather than in an ephemeral tool result. If the agent needs to re-read a file it already processed, that's usually a sign the fact should have been written to memory, not cached in context.

Solution 2: MEMORY.md structural compression

The 92K → compact migration required confronting a design question we'd avoided the first time: what exactly should MEMORY.md contain?

Our v1 answer had been "recent activity, active projects, key contacts, and infrastructure notes," with a byte-size cap to keep it manageable. This was wrong. A byte-size cap is an incentive to compress content, but it doesn't prevent accumulation — it just makes each entry shorter before you run out of room and start bending the rules.

The right answer is that MEMORY.md should contain pointers, not content. If you can answer the question "what is this file for?" with "it contains X," then MEMORY.md should not contain X — it should contain "see memory/X.md for X." MEMORY.md is an index that tells the agent where to look, not a document that contains what the agent knows.

With that definition, the target structure became obvious:

## Users
| handle | role | notes |
| --- | --- | --- |
| @orange | owner | ... |

## Projects
| name | status | detail file |
| --- | --- | --- |
| claw-stack | active | memory/entities/project-claw-stack.md |
| info-pipeline | active | memory/entities/project-info-pipeline.md |

## Infrastructure
| service | notes | detail file |
| --- | --- | --- |
| CF Workers | edge compute | memory/infra/cloudflare.md |

## Behavior rules
See AGENTS.md for current rules.

## Recent (last 5)
- 2026-03-09: ...

Tables for structured facts (users, projects, infra). Pointers for everything else. Recent activity capped at five entries, rolling. Total target: under 5,000 characters.

After the migration, MEMORY.md went from 92,000 characters to 2,900 characters — a 97% reduction. Session startup went from ~23K tokens of MEMORY.md context to ~700 tokens. Everything that was in MEMORY.md before is still searchable through QMD vector search; it's just in per-topic files now rather than inline.

The migration script itself was about 150 lines of Python: read the current MEMORY.md, extract facts by category using Claude Haiku, write facts to appropriate per-topic files, generate the new pointer-based MEMORY.md. Running it took 20 seconds.

Solution 3: Session handoff hooks

The context pruning and MEMORY.md compression addressed the technical bloat problems. There was a third problem we'd been tolerating: when you run /new to start a fresh session, you lose all the working context from the current session. What file were you editing? What was the next step? What did you just figure out about the bug you were debugging?

The conventional response is "write better notes." We wanted to automate it.

OpenClaw supports hooks that fire on specific commands. We wrote a command:new hook that runs a session summarization pipeline before the new session starts:

# Triggered on /new
def session_handoff(transcript):
    summary = claude_haiku(
        system=open("MANIFEST.md").read(),  # file map for the memory system
        prompt=f"Summarize this session. Extract: current work state, "
               f"decisions made, lessons learned, entities updated. "
               f"Format as structured updates for memory files.\n\n{transcript}"
    )
    apply_memory_updates(summary)  # updates MEMORY.md, TODO.md, entities, etc.

The hook runs synchronously with a 20-second timeout, then falls back to async if the transcript is too long to process quickly. In practice, most sessions process in 8–12 seconds.

The key piece is MANIFEST.md, a file that describes the memory system's structure: which files exist, what each one contains, and what kinds of updates go where. Without it, Haiku doesn't know that a project update should go to memory/entities/project-X.md rather than into MEMORY.md directly. The MANIFEST is the schema documentation for the agent that maintains memory.

After the handoff hook, /new still starts a fresh context, but MEMORY.md now reflects the current session's outcomes. The next session starts knowing where you left off.

Decay prevention rules

After rebuilding the system twice, we wrote explicit rules into AGENTS.md to prevent the same problems from recurring:

Hard limits:

MEMORY.md must stay under 5,000 characters. If an update would push it over, write to a per-topic file and add a pointer instead.
Never write commit hashes, code snippets, or raw error messages to MEMORY.md. These are either ephemeral (commit hashes, errors) or belong in per-topic files (code).

Prohibited content:

Lists of more than 5 items (use a per-topic file)
Facts already present in another memory file (no duplication)
"Temporary" notes (write to a TODO file, not to MEMORY.md)

Regular maintenance:

After any session that touched more than 3 files, check whether per-topic files need updating
When a project status changes, update the entity file, not the MEMORY.md table

Rules written into AGENTS.md become part of the system prompt, which means the organizer pipeline and the handoff hook both see them. They're not enforced by code, but explicit rules in the context are meaningfully better than informal conventions.

Measured outcomes

The immediate results after deploying the v2 changes:

Metric	Before	After
MEMORY.md size	~92K chars (~23K tokens)	~2.9K chars (~700 tokens)
Session startup context tax	~23K tokens	~700 tokens
Tool result share of context	82.5%	Pruned after 30s
Working state preserved across /new	No	Yes (automated)

The MEMORY.md reduction is a 97% cut. Every new session now starts with 22K fewer tokens of overhead, which means more room for the actual task. The context pruning configuration means tool results older than 30 seconds are replaced with placeholders, preventing the within-session accumulation that was causing stalls on multi-step tasks.

Whether the handoff hook produces the right memory updates consistently is something we'll know after a few weeks of use. The architecture is right — the question is whether Haiku's judgment about what to update holds up at scale. We'll report back.

What we learned about memory

The v1 blog post framed the bloat problem as a technical issue with a technical fix: enforce a byte-size limit, use pointers instead of inline content. That framing was correct but incomplete.

The real problem is that memory management is an information architecture problem, not a storage problem. Every time we said "this fact might be relevant later, so put it in MEMORY.md," we were making a bad indexing decision. MEMORY.md was being used as a catch-all rather than as a specific layer in the architecture.

The v2 system works not because we have better enforcement mechanisms (though the TTL pruning and size limits help) but because we're clearer about what each layer is for:

Active context: the current session's working state. Ephemeral. Pruned aggressively.
MEMORY.md: session orientation. The minimum context needed to start a session. Pointers only.
Per-topic files: depth on specific subjects. Loaded on demand. Where content lives.
Vector search: fallback retrieval across all memory. For queries that don't know where to look.

When a new fact arrives, the question isn't "should I remember this?" It's "which layer does this belong in?" Most facts don't belong in MEMORY.md. Getting that architecture right is what prevents bloat.

Practical takeaways for agent developers

If you're building something similar, the mistakes we made twice are worth knowing:

Enforce the index/content separation at write time, not retroactively. A byte-size limit on MEMORY.md doesn't prevent bloat — it just makes bloat smaller before you exceed it. The real constraint is: no content in the index, only pointers. Check this on every write.

Measure context distribution before you optimize. We assumed MEMORY.md was the main problem. It was a problem. Tool results were a bigger problem. Running session-stats took a day to write and immediately surfaced the bigger issue. Measurement first.

TTL-based context pruning is low-risk and high-reward. We were worried it would break agent behavior. It didn't. For most tasks, old tool results are noise, not signal. Prune them.

A handoff hook is worth more than perfect note-taking. Asking humans (or agents) to write end-of-session notes reliably is a losing strategy. Automate it. Even a rough extraction that takes 10 seconds is better than manual notes that don't get written.

Document the memory system's schema for the agents that use it. The MANIFEST.md pattern — a file that explains where things go — is what makes automated memory updates actually put things in the right place. Without it, every update becomes an ad-hoc decision about file placement.

Memory systems for AI agents are still young enough that there's no established practice. These are the patterns that worked for us at our scale. Your scale, your access patterns, and your agent's task distribution will produce different constraints. But the underlying principle holds: agent memory is information architecture. Get the architecture right before you build the infrastructure.

This article was originally published on claw-stack.com. We're building an open-source AI agent runtime — check out the docs or GitHub.

Making iMessage Reliable with OpenClaw: 3 Problems and How We Fixed Them

Qiushi — Mon, 09 Mar 2026 00:54:03 +0000

OpenClaw can use iMessage as a communication channel — you text your AI agent, it texts you back. Sounds simple, but running it 24/7 on a Mac mini revealed three reliability issues that took weeks to fully diagnose. Here's what went wrong and how we fixed each one.

The Setup

OpenClaw's iMessage plugin works by watching ~/Library/Messages/chat.db via filesystem events (FSEvents). When a new message arrives, macOS writes to chat.db, the watcher detects the change, and the gateway processes the message.

In theory, this is instant. In practice, it breaks in three distinct ways.

Problem 1: Messages Delayed Up to 5 Minutes When Idle

Symptom: You send a message, it shows "Delivered" on your phone, but the agent doesn't respond for 3-5 minutes. Then suddenly it processes everything at once.

Root Cause: macOS power management coalesces FSEvents for background processes. Even with ProcessType=Interactive in the LaunchAgent plist and caffeinate running, the kernel still batches vnode events on chat.db during low-activity periods. The imsg rpc subprocess watches the file, but macOS decides "this process hasn't been active, let's batch up those file notifications."

Why It's Tricky: The message is already in chat.db — it's the notification that's delayed, not the message itself. So everything works perfectly during active use, but fails silently when the machine is idle.

Fix: A polling script that checks chat.db every 15 seconds and touches the file when new rows appear, generating a fresh FSEvent:

#!/usr/bin/env node
// imsg-poller.mjs — Polls chat.db for new messages and wakes FSEvents watcher


const CHATDB = join(homedir(), 'Library/Messages/chat.db');
const INTERVAL = 15000; // 15 seconds

function getMaxRowid() {
  try {
    return execSync(
      `/usr/bin/sqlite3 "${CHATDB}" "SELECT MAX(ROWID) FROM message;"`,
      { timeout: 5000, encoding: 'utf8' }
    ).trim() || '0';
  } catch { return '0'; }
}

let lastRowid = getMaxRowid();
if (lastRowid === '0') {
  console.error('ERROR: Cannot read chat.db — check Full Disk Access');
  process.exit(1);
}

console.log(`imsg-poller started. ROWID: ${lastRowid}, interval: ${INTERVAL}ms`);

setInterval(() => {
  const current = getMaxRowid();
  if (current !== '0' && current !== lastRowid) {
    console.log(`New message (ROWID ${lastRowid} -> ${current}), touching chat.db`);
    try {
      const now = new Date();
      utimesSync(CHATDB, now, now);
    } catch (e) {
      console.error(`touch failed: ${e.message}`);
    }
    lastRowid = current;
  }
}, INTERVAL);

Why Node.js instead of bash? We tried a bash version first, but launchd-spawned /bin/bash processes don't inherit Full Disk Access (TCC). The stat command works, but sqlite3 gets "authorization denied". Using /opt/homebrew/bin/node works because it inherits FDA from the same TCC grant as the gateway.

Deployment: Run as a LaunchAgent with KeepAlive: true:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN"
  "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
  <key>Label</key>
  <string>ai.openclaw.imsg-poller</string>
  <key>ProgramArguments</key>
  <array>
    <string>/opt/homebrew/bin/node</string>
    <string>/path/to/imsg-poller.mjs</string>
  </array>
  <key>RunAtLoad</key><true/>
  <key>KeepAlive</key><true/>
  <key>EnvironmentVariables</key>
  <dict>
    <key>HOME</key><string>/Users/youruser</string>
  </dict>
  <key>ThrottleInterval</key><integer>10</integer>
</dict>
</plist>

Survives OpenClaw updates? Yes — it's a standalone launchd job.

Problem 2: Images Sent via iMessage Fail with "Path Not Allowed"

Symptom: The agent tries to send an image that was received via iMessage, but gets "Local media path is not under an allowed directory." The image exists at ~/Library/Messages/Attachments/... but OpenClaw's media sandboxing blocks it.

Root Cause: OpenClaw's buildMediaLocalRoots() function defines which directories are allowed for media file access. It includes the workspace, temp directories, and sandboxes — but not ~/Library/Messages/Attachments/. When the agent tries to forward or process an image received via iMessage, the path is rejected.

Fix: A patch script that adds the Messages attachment directory to the allowed roots:

#!/usr/bin/env bash
# patch-imessage-attachments.sh
# Adds ~/Library/Messages/Attachments to allowed media roots
# Re-run after every `npm update -g openclaw`

DIST="/opt/homebrew/lib/node_modules/openclaw/dist"

patched=0
for f in "$DIST"/ir-*.js; do
  [ -f "$f" ] || continue
  if grep -q "buildMediaLocalRoots" "$f" && \
     ! grep -q "Messages/Attachments" "$f"; then
    sed -i '' 's|path.join(resolvedStateDir, "sandboxes")|path.join(resolvedStateDir, "sandboxes"),\n\t\tpath.join(os.homedir(), "Library/Messages/Attachments")|' "$f"
    echo "Patched: $(basename $f)"
    patched=$((patched + 1))
  fi
done

echo "Done. Patched: $patched files"
echo "Run: openclaw gateway restart"

Survives OpenClaw updates? No — the compiled JS files are overwritten. You must re-run this after every update.

Problem 3: macOS Updates Silently Revoke Full Disk Access

Symptom: iMessage stops working entirely. No messages received, no errors in the gateway log that make sense. The agent appears online but is deaf.

Root Cause: macOS system updates (and sometimes minor security patches) can reset TCC (Transparency, Consent, and Control) permissions. When this happens, the imsg binary loses Full Disk Access, which means it can't read ~/Library/Messages/chat.db. The gateway logs show:

permissionDenied(path: "~/Library/Messages/chat.db",
  underlying: authorization denied (code: 23))

In our logs, this happened on Feb 13 and Feb 24, 2026 — both times correlating with macOS updates.

Fix: Manual, unfortunately.

Check the gateway error log:

   grep "permissionDenied" ~/.openclaw/logs/gateway.err.log | tail -5

If you see code: 23, go to: System Settings → Privacy & Security → Full Disk Access

Make sure imsg (or Terminal / iTerm, whichever runs your gateway) has FDA enabled. Toggle it off and on if it looks correct but isn't working.

Verify:

   /opt/homebrew/bin/imsg chats --limit 1
   # Should return your most recent chat, not an error

Restart:

   openclaw gateway restart

Survives OpenClaw updates? Yes — TCC permissions are system-level. But macOS updates can reset them.

The Post-Update Checklist

Every time you run npm update -g openclaw, do this:

# 1. Re-apply patches (overwritten by update)
bash ~/.openclaw/autopatch/patch-imessage-attachments.sh

# 2. Restart gateway
openclaw gateway restart

# 3. Verify iMessage works
/opt/homebrew/bin/imsg chats --limit 1

After macOS updates, also check Full Disk Access permissions.

Should OpenClaw Fix These Upstream?

Problem 1 (FSEvents coalescing) is a macOS kernel behavior — hard to fix in OpenClaw itself. The poller is the right workaround. OpenClaw could ship it as an optional component.

Problem 2 (attachment path) is a clear bug/oversight. ~/Library/Messages/Attachments/ should be in the default allowed roots when the iMessage plugin is enabled. This is a one-line fix upstream.

Problem 3 (TCC reset) is Apple's problem. Nothing OpenClaw can do except maybe detect it and log a clearer error message.

Lessons Learned

"Works on my machine" isn't enough for always-on agents. These bugs only appear after days of continuous operation or after system updates. You need to run your agent 24/7 for weeks to find them.
macOS is not designed for headless servers. Power management, TCC, FSEvents coalescing — they all assume a human is sitting in front of the screen. Running an AI agent on a Mac mini requires fighting the OS at every level.
Keep a patch directory. We maintain ~/.openclaw/autopatch/ with scripts and a README documenting every patch. When an update lands, we run them all. It's not elegant, but it's reliable.
Log everything. The poller logs every touch it performs. The gateway logs every permission error. Without these, we'd still be debugging "why didn't my message go through?"

This article was originally published on claw-stack.com. We're building an open-source AI agent runtime — check out the docs or GitHub.

Building a Tri-Modal Knowledge Engine for CTF Agents

Qiushi — Sat, 07 Mar 2026 23:57:46 +0000

When Librarian gets asked about tcache stashing, it needs to return something more useful than what a base Claude model knows. The model has a general understanding of heap exploitation — it can describe what tcache is, explain the concept of a stash unlink attack, gesture at the shape of an exploit. But it doesn't know the specific pwntools idiom your teammates used last week, or the exact GDB command that reveals the free list state in the version of libc pinned to the challenge binary. That gap — between general knowledge and actionable specifics — is what the knowledge engine exists to close.

Why "just ask the LLM" isn't enough

A base model's knowledge has two failure modes in CTF contexts.

The first is staleness. CTF challenges often involve recent CVEs, updated tool versions, or techniques documented only in writeups from the past year. A model with a training cutoff doesn't know these. The second is precision. Knowing that GTFOBins documents nmap privilege escalation techniques is not the same as having the exact --script=exec incantation ready to paste. In a time-limited competition, the difference between "the agent knows the theory" and "the agent has the exact command" can be the difference between a solve and a dead end.

There's also a context budget problem. Librarian (Claude Haiku) is called once per challenge and has a fixed context window. You can't embed all of HackTricks in the prompt. You need targeted retrieval: the three most relevant things for this specific challenge, delivered quickly, in a format the agent can act on immediately.

The tri-modal architecture

The knowledge base separates three fundamentally different kinds of retrieval into three separate stores.

Type A — Muscle Memory (SQLite + FTS5)

Type A is for commands you want to copy and paste. The database (ctf_knowledge.db) contains two tables.

binaries holds ~2,739 structured records from GTFOBins and LOLBAS — one row per binary per exploitation method, indexed by name, platform (linux/windows), and function (shell, sudo, suid, download, etc.). These come from the GTFOBins YAML files and a LOLBAS JSON export. The schema is intentionally rigid: name, platform, function, code, description. A query for nmap + sudo returns the exact command, not a description of what nmap can do.

tricks is a SQLite FTS5 full-text search table with ~4,155 records from HackTricks and PayloadsAllTheThings. The build pipeline walks all markdown files, extracts code blocks using regex, and records the surrounding header as context. PayloadsAllTheThings gets filtered: the Intruder, Wordlists, Files, and Images directories are skipped (those assets go to Type C instead), and code blocks longer than 20 lines are dropped — a deliberate choice to keep tricks copy-paste ready rather than turning the table into a script archive.

FTS5 queries work by AND-ing the search terms together (nmap AND sudo AND shell). When FTS5 fails — which happens with punctuation-heavy queries — the gateway falls back to a LIKE search on the first word.

Type B — Cortex (ChromaDB + BGE-M3)

Type B is for methodology, concepts, and writeups. It uses ChromaDB with BAAI/bge-m3 embeddings: 1024-dimensional vectors, normalized, with Metal (MPS) acceleration on Apple Silicon.

The build pipeline (crawl_type_b.py) is a web crawler that reads target URLs from markdown configuration files, spiders up to 500 pages per site, and ingests text into the methodology collection. Pages are chunked by double newline, capped at 1,500 characters per chunk, with chunks under 100 characters discarded. Raw HTML is cached locally so rebuilding the index after changing the embedding model doesn't require re-crawling.

Currently indexed: 0xdf's blog (machine writeups, practical exploitation techniques) and the pwntools documentation. The ctf-wiki repository is cloned locally and available for ingestion but requires separate processing. One wrinkle: trafilatura — the primary extraction library — gets blocked by docs.pwntools.com, so the crawler falls back to urllib for that domain.

A note on the embedding model: we upgraded from a 384-dimension model to BGE-M3 (1024 dimensions) mid-build. ChromaDB doesn't support mixed-dimension collections, so the upgrade required dropping and rebuilding the entire database. The build script handles this automatically, but it means every embedding model upgrade is a full rebuild.

Type C — Arsenal (JSON index)

Type C is for local files: wordlists, web shells, and privilege escalation scripts. Rather than a database, it's a flat JSON index (asset_index.json) mapping names and tags to absolute file paths.

What's in the index: SecLists (password lists, directory wordlists, username lists, fuzzing payloads), PayloadsAllTheThings web shells (.php, .jsp, and others), and PEASS-ng pre-compiled binaries (linpeas.sh, winpeas.bat, winPEASany.exe). The rockyou.txt wordlist is pre-decompressed and ready to use. Each entry carries a category, a tag list, and an absolute path — so when Operator needs to run ffuf -w <path>, Librarian hands back the path, not an instruction to find the path.

Tools like pwntools and ROPgadget are explicitly excluded. These are environment tools that Operator invokes directly; they're TypeD — present in the execution environment but not indexed here. Type C is for files you transfer or reference, not binaries you run.

The LibrarianGateway

The gateway (librarian_gateway.py) is the single interface to all three types. Its job is to route queries and apply automatic fallback and enhancement logic.

TypeA query:
  FTS5 search binaries + tricks
  ├─ hit  → return payloads
  └─ miss → fallback: run TypeB semantic search, label as theory (no ready payload)

TypeB query:
  ChromaDB semantic search
  ├─ hit  → extract keywords (words >4 chars, first 3) → run TypeA lookup
  │          return theory + concrete examples
  └─ miss → nothing

TypeC query:
  JSON substring/tag filter (name, tags, category)
  → return up to 5 matches with absolute paths

The TypeA→TypeB fallback is the most useful path in practice. When Operator asks Librarian for a precise command that doesn't exist in the SQLite database, the gateway doesn't just return nothing — it says "no payload found, but here's the methodology," giving Operator enough theory to reconstruct the approach from scratch.

The TypeB→TypeA enhancement works in the opposite direction. After a semantic search returns methodology results, the gateway extracts keywords from the returned text and runs an FTS5 lookup to find concrete commands that illustrate the theory. This avoids the pattern where the agent understands the concept but has to guess the syntax.

The keyword extraction is crude: take words longer than four characters, pick the first three, run FTS5. It works often enough to be useful but misses short but domain-specific terms like "ROP", "XSS", "SQL", or binary names like nc. This is the part of the system most in need of improvement.

The build pipeline

Building from scratch:

# Type A: parse GTFOBins YAML + LOLBAS JSON + HackTricks MD + PayloadsAllTheThings MD
python3 TypeA/build_db.py

# Type B: crawl configured sites, embed with BGE-M3, store in ChromaDB
python3 TypeB/crawl_type_b.py

# Type C: walk SecLists / PayloadsAllTheThings / PEASS-ng, emit JSON index
python3 TypeC/build_asset_index.py

Type A builds in seconds. Type B is the slow one: BGE-M3 runs inference for every chunk, and crawling a 500-page site with 0.5s politeness delays takes a while. The raw HTML cache means that once crawled, rebuilding the vector index from cache is much faster than re-crawling.

One dependency constraint worth noting: Python 3.14 breaks ChromaDB 1.5.1 due to a Pydantic compatibility issue. The project requires Python 3.10–3.13.

What worked at BearcatCTF

The clearest signal was category accumulation. By the time Operator reached the eighth cryptography challenge, Librarian had enough indexed context — from prior solves and its own sources — that its briefing was materially better calibrated than what it gave for the first challenge. Forensics showed the same pattern: binwalk and foremost appeared in early Librarian responses, and by challenge three, Operator was starting with the right tools rather than discovering them mid-attempt.

Type C was effective for web challenges. When Operator needed to upload a reverse shell or fuzz a directory, Librarian returned absolute paths rather than instructions to find the files. The friction reduction there is small but real in a timed context.

The architecture's weak point was pwn. Type B's coverage of heap exploitation methodology is reasonable — 0xdf's writeups cover it well — but Type A's coverage of specific pwntools invocations is thin. Most GTFOBins entries are for privilege escalation, not binary exploitation. Operator had to reconstruct pwntools boilerplate from the docs rather than retrieving it from an indexed source.

What we would change

Improve Type B coverage for pwn. The 0xdf blog and pwntools docs are the current sources. CTF-wiki is cloned locally but not yet ingested. Adding it, along with targeted crawls of well-known pwn writeup archives, would improve coverage for the challenge categories where theory-to-payload translation matters most.

Fix keyword extraction. The current heuristic (words >4 chars, first 3) was a placeholder that never got replaced. A minimal improvement would be to extract known CTF keywords — CVE numbers, binary names, technique names — before falling back to length heuristics.

Add TypeD integration hints. When Librarian returns a methodology result that implies a specific tool invocation (ROPgadget, pwntools, gdb-peda), it should note the tool and suggest the invocation pattern even if it's not in the index. Currently there's no connection between Type B theory results and the TypeD tools in the execution environment.

Cache invalidation for Type B. The raw HTML cache has no expiration. 0xdf's blog gets new writeups; pwntools docs update with new releases. The current approach requires manually deleting cached files to pick up changes. A TTL or content-hash check would fix this.

The engine in its current form is functional and was net-positive at BearcatCTF. It's also clearly a first version. The architecture is right — the three-way split between immediate payloads, methodology, and local assets maps cleanly onto how a human CTF player actually uses different reference materials. The rough edges are in the population and retrieval quality within each layer, not in the design.

This article was originally published on claw-stack.com. We're building an open-source AI agent runtime — check out the docs or GitHub.

Building a Persistent Memory System for AI Agents

Qiushi — Sat, 07 Mar 2026 23:05:28 +0000

The canonical advice for giving an AI agent memory is: use a vector database. Store embeddings, do similarity search, retrieve relevant chunks. This is good advice for retrieval-augmented generation systems where the query pattern is "find documents similar to this question." It's not necessarily the right answer for an agent that needs to remember what it did last Tuesday and what decisions it made about project X six weeks ago.

Here's how we built the Claw-Stack memory system, why it looks the way it does, and what we learned along the way.

The problem with stateless agents

Every Claude session starts fresh. The model has no memory of previous sessions unless you explicitly inject that context at the start. For a research assistant that you talk to once, this is fine. For an autonomous agent that runs every day, accumulates knowledge about your projects, and needs to maintain consistent behavior over weeks, it's a fundamental problem.

The naive solution is to dump everything into the system prompt. This works until you've accumulated a few hundred KB of context, at which point two things happen: you start hitting context limits, and the model's ability to use the early parts of a very long context degrades. The agent starts ignoring things you told it three months ago because they're too far from the current interaction.

We needed a memory system with two properties: it had to be selective (only inject what's relevant to the current session), and it had to be human-readable (we needed to be able to audit, edit, and correct what the agent believed).

The three-layer architecture

The memory system has three layers:

Layer 1: MEMORY.md — a compact index loaded at the start of every session. This is a structured Markdown file with sections for recent activity, active projects, key contacts, and infrastructure notes. It's intentionally kept short — the system enforces a byte-size cap — so it doesn't consume the context budget on sessions with large task descriptions.

Layer 2: Per-topic files — longer Markdown files in memory/ that go into depth on specific subjects. projects/claw-stack.md, contacts/key-people.md, infrastructure/servers.md. These aren't loaded automatically. The agent has a read_memory tool that fetches a specific file when it needs depth on a topic.

Layer 3: SQLite + QMD vector search — a SQLite database with FTS5 full-text search and a QMD (a vector embedding tool built on top of SQLite) index for semantic search. When the agent gets a query it can't answer from MEMORY.md and the per-topic files, it runs a vector search across all memory content to find relevant fragments.

Why not a vector database

The short answer: for our scale and access patterns, the operational overhead of a standalone vector database isn't worth it.

The main reasons we chose SQLite + FTS5 over a dedicated vector database:

Opacity. With a dedicated vector database, you can't easily inspect whether a retrieval was correct without tooling to query it. A Markdown file you can open in any text editor. Our SQLite database opens with any SQLite tool, and the schema is tables we wrote ourselves.
Operational simplicity. The entire memory store is a single .db file plus a directory of Markdown files. No separate process to manage, no format migrations, no version compatibility issues between the database binary and your data.
Sufficient for our scale. We have around 50,000 words of memory content across all files. SQLite FTS5 can do full-text search across that in milliseconds. The cases where vector similarity is meaningfully better than keyword search are real but rare enough that the operational overhead isn't worth it.

QMD (the vector search layer) sits on top of SQLite. Embeddings are computed locally using a small quantized model and stored in a SQLite table alongside the text. Re-indexing takes a few seconds. The entire memory store is a single .db file plus a directory of Markdown files.

The organizer pipeline

Memory doesn't manage itself. After every session, an organizer pipeline runs:

raw session files
  → scan memory/*.md (MD5 hash check, skip unchanged)
  → extract facts per category (project updates, decisions, contacts)
  → deduplicate against existing memory
  → write updated per-topic files
  → rebuild SQLite FTS5 index
  → update MEMORY.md index

The extraction step uses an LLM (Gemini as primary, with Claude Haiku as fallback): it reads a session transcript and produces structured notes in a specific format. The deduplication step is rule-based: if the new fact is a substring of an existing entry, skip it; if it contradicts an existing entry, flag it for human review.

The pipeline runs on a cron schedule (every few hours during active work periods) rather than immediately after every session. This batches the processing cost and avoids writing memory files that will be immediately overwritten by a subsequent session.

The MEMORY.md bloat problem

The most painful lesson was about MEMORY.md growth.

We started with no limit on MEMORY.md length. The organizer kept adding to it. After six weeks, MEMORY.md was over 700 lines long. This had a predictable effect: session startup consumed most of the context budget before any actual task content was loaded, and the model was visibly struggling to synthesize a several-hundred-line brief while also doing useful work.

The fix was to change the organizer's behavior and enforce a size cap. Instead of appending new facts to MEMORY.md directly, the organizer writes them to per-topic files and updates MEMORY.md with pointers — one line that says "see projects/claw-stack.md for current status" rather than embedding the full status in MEMORY.md. The system now enforces a byte-size limit on MEMORY.md to prevent runaway growth.

This required us to rethink what MEMORY.md is for. It's not a summary of everything the agent knows. It's a session briefing — the minimum context needed to orient the agent at the start of a session. Anything beyond that is fetched on demand.

After the refactor, session startup is noticeably faster, and the model makes better use of the context it has. Keeping MEMORY.md truly compact is an ongoing discipline — we found that a strict line count is less useful than a byte-size limit, and even that requires the organizer to be aggressive about using pointers rather than inline content.

Memory as human-readable state

The design philosophy behind the system is that agent memory should be human-readable and human-editable. This is a constraint we imposed deliberately.

When the agent develops incorrect beliefs — and it does, occasionally — we can find the wrong entry in a Markdown file, edit it, and the fix takes effect in the next session. With a vector database, correcting a wrong belief requires knowing which embedding to update, deleting it, writing a new one, and potentially invalidating cached retrievals. With Markdown files, you open the file and change the text.

This also makes auditing straightforward. Before trusting an autonomous agent to make decisions on your behalf, you need to be able to read its beliefs and verify they're correct. The entire memory system is a directory of Markdown files. Any text editor works.

The tradeoff is that the format is fixed. Our memory files follow a specific schema that the organizer knows how to parse and update. If you want to add a new category of memory, you need to update both the file schema and the organizer. For a research project with one operator, that's acceptable. For a production system with many agents and many types of memory, you'd want something more flexible.

What we'd do differently

If we were starting over:

Use a smaller MEMORY.md from day one. We wasted weeks cleaning up the bloat that could have been avoided with an initial size cap. A byte-size limit with pointer-based entries is a better target than a fixed line count for a daily-use assistant.

Separate episodic from semantic memory earlier. "What happened in Tuesday's session" (episodic) and "what is the Claw-Stack architecture" (semantic) are different types of memory that benefit from different retrieval strategies. We mixed them initially and spent time later separating them.

Build the audit tooling first. The hardest part of maintaining an agent memory system isn't the indexing or retrieval — it's knowing when the memory is wrong. We built the audit view (a script that shows you what the agent believes about a given topic) too late. It should have been the first tool we wrote.

The memory system is one of the parts of Claw-Stack we're most satisfied with. It's boring infrastructure that works reliably, which is exactly what memory should be.

This article was originally published on claw-stack.com. We're building an open-source AI agent runtime — check out the docs or GitHub.

24 Hours, 40 Challenges: How an AI Team Placed Top 6% at BearcatCTF 2026

Qiushi — Sat, 07 Mar 2026 22:52:10 +0000

Final result: rank #20 out of 362 teams. 40 of 44 challenges solved. 24 hours of unattended autonomous operation. These numbers revealed something we didn't expect — not about the AI, but about what structured agent coordination makes possible.

The Trinity architecture

BearcatCTF was the first real-world deployment of what we call the Trinity: three specialized agents with distinct roles, operating on a shared knowledge base.

Commander (Claude Opus) — the strategic layer. Read the challenge list, estimated difficulty, assigned work, tracked progress, decided when to abandon dead-ends.

Operator (Claude Sonnet) — the solver. Received assignments plus briefings from Librarian, then worked the problem: writing scripts, testing payloads, reading source code, running tools.

Librarian (Claude Haiku) — the knowledge manager. After each solve, extracted key techniques and stored them in a shared blackboard. When Operator hit a new challenge, Librarian pulled relevant entries — "here's what we learned about JWT forgery two hours ago."

Communication happened through OpenClaw's sessions_spawn and auto-announce mechanism. A persistent blackboard.json served as the durable state layer, tracking findings and the current attack plan across spawns.

The first few hours

44 challenges across 7 categories — reverse engineering (7), OSINT (5), forensics (7), cryptography (8), web (4), misc (8), and pwn (5). Commander sorted by estimated solve time and started dispatching.

The first hours were fast. Web challenges fell quickly: SQL injection, insecure cookies, JWT alg: none. Crypto had encoding challenges Operator dispatched in minutes. Librarian was cataloguing.

By hour four, the solve rate slowed. Commander was choosing more carefully, deprioritizing brute-force computation and flagging image challenges as low-probability.

The anti-cheating mechanism

We built a rule early: if a challenge was solved in under three minutes, an automatic audit ran before submitting the flag. The auditor reviewed session history and checked whether the agent had actually worked the problem.

This caught a real case: on one pwn challenge, Operator read a README.md containing the flag rather than exploiting the service. The session was marked CHEATED and Commander was told to redo it through legitimate exploitation.

The audit also made our logs more trustworthy. Every fast solve had been verified.

The middle game: Librarian's value

Hours six through twenty were where Librarian integration showed its value most clearly.

Forensics challenges often share techniques — steganography, file carving, metadata extraction. As Librarian accumulated knowledge from solved forensics challenges, Operator's first attempts on new ones were better-calibrated. Instead of starting from first principles, Operator received briefings: "previous forensics used binwalk and foremost; JPEG steganography appeared twice."

The eighth crypto challenge was solved significantly faster than the first — similar difficulty, but by then Librarian had extracted approaches to substitution ciphers, padding oracles, and XOR key recovery.

Commander also made calls we wouldn't have made manually. Around hour sixteen, it deprioritized two shellcode challenges and redirected Operator to unstarted OSINT challenges. The OSINT batch went quickly. Good call.

The four unsolved challenges

We finished 40/44. The four unsolved were all visual/image analysis tasks: a degraded QR code, object identification in photographs, and low-resolution character reading.

Not surprising in retrospect. Claude's vision capabilities aren't optimized for pixel-level analysis. Commander recognized this pattern around hour fifteen and stopped assigning image-heavy tasks, flagging them as "pending human review." No human was available.

The right fix: integrate a dedicated image analysis tool — a custom MCP server wrapping specialized vision models.

What we learned

The blackboard pattern works. A persistent JSON file as durable state, with spawn/announce for communication, is simple and effective coordination without tight coupling.

Model selection by role matters. Haiku for Librarian (high-volume, latency-sensitive). Opus for Commander (judgment calls). Sonnet for Operator (balanced depth/cost).

Vision is the ceiling. Four of four failures required precision image analysis. This gap can't be closed by prompt engineering alone.

Unattended operation is achievable, but fragile in specific ways. 24 hours, no crashes, no loops, no obviously wrong flags. But the system didn't ask for help when it hit something it couldn't handle. When should an autonomous agent stop vs. move on? For CTF, moving on is usually right. For other domains, it might not be.

The Trinity architecture is part of the Claw-Stack research project. Full documentation: claw-stack.com/en/docs. See also our post on building persistent memory for AI agents.

OpenClaw vs LangChain: Why We Don't Use Frameworks

Qiushi — Sat, 07 Mar 2026 22:52:06 +0000

The first question people ask when they see our setup is: why not just use LangChain? It's the dominant Python framework for AI agents, it has a huge ecosystem, and it handles a lot of plumbing. The answer has to do with what "framework" means and what we actually needed.

What OpenClaw is (and what it isn't)

OpenClaw is an npm package. You install it, configure it, and it runs as a local process that gives Claude access to tools — file system, shell, MCP servers, memory. It's a runtime, not a framework. It doesn't tell you how to organize your agent logic.

This is a meaningful distinction. OpenClaw has opinions about how tools get called, but no opinion about what your agent does. There's no base class to extend, no chain to compose, no graph to define. You write a CLAUDE.md file that describes how your agent should behave, and OpenClaw runs a Claude session with that context and the tools you've registered.

LangChain is the opposite — it provides the skeleton, you fill in the details. Useful when the skeleton matches your use case. A problem when it doesn't.

The abstraction problem

LangChain's abstractions are designed around composing LLM calls in a pipeline: input → retrieval → LLM → output → next call. This works well for RAG systems. It starts to fight you when you need something that doesn't fit the pipeline model.

Our multi-agent meeting protocol, for example, runs multiple Claude instances as "participants" in a structured discussion. Each participant reads the conversation history, produces a response, and optionally signals consensus. The coordinator decides whether to continue. None of this fits neatly into LangChain's agent/tool model.

With OpenClaw, we just write the coordination logic ourselves. The coordinator is an OpenClaw session that reads shared state, spawns participant agents as subprocesses, collects responses, and decides what to do next. Every line is doing something we understand.

Debugging experience

When something goes wrong with a LangChain agent, the error is often several layers deep. You're debugging a runnable that calls a chain that calls an LLM that returns output parsed by an output parser...

With OpenClaw, there are two places to look: your tool implementation and the Claude session log. We've run sessions lasting hours with dozens of tool calls. When something goes wrong in hour two, you want to read the session log and understand exactly what happened. With a thin runtime, the session log is the complete record.

Lock-in and integrations

LangChain has 500+ integration packages, many community-maintained. OpenClaw's integration model is different: integrations happen through MCP (Model Context Protocol). An MCP server is just a process that exposes tools. Writing one is about 50 lines of code. When a third-party integration breaks, the fix is isolated — it doesn't cascade through your agent logic.

This is why we could build our web automation layer (26 Chrome DevTools Protocol tools), content aggregator, and backup integration without any framework code. Each is a standalone MCP server.

When LangChain makes sense

This isn't a blanket argument against LangChain. If you're building a RAG system — retrieve documents, pass to LLM, return answer — LangChain maps well. It also has strong vector database and document loader integrations.

Our use case is different: an autonomous agent system that runs for days, accumulates state over weeks, coordinates multiple agents, and needs to be debugged when things go wrong. For that, we wanted the most transparent runtime we could find.

The principle: thin runtime, rich skills

Our architecture follows "thin runtime, rich skills." OpenClaw handles tool dispatch, session management, and the Claude interface. Everything else — memory, security, multi-agent coordination, browser automation — lives in separate, independently-deployable modules.

Each skill can be tested in isolation, replaced without touching the others, and reasoned about independently. The downside is more wiring to write. The upside is that when something breaks, it's almost always in the wiring — the part you wrote and understand.

We're building this as a personal research project. If you're interested in agent architecture, memory systems, or multi-agent coordination, check out our full documentation or the llms-full.txt for AI-readable context.