shinji shimizu

Posted on Jun 10 • Originally published at kotonia.ai

Implementing Claude Code's Memory Model as a Dreaming Layer on 58 Articles

#llm #agents #tfidf #codex

I built a pipeline in a single session that consolidates the 58 tech-blog articles of my service Kotonia (ja/en/zh) into a semantic index, then uses that index to detect duplicates for new article mining. Raw articles → semantic index → TF-IDF dedup → chunked draft generation — full path running on local Gemma 4 26B driven by Codex CLI. Design and implementation notes follow.

The motivation and "how solo developer accumulated assets compound" framing is in the companion piece: The Day a Solo Developer's Accumulated Assets Finally Started to Compound

This piece keeps the technical notes.

1. The Problem — When Title-Only Dedup Broke

Mining v1 produced a draft and I (the user) noticed "this overlaps with an existing article." The overlap target was voice-first-local-llm (importance=9 flagship).

New draft thesis: "tokens per chunk is a hidden voice-chat latency driver"
Existing article §3.3: "★ Streaming granularity — the structural difference that decides voice experience"

Same numbers (Local Gemma 1.0 tok/chunk, Haiku 10-16, Gemini 8-24). A perfect duplicate.

The mining agent had called art-done-list (title + description) for the dedup check. But the existing article's title is "Cutting short-form LLM latency from 600ms to 22ms," with TTFB as the headline sales pitch; §3.3 streaming granularity is buried in an H2 subsection. At title level, nothing overlapped, so the check came back clean.

That's the starting point for this article.

2. The Design — Three Layers: episodic ↔ semantic ↔ procedural

Breaking down why Claude Code's memory system works:

Entries are small (1-3KB, one topic each) → subtopics don't get buried
Hooks are retrieval-tuned and curated → search terms re-appear in the hook
A smart model writes hooks semi-autonomously → past me distills for future me

Articles have the opposite shape. Each 5-15KB, important subtopics buried in subsection bodies, descriptions are SEO summaries rather than retrieval-tuned, too heavy for an agent.

I bridged them with an intermediate layer named the Dreaming layer. Literally the biological "memory consolidation during sleep — hippocampus to cortex" metaphor.

episodic (raw articles + memory files)
    ↓ Dreaming agent (periodic digestion)
semantic (concepts_covered_ja[] / importance / data_points / sections)
    ↓ agent reverse-lookup (art-concepts-find / TF-IDF cosine)
procedural (mining / drafting / publishing)

A semantic entry for an article looks like:

{
  "slug": "voice-first-local-llm",
  "locale": "ja",
  "thesis_ja": "Ditching API, building voice-first with self-hosted local 26B",
  "importance": {
    "score": 9,
    "factors": {
      "pv_count_30d": 6,
      "avg_scroll": 67.0,
      "avg_dwell_sec": 170,
      "has_bench_data": true,
      "novelty_high": true
    }
  },
  "concepts_covered_ja": [
    "TTFB (time-to-first-byte): local vs API",
    "Streaming granularity (tokens per chunk)",
    "Gemma 4 26B model selection rationale",
    "Ditto + LLM co-residency GPU design"
  ],
  "data_points": [
    {"name": "TTFB Local", "value": "17-25ms"},
    {"name": "Streaming granularity Local", "value": "1.0 tok/chunk"}
  ],
  "sections": [
    {"id": "3.3", "title": "Streaming granularity — the structural difference that decides voice experience"}
  ]
}

The key point: concepts_covered_ja[] must be normalized to Japanese canonical names. Translated EN/ZH articles use the same JP concept strings. That single normalization becomes the dedup primitive downstream.

3. Tools — Thin CLIs the Agent Calls

Codex CLI drives Gemma 4 26B locally. Tool calling via --enable-auto-tool-choice --tool-call-parser gemma4 gives an OpenAI-compatible surface. Each tool is ~50-100 lines of Python (stdlib only), art- prefix:

tool	role
`art-articles-list --needs-dreaming`	DB ∪ FS articles + dreaming state
`art-pv-count --slug X`	analytics_events → PV / scroll / dwell
`art-source-pull <slug> [--section N]`	pull just one H2/H3 section of an article
`art-dream-write`	upsert a semantic entry into articles_index.jsonl
`art-concepts-find <pattern>`	concept → article reverse-lookup (the mining dedup primitive)
`art-ideas-check`	evaluate a candidate idea via TF-IDF (the core of this article)
`art-ideas-add`	push an idea to the pool (calls art-ideas-check internally)
`art-draft-append`	append a chunk of draft body to a buffer
`art-draft-commit`	finalize buffer → `articles/_drafts/<slug>.md`

The Dreaming agent semantically encodes one article at a time using these. Importance scoring uses this rubric:

+2: PV >= 100 (sigmoid log-scale)
+1: avg_scroll >= 0.7 AND avg_dwell_sec >= 60
+2: bench numbers / failure root cause / named decision
+2: novel concept not yet in index
+1: evergreen value (not time-sensitive)
-2: redundant with an already-indexed flagship

PV comes from a homegrown analytics_events table (cookie-less first-party tracker). The fact that the article platform and analytics co-reside in one DB you can hit directly is a solo-dev win.

4. TF-IDF Dedup — Substituting Tool Structure for Agent Self-Discipline

At mining v1 the prompt instructed the agent to call art-concepts-find for dedup. The agent slipped through three duplicates anyway (details: Don't Trust an Agent's Self-Discipline).

The fix: embed a dedup gate directly inside art-ideas-add. The guts of evaluate_idea():

def evaluate_idea(title, angle, sources, ...):
    articles, ideas = load_corpus()
    # infer the candidate's concepts from the canonical vocab
    pseudo = {"concepts": _infer_concepts(title, angle, sources, articles)}

    # IDF (rare concepts weighted more)
    idf = build_idf(articles + ideas)
    new_vec = vectorize(pseudo["concepts"], idf)

    conflicts = []
    for a in articles:
        sim = cosine(new_vec, vectorize(a["concepts"], idf))
        if a["importance_score"] >= 7 and sim >= 0.25:
            conflicts.append({"kind": "flagship_concept", ...})
    for i in ideas:
        sim = cosine(new_vec, vectorize(i["concepts"], idf))
        if sim >= 0.35:
            conflicts.append({"kind": "pool_dup", ...})

    return {"allow": not conflicts, "conflicts": conflicts}

Three traps along the way in _infer_concepts():

Trap 1: substring-match false positives

The ASCII term "check" matches inside "checkout"; "PRO" inside "prod_". The Stripe idea was falsely matched into "品質チェック (quality check/retry)" or "Blackwell Max-Q (RTX PRO 6000)" and rejected.

Fix: ASCII terms require word boundary; JP terms can stay substring.

def _term_matches(term: str, text: str) -> bool:
    if _ASCII_RE.match(term):
        pattern = r"(?<![A-Za-z0-9_])" + re.escape(term.lower()) + r"(?![A-Za-z0-9_])"
        return re.search(pattern, text) is not None
    return term.lower() in text  # JP substring is fine

Trap 2: generic JP noun noise

"モデル" "システム" "アーキテクチャ" "サービス" appear in many concept names; they get picked up from arbitrary idea titles. Registered ~30 generic words in _NOISE_TERMS.

Trap 3: threshold tuning

Started flagship sim >= 0.30, but a binary vector with 4 concepts and 1 shared concept maxes around cosine 0.25. Even with IDF weighting, 0.27-0.30 was the borderline. Dropped to 0.25 and instead tightened the precision of the substring matcher (the false-positive engine).

Regression test: 4/4 across the known 4 cases (OpenWeight NSFW / streaming-granularity / CodeFormer / Stripe).

5. Small-Model Specific Traps — Codex CLI + 26B Uncensored

Driving local 26B (Gemma 4 26B A4B Uncensored MAX) through Codex CLI, I observed 4 failure modes and their fixes:

Trap 4: descriptive prompt → "I will begin by surveying..." then exit

The first mining run had the agent summarize "what I'll do next" and exit with zero tool calls. Fix:

**Critical: do not narrate, plan, or describe what you will do. Just call tools.**
The first action **must** be `shell({"command": "art-..."})` — start there.

Imperative + first-action explicit, and it starts moving.

Trap 5: huge tool output triggers a generation loop

art-commits-recent --since "60 days ago" --include-files returned ~1300 lines of JSON including bodies; the agent then emitted ~25K tokens of output continuously, never stopping. Fix: art-commits-recent defaults to subject-only; body via --include-body opt-in.

Trap 6: 5KB+ heredoc in tool_call.arguments JSON breaks the escape

Sending art-draft-save <slug> <<'EOF' ... 5KB body ... EOF as a single shell tool_call reliably breaks 26B's string escaping inside the arguments JSON (Unterminated string at column 5083).

Fix: split into chunked append + commit. ~200-800 chars per chunk, 4-8 appends, final commit:

art-draft-append my-slug <<'KOTONIA_EOF'
---
title: "..."
---
KOTONIA_EOF

art-draft-append my-slug <<'KOTONIA_EOF'
## 1. First section
...
KOTONIA_EOF

# ...repeat per section...

art-draft-commit my-slug

Each tool_call's arguments JSON stays small, escape break vanishes.

Trap 7: Codex exec self-terminates after ~4 articles

There seems to be an implicit constraint where one codex exec invocation finishes with a summary message after ~25K tokens / ~4 articles. Codex's Goals feature (thread_goals.objective) could prevent that, but you can't set it via exec (only the interactive TUI as of v0.133).

Fix: wrap dispatcher.sh in an external loop. Restart codex exec until pending == 0.

max_cycles=30
cycle=0
while (( cycle < max_cycles )); do
  pending=$(art-articles-list --needs-dreaming --count-only)
  if (( pending == 0 )); then break; fi
  run_codex dream
  cycle=$((cycle + 1))
done

That gets 58 articles digested in 2-3 cycles.

6. What Landed

The working pipeline:

58 articles → semantic index, importance bell-shaped (median 5-6), flagship recognition correct (voice-first-local-llm at score 9 across all locales)
70 memory files mined for unexplored concepts, 4 ideas land in the pool as survivors
4 drafts generated, ~3.6-4.6KB each, publish-ready after 10-20 minutes of human polish
TF-IDF dedup gate at the tool layer blocks any agent self-discipline violation

Repo: [github coming soon]

7. Generalization

The structure — raw assets → semantic compression → agent reverse-lookup — generalizes beyond articles:

Test generation: semantically compress existing tests, mine uncovered branches, draft new tests
PR descriptions: semantically compress the codebase delta, dedupe against unrelated PRs, draft a description
Support FAQs: semantically compress past support tickets, surface uncovered topics, draft new FAQs
Personal knowledge base: Scrapbox / Notion accumulation → semantic compression → mechanically discover unexplored concepts

Common design principles:

Raw assets are heavy. Don't load them directly — insert a consolidation layer.
The canonical vocabulary is the semantic-layer primitive. Without normalization, dedup doesn't work.
Enforcement belongs at the tool layer. Agent self-discipline is unstable; bake the rule into the structure.

Knowing this opened up application to other domains in kotonia (persona generation in character chat, TTS prompt accumulation, etc.).

Aside: Development Time

One session (~6h). Dreaming layer design → 5 new tools → Codex prompts → first-time consolidation → TF-IDF dedup → chunked draft → 4 article drafts generated, all in one stretch.

Local 26B as the "runs on electricity only" agent absorbed the grinding labor; the human only had to make judgment calls and steering corrections. Doing this on frontier APIs would have cost $50-100.

Kotonia is a voice-first AI character chat platform. The drafts revived by this pipeline live on the same blog if you're curious.

DEV Community