I built a pipeline in a single session that consolidates the 58 tech-blog articles of my service Kotonia (ja/en/zh) into a semantic index, then uses that index to detect duplicates for new article mining. Raw articles → semantic index → TF-IDF dedup → chunked draft generation — full path running on local Gemma 4 26B driven by Codex CLI. Design and implementation notes follow.
The motivation and "how solo developer accumulated assets compound" framing is in the companion piece: The Day a Solo Developer's Accumulated Assets Finally Started to Compound
This piece keeps the technical notes.
1. The Problem — When Title-Only Dedup Broke
Mining v1 produced a draft and I (the user) noticed "this overlaps with an existing article." The overlap target was voice-first-local-llm (importance=9 flagship).
- New draft thesis: "tokens per chunk is a hidden voice-chat latency driver"
- Existing article §3.3: "★ Streaming granularity — the structural difference that decides voice experience"
Same numbers (Local Gemma 1.0 tok/chunk, Haiku 10-16, Gemini 8-24). A perfect duplicate.
The mining agent had called art-done-list (title + description) for the dedup check. But the existing article's title is "Cutting short-form LLM latency from 600ms to 22ms," with TTFB as the headline sales pitch; §3.3 streaming granularity is buried in an H2 subsection. At title level, nothing overlapped, so the check came back clean.
That's the starting point for this article.
2. The Design — Three Layers: episodic ↔ semantic ↔ procedural
Breaking down why Claude Code's memory system works:
- Entries are small (1-3KB, one topic each) → subtopics don't get buried
- Hooks are retrieval-tuned and curated → search terms re-appear in the hook
- A smart model writes hooks semi-autonomously → past me distills for future me
Articles have the opposite shape. Each 5-15KB, important subtopics buried in subsection bodies, descriptions are SEO summaries rather than retrieval-tuned, too heavy for an agent.
I bridged them with an intermediate layer named the Dreaming layer. Literally the biological "memory consolidation during sleep — hippocampus to cortex" metaphor.
episodic (raw articles + memory files)
↓ Dreaming agent (periodic digestion)
semantic (concepts_covered_ja[] / importance / data_points / sections)
↓ agent reverse-lookup (art-concepts-find / TF-IDF cosine)
procedural (mining / drafting / publishing)
A semantic entry for an article looks like:
{
"slug": "voice-first-local-llm",
"locale": "ja",
"thesis_ja": "Ditching API, building voice-first with self-hosted local 26B",
"importance": {
"score": 9,
"factors": {
"pv_count_30d": 6,
"avg_scroll": 67.0,
"avg_dwell_sec": 170,
"has_bench_data": true,
"novelty_high": true
}
},
"concepts_covered_ja": [
"TTFB (time-to-first-byte): local vs API",
"Streaming granularity (tokens per chunk)",
"Gemma 4 26B model selection rationale",
"Ditto + LLM co-residency GPU design"
],
"data_points": [
{"name": "TTFB Local", "value": "17-25ms"},
{"name": "Streaming granularity Local", "value": "1.0 tok/chunk"}
],
"sections": [
{"id": "3.3", "title": "Streaming granularity — the structural difference that decides voice experience"}
]
}
The key point: concepts_covered_ja[] must be normalized to Japanese canonical names. Translated EN/ZH articles use the same JP concept strings. That single normalization becomes the dedup primitive downstream.
3. Tools — Thin CLIs the Agent Calls
Codex CLI drives Gemma 4 26B locally. Tool calling via --enable-auto-tool-choice --tool-call-parser gemma4 gives an OpenAI-compatible surface. Each tool is ~50-100 lines of Python (stdlib only), art- prefix:
| tool | role |
|---|---|
art-articles-list --needs-dreaming |
DB ∪ FS articles + dreaming state |
art-pv-count --slug X |
analytics_events → PV / scroll / dwell |
art-source-pull <slug> [--section N] |
pull just one H2/H3 section of an article |
art-dream-write |
upsert a semantic entry into articles_index.jsonl |
art-concepts-find <pattern> |
concept → article reverse-lookup (the mining dedup primitive) |
art-ideas-check |
evaluate a candidate idea via TF-IDF (the core of this article) |
art-ideas-add |
push an idea to the pool (calls art-ideas-check internally) |
art-draft-append |
append a chunk of draft body to a buffer |
art-draft-commit |
finalize buffer → articles/_drafts/<slug>.md
|
The Dreaming agent semantically encodes one article at a time using these. Importance scoring uses this rubric:
+2: PV >= 100 (sigmoid log-scale)
+1: avg_scroll >= 0.7 AND avg_dwell_sec >= 60
+2: bench numbers / failure root cause / named decision
+2: novel concept not yet in index
+1: evergreen value (not time-sensitive)
-2: redundant with an already-indexed flagship
PV comes from a homegrown analytics_events table (cookie-less first-party tracker). The fact that the article platform and analytics co-reside in one DB you can hit directly is a solo-dev win.
4. TF-IDF Dedup — Substituting Tool Structure for Agent Self-Discipline
At mining v1 the prompt instructed the agent to call art-concepts-find for dedup. The agent slipped through three duplicates anyway (details: Don't Trust an Agent's Self-Discipline).
The fix: embed a dedup gate directly inside art-ideas-add. The guts of evaluate_idea():
def evaluate_idea(title, angle, sources, ...):
articles, ideas = load_corpus()
# infer the candidate's concepts from the canonical vocab
pseudo = {"concepts": _infer_concepts(title, angle, sources, articles)}
# IDF (rare concepts weighted more)
idf = build_idf(articles + ideas)
new_vec = vectorize(pseudo["concepts"], idf)
conflicts = []
for a in articles:
sim = cosine(new_vec, vectorize(a["concepts"], idf))
if a["importance_score"] >= 7 and sim >= 0.25:
conflicts.append({"kind": "flagship_concept", ...})
for i in ideas:
sim = cosine(new_vec, vectorize(i["concepts"], idf))
if sim >= 0.35:
conflicts.append({"kind": "pool_dup", ...})
return {"allow": not conflicts, "conflicts": conflicts}
Three traps along the way in _infer_concepts():
Trap 1: substring-match false positives
The ASCII term "check" matches inside "checkout"; "PRO" inside "prod_". The Stripe idea was falsely matched into "品質チェック (quality check/retry)" or "Blackwell Max-Q (RTX PRO 6000)" and rejected.
Fix: ASCII terms require word boundary; JP terms can stay substring.
def _term_matches(term: str, text: str) -> bool:
if _ASCII_RE.match(term):
pattern = r"(?<![A-Za-z0-9_])" + re.escape(term.lower()) + r"(?![A-Za-z0-9_])"
return re.search(pattern, text) is not None
return term.lower() in text # JP substring is fine
Trap 2: generic JP noun noise
"モデル" "システム" "アーキテクチャ" "サービス" appear in many concept names; they get picked up from arbitrary idea titles. Registered ~30 generic words in _NOISE_TERMS.
Trap 3: threshold tuning
Started flagship sim >= 0.30, but a binary vector with 4 concepts and 1 shared concept maxes around cosine 0.25. Even with IDF weighting, 0.27-0.30 was the borderline. Dropped to 0.25 and instead tightened the precision of the substring matcher (the false-positive engine).
Regression test: 4/4 across the known 4 cases (OpenWeight NSFW / streaming-granularity / CodeFormer / Stripe).
5. Small-Model Specific Traps — Codex CLI + 26B Uncensored
Driving local 26B (Gemma 4 26B A4B Uncensored MAX) through Codex CLI, I observed 4 failure modes and their fixes:
Trap 4: descriptive prompt → "I will begin by surveying..." then exit
The first mining run had the agent summarize "what I'll do next" and exit with zero tool calls. Fix:
**Critical: do not narrate, plan, or describe what you will do. Just call tools.**
The first action **must** be `shell({"command": "art-..."})` — start there.
Imperative + first-action explicit, and it starts moving.
Trap 5: huge tool output triggers a generation loop
art-commits-recent --since "60 days ago" --include-files returned ~1300 lines of JSON including bodies; the agent then emitted ~25K tokens of output continuously, never stopping. Fix: art-commits-recent defaults to subject-only; body via --include-body opt-in.
Trap 6: 5KB+ heredoc in tool_call.arguments JSON breaks the escape
Sending art-draft-save <slug> <<'EOF' ... 5KB body ... EOF as a single shell tool_call reliably breaks 26B's string escaping inside the arguments JSON (Unterminated string at column 5083).
Fix: split into chunked append + commit. ~200-800 chars per chunk, 4-8 appends, final commit:
art-draft-append my-slug <<'KOTONIA_EOF'
---
title: "..."
---
KOTONIA_EOF
art-draft-append my-slug <<'KOTONIA_EOF'
## 1. First section
...
KOTONIA_EOF
# ...repeat per section...
art-draft-commit my-slug
Each tool_call's arguments JSON stays small, escape break vanishes.
Trap 7: Codex exec self-terminates after ~4 articles
There seems to be an implicit constraint where one codex exec invocation finishes with a summary message after ~25K tokens / ~4 articles. Codex's Goals feature (thread_goals.objective) could prevent that, but you can't set it via exec (only the interactive TUI as of v0.133).
Fix: wrap dispatcher.sh in an external loop. Restart codex exec until pending == 0.
max_cycles=30
cycle=0
while (( cycle < max_cycles )); do
pending=$(art-articles-list --needs-dreaming --count-only)
if (( pending == 0 )); then break; fi
run_codex dream
cycle=$((cycle + 1))
done
That gets 58 articles digested in 2-3 cycles.
6. What Landed
The working pipeline:
- 58 articles → semantic index, importance bell-shaped (median 5-6), flagship recognition correct (voice-first-local-llm at score 9 across all locales)
- 70 memory files mined for unexplored concepts, 4 ideas land in the pool as survivors
- 4 drafts generated, ~3.6-4.6KB each, publish-ready after 10-20 minutes of human polish
- TF-IDF dedup gate at the tool layer blocks any agent self-discipline violation
Repo: [github coming soon]
7. Generalization
The structure — raw assets → semantic compression → agent reverse-lookup — generalizes beyond articles:
- Test generation: semantically compress existing tests, mine uncovered branches, draft new tests
- PR descriptions: semantically compress the codebase delta, dedupe against unrelated PRs, draft a description
- Support FAQs: semantically compress past support tickets, surface uncovered topics, draft new FAQs
- Personal knowledge base: Scrapbox / Notion accumulation → semantic compression → mechanically discover unexplored concepts
Common design principles:
- Raw assets are heavy. Don't load them directly — insert a consolidation layer.
- The canonical vocabulary is the semantic-layer primitive. Without normalization, dedup doesn't work.
- Enforcement belongs at the tool layer. Agent self-discipline is unstable; bake the rule into the structure.
Knowing this opened up application to other domains in kotonia (persona generation in character chat, TTS prompt accumulation, etc.).
Aside: Development Time
One session (~6h). Dreaming layer design → 5 new tools → Codex prompts → first-time consolidation → TF-IDF dedup → chunked draft → 4 article drafts generated, all in one stretch.
Local 26B as the "runs on electricity only" agent absorbed the grinding labor; the human only had to make judgment calls and steering corrections. Doing this on frontier APIs would have cost $50-100.
Kotonia is a voice-first AI character chat platform. The drafts revived by this pipeline live on the same blog if you're curious.
Top comments (0)