Tom Lee

Posted on Mar 25 • Originally published at blog.clawsouls.ai

FTS vs Hybrid Memory Search: A Real-World Benchmark

#ai #benchmark #search #agents

TL;DR

Mode	Exact Keywords	Paraphrased	Contextual	Overall
FTS-only	85%	30%	30%	48%
Hybrid	85%	55%	55%	65%

FTS-only is surprisingly capable for direct lookups. But the moment users rephrase or ask abstract questions, hybrid search pulls ahead by +25 percentage points.

Why This Matters

SoulClaw's 4-Tier Memory system stores everything an AI agent learns — from daily conversation logs to long-term decisions and project context. When your agent needs to recall something, the retrieval method determines whether it finds the right memory or returns noise.

Most hosted AI agents use one of three approaches:

Full-Text Search (FTS): Keyword matching. Fast, free, requires no ML model.
Semantic Search: Vector embeddings (e.g., bge-m3 via Ollama). Finds conceptually similar content even without exact keyword overlap.
Hybrid: Combines both, re-ranks results. SoulClaw's default when an embedding provider is available.

We wanted hard numbers on the actual difference — not synthetic benchmarks, but real agent memory with real questions.

Methodology

Corpus

303 files, ~14,000 lines of agent memory
Accumulated over 6+ weeks of daily AI agent operation
Includes: daily logs, project topic files, compaction summaries, business decisions, technical troubleshooting

Questions

30 human-written questions in three categories:

Exact (10): Query uses terms that appear verbatim in the source ("What is the Zenodo DOI for Paper 1?")
Paraphrase (10): Query uses synonyms or indirect references ("Which AI memory startup did a Google chief scientist invest in?" → answer mentions "Jeff Dean" and "SuperMemory")
Contextual (10): Query is abstract, requiring understanding of context ("What's our real competitive advantage in hosting?" → answer spans multiple files about Soul ecosystem differentiation)

Evaluation

Human-scored (not LLM-scored) — the question author evaluated each result
Scoring: 0 = irrelevant, 1 = partially relevant, 2 = correct answer retrievable
Each question scored for top-5 retrieved results
Total possible score per mode: 60 points (30 questions × 2 max)

Search Configuration

FTS: SoulClaw's built-in SQLite FTS5 (chunks_fts table), keyword extraction from query
Hybrid: FTS + Ollama bge-m3 embeddings (768-dim, multilingual), reciprocal rank fusion

Results

By Category

Category	Questions	FTS Score	Hybrid Score	FTS %	Hybrid %	Delta
Exact	10	17/20	17/20	85%	85%	0%
Paraphrase	10	6/20	11/20	30%	55%	+25%
Contextual	10	6/20	11/20	30%	55%	+25%
Total	30	29/60	39/60	48%	65%	+17%

Key Observations

1. FTS is strong on exact queries.

When users ask "What's the Zenodo DOI?" and the memory file contains "Zenodo DOI: 10.5281/...", FTS finds it immediately. 85% accuracy on exact-term queries is production-viable.

2. Hybrid's advantage is entirely in recall on indirect references.

"The teenager who built an AI service" → FTS can't match this to "Dravya Shah, 19, SuperMemory." Hybrid's semantic component bridges the vocabulary gap.

3. Contextual queries are hard for both.

"Why did we choose papers over patents?" requires understanding a decision process spread across multiple files and conversations. 55% isn't great — this is arguably a reasoning task, not a retrieval task.

4. Hybrid never hurts.

In no case did hybrid score lower than FTS. The semantic component adds signal without adding noise, thanks to reciprocal rank fusion.

Notable Examples

Question	FTS	Hybrid	Why
"Which startup did a Google chief scientist invest in?"	❌ 0	✅ 2	FTS can't connect "Google chief scientist" → "Jeff Dean"
"What caused the chat authentication error?"	✅ 2	✅ 2	Both find "401 bug" and "machineToken" keywords
"Why are we careful in the open-source community?"	❌ 0	⚠️ 1	Abstract concept, hybrid gets close but not definitive
"Primer Stealth deadline?"	✅ 2	❌ 0	FTS wins — exact keywords present, hybrid ranked wrong file higher

What This Means for SoulClaw Users

Self-hosted (free)

Install Ollama + bge-m3 for hybrid mode. Without it, you still get FTS — perfectly usable for daily operation. Setup guide →

ClawSouls Hosting

Plan	Memory Mode	Why
Starter ($7/mo)	FTS-only	Handles 85% of exact queries. Most day-to-day agent interactions use exact terms.
Pro ($29/mo)	FTS-only	Same retrieval, more compute for other tasks.
Premium ($149/mo)	Hybrid	Full semantic search. Best for power users with large memory corpora and indirect queries.

The pragmatic view

If 70-80% of your questions to an agent use exact terms ("show me yesterday's meeting notes", "what's the API key"), FTS-only is fine. The gap matters for long-term users who develop shorthand and indirect references with their agent over weeks and months.

Reproducing This Benchmark

The benchmark framework is available at our memory-bench repository. To run against your own agent's memory:

Write questions with ground truth (which files contain answers)
Run benchmark.sh against your SoulClaw memory database
Score results manually

We deliberately chose human evaluation over LLM-as-judge to avoid circular reasoning — using an LLM to evaluate an LLM's memory retrieval introduces confounding variables.

Limitations

Single corpus: Results reflect one agent's 6-week memory. Different content distributions may yield different ratios.
30 questions: Statistically small sample. We chose depth of evaluation over breadth.
Bilingual corpus: Memory is Korean + English mixed. bge-m3 handles multilingual well, but FTS keyword extraction may behave differently in monolingual corpora.
No semantic-only mode tested: We compared FTS vs Hybrid. Pure semantic (no FTS component) was not benchmarked separately.

Conclusion

Full-text search is not dead. For agent memory retrieval with exact-term queries, it performs at 85% — good enough for most daily use. But hybrid search earns its keep on the 20-30% of queries where users don't use the exact words stored in memory.

The takeaway for AI agent builders: don't skip FTS as a baseline. It's free, fast, and surprisingly effective. Add semantic search as an enhancement, not a replacement.

This research was conducted using SoulClaw v2026.3.34's memory system. Raw benchmark data and methodology are available in our memory-bench repository.

Originally published at blog.clawsouls.ai

DEV Community