Everyone's Launching Wrappers. Nobody's Going Deep.

#startup #webdev #programming #ai

You know that feeling when you open someone's "AI-powered" product, view source, and realize the entire intelligence layer is a single API call with a system prompt? I get that feeling about four times a week now.

I'm building memory for AI agents. Not the kind where you shove conversation logs into a vector database and run similarity search, which is what every tutorial teaches and every wrapper product ships. The kind where you actually measure whether retrieval works, find out it doesn't, and spend three months fixing it instead of launching.

Here's what "going deep" looks like day to day, because I think people romanticize it and the reality is mostly spreadsheets and DoorDash at 2am.

The boring part nobody shows you

The first thing I did was run a benchmark against my own retrieval pipeline. Ground-truth questions, known correct answers. The results were bad. Not "needs tuning" bad. The system was confidently retrieving wrong memories and missing obvious temporal references, confusing things said last week with things said six months ago, mixing up which person said what in multi-party conversations.

I categorized 357 failures by hand. Two weeks of reading each failed retrieval and classifying why. The finding: 92% of failures were retrieval failures, not reasoning failures. The data was in the database. The search couldn't find it.

I confirmed with an oracle test. Bypassed retrieval, gave the model the full conversation as context. Accuracy jumped to 93.8%. The information was always there. The search layer was broken. The entire field was focused on improving the reasoning layer while the retrieval layer underneath was silently failing, and nobody had checked because the failures are invisible. The system returns results. They just happen to be the wrong results.

So then I needed to understand how much the embedding model and reranker choice mattered. I built a test rig: 7 embedding models crossed with 8 rerankers, 56 combinations, each evaluated against 1,540 ground-truth questions. About 26,000 total evaluations.

Nobody had published this comparison before. The reason is simple: it's tedious work with no shortcut. You configure, run, wait, record, repeat. For weeks.

What the data showed

The spread across all 56 combinations was 3.2 percentage points (89.9% to 93.1%). Most products never test a single combination. They use whatever the tutorial picked.

The finding that broke my brain: a $0.40 per million token model with 100 retrieved memories beat a $15 per million token model with 15 retrieved memories. The cheap model with better retrieval recovered 82% of errors. The expensive model with worse retrieval recovered 54%. Retrieval quality dominated model quality completely. Optimizing your search pipeline was worth more than a model upgrade costing 37 times as much.

I also found a silent bug in my own code during this process. A script was loading MiniLM instead of the GTE ModernBERT reranker I'd configured. No error, no warning. Just quietly wrong. If I hadn't been running ground-truth benchmarks I never would have caught it. This exact type of misconfiguration is sitting in production systems that have never been tested against known correct answers.

What "going deep" actually means in practice

It means choosing SQLite over Pinecone and everyone thinking you're not serious. But the constraint forced a hybrid search pipeline (sparse FTS5 plus dense vector search, reciprocal rank fusion, cross-encoder reranking) that runs on a Raspberry Pi for $12/month. The whole system scores within 3 points of setups requiring $150 to $400/month in GPU infrastructure. One file, no cluster, no excuses. If retrieval breaks, the architecture broke, and you fix the actual problem instead of blaming infrastructure.

It means reading neuroscience papers about how the hippocampus filters incoming memories and building a three-signal encoding gate (novelty, salience, prediction error) instead of just storing everything and hoping retrieval sorts it out. Your brain doesn't record everything, it runs a filter first, and that's not a limitation, it's the mechanism that makes retrieval work. Less noise going in means better results coming out. The benchmarks supported this approach.

It means writing a research paper and getting it on arXiv instead of shipping the next feature. The paper (arXiv:2605.04897) has methodology, controlled benchmarks, and reproducible results. If the claims were going to hold up, the data had to be public.

That's the foundation TrueMemory is built on. Research first, product second.

Why this matters for builders

Anthropic is shipping native memory for Claude. OpenAI is building memory into ChatGPT. Google's Gemini remembers conversations. Every platform is adding memory as a checkbox feature.

When the platform ships a native version of your wrapper, you die. Not because their version is better but because it's already installed and free. Meeting summarizers learned this last year when Zoom, Meet, and Teams all shipped native summarization within months of each other. The platform doesn't need a good version, just a good enough version with better distribution than you'll ever have. The bar for survival is high.

If you're building an AI product, here's my suggestion: run a benchmark. A real one, with ground-truth answers. Measure whether your retrieval actually returns correct results or just plausible-looking ones. You might not like what you find, but at least you'll know what you're shipping.

Everyone's launching wrappers. The ones that survive will be the ones that went deep enough to own something real.

Josh Adler is a researcher at TrueMemory, a Sauron company. Research: arXiv:2605.04897. More at joshadler.com.