Praful Reddy

Posted on Apr 22

Why I used a 50-year-old algorithm instead of embeddings to cut Claude API token costs

#ai #opensource #typescript #llm

I built Prism — a local proxy that routes only relevant knowledge to Claude per query using BM25, with zero extra API calls, zero embeddings, and zero vector databases.
Every time you send a prompt to Claude, it considers its entire
knowledge space. A question about a React bug still costs tokens
on geography, cooking, history, and every other domain Claude was
trained on. Nobody talks about this because the context window is
large enough that it "works." But it's wasteful by design — and
it produces padded, unfocused responses as a side effect.

I spent two weeks building a fix. The result is
Prism — a local
proxy that intercepts your Claude API calls and routes only
relevant knowledge to each query. Zero extra API calls. Zero
embeddings. Zero vector database. Pure deterministic math.

## The problem with every other approach

Before I explain what Prism does, I want to explain what it
deliberately does not do — because the distinction matters.

Every existing context optimizer I found uses a second, smaller
LLM to compress the input to the main LLM. LLMlingua, Selective
Context, LLM-DCP — they all work the same way: call a model to
decide what to keep, then call the main model with the compressed
input.

That's two inference calls instead of one. You're burning tokens
to save tokens. The abstraction is broken at the foundation.

I kept asking: do you actually need a model to decide what's
relevant? For most prompts, the answer is no. The relevant
domain is structurally detectable from the words in the query
itself. You don't need intelligence to know that "fix this
TypeError" is a code/debug question. You need pattern matching.

So I reached for BM25.

## What BM25 is and why it works here

BM25 (Best Match 25) is an information retrieval algorithm from
1994, built on TF-IDF principles from the 1970s. It's the
ranking function that powered search engines before neural
networks existed. It scores documents against a query using
term frequency, inverse document frequency, and document length
normalization. No model. No training. Pure math.

Here's the key insight: I pre-built a corpus of 40 knowledge
domain nodes — javascript, security, databases, geography,
history, medicine, etc. Each node has a keyword set describing
that domain. At query time, BM25 scores every domain node
against the incoming prompt in microseconds and returns the
top 5 by relevance.

The Knowledge Graph then walks relationship edges — if
"javascript" scores highest, its related domains (node, react,
typescript) get included at a discounted score. The result is
a focused set of 3-5 domains that actually matter for this
specific query.

Building the index takes ~2ms at startup. Each query takes
under 1ms. The entire operation costs zero dollars and works
completely offline.

## The four pipeline stages

1. Intent Classifier
Reads the prompt and assigns one of six intent types: CODE,
DEBUG, FACTUAL, CONCEPTUAL, DECISION, or CREATIVE. Uses a
deterministic keyword graph — trigger words, regex patterns,
confidence thresholds. No model call. Under 1ms.

2. Knowledge Graph
BM25-scores 40 domain nodes against the prompt + intent.
Applies intent affinity boosts (DEBUG queries get a 1.4x
multiplier on security and language domains). Walks relationship
edges for related domains. Returns top 5 nodes with scores.

3. Context Injector
Builds a focused system prompt fragment under 300 tokens.
Format varies by intent:

DEBUG: "Diagnose from [security] perspective. State cause first. Then fix. Then why."
FACTUAL: "Answer from [geography] knowledge. One direct answer. No padding. Facts only."

Always appends: "Be dense. Replace meta-commentary with labels:
[reason] [context] [caveat]. Skip preamble and sign-off."

4. Response Enforcer
Post-processes Claude's raw response before returning it to
you. Runs 111 filler phrase patterns — "Here's the thing
about", "Let me walk you through", "I hope this helps",
"Great question" and 108 others. Prefix and suffix patterns
are deleted entirely. Inline patterns are replaced with
compact semantic labels. Result: 30-50% shorter responses
that are actually denser with information.

## Before and after

Here's a real example. Prompt: "fix the TypeError in my
auth middleware"

Without Prism:

Full knowledge space considered
Response opens with: "I'd be happy to help you fix that TypeError! Let me walk you through what's likely happening here. TypeErrors in Express middleware are quite common and usually fall into a few categories..."
Tokens in: ~800 (with any system context)

With Prism:

Intent: DEBUG (0.94 confidence)
Domains activated: security (0.91), javascript (0.87), node (0.72)
System fragment injected: 52 tokens
Response opens directly with the diagnosis
Filler removed: 6 phrases
Tokens saved: ~140

The response isn't just shorter — it's structured differently.
It leads with cause, then fix, then explanation. That's the
intent-specific formatting doing its job.

## How to use it

Prism runs as a local proxy on port 3179. You change one
thing in your code:

// Before
const client = new Anthropic({
  baseURL: 'https://api.anthropic.com'
});

// After — that's it
const client = new Anthropic({
  baseURL: 'http://localhost:3179'
});

Or use the SDK directly:

import { prism } from 'prism-ai';

const response = await prism.send({
  prompt: "fix the TypeError in my auth middleware",
  apiKey: process.env.ANTHROPIC_API_KEY
});

console.log(response.intent);    // DEBUG
console.log(response.domains);   // ['security', 'javascript', 'node']
console.log(response.tokensIn);  // 312
console.log(response.fillerRemoved); // 6

Install:

npx prism-ai start

That's the entire setup.

## Prism Agent

I also built Prism Agent
on top of the SDK — a Claude Code alternative with a live
knowledge graph pane in the terminal. Every turn shows you
which domains activated, their BM25 scores, tokens saved,
and filler removed. You can pin domains (always include this)
or suppress them (never use this). First coding agent that
isn't a black box.

npm install -g prism-agent
prism-agent

Why this matters beyond token costs

The token savings are real but they're not the main point.
The main point is that focused context produces better
responses. When Claude is directed at a specific domain
with a specific response format for a specific intent type,
the quality of the answer goes up — not just the length
down. The Response Enforcer compounds this by removing the
preamble padding that dilutes the actual answer.

BM25 has been solving information retrieval problems for 50
years. It doesn't hallucinate. It doesn't drift. It doesn't
need a GPU. It runs in a for loop. For the specific problem
of "which knowledge domain is this query about," it's more
than sufficient — and it's the right tool precisely because
it's so much simpler than the alternatives.

Both projects are fully open source and MIT licensed.

prism-ai: npm | GitHub
prism-agent: GitHub

If you're building anything on top of Claude or other LLM
APIs and want to talk through the BM25 implementation,
drop a comment — happy to go deeper on any part of this.

DEV Community

Why I used a 50-year-old algorithm instead of embeddings to cut Claude API token costs

Why this matters beyond token costs

Top comments (0)