Build Once, Sell Twice: caching LLM analysis with pgvector

#ai #rag #postgres #performance

A few months ago I was staring at a spreadsheet trying to figure out why our unit economics looked terrible. Every new AI feature was increasing revenue — and quietly destroying our margins.

The product details don't matter but the shape will be familiar to anyone building AI-powered features. Imagine you're building an Instagram DM CRM that qualifies leads, scores intent, drafts follow-ups, and alerts sales reps when hot leads land. Customers pay per seat, plus add-ons like pipeline reports or lead reactivation audits. Each DM costs ~$0.02 to analyze with AI.

The product looks simple on paper and the unit economics look fine.
Then you notice something. The same DM gets analyzed when it first comes in. Then again on Monday when the customer pulls the "Weekly Pipeline Report". Then again on Friday when someone runs the "Hot Leads Audit". Different angles, different prompts, but the underlying message is the same and you're paying the LLM to re-read the exact same "hey is this still available?" three times in one week. The more your customers engage, the more your margin bleeds.

This is the story of how pgvector becomes the cheapest line item on your balance sheet.

The setup

Every DM that comes in, regardless of which features the customer eventually pays for, gets a free embedding:

INSERT INTO dm_analysis (dm_id, embedding)
VALUES ($1, $2);
-- everything else (intent, qualification, summary...) stays NULL

That's it. One row, one 1536-dim vector from text-embedding-3-small, ~$0.000003 per DM, generated synchronously on intake. By the time a customer queries, every DM already has its row. You absorb the embedding cost as infrastructure. Why? Because if the customer ever runs a semantic search ("show me leads like the one that just closed") or asks a chatbot a question, you want 100% of their DMs discoverable — not just the slice they've paid to deeply analyze.

The schema looks like this:

CREATE TABLE dm_analysis (
    analysis_id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    dm_id                UUID NOT NULL UNIQUE REFERENCES dms(dm_id),
    -- expensive fields, filled when the LLM runs:
    intent               VARCHAR(50),
    qualification_score  DECIMAL(3,2),
    fit_label            VARCHAR(20),
    urgency_level        VARCHAR(20),
    summary              TEXT,
    -- always filled, free:
    embedding            VECTOR(1536),
    created_at           TIMESTAMP DEFAULT NOW()
);

CREATE INDEX idx_analysis_embedding
    ON dm_analysis
    USING hnsw (embedding vector_cosine_ops);

The trick: a dm_analysis row with a filled summary means you paid an LLM for it. A row with only embedding means you haven't yet. The schema is the cache.

The "target vector" trick

When a customer triggers the "Hot Leads This Week" report, you don't dump every DM of the week into an LLM. That'd be wasteful, slow, and noisy. Instead, you build a target vector for the report type:

const reportVector = await embed(
  "I'm ready to buy this week. We have budget approved. Need to make a decision soon."
);

Match the shape of your target text to the shape of your data. DMs are prose, so the target vector should be prose too, embedding a keyword bag against natural messages produces fuzzier matches.

Then you let Postgres find the 50 DMs most relevant to that intent:

SELECT d.dm_id, da.summary, da.qualification_score
FROM dms d
JOIN dm_analysis da ON da.dm_id = d.dm_id
WHERE d.tenant_id = $1
  AND d.received_at >= $2  -- this week
ORDER BY da.embedding <=> $3::vector  -- cosine distance to target
LIMIT 50;

Cost of this query: $0. Time: ~50ms with the HNSW index, even at hundreds of thousands of DMs.

Now here's the part that turns the business around.

Build once, sell twice

When those 50 DMs come back, you split them into two piles:

const enriched = dms.filter(d => d.summary !== null);  // already paid for
const raw      = dms.filter(d => d.summary === null);  // need LLM

You only pay the LLM for raw. The enriched DMs — analyzed by some other feature, days or weeks ago — come back for free. Their intent, qualification score, and summary are already in dm_analysis, sitting in Postgres, waiting.

The compounding works because target vectors for related features overlap heavily — the "Hot Leads" report and the "Pipeline Deep Dive" both surface high-intent DMs, even though the prompts ask different questions. Different framings, same source material.

The first time a customer runs a report, all 50 DMs might be raw. Cost: $1.00.

By the third report, maybe 30 of the 50 retrieved are already enriched (because they were relevant to some previous feature run and got paid for then). Cost drops to $0.40.

By month three, on a "Monthly Pipeline Deep Dive" pulling 200 DMs, 150 of them are typically already enriched. Cost: $1.00 instead of $4.00.

The tendency is monotonic: the more a customer engages, the cheaper each subsequent run becomes for you. The LLM bill grows sublinearly while revenue grows linearly. That's the whole game.

Why pgvector specifically

Three reasons not to reach for Pinecone or Weaviate here:

One database. The embedding lives in the same row as the structured analysis. The same JOIN that filters by tenant_id (for row-level tenancy) also filters the vector search. No two-phase commits, no "find IDs in Pinecone, fetch rows from Postgres" dance, no consistency window where a DM exists in one system and not the other.

The cache is a column, not a service. WHERE summary IS NOT NULL distinguishes an enriched DM from a raw one. There's no separate metadata store to keep in sync. When the LLM finishes, an UPDATE happens, and that's the entire state transition.

HNSW is good enough. At reasonable scale — say a million DMs per tenant in the worst case — pgvector's HNSW index returns top-50 in single-digit milliseconds. You will not be bottlenecked by vector search. You'll be bottlenecked by your LLM rate limits, by Stripe webhooks, by your own code. Never by the vectors.

The takeaway

If you're building anything LLM-shaped where the same source data gets re-analyzed for different purposes — reports, dashboards, chatbots, alerts, auto-categorization, lead scoring — the cheapest dollar you'll ever spend is on a VECTOR column and an HNSW index. Pay the LLM once per piece of content, ever. Let semantic search decide which pieces are relevant to the next question. Let your margin compound.

The fancy moat isn't the AI. It's the cache.