DEV Community: Vahid Aghajani

SQLite FTS5: How Full-Text Search Actually Works (Inverted Index + BM25)

Vahid Aghajani — Sat, 25 Jul 2026 08:24:17 +0000

Originally published on software-engineer-blog.com.

You have a folder of 10,000 markdown notes and you search for postgres backup. grep takes 400 ms and hands back 40 files in the order they happen to sit on disk. SQLite FTS5 takes 3 ms and puts the right note first.

Same files. Same query. The entire difference is two ideas: an inverted index and BM25 ranking. Both fit in your head, and both fit in one SQLite file — no Elasticsearch cluster, no search service to babysit.

Start honest: grep is not wrong

grep -r "postgres backup" notes/ opens every note and reads every byte. At 300 notes that is genuinely the correct answer — the simplest thing that works. At 10,000 notes it re-reads all of them on every keystroke, and your 400 ms search-as-you-type feels like typing through mud.

But speed is only half the problem, and it's the half everyone notices. The half that quietly hurts more: grep cannot rank. A tight 200-word note that is about postgres backups and a standup note that mentions "postgres backup" once in paragraph nine come back identical — two matching file paths, in folder order. grep knows containment. It has no concept of relevance.

The flip: store "word → files", not "file → words"

Every file on disk is already a mapping of file → the words in it. Full-text search flips that around and stores word → the files it appears in. That's the whole trick — the inverted index — and in SQLite it's one line:

CREATE VIRTUAL TABLE notes USING fts5(path UNINDEXED, title, body);

Under the hood, FTS5 keeps a postings list per term: for postgres, a list of (docid, column, position) entries — which note, which column, and where in the text the word sits.

Now run the query:

SELECT path FROM notes WHERE notes MATCH 'postgres backup';

FTS5 reads the postings list for postgres, the postings list for backup, intersects them — and touches nothing else. Files scanned: 0. That's the 400 ms → 3 ms jump: the work is proportional to the query's terms, not to the corpus.

And because the index stores positions, not just membership, you get two things grep can only fake with regex:

-- exact phrase: the words must be adjacent, in order
SELECT path FROM notes WHERE notes MATCH '"postgres backup"';

-- proximity: within 5 tokens of each other
SELECT path FROM notes WHERE notes MATCH 'NEAR(postgres backup, 5)';

The part that trips everyone: the ranking is NOT stored

Here's the correction beat. It's tempting to imagine the index stores a rank next to each posting — "this note is a 9/10 for postgres". It doesn't, and it can't.

BM25 — the ranking function FTS5 uses — is computed at query time, from three inputs:

Term frequency (TF): how often the term appears in this note. More mentions → more likely the note is actually about it.
Inverse document frequency (IDF): how rare the term is across the corpus. If postgres appears in 12 notes and backup in 3,000, a postgres hit carries far more signal — the rare term dominates the score.
Length normalization: a 5,000-word brain-dump can't out-score a tight 200-word note by sheer bulk. Frequency is judged relative to document length.

The reason it can't be precomputed: BM25 scores a pair — this query against this document. The same note scores differently for postgres than for postgres backup. A stored per-note rank isn't even well defined.

The real query — and the negative-number gotcha

Production search almost always wants a title hit to outrank a body hit. BM25's column weights do exactly that:

SELECT path, bm25(notes, 0.0, 10.0, 1.0) AS rank
FROM notes
WHERE notes MATCH 'postgres backup'
ORDER BY rank
LIMIT 10;

Two details bite here:

The weights are positional — one per declared column, including the UNINDEXED one. 0.0 for path, 10.0 for title, 1.0 for body: a title hit counts 10× a body hit.
bm25() returns a negative number. More relevant = more negative. So plain ORDER BY rank ascending is already best-first. Everyone writes ORDER BY rank DESC exactly once, stares at the worst results on top, and never does it again.

grep vs FTS5, honestly

	grep	SQLite FTS5
Question it answers	Which files contain this word?	Which file is this word most about?
Work per query	Reads every byte of every file (~400 ms at 10k notes)	Reads only the query terms' postings lists (~3 ms)
Ranking	None — results in folder order	BM25 (TF × IDF × length norm), computed per query
Phrase / proximity	Regex gymnastics	Native — `"exact phrase"`, `NEAR()` from stored positions
Freshness	Always current — reads the real files	Derived copy — goes stale if you write outside the ingestion path
Setup	Zero	One `CREATE VIRTUAL TABLE` + an ingestion step you now own

The two costs nobody puts in the demo

1. The index is a derived copy — you now own a sync problem. Edit a note in your editor, outside whatever code inserts into the FTS table, and the index goes quietly stale. It will keep answering, confidently, from old text. Stale-but-confident is strictly worse than grep's honest "no match" — grep at least never lies about the current state of disk.

2. BM25 is purely lexical. It counts tokens. It has no idea car and automobile are related, that pg_dump is about postgres backups, or that "restore my database" and "postgres backup" are the same intent. That's not a bug in FTS5 — it's the ceiling of lexical search, and exactly where semantic/vector search begins.

The AI angle: why RAG pipelines still run BM25

If you're building retrieval for an LLM — a RAG system over docs, tickets, or a codebase — this isn't retro plumbing. It's half of the current best practice.

Embedding search has the opposite failure mode to BM25: it knows car ≈ automobile, but it's mushy on exact tokens. Ask a pure vector index for ERR_CONN_RESET_1042 or bm25(notes, 0.0, 10.0, 1.0) and it happily returns text that's semantically nearby while missing the one chunk containing the literal string. BM25 nails exact identifiers, function names, error codes, version strings — precisely the queries developers actually make.

That's why production RAG stacks run hybrid search: BM25 and vector similarity in parallel, merged with reciprocal rank fusion or a reranker. The inverted index + BM25 you just learned isn't the "old way" — it's the lexical leg of that hybrid, and with FTS5 it costs you one file and zero services. A SQLite database with an FTS5 table and an embedding table is a legitimate, boringly reliable retrieval layer for a small-to-mid RAG system.

And the mental model transfers one-to-one: IDF's "rare terms carry the signal" is the same instinct behind why a good chunking strategy keeps identifiers intact, and BM25's query-document pair scoring is the same shape as a cross-encoder reranker — score the pair at query time, because relevance isn't a property of the document alone.

Verdict

Under ~300 documents, grep (or LIKE '%…%') is genuinely fine — don't build an ingestion pipeline you don't need. The moment you need search-as-you-type or ranked results, FTS5 gives you real search-engine machinery — inverted index, postings lists, BM25, phrase and proximity queries — for one CREATE VIRTUAL TABLE, inside a database you're probably already shipping. Reach for a dedicated search service (or the vector leg) only when you hit FTS5's honest ceilings: cross-machine scale, typo tolerance, or synonym/semantic matching.

grep asks which files contain this word. FTS5 asks which file is this word most about. That move — from containment to relevance — is the whole idea.

🎥 Watch the 3-minute version: SQLite FTS5 — inverted index + BM25, animated

Next-Token Prediction: How an AI Actually Writes Text (Not Magic — Just Probability)

Vahid Aghajani — Thu, 23 Jul 2026 17:48:21 +0000

Originally published on software-engineer-blog.com.

Ask an AI the same question twice. Get two different answers. That's not a glitch you tolerate — it's the entire mechanism working exactly as designed.

Start below the buzzword: a language model never sees a finished sentence. It only ever answers one tiny question, over and over — given everything written so far, what's the next chunk of text?

One-line mental model: the model outputs a probability over every possible next token → it samples from that distribution instead of always grabbing the top score → the winning token gets glued onto the text → the exact same question runs again from scratch, one token at a time.

The concrete example: finishing one sentence

Say DraftPal, a writing assistant, is finishing: "The cat sat on the ___." It doesn't know the ending. It computes one probability for every possible next token it knows about:

mat    → 41%
chair  → 19%
floor  → 12%
...    → (thousands more, trailing to ~0%)

That's it. That's the entire "intelligence" at this step — a ranked list over the whole vocabulary, built fresh from the text so far.

The part almost everyone skips: it samples, it doesn't grab the top score

Here's the detail that explains half the "weird" behavior people notice about LLMs: the model does not deterministically pick mat because it's the highest score. It samples — a weighted die roll across that entire distribution. 41% wins most of the time. Sometimes chair wins instead. Same model, same prompt, different word — because the die was rolled, not read off a table.

Whatever wins gets glued onto the text, and the whole question — "given everything so far, what's next?" — runs again from scratch, now one token longer. One token, one roll, repeat. That loop, run a few hundred times, is what writes an entire reply.

# pseudocode — the entire generation loop
tokens = tokenize(prompt)
while not done:
    distribution = model(tokens)      # probability over every next token
    next_token = sample(distribution) # NOT always argmax
    tokens.append(next_token)

The chart isn't fixed — it's rebuilt from context every time

Add four words of context before the same question — "write this like a horror story" — and the exact same probability computation comes back totally different: mat collapses under 1%, coffin jumps to 99%. Nothing about the model changed. The input context changed, so the distribution it computes changed.

This one mechanism quietly explains two things developers run into constantly:

Why the same prompt gives two different replies on two runs. No hidden state, no bug — it's sampling from a distribution, and the die comes up differently.
What "personalization" actually is. A model doesn't know you. Your prior messages get stuffed back into the context window on every call, which reshapes the same probability chart toward tokens that fit what you've said before. It's context, not memory.

	Greedy (always top score)	Sampling (the real default)
Determinism	Same input → same output, always	Same input → can vary run to run
Variety	Low — often repetitive/boring	Higher — natural-sounding variation
Reproducibility	Perfect	Traded away for the variety
Typical use	Structured/deterministic tasks (code, JSON)	Open-ended writing, chat, brainstorming

What it costs, and where it fails

The determinism/variety trade-off. Force the model to always take the top slot (greedy decoding, or "temperature 0") and answers get boringly identical every run — useful when you need reproducibility, e.g. structured extraction. Leave sampling on and you get natural variety, at the cost of never getting the exact same output twice.

The compute cost is per token, not per reply. Every single token — not the whole response — costs one full forward pass through the model. A 500-token answer is roughly 500 times more expensive than a 1-token answer, not "a bit more." This is also why streaming feels slow on long outputs: you're watching the loop happen in real time.

No going back — the seed of a hallucination. Once a token is glued onto the context, it is never revised. The model doesn't get to reconsider token 40 after generating token 41. So one confident wrong guess early on doesn't get corrected — the next question is now "given everything so far, including that wrong guess, what's next?" — and the model builds forward on its own mistake. That's the actual mechanical origin of a hallucination: not "the model lied," but "the model committed to a token and the loop only moves forward."

Reframe: this is also the whole story behind LLM-serving latency

If you've ever looked at an inference dashboard, two metrics show up everywhere: TTFT (time-to-first-token) and TPOT (time-per-output-token). This loop is exactly what they're measuring.

TTFT is the cost of that first forward pass — reading the whole prompt and producing the first probability distribution.
TPOT is the cost of every subsequent iteration of the loop above — one more forward pass per token, forever, until the model samples a stop token.

That's also why batching exists as a serving technique: since each loop iteration is bottlenecked on loading the model's weights into the GPU's compute units rather than on the arithmetic itself, serving frameworks pack multiple users' next-token requests into the same forward pass so one expensive weight-load produces many tokens at once. And it's why response length is the single biggest lever on cost and latency in any LLM product — you are quite literally paying for the number of times the loop above has to run.

The takeaway

Not a sentence writer. A next-token predictor, running in a loop — one probability chart, one weighted roll, one token glued on, repeated until it samples a stop.

Two devs run the same prompt through the same model. One gets "…sat on the mat," the other gets "…the windowsill." Neither is wrong. That's not inconsistency — that's the mechanism.

Speculative Decoding, Explained: Free LLM Speed With Zero Quality Loss

Vahid Aghajani — Thu, 23 Jul 2026 06:33:26 +0000

Originally published on software-engineer-blog.com.

Autoregressive LLM decoding produces one token per forward pass. A 100-token answer means 100 sequential passes through the entire weight matrix. The arithmetic is trivial; the GPU memory trip is where all the time goes. You are bandwidth bound, not compute bound — the hardware sits idle waiting for data.

One-line mental model: Rent a small model to guess K tokens cheaply; verify all K+1 against the big model in one pass (same weight load); accept matches, correct the first mismatch, discard the rest.

Why Decoding Is Bottlenecked on Memory, Not Compute

When you generate the next token, the transformer reads every weight in the model to produce a single logit vector. On a GPU:

Compute work: a few billion floating-point operations.
Data movement: 70 billion parameters × (2 to 8 bytes per parameter, depending on precision) loaded from GPU memory, then written back.

On modern hardware, moving that much data takes orders of magnitude longer than computing with it. Each forward pass is a round trip to memory for almost no local work. The GPU's arithmetic units are starved. Increasing batch size helps when you have many prompts to process in parallel, but for a single sequence of generations — the most common case in real-time chat — you cannot hide that latency.

This is the setup that speculative decoding exploits.

The Core Idea: Draft Model Proposes, Big Model Verifies

Instead of the big model generating one token at a time, use a smaller, cheaper draft model to propose K candidate tokens in a row:

Input: "def quicksort(arr):"
Draft model (1B params) proposes:
  Token 1: "if"
  Token 2: "len"
  Token 3: "("
  Token 4: "arr"

Now, instead of running the big model (70B) four times — once per proposed token — you run it once and ask it to score all five positions: the original input plus the four draft proposals. The big model's forward pass loads the entire weight matrix whether it generates one token or evaluates four. Verification is nearly free relative to generation.

After that single pass, you compare the big model's chosen token at each position against what the draft model proposed:

Positions match: accept the draft token and keep going.
First position diverges: replace it with the big model's token, discard all subsequent draft proposals, and stop (or re-run the draft model from that point).

In the best case, you accept all K proposed tokens and advance K steps with one big-model pass. In the worst case, you accept zero and advance one. Either way, you spent one weight-load trip on the big model.

Running Example: CodeCue, an In-Editor Code Assistant

Imagine CodeCue, an IDE plugin that auto-completes code as you type. You're writing Python:

def fibonacci(n):
    if n <= 1:
        return n
    |  # cursor here

Without speculative decoding:

Big model (Llama 70B) generates token 1: return
Reload weights, generate token 2: fib
Reload weights, generate token 3: (
Reload weights, generate token 4: n
Reload weights, generate token 5: -

Four memory round trips for something as simple and predictable as a function call.

With speculative decoding:

Draft model (CodeQwen 1B) quickly proposes: return, fib, (, n, -
Big model loads weights once, scores all six positions (original + five proposals).
Big model agrees on the first four tokens. On position five, it chooses 1 instead of -.
Accept: return fib(n - 1, correct the mismatch, discard anything after.

One big-model weight load instead of five. The output is identical to what Llama 70B would have chosen on its own — the accept rule enforces it mathematically.

Mathematical Identity: Why Output Quality Is Preserved

The accept rule is designed so that the joint distribution of accepted tokens matches what the big model would have produced alone. Here's why:

At each position, when the big model disagrees with the draft, you take the big model's token. When they agree, you take the accepted token — but you only got there because the big model would have generated it at that position anyway. The prefix that made it through acceptance is a valid execution path of the big model's sampling procedure.

Unlike quantization (which discards precision to trade quality for speed) or distillation (which accepts training-time accuracy loss), speculative decoding does not trade quality. The output distribution is preserved exactly. You can verify this by running the same sequence through both approaches: they will produce identical token probabilities.

What Costs You: Acceptance Rate, VRAM, and Workload Shape

Cost	Impact	Mitigation
Bad draft-model accuracy	Mismatches force rejections; you spend compute on draft tokens you discard. End-to-end latency can be worse than single-model decoding.	Match draft model capacity to the workload. Use a model trained on your domain (e.g., CodeQwen for code).
Two models in VRAM	You need memory for both the draft (1–7B) and big model (70B+). On a single GPU, this is tight or infeasible.	Use smaller draft models (300M–1B). Offload draft to CPU in some setups (not recommended for low-latency).
Workload-dependent gains	Boilerplate code, JSON, schema, structured output: high acceptance. Creative writing, reasoning, brainstorming: low acceptance.	Profile your use case. Disable speculative decoding for open-ended tasks; enable for code, API responses, structured data.
Acceptance rate collapse	If draft and big model disagree frequently (e.g., different tokenizers, training data), you get no speedup or slowdown.	Use a draft model derived from or aligned with the big model.

Real-World Performance

Typical wall-clock speedups in production:

Structured output (JSON, code, boilerplate): 2–4×
Mixed or conversational: 1.5–2×
Creative or long-reasoning chains: close to 1× (most drafts rejected)

These gains assume a well-tuned draft model and acceptance rates above 60–70%.

LLM Inference Context: TTFT vs. TPOT

In LLM serving, speculative decoding affects time-per-output-token (TPOT) — the latency to generate each successive token after the first. Time-to-first-token (TTFT) remains unchanged because the draft model alone cannot produce coherent output; you still need the big model's first forward pass to seed the sequence.

If you use KV caching and continuous batching in your serving infrastructure, speculative decoding stacks orthogonally: caching reduces redundant computation within a sequence, batching amortizes weight loads across multiple prompts, and speculative decoding removes idle GPU cycles within a single sequence. None of these three optimizations make the others redundant.

When to Reach for Speculative Decoding

Reach for speculative decoding when your workload is structured, predictable, and code-heavy (code completion, API response generation, schema filling) and you have enough VRAM for two models. Use single-model decoding when your task is creative, open-ended, or reasoning-heavy (storytelling, math steps, brainstorming), VRAM is tight, or acceptance rates are empirically below 40%.

Watch the 90-second reel on software-engineer-blog.com to see the full visual breakdown.

Data Modeling Explained: From One Messy Table to a Real Schema

Vahid Aghajani — Wed, 22 Jul 2026 16:03:29 +0000

Originally published on software-engineer-blog.com.

Your schema is the contract every pipeline, query, and dashboard depends on. Get the modeling wrong, and you'll spend 3 a.m. fixing silent duplicates and mismatched joins. Get it right, and the whole stack hums.

Mental model: Data modeling is deciding which real-world nouns become tables, which facts live where, and how they reference each other—so you have one source of truth and every query knows where to look.

The Problem: One Fat Table

Imagine you're building the backend for BeanBox, a coffee store that takes online orders. You start simple: one orders table. Every row is an order.

order_id | customer_name | customer_email | product_name | product_price | order_date
1        | Ada           | ada@b.co       | Espresso     | 3.50          | 2024-01-15
2        | Ada           | ada@b.co       | Latte        | 4.50          | 2024-01-16
3        | Bob           | bob@b.co       | Espresso     | 3.50          | 2024-01-16
4        | Ada           | ada@b.co       | Cappuccino   | 4.75          | 2024-01-17

This works until Ada changes her email to ada.new@b.co. Now you have two choices: update all four of her rows, or leave the old ones and live with inconsistency. Miss one row and you have two emails for one person. Your aggregations break. Your dashboard and your pipeline disagree on how many unique customers you have.

This is called an update anomaly—a sign that data is scattered where it should be singular.

Layer One: The Conceptual Model

Start by naming the real-world things your system cares about:

Customer — a person who buys coffee
Order — a purchase event
Product — something we sell

Draw lines between them:

A Customer places many Orders (one-to-many)
An Order contains one or more Products (many-to-many)
A Product appears in many Orders

This is the conceptual model: just nouns and relationships. No keys yet, no databases. Just: "What are the real things, and how do they connect?"

Layer Two: The Logical Model

Now turn those nouns into tables and add structure.

Customer becomes a table with:

customer_id (primary key — the unique identifier for this row)
name
email

Product becomes a table with:

product_id (primary key)
name
price

Order becomes a table with:

order_id (primary key)
customer_id (foreign key — references the Customer table)
order_date

OrderItem (or OrderProduct) handles the many-to-many:

order_item_id (primary key)
order_id (foreign key)
product_id (foreign key)
quantity

Now an order does not copy the customer's email. It just points to the customer with customer_id. If Ada changes her email, you update the customers table once, and every order automatically references the new email.

customers:
id | name | email
1  | Ada  | ada.new@b.co
2  | Bob  | bob@b.co

products:
id | name       | price
1  | Espresso   | 3.50
2  | Latte      | 4.50
3  | Cappuccino | 4.75

orders:
id | customer_id | order_date
1  | 1           | 2024-01-15
2  | 1           | 2024-01-16
3  | 2           | 2024-01-16
4  | 1           | 2024-01-17

order_items:
id | order_id | product_id | quantity
1  | 1        | 1          | 1
2  | 2        | 2          | 1
3  | 3        | 1          | 1
4  | 4        | 3          | 1

This is normalization: every fact lives in exactly one place.

The Three Relationships

Relationship	Pattern	Implementation	Example
One-to-Many	One A owns many Bs	B has a foreign key to A	One Customer, many Orders. `orders.customer_id` → `customers.id`
One-to-One	One A is exactly one B	Either table has a foreign key to the other; often separate for security or audit	One User, one EncryptedPassword. `passwords.user_id` → `users.id`
Many-to-Many	Many As relate to many Bs	Join table with two foreign keys	Many Orders, many Products. `order_items.order_id` → `orders.id`, `order_items.product_id` → `products.id`

Normalization: One Source of Truth

Normalization is a set of rules (Normal Forms: 1NF, 2NF, 3NF, BCNF, and beyond) that ensure:

Every fact lives in exactly one place — no redundant copies
Updates are atomic — change an email once, it's updated everywhere
No insertion anomalies — you can add a new customer without creating a fake order
No deletion anomalies — you can delete an order without losing customer information

For BeanBox, your normalized schema means:

Change Ada's email one time in the customers table
Every query that joins orders to customers sees the new email
No silent mismatches, no data rot

Layer Three: The Physical Model

Once you've normalized, the database gets real:

You pick data types: customer_id is a BIGINT, email is a VARCHAR(255), price is a DECIMAL(10,2)
You add indexes: put an index on orders.customer_id so joins are fast
You decide on denormalization on purpose

When to Denormalize

Normalization is for correctness: one source of truth, no anomalies, correct writes. But analytics has different demands: you want wide, flat tables so queries avoid expensive joins.

This is denormalization, and it's not cheating—it's deliberate and separate.

-- Normalized: correct, but six joins
SELECT
  c.name,
  COUNT(o.id) AS order_count,
  SUM(oi.quantity * p.price) AS revenue
FROM customers c
LEFT JOIN orders o ON o.customer_id = c.id
LEFT JOIN order_items oi ON oi.order_id = o.id
LEFT JOIN products p ON p.id = oi.product_id
GROUP BY c.id, c.name;

-- Denormalized: wide, one table, fast read
SELECT customer_name, order_count, revenue
FROM customer_revenue_summary;

For an operational database (OLTP), normalize. For a data warehouse or analytics layer (OLAP), denormalize into a star schema or wide fact tables. Build the wide table from the normalized source on a schedule—Airflow, dbt, whatever.

-- dbt: build the wide table from normalized sources
SELECT
  c.id AS customer_id,
  c.name AS customer_name,
  c.email,
  o.id AS order_id,
  o.order_date,
  p.name AS product_name,
  p.price,
  oi.quantity,
  (p.price * oi.quantity) AS line_total
FROM customers c
LEFT JOIN orders o ON o.customer_id = c.id
LEFT JOIN order_items oi ON oi.order_id = o.id
LEFT JOIN products p ON p.id = oi.product_id

Then your dashboard query is a single table scan. Fast, explicit, and the denormalization is version-controlled.

Why the Schema Matters for Data Engineering

Your schema is a contract. It says:

"Here are the tables and their columns"
"Here are the primary and foreign keys"
"Here are the data types"
"Here is the one source of truth for each fact"

Every pipeline, query, and dashboard bets on that contract. When you change it without planning:

A pipeline that expected orders.customer_id breaks if you rename it
A dashboard that counts distinct customers breaks if you suddenly allow NULL in customer_id
Two teams build different logic to join orders to customers because the relationship was never documented

A clean schema prevents silent bugs. It's the difference between 3 a.m. on-call and sleeping through the night.

For LLM-Serving Systems

If you're building a system where an LLM needs to query a database (e.g., a retrieval-augmented generation pipeline), data modeling becomes a latency surface:

TTFT (time to first token) depends on query latency to fetch context. A poorly normalized schema with N+1 join problems will serialize your token generation.
TPOT (time per output token) stays constant, but the prefill depends on context window size, which depends on how much data you can fetch in time.
Denormalization (wide tables, materialized views, embedding caches) reduces query latency and gets you context faster, letting the LLM start generating tokens sooner.

Your schema design directly affects how fast a user sees a response from an LLM-powered application.

Verdict

Normalize for correct writes and one source of truth. Use normalized schemas in operational databases (OLTP) and data sources of record. Denormalize deliberately for analytics. Build wide, flat tables in your warehouse using dbt or similar tools; denormalization is a choice, not a side effect.

If you're unsure whether a fact should be split across tables, ask: "If this value changes, how many places would I have to update it?" If the answer is more than one, normalize it.

Watch the 90-second reel to see this unfold on one live example.

GeoParquet Explained: Your Geodata Has Two Shapes (One You Edit, One You Scan)

Vahid Aghajani — Tue, 21 Jul 2026 18:03:46 +0000

Originally published on software-engineer-blog.com.

Take one dataset — call it LandGrid, 200 million land parcels, each with a polygon and about forty attributes — and ask it two questions.

Question one, from a surveyor: "Parcel 4,182,930 got resurveyed. Move its eastern boundary two metres and save it."

Question two, from an analyst: "Total assessed value of every residential parcel in the country, grouped by county."

Same data. Same disk. One question finishes in a millisecond, the other takes six minutes and reads 400 GB. That is not a tuning problem, and no index will fix it. It's the storage layout telling you the truth: your geodata has two shapes, and you are only storing one of them.

First principles: what a row actually is on disk

Forget databases for a second and think about bytes.

In PostGIS, a parcel is a row, and a row is stored contiguously. Parcel ID, owner, county, zoning code, assessed value, thirty-five more fields, and then the polygon geometry as WKB — all glued together, one after another, in the same disk page.

That layout has one enormous virtue: everything about one feature is in one place. The surveyor's query hits an index, the index points at a page, one read pulls the whole parcel, you edit it in place, you commit. Row-oriented storage plus an R-tree index is the correct answer to "fetch this one thing and change it." This is OLTP, and PostGIS is excellent at it.

Now run the analyst's query against that same layout. It needs exactly two columns: zoning and assessed_value. But the columns are glued into the rows. To read two fields from 200 million rows you must walk 200 million rows — which means dragging all forty fields off disk, geometry included, because the polygon sits physically between the value you want on this row and the value you want on the next one.

You needed 2 of 40 columns. You read 40. And the largest of them, the geometry, was never even referenced by the query.

An index cannot rescue this. Indexes exist to avoid looking at rows. The analyst's query looks at every row on purpose. Nothing is being looked up; everything is being read. So the only thing left to change is the physical layout itself.

Flip the layout

Store the file column by column instead of row by row. All 200 million zoning codes contiguously, then all 200 million assessed values, then all the geometries in their own block at the end.

Now the analyst's query reads two contiguous runs of bytes and stops. That is column pruning, and it is not a clever optimisation — it falls straight out of the layout. The geometry column is never touched because the reader never has to step over it.

You get a second win for free. A column holds one kind of value, so a run of bytes is 200 million zoning codes rather than an alternating mess of ints, strings and blobs. Dictionary encoding, run-length encoding and general-purpose compression all work far better on homogeneous data than on interleaved records. Columnar geospatial files routinely land several times smaller than the row-oriented equivalent.

This layout is Apache Parquet. It is a file format, not a database, and it has been the standard analytical file in the data world for a decade.

GeoParquet is a convention, not a database

Here is the part people get wrong. GeoParquet is not a new format, not a fork of Parquet, not a query engine, and not "an OLAP system." It is a thin metadata convention written on top of ordinary Parquet.

A GeoParquet file is a Parquet file with a geo key in the file-level metadata. That key declares:

the primary geometry column (and any secondary ones),
the encoding — WKB by default, or native GeoArrow since GeoParquet 1.1,
the CRS, as PROJJSON (not a bare EPSG integer),
the geometry types present,
the bbox of the data,
and the edge semantics — planar or spherical.

That's essentially it. The consequence is the good part: any Parquet reader can still open the file. Pandas, Spark, BigQuery, a plain parquet-tools dump — none of them break. They just see a binary column. A geo-aware reader looks at the geo metadata and knows those bytes are polygons in a specific CRS. It's the same trick as WKB inside a database column, moved up to the file level.

import duckdb

con = duckdb.connect()
con.execute("INSTALL spatial; LOAD spatial;")
con.execute("INSTALL httpfs; LOAD httpfs;")

con.sql("""
    SELECT county, SUM(assessed_value) AS total
    FROM 'landgrid.parquet'
    WHERE zoning = 'RES'
    GROUP BY county
    ORDER BY total DESC
""").show()

No server, no import step, no CREATE TABLE. The engine opens the file, reads the footer, and touches two columns.

Row groups, statistics, and the 1.1 bbox column

Column pruning gets you from 40 columns to 2. The next win is skipping rows — and this is where Parquet's internal structure matters.

A Parquet file is cut into row groups: horizontal slices, typically tens to hundreds of megabytes. Each row group stores each column as a separate chunk, and — crucially — each chunk carries statistics: min, max, null count.

So a reader handling WHERE assessed_value > 1000000 opens the footer, walks the row-group statistics, and discards whole row groups whose max is below the threshold without decompressing a single byte of them. This is predicate pushdown, and it happens before any real I/O.

For geometry, min/max on a WKB blob is meaningless — which is why GeoParquet 1.1 added the bbox covering column: a struct column of xmin, ymin, xmax, ymax per row, whose own Parquet statistics per row group give you the spatial extent of that row group. A bbox query can now drop row groups the same way a numeric filter does.

-- with a 1.1 bbox covering column, this prunes row groups
-- instead of decoding 200 million polygons
SELECT * FROM 'landgrid.parquet'
WHERE bbox.xmin < 8.6 AND bbox.xmax > 8.4
  AND bbox.ymin < 47.5 AND bbox.ymax > 47.3;

The Hilbert trap: skipping is only as good as your sort order

Now the part nobody warns you about, and the single biggest reason a GeoParquet file underperforms in practice.

Row-group skipping only works if the rows were physically sorted before the file was written.

Write LandGrid in insertion order — the order the parcels happened to arrive, county by county over twenty years, or worse, arbitrary — and every row group ends up holding a random scattering of parcels from all over the country. Every row group's bbox is therefore roughly the whole country. Every row group overlaps your query. Nothing gets skipped. You scan the entire file and wonder why the format everyone praised is slow.

The fix is to sort on a space-filling curve — usually Hilbert — at write time, so that rows near each other in 2D end up near each other in the file:

COPY (
    SELECT *, ST_Extent(geom) AS bbox
    FROM parcels
    ORDER BY ST_Hilbert(
        geom,
        ST_Extent(ST_MakeEnvelope(5.9, 45.8, 10.5, 47.8))
    )
) TO 'landgrid.parquet'
  (FORMAT PARQUET, COMPRESSION ZSTD, ROW_GROUP_SIZE 100000);

Now each row group covers a compact patch of ground, its bbox is tight, and a query over one city touches a handful of row groups instead of all of them.

The important property to internalise: this is a write-time decision. In a database, you can add an index to an existing table whenever you like. Here, the sort order is the index, and it is baked into the byte layout. You cannot bolt it on afterwards — you rewrite the file. Anyone handing you a GeoParquet file has already decided how fast your spatial filters will be.

Cloud-native: the index lives inside the file

Because Parquet's footer sits at a known place and the footer records the byte offset of every column chunk in every row group, a reader can work over HTTP range requests. Fetch the footer. Read the statistics. Decide which chunks you need. Fetch exactly those byte ranges from object storage.

No server. No database process. No API in front. A file on S3, R2 or Azure Blob is a queryable dataset, and the "index" is not an external structure — it is inside the file.

That's why Overture Maps distributes the entire planet as GeoParquet on object storage and lets you query a city from a laptop without downloading the world. It is the same philosophy that COPC applies to point clouds: keep the spatial organisation inside a single self-describing file, and let range requests do the rest.

The honest limits

GeoParquet is the scan shape. It is genuinely bad at the other job, and pretending otherwise leads to painful architectures.

No in-place UPDATE. Parquet files are immutable. Changing one parcel means rewriting a file (or at best a partition). Correcting one boundary by rewriting a 40 GB file is absurd.
No transactions, no concurrent writers, no table semantics — unless you layer Iceberg or Delta Lake on top, which add a metadata layer giving you snapshots, schema evolution and row-level updates over the same Parquet files.
Weak at single-feature lookups. Fetching one parcel by ID means scanning row-group statistics for an ID that may be anywhere. A database index does this in microseconds.
Not the best live bbox format. For "give me the features in this viewport, right now, over HTTP", FlatGeobuf usually wins: it carries a packed Hilbert R-tree, so a client does a couple of range requests straight to the matching features. GeoParquet's granularity stops at the row group.

	PostGIS (row-oriented)	GeoParquet (columnar)
Physical layout	Row contiguous on disk	Column contiguous on disk
Best at	Fetch/edit one feature	Scan/aggregate everything
Index	B-tree + R-tree, added anytime	Sort order, fixed at write time
Update one feature	`UPDATE` in place	Rewrite the file
Transactions	Yes (ACID)	No (unless Iceberg/Delta)
Reads 2 of 40 columns	Reads all 40	Reads 2
Serving model	Server process + connection	Plain file + HTTP range requests
Live viewport fetch	Good	Weak (FlatGeobuf better)
Typical home	Editing systems, APIs, OLTP	Analytics, distribution, OLAP

The verdict

If you edit it, it belongs in a database. If you only ever scan it, it belongs in a file.

That is the whole rule, and it is a statement about access pattern, not about technology preference. The surveyor's workload and the analyst's workload are different physical problems, and one byte layout cannot be optimal for both.

Mature geospatial stacks stop trying and keep both shapes: PostGIS as the system of record where features are created and edited, and a GeoParquet export — Hilbert-sorted, bbox-covered, sitting on object storage — as the analytical and distribution copy, refreshed nightly. The database stays small and fast at the thing it is good at. The analyst stops running six-minute queries against production. Nobody argues about which one "wins", because they were never doing the same job.

▶ Watch the reel: your geodata has two shapes

Connection Pooling: Why Your API Dies at 200 Users (But the DB Is at 4% CPU)

Vahid Aghajani — Sun, 19 Jul 2026 07:19:04 +0000

Originally published on software-engineer-blog.com.

FitLog is a small workout-tracking app: one API, one Postgres database behind it. At 20 users it's instant — every request feels like a local function call. Then a launch happens, 200 people sign up at once, and the API falls over. Requests time out, error rates spike, pagers go off.

So you open the database dashboard, bracing for a pegged CPU and a disk on fire — and Postgres is at 4% CPU. Barely awake. The database was never the bottleneck. It was never even reached.

That paradox is the whole reason connection pooling exists.

A connection is not a variable — it's a small server

The instinct is to think of "connecting to the database" as cheap: assign a variable, you're connected. It isn't. Opening a fresh Postgres connection is a small negotiation:

A TCP handshake (round trips).
A TLS handshake (more round trips, crypto).
Password authentication.
Then Postgres forks a whole backend process dedicated to that one connection — its own memory, its own everything.

Add it up and you're looking at roughly ~40ms and megabytes of RAM — to set up a connection that will then run a 3ms query. You're paying 90%+ of the cost on setup, before any real work happens.

# The trap: a brand-new connection per request
def handle_request(query):
    conn = psycopg2.connect(DATABASE_URL)  # ~40ms: TCP + TLS + auth + forked backend
    cur = conn.cursor()
    cur.execute(query)                     # 3ms of actual work
    result = cur.fetchall()
    conn.close()                           # ...and throw the expensive thing away
    return result

Do that once per request and every request drags a 40ms anchor behind a 3ms task.

The wall: `max_connections = 100`

It gets worse than "slow." Because every connection is a forked backend process, Postgres refuses to spawn an unbounded number of them. It ships with:

max_connections = 100

Connection #101 does not queue, and it does not slow down. It is refused, immediately:

FATAL: sorry, too many clients already

So when 200 users arrive at once and your app opens a connection per request, roughly 100 are turned away at the door, and the survivors each burn 40ms of setup for their 3ms of work. From the app it looks like the database is melting. From the database's point of view, it did almost nothing — it spent its time forking and rejecting, not querying. Hence 4% CPU while everything is on fire.

The fix: stop opening connections

A connection pool flips the model. Instead of creating a connection per request, you open a small, fixed set once, at boot, and keep them open for the life of the process. A request no longer creates a connection — it borrows one from the pool, runs its query, and hands it straight back, still open, still authenticated.

from psycopg2.pool import ThreadedConnectionPool

# Opened ONCE at startup — the expensive setup is paid a single time
pool = ThreadedConnectionPool(minconn=5, maxconn=20, dsn=DATABASE_URL)

def handle_request(query):
    conn = pool.getconn()          # borrow a warm, authenticated slot (microseconds)
    try:
        cur = conn.cursor()
        cur.execute(query)         # 3ms of actual work
        return cur.fetchall()
    finally:
        pool.putconn(conn)         # return the slot, still open, for the next request

The 40ms setup happens maxconn times over the whole life of the app, not once per request. That's the entire trick — and it's what PgBouncer, HikariCP, and SQLAlchemy's pool are all doing under the hood. One job.

Three things people get wrong

1. The pool is a queue, not a multiplier. A pool of 20 does not let you serve infinite users. When all 20 slots are busy, request #21 waits for a slot to free up — it doesn't fail with "too many clients." You've traded hard refusals for bounded waiting, which is almost always the trade you want.

2. A smaller pool is often faster. It's tempting to crank the pool to 100 "to be safe." But 100 connections fighting over the same CPU cores, locks, and memory bandwidth run slower than 20 connections running at full speed. Past the point where the pool size matches what the database can actually do in parallel, more connections is pure contention. 20 focused workers beat 100 elbowing each other.

3. Size it to the database, not to your users. The right pool size tracks the database's cores and disk, not how many users you have. And the ceiling is shared: the pool is per process. Run 10 app instances with a pool of 20 each and you've opened 200 connections to a database that allows 100. Do the math across your whole fleet, not per box.

	Connection per request	With a pool
Setup cost	~40ms on every request	Paid once, at boot
Under a burst	Connection #101 refused	Request #21 waits in a queue
DB processes	One fork per request	Fixed, reused
Failure mode	"too many clients already"	Bounded latency
Right size	—	DB cores, shared across all processes

The one that bites in production

A pool hides a burst, not a bug. When you see pool exhausted / QueuePool limit reached, the reflex is to raise the pool size. Resist it. Nine times out of ten the pool isn't too small — a slow query is holding slots. One query that used to take 3ms now takes 3 seconds (a missing index, a lock, a table scan), so each slot is occupied 1000× longer, and 20 slots drain in an instant. Bumping the pool to 40 just means you exhaust 40 slots a moment later while the real culprit — the slow query — sails on. Fix the query, not the number.

The same idea, one layer up: serving LLMs

If you build AI features, you've already met this exact shape twice — and pooling is the answer both times.

RAG retrieval hammers your database. A retrieval-augmented app runs a vector similarity search (pgvector, or a dedicated store) on every question, often several per user turn. That's a connection-per-request firehose pointed straight at Postgres. Without a pool you rebuild the 40ms handshake on every retrieval; with one, the embedding lookups borrow warm slots. The RAG path is one of the most connection-hungry workloads you can ship, and it's the first place a missing pool shows up as mysterious latency.

LLM inference servers are connection pools for the GPU. Look at how vLLM or TGI serve a model and you'll see the identical mental model. The GPU can only decode so many sequences at once — that's max_num_seqs, the exact analogue of max_connections. Incoming requests don't each spin up their own model; they borrow a decode slot via continuous batching, stream their tokens, and release it. Request #N over the limit waits in a queue — it isn't refused. And the ceiling is set by KV-cache memory (GPU cores and VRAM), not by your user count — you size the batch to the hardware, exactly like sizing a pool to the database's cores. Even "pool exhausted = slow query" carries over: when a serving queue backs up, the cause is usually a few very long generations holding slots, not a batch size that's too small. Same disease, same cure, one layer up the stack.

The takeaway

Connection pooling isn't an optimization you bolt on later — for anything with real concurrency it's the difference between "works in the demo" and "survives the launch." Open your connections once, keep them warm, borrow and return instead of create and destroy. Treat the pool as a queue you size to the database (and count across every process), and when it exhausts, go hunt the slow query before you touch the dial.

No more 3 a.m. too many clients already pages.

Zero-Shot vs Few-Shot Prompting: Why Your LLM Output Keeps Breaking (and the 1-Minute Fix)

Vahid Aghajani — Sat, 18 Jul 2026 09:49:21 +0000

Originally published on software-engineer-blog.com.

Your model returns clean JSON in the notebook and a rambling paragraph in production. Same model, same question. You changed nothing about the weights—you changed the prompt.

That difference is zero-shot versus few-shot prompting, and it's the cheapest reliability upgrade in your AI stack.

Mental model: A language model is a next-token predictor with no memory of your intent except what's in the current prompt. Your prompt is the only spec it gets. Examples in that prompt teach it what shape you want—without retraining a single weight.

The Running Example: A HelpDesk Ticket Classifier

You're building a support system. Every ticket comes in as prose—messy, varied, human. Your backend needs strict JSON:

{
  "category": "billing",
  "priority": "high",
  "action": "refund"
}

You have two ways to ask the model. Both use the exact same trained weights. The output shape differs because the prompt changes.

Zero-Shot: Ask Without Examples

You pass the model a ticket and a question:

Ticket: "My invoice shows $500 but I only used the service for 2 days. This is wrong."

Convert this support ticket to JSON with keys: category, priority, action.

The model predicts the next token. Then the next. It has no worked example to anchor its output shape. What you get back might be:

A rambling paragraph starting with "This ticket is clearly about billing..."
Keys spelled differently: "Category" instead of "category"
Extra keys you never asked for
Values that don't match your enum ("medium-high" instead of "high")
Sometimes valid JSON, sometimes not

Each call drifts. The model is guessing the shape because your prompt gave it no pattern to copy.

Few-Shot: Paste Worked Examples First

Now you paste 2–3 examples of (ticket → correct JSON) before the real question:

Example 1:
Ticket: "App crashed when I tried to export data. Lost 30 minutes of work."
JSON:
{
  "category": "technical",
  "priority": "high",
  "action": "investigate"
}

Example 2:
Ticket: "Can you walk me through the API docs? I'm confused about authentication."
JSON:
{
  "category": "support",
  "priority": "low",
  "action": "guide"
}

Now classify this ticket:
Ticket: "My invoice shows $500 but I only used the service for 2 days. This is wrong."
JSON:

The model sees the pattern: key names, value choices, JSON shape. When it predicts the next token after JSON:, it copies that pattern:

{
  "category": "billing",
  "priority": "high",
  "action": "refund"
}

Same model. Same weights. The prompt now carries the specification.

Why This Isn't Fine-Tuning

A crucial distinction: few-shot changes no weights. Fine-tuning would:

Take your examples
Recompute gradients through the entire model
Update weights so the model "remembers" your task
Leave a new checkpoint on disk

Few-shot does none of that. The examples live only in the context window—the token buffer—for that single request. Once the request ends, the model is unchanged. This is in-context learning: the model learns a task by reading examples in the same conversation, not by retraining.

That's why it's fast to deploy (no training loop) and why it costs tokens (examples ride in every prompt).

The Honest Tradeoff

Dimension	Zero-Shot	Few-Shot
Setup cost	None—just ask	Spend time writing 2–3 good examples
Tokens per request	Minimal (prompt + question)	Higher (examples + prompt + question)
Cost per request	Cheaper	More expensive (more tokens)
Latency	Slightly faster	Slightly slower (more tokens to process)
Output consistency	Drifts; shape varies	Locked; model copies your pattern
When it breaks	Tricky tasks; hard to infer intent from wording alone	When examples don't cover edge cases

Every few-shot example you add burns tokens on every request. But if drifting output breaks your downstream code, you have no choice.

When to Reach for Each

Zero-shot: The task is obvious from the instruction alone. You're asking for a summary, translation, or simple classification where the expected output format is standard (prose, a sentence). Cost matters more than perfect consistency.

Few-Shot: You need a locked output shape (JSON, XML, a specific enum), a tricky judgment call where the edge cases aren't obvious, or a task where the phrasing in your examples changes the model's behavior (e.g., whether it calls something "priority: high" vs "severity: critical"). Consistency matters more than saving a few tokens.

In the HelpDesk example, you need few-shot because:

The output must be valid JSON your backend can parse
Keys and values must match an enum
The model can't infer that from wording alone; it needs examples to copy

Zero-shot would save tokens but cost you runtime errors. Few-shot costs tokens but keeps your pipeline running.

The LLM Serving Angle

In production, token count drives latency and cost. Few-shot examples increase the prompt length, which means:

Longer prefill time (processing the entire prompt, including examples, before generating the first answer token)
Higher tokens-per-second throughput (you're paying for more compute upfront)
Bigger memory footprint in the context window

If you're serving thousands of concurrent requests, those extra tokens compound. Some teams cache the prompt prefix (the examples) so only the new question and answer are computed per request—reducing redundant work. Others batch zero-shot requests to keep latency flat, and reserve few-shot for high-stakes tasks where consistency justifies the cost.

The Bottom Line

Zero-shot asks; few-shot shows. Same model, same weights—in-context learning does the rest. Reach for zero-shot when the task is obvious and cost matters; reach for few-shot when you need a locked format or your task has edge cases examples can clarify.

Watch the 90-second reel on YouTube or Instagram (@software-engineer-blog) for the visual walkthrough.

The N+1 Query Problem: Why 100 Products Cost 101 Queries (and Why an Index Won't Save You)

Vahid Aghajani — Fri, 17 Jul 2026 17:03:32 +0000

Originally published on software-engineer-blog.com.

You wrote one query. The database ran 101. And nothing in the code looks wrong.

This is the N+1 problem — the single most common reason a page that was instant with 10 rows falls over at 1,000. There is no slow query in the log, nothing flagged, nothing to blame. Just a loop that quietly turns one request into a hundred. Let's build it up from first principles, fix it in the query you already wrote, and then — the part most explanations skip — watch the exact same shape appear far away from any database.

The innocent loop

Your page shows 100 products. So you write one query:

SELECT * FROM products LIMIT 100;

One query. Then you render the list, and for each product you show its category:

products = db.query(Product).limit(100).all()   # 1 query

for product in products:
    print(product.name, product.category.name)   # ← one more query, every loop

That product.category looks like a field access. It isn't. The category lives in another table, and it wasn't loaded, so the ORM goes and fetches it — once per product. One hundred products, one hundred extra queries. Plus the original list query, that's 1 + N: the N+1 problem, named after exactly this shape.

The trap is that the code reads perfectly. There is no visible loop over the database, no obvious mistake. The extra hundred round-trips are hidden inside an attribute access.

The floor nobody starts with: a query's cost is the round-trip

To see why 101 queries is a disaster when each one is "fast," you have to know what one query actually costs.

Ask Postgres to find a single category by its primary key and it does that in about 0.18 ms. The lookup is genuinely fast. But getting the question to the database and the answer back — the network round-trip — costs around 2 ms. Serialize the query, cross the socket, wait, deserialize the result.

So each of those 100 category fetches is a perfectly indexed, sub-millisecond lookup wrapped in a 2 ms round-trip. Multiply out: ~100 trips × ~2 ms ≈ 512 ms on that page, versus ~12 ms if you'd asked once. Same data. 40× slower, purely in trips.

Why an index won't save you

The instinct, when a page is slow, is to reach for an index. Run EXPLAIN and you'll be disappointed: the database is already using the primary-key index for every one of those category lookups. They are as fast as a single query can be.

That's the whole point, and it's worth stating plainly:

Those are two different problems. An index cannot change a number it doesn't control. 101 indexed queries are still 101 round-trips. You can index every column in the schema and this page stays slow, because the cost was never in the finding — it was in the trips.

The fix goes into the query you already wrote

You don't need caching, a queue, or a rewrite. The fix goes right into the query you already have: add a JOIN, and the category rides back in the same result, on one trip.

SELECT products.*, categories.name AS category_name
FROM products
JOIN categories ON categories.id = products.category_id
LIMIT 100;

One query. The category is already attached to each row — no per-item lookup, no loop firing behind your back. 101 queries collapse to 1. ~512 ms becomes ~12 ms.

And you rarely write that SQL by hand. Tell your ORM to eager-load the relationship and it writes the exact join for you — it's one line:

ORM	Eager-load in one line
SQLAlchemy	`joinedload(...)` / `selectinload(...)`
Django	`select_related(...)` / `prefetch_related(...)`
Rails	`includes(...)`
Prisma	`include: {...}`

# SQLAlchemy — one line turns 101 queries into 1
products = (
    db.query(Product)
      .options(joinedload(Product.category))
      .limit(100)
      .all()
)

It was never about databases

Here is the part worth carrying out of this article, because it's bigger than SQL: N+1 is not an ORM quirk. It's any loop that makes one round-trip per item.

The database is just where most people notice it first. The same shape shows up everywhere a loop talks to something across a boundary:

One REST call per row in a list.
One GraphQL resolver firing per node in a result.
One object-store GET per key.
One LLM call awaited at a time.

That last one is where this bites hardest in AI work today. Say you need to classify, embed, or summarize 500 items and you write the obvious loop:

# The N+1 shape, wearing an AI hat
results = []
for item in items:                     # 500 items
    results.append(await llm.complete(item))   # one round-trip each, awaited in turn

Every await waits for a full network round-trip to the model provider before the next one starts. At ~2 seconds each, 500 sequential calls is ~17 minutes of near-pure waiting — for work that could have gone out concurrently. The fix is the same idea as the JOIN: stop issuing one trip per item. Batch the inputs into a single request where the API supports it, or fan the calls out with asyncio.gather / a bounded worker pool and let them fly in parallel. Same disease, same cure — collapse N trips toward 1.

# Fan out instead of awaiting one at a time
results = await asyncio.gather(*(llm.complete(item) for item in items))

Whether the "trip" is a SQL round-trip, an HTTP call, or a token-generation request to a model, the lesson holds: it's the number of trips, not the speed of each, that's killing you.

Not a toy problem: Shopify

If this sounds like a beginner mistake, it isn't. Shopify hit exactly this shape in their GraphQL API. In their own words:

That's N+1, verbatim, at one of the largest commerce platforms on the internet. Their answer was to build graphql-batch — a library whose entire job is to coalesce those per-item trips into batched ones. When a company at that scale ships a library just to stop N+1, that tells you how common and how invisible it is.

The verdict

	Add an index	Add a JOIN / eager-load
What it changes	How fast ONE query finds rows	How MANY queries you issue
Round-trips for 100 products	Still 101	1
Fixes N+1?	No	Yes
Where it lives	Schema / migration	The query you already wrote

The real takeaway is a habit, not a keyword: log the round-trip count per request and assert on it in a test. If that number grows when your data grows, you have an N+1 — and it is completely invisible on a laptop seeded with 10 rows, which is precisely why it survives all the way to production.

Count your round trips, not your milliseconds.

How a Database Index Actually Works: B-Trees, Seq Scans, and the Cost Nobody Mentions

Vahid Aghajani — Thu, 16 Jul 2026 21:24:09 +0000

Originally published on software-engineer-blog.com.

The same query. 4.2 seconds, then 3 milliseconds. Same data, same machine, same SQL. The only thing that changed was one line:

CREATE INDEX idx_customers_email ON customers(email);

Most explanations of indexing start at "it's like the index in a book" — and then stop. That analogy is fine as far as it goes, but it skips the part that actually matters: why the database is slow without one, and what it costs you to add one. So let's start a level below the index.

Throughout, one running example: BrewBox, a coffee shop with a customers table of 5,000,000 rows, and one query their login page runs on every single page load:

SELECT * FROM customers WHERE email = 'sara@mail.com';

First principles: what a table actually is

Here's the thing nobody tells you up front. A table is a pile of rows on disk, written one after another in the order they arrived. That's it. Nothing about it is sorted.

Sara signed up 3 million rows ago, sandwiched between whoever signed up just before and just after her. Her row isn't in an alphabetical slot. There is no alphabetical slot. There's just insert order.

So when you ask for WHERE email = 'sara@mail.com', the database genuinely does not know where that row is. And it has no clever way to guess, because nothing about the layout of the file correlates with email addresses.

The full table scan

With no idea where Sara is, the database has exactly one option: look everywhere.

Row 1 — not Sara. Row 2 — not Sara. Row 3 — not Sara. All the way down. Five million reads, to return one row. That's a full table scan (Postgres calls it a Seq Scan), and it's not the database being dumb — it's the database having no alternative.

And it's worse than a one-time 4.2-second hit, because this is a login page. Every page load does the whole thing again. Ten users online means ten full scans of five million rows, concurrently, competing for the same disk.

Enter the index

Now the one line of SQL:

CREATE INDEX idx_customers_email ON customers(email);

Notice what this doesn't do. It doesn't change your table. It doesn't change your query — you still write the same SELECT, and the planner just quietly starts using the index. Nothing in your application code changes.

What it builds is a second structure, on the side. And that structure holds only two things per row:

the indexed column (the email), and
a pointer to where that full row physically lives on disk.

Not the whole row. Just the value and the address. And critically — it is kept sorted.

That's the entire trick, and it's worth saying plainly: sorted means you can skip. Once the values are in order, you no longer have to look at something to rule it out. You can rule out half the remaining data with a single comparison.

The B-tree: three hops, not five million reads

Concretely, that sorted structure is a B-tree. Searching it is the game of "higher or lower," and every guess throws away most of what's left:

Is sara@mail.com before or after "m"? After → go right. (Half the index just disappeared.)
Before or after "sar"? Take that branch.
Three hops down, and you're on the exact entry — holding a pointer straight to Sara's row.

The reason this scales so absurdly well is that a B-tree is wide, not deep. Each node holds hundreds of keys, so the tree fans out fast. Five million rows is only about three levels deep. Ten times the data doesn't cost ten times the work — it costs roughly one more hop.

The payoff

3 rows read instead of 5,000,000. 4.2 seconds → 3 milliseconds. Roughly 1,400× faster, with the same query, the same data, and the same machine.

The cost nobody mentions

Here's the part that gets left out of every "just add an index" answer.

An index is a copy. It's a real structure that lives on disk, so it costs disk. That's the small cost.

The real cost is this: every write has to keep it true. Insert one customer and the database doesn't do one write — it writes the row, and then it also updates every index on that table, because an index that doesn't know about Sara is an index that lies.

Six indexes on customers? One INSERT becomes seven writes. Same for updates and deletes. Index every column "just in case," and your reads get fast while your writes quietly crawl.

So an index is not free speed. It's a trade: faster reads, slower writes.

	No index	With an index
Lookup by that column	Full scan — reads every row	~3 hops down a B-tree
Rows read to find one	5,000,000	3
Query plan	`Seq Scan`	`Index Scan`
Scaling	Linear — 10× rows, 10× work	Logarithmic — 10× rows, ~1 more hop
Cost of one INSERT	1 write	1 write + 1 per index
Disk	Just the table	Table + a copy of the column
Best for	Write-heavy tables, tiny tables	Columns you filter, join, or sort on

The verdict: what to index, and what not to

Index the columns you actually filter, join, or sort on — the ones in your WHERE, your JOIN ... ON, your ORDER BY. Those are the columns where a sorted side-structure buys you something real.

Don't index everything else. An index nobody queries is pure cost: disk you pay for and writes you slow down, forever, in exchange for nothing. A few good indexes beat a dozen speculative ones.

Two practical notes:

Tiny tables don't need indexes. If the whole table fits in a page or two, scanning it is the fast path — the planner will often ignore your index anyway, and it's right to.
Low-cardinality columns are usually a bad fit. An index on a boolean is_active can't skip much — half the table matches either way. Indexes pay off when the value is selective enough to eliminate most rows.

The same idea, one layer up: indexes in AI systems

If you work on LLM or RAG systems, you've already met this exact trade — just wearing a different hat.

When a RAG pipeline retrieves context, it's answering "which of my 5,000,000 chunks are closest to this query embedding?" The naive answer is the same as BrewBox's: compare against all of them — a full scan, just with cosine similarity instead of =. It works fine at ten thousand vectors and falls over at ten million, for precisely the reason above: linear scaling.

So vector databases do the same thing a relational database does: build a sorted-ish side-structure that lets you skip. A B-tree can't do it (embeddings have no meaningful "before m"), so the structure is different — typically HNSW, a navigable graph you greedily hop through, which I break down in Vector Search — how HNSW finds nearest neighbours. But the shape of the deal is identical:

Same payoff: don't touch most of the data. Hops, not a scan.
Same cost: the index is a copy that costs memory, and every insert has to update it — which is exactly why re-embedding a large corpus is slow, and why some vector stores make you rebuild rather than insert cheaply.
Same discipline: index the thing you actually query on.

The one extra wrinkle: a B-tree lookup is exact, while a vector index is approximate — HNSW may miss a true nearest neighbor, and you tune that recall/latency trade knowingly. Beyond that, if you understand why CREATE INDEX makes reads fast and writes slow, you already understand why your vector store behaves the way it does. It's the same bargain: pay on write, save on read.

Where your 4.2 seconds went

Here's the one thing to actually do with this. Take your slowest endpoint, grab the query it runs, and put EXPLAIN in front of it:

EXPLAIN ANALYZE SELECT * FROM customers WHERE email = 'sara@mail.com';

Read the plan. If you see Seq Scan on a big table, that's not a mystery anymore — that's the database telling you, in plain language, that it's reading every row because you never gave it another option. You just found your 4.2 seconds.

And if you see Index Scan, the index is doing its job. The remaining question is the other half of the trade: are you paying for any indexes nobody reads?

Vector Search — how HNSW finds nearest neighbours

Vahid Aghajani — Thu, 16 Jul 2026 13:00:10 +0000

Originally published on software-engineer-blog.com.

Your documentation site has 2,000,000 help articles. A user types a question into the search box and gets the best match in 2 milliseconds—without the system ever comparing that question to almost any of them. This is not magic. It's HNSW: Hierarchical Navigable Small World, the graph structure that powers vector search in FAISS, pgvector, Qdrant, Weaviate, and Milvus.

Mental model: HNSW is a multi-layer graph where every vector points to its nearest neighbors, and search is a greedy walk that descends from sparse long-range links at the top to dense fine-grained links at the bottom—like taking highways down to local streets to find an address.

The Problem: Why Brute Force Is Correct but Unusable

You already have the pieces:

Each of your 2,000,000 help documents is embedded—a point in a 768-dimensional space.
A user's question becomes a point in that same space.
"Best answer" means the nearest point in that space (highest cosine similarity).

The obvious solution is also the correct one: compare the query point against all 2,000,000 document points, measure distance to each, return the closest. With 768 dimensions per vector, that's roughly 1,540,000 arithmetic operations per document, summing to about 1,240 milliseconds on a single core.

Worse: it scales linearly. Double your documents, double your wait. Triple them, triple the wait. At 10 million documents, you are looking at 6+ seconds. At 100 million, a minute or more. And every new search pays the full price.

Why You Cannot Index Your Way Out

Your first instinct: Can't we just use a B-tree? B-trees are brilliant for one-dimensional sorting. They let you find the number 42 in a billion-element list in log(n) comparisons.

But "nearest neighbor" in 768 dimensions is not sorting on one axis. Closeness is simultaneous proximity across all 768 axes. A B-tree sorts on one—your longitude, your latitude, your price. The second you introduce a second dimension, tree structures collapse because there is no left-right order that preserves nearness in 2D, let alone 768D.

Some databases (PostgreSQL with pgvector) do attempt tree-based approaches like IVFFlat (Inverted File with flat clusters), but they still scan multiple clusters and fall back to brute force within each. They are faster than pure brute force, but still O(n) in the worst case.

You need a different shape: a graph.

Enter HNSW: The Navigable Small World

HNSW solves this by building a graph where:

Every vector is a node.
Every node is wired to M of its nearest neighbors (M is typically 5–48; higher M = more accuracy, more memory).
Search is a greedy walk: starting from a random node (or a designated entry point), hop to whichever neighbor is closer to the query. Repeat until no neighbor is closer—you have arrived.

This is approximate search: you may not find the true global nearest neighbor. Instead, you find a local nearest neighbor—the best one reachable by greedily following edges. But because every node is connected to its nearby neighbors, and those neighbors connect to their neighbors, you can usually reach the true global nearest in a few hops.

On the AskDocs example: instead of 2,000,000 comparisons, you make roughly 1,800 comparisons across maybe 15–20 hops. That's 0.09% of the work. Latency drops from 1,240 ms to 2 ms.

The H: Hierarchy and Layer Skip

One flat graph with M neighbors per node has a problem: a greedy walk might need hundreds of hops to cross the graph. Start at one end, need to reach the other; your neighbors are all locally nearby, so you take tiny steps.

HNSW solves this with layers—a skip-list-like idea:

Layer L (top, sparse): Contains only a fraction of vectors, wired with long-range links. Think highways connecting cities 500 km apart.
Layer L-1 (denser): More vectors, tighter links. Main roads connecting towns 50 km apart.
Layer 0 (bottom, densest): Every single vector, linked to its M nearest neighbors. Local streets.

Search descends:

Start at the entry point on the top layer.
Greedily walk until no neighbor is closer.
Drop to the next layer and walk again from your current position.
Repeat until layer 0.

On layer 0, you refine the answer among nearby vectors. The higher layers got you in the right neighborhood fast; the lower layers get you to the exact street.

Result: 15–20 hops instead of hundreds. And the number of hops is logarithmic in the dataset size.

Two Tuning Knobs: M and ef_search

HNSW has two main hyperparameters:

M: Edges per node (default ~16). Higher M = faster search but more memory and slower insertions.

ef_search: The size of a candidate list kept during search. The algorithm explores the graph, keeping the N best candidates seen so far. Higher ef_search = more of the graph explored = higher recall but slower search.

You set ef_search at query time, not build time. This lets you tune recall vs. latency per query. A strict recall requirement? Raise ef_search. A latency deadline? Lower it.

The Catch: Approximation Is a Tradeoff

HNSW is approximate, not exact. A greedy walk can settle into a local minimum—the best neighbor you can reach from your current position, but not the global best. This is especially likely in sparse regions of the embedding space or when M is small.

Accuracy is measured as recall: the fraction of true nearest neighbors found. A recall of 0.99 means 99% of your top-10 results would have appeared in a brute-force top-10.

Recall is tunable:

Raise M → more edges per node → more paths to the true nearest → higher recall, higher memory, slower build.
Raise ef_search → larger candidate set during search → more graph explored → higher recall, slower query.

You cannot have perfect recall at perfect speed. You choose your point on the recall-latency curve. Most production systems run at 95–99% recall to stay sub-10ms; some (e.g., recommendation systems) drop to 90% recall and trade 0.5ms latency for a small recall penalty.

The Payoff

Brute force on 2,000,000 vectors:

2,000,000 comparisons
~1,240 ms
100% recall
O(n) scaling

HNSW with M=16, ef_search=200:

~1,800 comparisons
~2 ms
~99% recall
O(log n) scaling

You trade 1% accuracy for a 600× speedup. In practice, that 1% is almost never noticed by users—the wrong answer is so close in meaning space that it is functionally equivalent.

One Table: HNSW vs. Alternatives

Approach	Lookups (2M vectors)	Latency	Recall	Scaling	Memory Overhead
Brute Force	2,000,000	~1,240 ms	100%	O(n)	Minimal
IVFFlat (clustering)	~100,000	~60 ms	~98%	O(n) worst-case	Low
HNSW (hierarchical graph)	~1,800	~2 ms	~99%	O(log n)	Moderate (M × n edges)

HNSW for LLM Inference and RAG

In Retrieval-Augmented Generation (RAG), you embed a user's prompt and search a knowledge base to fetch relevant context before passing it to an LLM. Latency here is TTFT (time-to-first-token): every millisecond spent searching is a millisecond the user waits before the model starts generating.

HNSW is essential because:

Vector databases store millions of chunks (customer docs, code, research papers). Brute-force search would add 1+ seconds per query—unacceptable for interactive LLM chat.
HNSW keeps TTFT under 10ms, letting the bottleneck shift to the LLM's token generation (TPOT, time-per-output-token), not retrieval.
Approximate recall is fine here. An LLM is robust to slightly off-topic context; 99% recall is indistinguishable from 100% in practice.

When building a RAG pipeline, choose HNSW (via FAISS, Qdrant, Weaviate, or pgvector) over brute-force search; your TTFT will stay latency-bound by the LLM, not the retriever.

Tools That Use HNSW

FAISS (Meta, open-source): low-level vector search library; HNSW is one of many indices.
hnswlib (Yu. Malkov, open-source): the reference HNSW implementation; often embedded in other databases.
pgvector (Postgres extension): HNSW available via CREATE INDEX ... USING hnsw.
Qdrant (vector database): HNSW is the default index type.
Weaviate (vector database): supports HNSW alongside other structures.
Milvus (open-source vector database): HNSW available as an index option.

Verdict

Reach for brute-force search when you have <100k vectors and latency is not a constraint (offline analytics, one-time batch jobs).

Reach for HNSW when you have millions of vectors, need sub-10ms latency, and can tolerate 1–5% recall loss (RAG, recommendation systems, real-time search).

Watch the 90-second reel on YouTube to see this in motion: the walk down the layers, the greedy hops, the latency ticking from 1,240 ms down to 2 ms.

RAG vs Fine-tuning

Vahid Aghajani — Wed, 15 Jul 2026 07:35:10 +0000

Originally published on software-engineer-blog.com.

The single most expensive misconception in applied AI right now is that fine-tuning teaches a model your documents. It doesn't — and entire GPU budgets get torched on this one mistake. The real split is clean, but almost nobody frames it this way: knowledge versus behaviour.

Mental model: RAG is a library you hand the model at query time; fine-tuning is teaching the model a writing style.

The confusion

You have a set of internal documents. You want the model to answer questions about them. Two paths show up:

Fine-tune the model on those documents.
Use RAG — retrieval augmented generation.

Most teams pick option 1, expecting the model to "know" the docs. Six weeks later, on GPU credits they can't get back, they realize the model still hallucinates. Then they find out about RAG. Then they argue about which one to use.

The argument ends when you stop conflating knowledge with behaviour.

RAG: Knowledge at query time

RAG doesn't change the model at all. It changes the prompt.

Here's the flow:

Your documents live in a search index (typically a vector database: Pinecone, Weaviate, Milvus, or a simpler inverted index).
A user asks a question.
You retrieve the K most relevant chunks from that index.
You paste those chunks into the prompt, above or below the user's question.
The model reads them and answers.
The weights never change.

Because the weights never change:

Sources are citable. The model can say "according to page 47 of your policy manual."
Facts stay current. When your pricing changes, you reindex the new document. You don't retrain.
Hallucination is local. If retrieval fails (wrong chunk fetched), you get a confident wrong answer from that chunk. The failure is visible — you can debug the retrieval pipeline.

The cost of RAG is the pipeline itself. You now own:

A chunking strategy (how do you split documents so the model can read them?)
Embeddings (how do you represent each chunk as a vector?)
A vector store (where do chunks live, and how do you search them?)
Re-ranking or filtering (do the top K results actually matter?)
Latency and token overhead (every prompt now includes retrieved chunks).

The weak link is retrieval. Fetch the wrong chunk and the model answers confidently from it. This is operationally messier than it sounds: your monitoring has to watch for semantic drift in retrieval quality, not just model accuracy. A change to your embedding model can silently degrade recall.

Fine-tuning: Behaviour, not knowledge

Fine-tuning changes the model's weights. You feed it thousands of input-output pairs, and the model adjusts its parameters to predict those outputs given those inputs.

Fine-tuning teaches style. Examples:

"Always output valid JSON, never markdown."
"Use our internal terminology: 'customer mandate' not 'contract'."
"Never refuse; reframe instead."
"Your tone is formal, clinical, concise."

These patterns — the structural regularities in your training data — get absorbed into the weights. The model learns to emit outputs that match the shape of your examples.

Fine-tuning does not teach new facts. This is the misconception that matters.

Why? Because facts are not patterns. A fact is a specific piece of information: "Our API rate limit is 1000 requests per minute." When you fine-tune on documents containing that fact, the model doesn't store the fact. It learns correlations: tokens near "rate limit" tend to be followed by numbers in a certain range. Those correlations smear across the weight matrix. The model still hallucinates. It still gets the limit wrong half the time. And when you change the limit to 2000 requests per minute, there is no weight to edit — you have to retrain.

Because weights are opaque, the model can't cite the source. It can't distinguish between what it learned during pre-training and what it learned during fine-tuning. Everything is probability.

Side-by-side: when each breaks down

Dimension	RAG	Fine-tuning
What you're teaching	New knowledge (facts, data)	Consistent behavior (style, format, tone)
Does the model's weights change?	No	Yes
Can the model cite sources?	Yes (if retrieval includes source metadata)	No
How do you update when facts change?	Reindex (days, sometimes hours)	Retrain (weeks, GPU-intensive)
Hallucination risk if retrieval fails	High (wrong chunk, confident wrong answer)	High (weights encode fuzzy patterns, not facts)
Cost per inference	Higher (chunk tokens in prompt)	Lower (no retrieval overhead)
Operational complexity	Retrieval pipeline, embedding drift, chunk quality	Training infrastructure, data labeling, version control

The LLM inference lens

If you're running a serving system, RAG and fine-tuning hit your latency budget differently.

RAG adds latency to time to first token (TTFT). The retrieval call (embedding the query, vector search, maybe re-ranking) happens before you send the prompt to the model. On a 100ms embedding latency + vector search, you're looking at 150–300ms added to TTFT before the model sees a token. Then chunks in the prompt increase the time per output token (TPOT) because the KV cache is larger.

Fine-tuning shifts latency to training time (offline). Inference is faster — shorter prompts, no retrieval. But you pay for retraining whenever behavior needs to change.

If you need both low latency and up-to-date facts, RAG is the only option. If you can tolerate retraining cycles, fine-tuning for behavior + a smaller RAG pipeline (for critical facts only) can reduce TTFT.

The verdict

Reach for RAG when: Your knowledge changes (documents, prices, policies, product specs). You need sources. You want to debug failures. You can afford the retrieval pipeline.

Reach for fine-tuning when: Your model's behaviour must stay consistent (output format, tone, terminology, refusal strategy). You're not adding facts; you're teaching a style.

If you need both: Fine-tune the behaviour first. Then wrap the fine-tuned model in a RAG pipeline that retrieves facts. Fine-tuning should not carry the burden of knowledge management — it will fail at that job, and you'll waste time and GPU budget figuring out why.

Watch the 90-second reel for the quick framing.

bcrypt vs SHA-256: Why a Password Hash Should Be Slow on Purpose

Vahid Aghajani — Mon, 13 Jul 2026 14:04:40 +0000

Originally published on software-engineer-blog.com.

"We hash our passwords" is the sentence that ends most security reviews. It should start them — because the next question is the one that actually matters: with what?

Both bcrypt and SHA-256 are one-way hashes. Both turn hunter2 into a fixed-length blob you can't read backwards. And yet, when the database leaks, only one of them holds.

The difference is speed. And it's the opposite of what your instincts say.

SHA-256: a brilliant hash, doing the wrong job

Let's be clear about something up front, because the internet gets this wrong constantly: SHA-256 is not broken.

It's one-way. It's collision-resistant. It's the hash under git commits, TLS certificates, Bitcoin, and every checksum you've ever verified. It is a genuinely excellent cryptographic primitive, and it was designed with one dominant goal: be fast. Hash a gigabyte of file in a blink. Hash a million records without noticing.

import hashlib

h = hashlib.sha256(b"hunter2").hexdigest()
# f52fbd32b2b3b86ff88ef6c490628285f482af15ddcb29541f94bcf526a3f6c7

Same input, same output. Every time. Which is exactly the property you want for a checksum — and exactly the property that kills you here.

Attackers don't reverse your hash

This is the mental model flip. Nobody is sitting there trying to invert SHA-256. That's hard, and they don't need to.

They guess. They take a wordlist — the top 10 million leaked passwords, dictionary words, Summer2024! and its ten thousand cousins — hash each guess, and compare against your dump. It's a race, and the only thing that limits them is how many guesses per second the hardware can do.

With SHA-256, that number is obscene. A single consumer GPU chews through roughly 10,000,000,000 SHA-256 hashes per second. Ten billion. Per second. On one card.

Your leaked user table doesn't survive the weekend. It survives hours.

And without a salt, it's worse

SHA-256 has no salt built in. If two of your users pick the same password, they get the same hash, sitting right next to each other in the dump:

alice@corp.com   f52fbd32b2b3b86f...
bob@corp.com     f52fbd32b2b3b86f...   ← same hash = same password
carol@corp.com   f52fbd32b2b3b86f...   ← this one's popular, crack it once

Crack it once, own three accounts. Worse still, an attacker doesn't even have to do the work — they can precompute a giant table of hash → password once (a rainbow table) and then just look yours up. The cracking cost drops to a database join.

bcrypt: a hash that inverts every property on purpose

bcrypt isn't a general-purpose hash. It's a password hash, and it was designed by someone who had already thought through everything above. It takes SHA-256's virtues and deliberately throws them away.

1. It's slow. That's the entire point. bcrypt is built around a deliberately expensive key-setup step. It cannot be made fast, and it resists the GPU parallelism that makes SHA-256 cracking so cheap.

2. Every hash carries its own random salt. You don't manage it, you don't store it in a second column — bcrypt generates it per password and writes it inside the hash string. Same password, two users, two completely different hashes. Rainbow tables die instantly.

3. The cost is a dial you control. That's the 12 below — the work factor. Cost 12 means 2¹² = 4,096 rounds of key setup. Each +1 doubles the work, forever. Hardware gets faster? Bump the number.

import bcrypt

# Hashing — the salt is generated for you and baked into the output
hashed = bcrypt.hashpw(b"hunter2", bcrypt.gensalt(rounds=12))
# $2b$12$eImiTXuWVxfM37uY4JANjQuwoNw2NvJ2ZbcFTz4dGrvQoOHnBrOZK
#  │   │  └─ the salt, right there in the string
#  │   └──── cost factor: 12
#  └──────── algorithm: bcrypt

# Verifying — no salt column to look up, it's already in the hash
bcrypt.checkpw(b"hunter2", hashed)   # True

Read that output format for a second, because it's the elegant part: the algorithm, the cost, and the salt all travel with the hash. You can raise the cost factor next year and old hashes still verify — they carry their own instructions.

What that does to the attacker

At cost 12, one hash takes roughly 250 milliseconds. For a user logging in, that's imperceptible. For someone with your entire database and a rack of GPUs, it's a wall:

	SHA-256	bcrypt (cost 12)
Guesses/sec (1 GPU)	~10,000,000,000	~5,000
Salt	None by default	Per-password, automatic
Identical passwords	Identical hashes	Different hashes
Rainbow tables	Effective	Useless
Tunable over time	No	Yes (cost factor)
Time to crack a leak	Hours	Centuries
Right job	Integrity, signatures, HMAC	Passwords

Same leak. Same hardware. Same passwords. Hours versus centuries — and the only thing that changed is that you picked a hash that refuses to hurry.

What bcrypt actually costs you (it isn't free)

Anyone who sells you bcrypt as a pure win is skipping the invoice. There are three real costs, and all three have bitten production systems.

1. You pay the 250 ms on every single login. That's CPU, on your servers, per authentication. It's fine at a trickle. But a Monday-morning login storm — or a credential-stuffing bot hammering /login — turns a traffic spike into a CPU spike, and your own auth endpoint becomes the DoS. The fix isn't to lower the cost until it stops hurting; it's to rate-limit login attempts and size the box for the peak.

2. The work factor is a knob you have to keep turning. A cost that was painful for attackers in 2015 is comfortable for them now. The number isn't set-and-forget — it's a budget: pick the highest cost that keeps you around ~250 ms on your hardware, and re-measure every couple of years. (Because old hashes carry their own cost factor, you can upgrade lazily: on a successful login, if the stored cost is below your current target, re-hash and store.)

3. bcrypt silently truncates past 72 bytes. This one is a genuine footgun. Feed bcrypt a long passphrase and everything beyond byte 72 is ignored — no error, no warning. Two different 80-character passphrases sharing a 72-byte prefix will happily verify against each other. If you encourage long passphrases (you should), you need to know this.

The escape hatch: if any of that makes you nervous, reach for argon2id instead. It's the modern recommendation — no truncation, and it's tunable on memory as well as time, which makes GPU and ASIC attacks even more expensive. bcrypt is fine, battle-tested, and everywhere; argon2id is what you'd pick starting fresh today. Either one is a correct answer. SHA-256 is not.

The AI-era version of the same mistake

Here's where this gets freshly relevant, because the same decision shows up in every LLM app being built right now — and the right answer flips.

Your AI product ships an API. Every inference request arrives with an API key (sk-live-9f3a...), and you have to check it against the database on every call. So: bcrypt, right? Slow is safe, we just established that.

No. Do that and you've bolted 250 ms of CPU onto every single request to your model endpoint — on an endpoint whose whole selling point may be a sub-second time-to-first-token. You've made your auth layer slower than your LLM, and you've handed anyone with a load generator a trivial way to melt your gateway.

The reason it flips is the thing bcrypt was compensating for in the first place: entropy.

A password is chosen by a human. It's low-entropy, it's guessable, it's in a wordlist. bcrypt's slowness exists to make guessing uneconomical.
An API key is generated by you, from a CSPRNG, with 256 bits of randomness. It is not in any wordlist. There is nothing to guess. Ten billion guesses per second against a 256-bit random key is still, functionally, forever.

So for high-entropy secrets you've generated yourself — API keys, session tokens, password-reset tokens, webhook signatures — the fast hash is the correct hash. Store SHA-256(key), look it up by that digest on every request, and compare in constant time. Fast lookup, nothing sensitive at rest, no CPU tax on your hot path.

import hashlib, secrets

# Issue: 256 bits from a CSPRNG. Show it to the user exactly once.
raw_key = "sk-live-" + secrets.token_urlsafe(32)

# Store: only the fast digest. A leaked table of these is worthless —
# there's no wordlist for 256 random bits.
key_digest = hashlib.sha256(raw_key.encode()).hexdigest()

# Verify, on every inference request: one hash, one indexed lookup.
# Constant-time compare to avoid leaking the digest byte-by-byte.
secrets.compare_digest(key_digest, row.key_digest)

Same two algorithms. Opposite verdict. Because the question was never "which hash is stronger" — it was always "what is the attacker's cheapest path in?" For a human-chosen password, that path is guessing, so you make guessing slow. For a 256-bit random token, that path doesn't exist, so you optimize for the thing that does matter: throughput.

That's the real skill. Not memorizing "bcrypt good, SHA-256 bad" — but knowing which threat you're actually paying to defend against.

The verdict

The whole lesson compresses into one line:

Storing user passwords? → bcrypt (cost ~12, budget the 250 ms, rate-limit your login endpoint, mind the 72-byte limit) — or argon2id if you're starting fresh.
Integrity, checksums, git objects, digital signatures, HMAC, hashing high-entropy API keys? → SHA-256, and don't feel bad about it for a second. That's the job it was built for, and it's superb at it.

SHA-256 isn't broken. It never was. It's just the wrong tool for this one job — and "we hashed it" was never the same sentence as "it's safe."

So: which one is in your users table?

DEV Community: Vahid Aghajani

SQLite FTS5: How Full-Text Search Actually Works (Inverted Index + BM25)

Start honest: grep is not wrong

The flip: store "word → files", not "file → words"

The part that trips everyone: the ranking is NOT stored

The real query — and the negative-number gotcha

grep vs FTS5, honestly

The two costs nobody puts in the demo

The AI angle: why RAG pipelines still run BM25

Verdict

Next-Token Prediction: How an AI Actually Writes Text (Not Magic — Just Probability)

The concrete example: finishing one sentence

The part almost everyone skips: it samples, it doesn't grab the top score

The chart isn't fixed — it's rebuilt from context every time

What it costs, and where it fails

Reframe: this is also the whole story behind LLM-serving latency

The takeaway

Speculative Decoding, Explained: Free LLM Speed With Zero Quality Loss

Why Decoding Is Bottlenecked on Memory, Not Compute

The Core Idea: Draft Model Proposes, Big Model Verifies

Running Example: CodeCue, an In-Editor Code Assistant

Mathematical Identity: Why Output Quality Is Preserved

What Costs You: Acceptance Rate, VRAM, and Workload Shape

Real-World Performance

LLM Inference Context: TTFT vs. TPOT

When to Reach for Speculative Decoding

Data Modeling Explained: From One Messy Table to a Real Schema

The Problem: One Fat Table

Layer One: The Conceptual Model

Layer Two: The Logical Model

The Three Relationships

Normalization: One Source of Truth

Layer Three: The Physical Model

When to Denormalize

Why the Schema Matters for Data Engineering

For LLM-Serving Systems

Verdict

GeoParquet Explained: Your Geodata Has Two Shapes (One You Edit, One You Scan)

First principles: what a row actually is on disk

Flip the layout

GeoParquet is a convention, not a database

Row groups, statistics, and the 1.1 bbox column

The Hilbert trap: skipping is only as good as your sort order

Cloud-native: the index lives inside the file

The honest limits

The verdict

Connection Pooling: Why Your API Dies at 200 Users (But the DB Is at 4% CPU)

A connection is not a variable — it's a small server

The wall: max_connections = 100

The fix: stop opening connections

Three things people get wrong

The one that bites in production

The same idea, one layer up: serving LLMs

The takeaway

Zero-Shot vs Few-Shot Prompting: Why Your LLM Output Keeps Breaking (and the 1-Minute Fix)

The Running Example: A HelpDesk Ticket Classifier

Zero-Shot: Ask Without Examples

Few-Shot: Paste Worked Examples First

Why This Isn't Fine-Tuning

The Honest Tradeoff

When to Reach for Each

The LLM Serving Angle

The Bottom Line

The N+1 Query Problem: Why 100 Products Cost 101 Queries (and Why an Index Won't Save You)

The innocent loop

The floor nobody starts with: a query's cost is the round-trip

Why an index won't save you

The fix goes into the query you already wrote

It was never about databases

Not a toy problem: Shopify

The verdict

How a Database Index Actually Works: B-Trees, Seq Scans, and the Cost Nobody Mentions

First principles: what a table actually is

The full table scan

Enter the index

The B-tree: three hops, not five million reads

The payoff

The cost nobody mentions

The verdict: what to index, and what not to

The same idea, one layer up: indexes in AI systems

The wall: `max_connections = 100`