Embeddings on the Edge: sentence-transformers vs Hosted APIs

#rag #ai #embeddings #python

Book: RAG Pocket Guide: Retrieval, Chunking, and Reranking Patterns for Production
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

A team I talked to last quarter was paying around eleven thousand dollars a month, by their account, to embed product reviews on text-embedding-3-small. Roughly two hundred million chunks, refreshed weekly. Their on-call engineer ran a spike on BGE-large-en-v1.5 with text-embeddings-inference on a single H100. He came back two days later. Same recall on their eval set, as he told it. Approximately seventy dollars a day in GPU time on a spot instance. The same week, a friend at a four-person startup did the opposite migration. Ripped out their self-hosted MiniLM container, moved everything to OpenAI, and watched the bill drop from about a thousand dollars in GPU time to ninety in tokens, by his numbers.

Both teams were right. The default answer for "local or hosted embeddings" is it depends on your scale, your data, and what your team already operates. True and useless. This post is the honest version of it depends.

What counts as local in 2026

Hosted: text-embedding-3-small, text-embedding-3-large, Voyage 3 / 3 Lite, Cohere Embed v4, Gemini Embedding 2. You POST text, you get a vector, you pay per million input tokens.

Local: sentence-transformers on top of an open backbone like BAAI/bge-large-en-v1.5, BAAI/bge-m3, nomic-ai/nomic-embed-text-v2, or the small workhorses (all-MiniLM-L6-v2, bge-small-en at 384 dims). For production you wrap it in text-embeddings-inference (TEI) or Baseten's BEI. vLLM also exposes an embedding mode. Some teams still hand-roll FastAPI plus an ONNX-quantised checkpoint. The model file is a download. The infra is yours.

Pricing snapshot, end of April 2026

Verified on 2026-04-29.

Provider / model	Cost per 1M input tokens	Dimensions	Notes
OpenAI `text-embedding-3-small`	$0.02	1536 (Matryoshka)	Batch tier 50% off
OpenAI `text-embedding-3-large`	$0.13	3072 (Matryoshka)	Batch tier 50% off
Voyage `voyage-3-large`	$0.12	1024	Tops retrieval-leaning MTEB
Voyage `voyage-3-lite`	$0.02	512	Direct competitor to `3-small`
Cohere Embed v4	$0.12	1536 / 256	Multimodal (text + image)
Self-hosted BGE-large	~$0 marginal	1024	You pay for the GPU

Sources: OpenAI new embedding models post, Voyage pricing, Cohere pricing, Google Gemini embedding pricing. Prices drop roughly every six months; the comparison shape stays.

On the MTEB leaderboard as of 2026-04-29, Voyage 3 Large leads the retrieval slice and NV-Embed-v2 leads the overall average. BGE-M3 is the strongest open multilingual option you can download today. all-MiniLM-L6-v2 is still serviceable for English-only retrieval at 384 dims, and it runs free on the CPU you already own — though it does not beat the frontier.

The crossover formula

The right comparison is total cost at your monthly volume, not "per token" against "per hour."

Hosted is direct: T million tokens per month at price p costs T * p. With text-embedding-3-small at $0.02 and a corpus producing 500M tokens of new chunks plus 50M of query traffic, you pay 550 * 0.02 = $11/month. Hosted wins at small and medium scale by a wide margin.

Local is the GPU plus the ops. A reasonable setup is one or two L4 instances for low-to-medium traffic, or an H100 (or A100) for high traffic. Rough estimates based on public AWS spot pricing as of 2026-04-29 (see Vantage's spot history for live numbers): an L4 lands around $0.40–$0.70/hour, an H100 around $2–$3/hour. Always-on monthly works out to roughly $400 for the L4 and $1,800 for the H100. Add SRE time, a second region, and 20% for when the spot pool runs dry.

A note before the math: the GPU numbers below are optimistic. Real production runs two replicas and autoscales on weekends; spot pool drains push you to on-demand during outages. Double the GPU side to stay honest, then read the crossover figures.

Crossover is the volume T_break where T_break * p = monthly_GPU_cost.

text-embedding-3-small at $0.02/1M tokens
  vs always-on L4 at ~$400/month:
  T_break = 400 / 0.02 = 20B tokens/month

text-embedding-3-large at $0.13/1M tokens
  vs always-on L4 at $400/month:
  T_break = 400 / 0.13 ≈ 3.1B tokens/month

voyage-3-large at $0.12/1M tokens
  vs always-on H100 at $1,800/month:
  T_break = 1800 / 0.12 = 15B tokens/month

You need billions of tokens per month before the GPU bill beats the cheapest hosted model. The crossover for text-embedding-3-large is roughly a quarter as high as the small one; better hosted models cut the local case faster than they cut the bill.

Hosted wins until you are pushing billions of tokens per month. Below that, you run your own embedder for latency, residency, or predictable cost — not for the dollar figure.

Where local actually wins

Four cases. Usually about something other than cost.

Latency-critical retrieval. A round trip to OpenAI from us-east-1 is typically 30–100 ms based on community latency reports, and slower the rest of the time on the kind of "degraded latency for some customers" days you see on every hosted-AI status page. A local all-MiniLM-L6-v2 on a CPU returns an embedding in roughly 5–15 ms for a single short sentence on a modern x86 CPU (figures vary heavily by sequence length and hardware). If retrieval lives in the user's typing path (autocomplete, instant suggestions, search-as-you-type), those 50 ms matter, and the variance hurts more than the median. Hermes IDE's local-context indexer hit this. The network round trip wrecked perceived snappiness even when the API was fast, so we landed on a small int8-quantised open model behind the editor.

Air-gapped or regulated data. Hospitals, banks, government tenants on air-gapped networks. Data does not leave the perimeter. Ask the vendor for a BAA and a private-link endpoint and wait six weeks for legal, or run BGE-large on a GPU you already own. A data-residency decision, not a cost one.

Very high steady-state throughput. Hundreds of millions of tokens per day on a mostly-static corpus. The GPU bill is fixed; the hosted bill scales linearly. Above the crossover, the GPU pays for itself in weeks. Search-engine builders, code-search products with tens of millions of files, and content platforms re-embedding nightly when the model drops a minor version all land here.

Predictable cost. A CFO can plan around a known monthly GPU bill. Token pricing is fine until the product goes viral and the next bill is several times the previous one. Local flattens the curve at the cost of provisioning headroom.

What looks like a local win but is not: "I want to fine-tune the embeddings." Fine-tune on top of text-embedding-3-large with a thin reranker against a curated triplets dataset. You do not need to own the embedder to own the relevance.

Where hosted wins (most of the time)

Most teams below the crossover should pick hosted, especially when there is no GPU ops budget. Niche queries are where this gets interesting: right now Voyage 3 Large and text-embedding-3-large both sit above almost any open model on retrieval-leaning MTEB tasks, and that gap matters when the use case lives in their lead. Multilingual queries are similar — if the team is too small to evaluate bge-m3 honestly against the hosted competitors, hosted is the safer default.

Hosted also wins on two ops dimensions most comparison tables miss. The model upgrade is somebody else's problem, and so is the recall regression test that comes with it. When OpenAI eventually ships the next embedding model, you get a config change. When the BGE team drops v2, you get a re-index plus the maintenance window that goes with it. That work is real.

A real serving setup

If you run local in production, the small end of "good" looks like one process on the GPU box, behind your HTTP stack, with a content-hashed cache in front.

import os
from typing import Sequence

import numpy as np
from sentence_transformers import SentenceTransformer

MODEL_NAME = os.environ.get(
    "EMBED_MODEL",
    "BAAI/bge-large-en-v1.5",
)

_model = SentenceTransformer(MODEL_NAME, device="cuda")


def embed(
    texts: Sequence[str],
    batch_size: int = 64,
) -> np.ndarray:
    return _model.encode(
        list(texts),
        batch_size=batch_size,
        normalize_embeddings=True,
        convert_to_numpy=True,
        show_progress_bar=False,
    )

normalize_embeddings=True matters because cosine similarity on un-normalized vectors is a footgun nobody catches in code review. batch_size=64 is a starting point because BGE-large saturates an H100 well below batch-256 and a too-large batch hurts online latency. device="cuda" because the default CPU path on a 1024-dim model is much slower than people expect.

For real production, swap this for text-embeddings-inference running the same checkpoint. TEI handles dynamic batching and CUDA-graph optimisation that hand-rolled sentence-transformers does not. Hugging Face's TEI benchmarks have reported approximately 450 req/s on bge-base-en-v1.5 on a single A10G at 512-token sequence length (TEI v1.x as of April 2026; check the repo for current numbers). Larger GPUs and dynamic batching push that further, but the exact multiplier depends on sequence length and batch settings, so measure on your own workload before you plan capacity off a published number.

A pragmatic decision tree

Starting today on a small or medium RAG corpus? Hosted. text-embedding-3-small or voyage-3-lite for cheap; text-embedding-3-large or voyage-3-large when retrieval quality matters more than spend. Cache against (model_name, model_version, content_hash) so re-embeds do not multiply the bill.

If retrieval lives in a user-facing latency budget under 100 ms end-to-end, prototype with all-MiniLM-L6-v2 on the application machine. Know the local baseline before you choose.

Data that cannot leave the network? The choice was made for you. Run BGE-M3 or Nomic Embed v2 on the GPU you already operate.

Monthly token budget in the billions on a mostly-static corpus? Bake-off: hosted text-embedding-3-large versus bge-large-en-v1.5 and bge-m3 on your own eval set. Whichever ships acceptable recall at the lower total cost wins. Cost has to include SRE time, not just the GPU.

Most teams skip the bake-off and pick by reflex. Local embeddings usually land within roughly 2–5% of the hosted frontier on English retrieval per the MTEB retrieval slice as of April 2026, and on niche domains the gap can flip. A weekend of evaluation is cheaper than a year on the wrong default.

If this is useful

The RAG Pocket Guide walks the retrieval stack end to end: chunking, embedding-model selection, index choice, reranking, and the eval discipline that makes any of this measurable. The embeddings chapter has the longer version of this post, including the bake-off harness, the dimension-truncation trade for Matryoshka models, and the patterns for swapping models without invalidating an index.