Danis @ Artain

Posted on May 20

Embedding 685 million texts in 32 minutes

#rust #machinelearning #performance #opensource

My embedding pipeline used to take many hours. I'd kick it off, go do other work, come back, find a problem in the output, fix it, run it again. A single version of my semantic pipeline took three or four full runs before I was happy with it. That's a full working day burned on plumbing before the actual model work even starts.

Most embedding infrastructure is built to serve requests. I needed something different: a file-to-file engine that could keep eight GPUs busy until the corpus was done.

Now it takes 32 minutes. 357,893 msg/s sustained on 8x A100s, 2.6x faster than TEI in the reproducible benchmark, $6.75 in spot compute for the full production run.

Why I needed this

I'm building a behavioral intelligence model that needs semantic representations of public text. Think LinkedIn's "People You May Know," but generalized to open questions about a population. Who's emerging in the Rust scene, which ideas are gaining momentum among the people building the next decade.

The model needs semantic features from a large corpus of public text. Billions of raw messages. After filtering bots, deleted accounts, and noise, hundreds of millions of usable texts. Every one of them has to be embedded before the model ever sees it.

The part that bit me was iteration. You don't run an embedding pipeline once. You run it, look at the output, realize your filtering is wrong, fix it, run again. Change the model, run again. A single "version" is three-to-five full passes before you're happy with it. Whatever your wall-clock time is, multiply by at least four. That's the real cost.

What I tried first

I started with a Python pipeline running multilingual-e5-small on 8x A100s. PyTorch with BF16 for inference, HuggingFace tokenizers, manual multi-GPU dispatch, JSONL in and .npy out. This was already heavily optimized: stock sentence-transformers.encode() only gets about 2,500 msg/s on a single A100 (CPU tokenization bottlenecks the whole thing), so the custom batching and dispatch was doing 14x that out of the box. First fully optimized run came in around 35K msg/s, about 3+ hours wall clock on the full corpus. Over the next few weeks I pushed it to ~45K. That was the ceiling.

The deeper problem only showed up after four full runs. Every new dataset wanted its own optimization pass: different file shapes, different size distributions, different bottlenecks. The throughput was bad. Instability made it worse. I'd push it to 45K on one corpus, add a new source, and watch it regress to 37K. I was spending time re-tuning a pipeline that was too slow even when perfectly tuned.

I spent a day on landscape research: docs, design notes, GitHub issues, back-of-envelope throughput math. The closest off-the-shelf option was HuggingFace's Text Embeddings Inference (TEI), the obvious choice for embedding throughput. But TEI is a server. Every batch goes through an HTTP interface, the architecture built around answering API calls, not chewing through files on disk. OpenAI: $932 per pass, around $4K to validate a single pipeline change. Triton: another serving stack, gRPC overhead, complex deployment. candle: Rust, native, right direction, but not the TensorRT-style compiled inference path I needed.

After that pass, I stopped thinking of this as a "faster inference" problem. Everything I found assumed you were serving embeddings to clients. I was trying to produce them from a corpus. Different problem, different architecture.

What I needed was physical and boring: read hundreds of huge compressed files, extract text, embed every line, and write output with every CPU core pinned, every byte of RAM doing useful work, and every GPU saturated. No idle hardware anywhere in the chain. If I'm paying for an 8-GPU box by the hour, every component should be earning its keep.

Hitting the wall

On the last day with Python I threw the kitchen sink at it. Five attempts, four failures and one tiny win:

Multi-stream CUDA. Theory: threads release the GIL during the GPU forward pass, so data prep overlaps across streams. Reality: tensor creation, .to(device), and the tokenizer wrappers all hold the GIL for ~100ms between launches. Only one or two GPUs were ever active at once. No measurable gain.
torch.compile. CUDA graphs are shape-specific, so every new padded length triggered a multi-minute recompilation. Some runs came in nearly 3x slower than baseline. 64 simultaneous compilations across 8 GPUs hung a 96-vCPU machine for over an hour. Net negative.
ONNX Runtime with the TensorRT execution provider. ORT needs specific CUDA libraries on LD_LIBRARY_PATH, but PyTorch installs CUDA under non-standard pip paths ORT can't see. ORT silently fell back to its CPU executor (100x slower, indistinguishable from a hang) and I lost an hour before catching it.
SDPA / Flash Attention. Already on by default in transformers 4.48. No improvement.
Double-buffer pipeline. Tokenize the next chunk while inference runs on the current one. A real 10% gain. The only thing that moved the needle.

That night I wrote a retrospective. The diagnosis was structural, not tunable. The inference tooling available in Python is good at single-process model calls and latency-sensitive serving. Batch embedding at scale (millions of texts, eight GPUs, throughput-only) is neither. The GIL serializes every CPU-side operation. Library boundaries create version-coupling problems. Abstractions hide failures. For a 33M-parameter model where CPU work is 25-30% of total time, no amount of tuning gets past that ceiling. You have to leave Python.

What changed when I left Python

The constraint wasn't inference speed. It was system boundaries. In Python, I was orchestrating GPU inference through layers of abstraction: framework APIs, runtime sessions, serialization between processes. Each boundary adds latency that's easy to miss in a profile but obvious when GPUs are waiting for work.

Rust gave me one process controlling everything: file reading, tokenization, batching, multi-GPU dispatch, inference, and output. No GIL. No IPC. No abstraction between "prepare the batch" and "run inference." When you own the whole path in one address space, you can keep hardware saturated in ways that no amount of Python optimization allows.

I set targets before writing code: 3x conservative, 5x optimistic. Then I built the smallest thing that could prove or kill the idea. One long push: start single-threaded, measure, parallelize one stage, measure again. Started at 10K msg/s, hit 196K by dinner. Past the optimistic target.

Then I profiled ONNX Runtime's session.run(): 160ms per call, only 10ms was actual GPU compute. Wrote a C++ wrapper around TensorRT's native API, called it from Rust via FFI. With correct batch sizing: 265K msg/s on 8x A100. 5x the Python baseline.

The prototype proved the architecture. I committed a clean break: moved it to prototype/ and started the real engine from scratch. Next day: proper library plus thin CLI. Pointed it at the full corpus and watched it sustain 245K msg/s for over an hour. Three days to prove it could be done. One more to make it reusable. The remaining week and a half was optimization, hardening, and benchmarking.

The bottleneck nobody warns you about

At 8 GPUs, when you're pushing 250K+ messages per second through inference, the bottleneck moves to the CPU. Not a metaphor. Literally.

250K tokenization calls per second on the CPU side. If you can't keep up, GPUs idle. You're paying for 8 A100s and they're waiting for work.

Model	1 GPU	8 GPU	Scaling
e5-small-v2	50,127	253,578	5.1x
multilingual-e5-small	41,842	253,950	6.1x
multilingual-e5-base	17,319	122,169	7.1x
multilingual-e5-large	5,610	41,668	7.4x

e5-large gets 7.4x scaling from 8 GPUs. e5-small only gets 5.1x. With a small fast model, the GPUs finish their work before the CPU can prep the next batch. With a larger model, inference is slow enough that the CPU stays ahead easily.

The fix was architectural. Lock-free queues between every pipeline stage, one process driving all GPUs:

reader → tokenizer → batcher → GPU workers → writer

Every stage concurrent. While the tokenizer prepares batch N+1, the GPUs run batch N, and the writer flushes batch N-1 to disk. The work after the prototype was keeping GPUs fed well enough that they're never sitting idle.

What IgniteMS does differently

IgniteMS is not an embedding server. It's a batch engine.

It owns the entire path: reading compressed input, tokenization, batching, multi-GPU dispatch, TensorRT inference, and writing output. No HTTP. No container-per-GPU stack. No inter-process communication. One process, lock-free, file-to-file.

This is the boring part that matters. At this scale, the overhead between stages is the product. Every serialization boundary, every context switch, every idle millisecond waiting for work compounds across hundreds of millions of texts. When you own the whole pipeline in one address space, those overheads stop dominating the run.

The numbers

Reproducible benchmark (1M MSMARCO passages, e5-small-v2, 8x A100 p4d):

Engine	1 GPU	8 GPU
IgniteMS	50,127 msg/s	253,578 msg/s
Hugging Face TEI	16,648 msg/s	96,492 msg/s
SentenceTransformers	~2,500 msg/s	n/a

3x faster than TEI per GPU. 2.6x faster across all 8. Reproducible with python3 benchmark.py on the same hardware, data, and model.

Production run (real workload, not synthetic):

685,520,494 messages embedded
357,893 msg/s sustained (506K peak, 196K low on dense files)
31.9 minutes wall clock
1x p4d.24xlarge (8x A100 40GB), spot instance

Real social-media text. Variable length, multiple languages, messy formatting.

Cost: $6.75 at spot pricing. Embedding the same number of texts with OpenAI text-embedding-3-small: $932. That's $0.01 per million messages vs $1.36 per million.

What it doesn't do

Batch only. IgniteMS is for offline corpus embedding. If you need low-latency request/response serving today, use TEI, Triton, or another serving stack.
NVIDIA only. TensorRT. No AMD, no Apple Silicon.
Slow first run. TRT compiles engines for your GPU on first use (~5 min). Cached after that.
Tested on A100. Should work on Ampere+, haven't verified smaller GPUs.
HuggingFace encoders only. Anything that exports to ONNX with standard pooling. Tested with e5 family.

What this unlocks

At 35K msg/s, every run was a commitment. I'd plan my day around it. I'd avoid reprocessing unless I absolutely had to. I'd batch fixes: change three things, run once, hope they all worked. When they didn't I'd lose another day untangling which change broke what.

At 358K msg/s, embedding is a build step. I run it, check the result, act on it. If the output looks wrong I fix the issue and rerun in 30 minutes. I test ideas I wouldn't have tried before because the cost of being wrong is $7.

Embedding stops driving the schedule. It becomes another stage in the pipeline.

Try it

# Quickstart: downloads a dataset, embeds it, writes output
python3 quickstart.py

# Reproduce the benchmark against TEI
python3 benchmark.py --model e5-small-v2 --gpu-counts 1,8

# Production
docker run --rm --gpus all \
  -v "$PWD/data:/data" \
  -v ignite-ms-cache:/cache \
  ghcr.io/artain-ai/ignite-ms:latest \
  embed \
  --model intfloat/e5-small-v2 \
  --input /data/input.jsonl \
  --output /data/embeddings.npy \
  --cache-dir /cache \
  --gpus all

Input: JSONL or plain text (.zst and .gz supported). Output: .npy or .parquet.

If your embedding workload is measured in files, shards, or corpora instead of API requests, this is what IgniteMS is built for.

If you run it on another encoder, I'd like to see the numbers. BGE, GTE, and newer multilingual models are next on my list.

Disclosure: I used AI tools to help edit and tighten this post. The system, benchmark runs, production numbers, and technical claims are mine and were checked against run logs.

Apache 2.0. Full engine, no gated features.

GitHub: https://github.com/Artain-AI/ignite-ms