DEV Community

Nasit Sony
Nasit Sony

Posted on

I Built a Complete AI Infrastructure Stack from Scratch — Here's What I Learned

I Built a Complete AI Infrastructure Stack from Scratch — Here's What I Learned

Most AI projects start at the top of the stack.

You grab an LLM API, wire up a vector database, build a RAG pipeline, and ship. That works — until it doesn't. Until your training job crashes at hour 6. Until your inference cache fills up and nobody knows why. Until a worker dies mid-processing and your embeddings are corrupted.

I wanted to understand what happens below the API layer. So I built the whole thing from scratch.


The Stack

Over the past few months I built four interconnected systems that form a complete AI infrastructure stack:

VeriStore          → Storage layer (WAL, Raft, crash recovery)
      ↓
llm-serving-cache  → Inference serving (KV cache, GPU memory, routing)
      ↓
Veriflow           → Workload orchestration (training jobs, checkpoints, GPU scheduling)
      ↓
SmartSearch        → AI data pipeline (async ingestion, Kafka, RAG, fault tolerance)
Enter fullscreen mode Exit fullscreen mode

Each layer depends on the one below it. Each solves a real problem I kept running into. And each taught me something I couldn't have learned from reading documentation.


Layer 1 — VeriStore: How Data Actually Survives Crashes

GitHub: https://github.com/NasitSony/VeriStore

The first question I wanted to answer: how does data survive a crash?

Not "what does the documentation say" — but what actually happens at the byte level when a process dies mid-write.

VeriStore is a correctness-first key-value storage engine in C++ built from first principles:

  • Write-Ahead Log (WAL) — every write is logged before being applied. On crash, the log is replayed deterministically.
  • CRC validation — partial or torn writes are detected and ignored.
  • Group commit batching — instead of fsyncing every write, writes are batched. This improved throughput by ~2.7× in benchmarks.
  • Snapshot + compaction — periodic snapshots eliminate the need for full log replay on restart.
  • Raft consensus replication — a 3-node cluster with leader election, majority-based commit, and follower catch-up after crashes.
  • Mini S3-style object storage — built on top of the KV engine with chunked writes, prefix listing, and mark-and-sweep garbage collection.

What I learned

fsync is expensive, but skipping it is dangerous. Group commit is the right tradeoff — batch writes, fsync at boundaries. This is what RocksDB, PostgreSQL, and etcd all do.

The WAL commit point is everything. An object is valid only if its metadata is committed. This single rule makes crash recovery deterministic — you either have the commit record or you don't.

Raft is simpler than it sounds, but the edge cases are brutal. Leader crash during log replication, follower log divergence, split-brain scenarios — each required careful handling.


Layer 2 — llm-serving-cache: Where Does the KV Cache Live?

GitHub: https://github.com/NasitSony/llm-serving-cache

LLM inference is expensive. The prefill step — processing the prompt — is the main cost. If you've seen the same prompt before, you shouldn't have to recompute it.

llm-serving-cache is a control-plane service that tracks where cached attention prefixes live across distributed inference nodes and routes requests to maximize cache reuse.

Key results from benchmarks:

Scenario Avg Latency Hit Rate
No Cache 1405 ms 0%
Prefix Reuse 985 ms 50%
Exact Cache 205 ms 100%
GPU-Aware 843 ms 25%

Exact cache reuse reduces latency by ~85% compared to no cache.

The system models GPU memory as discrete blocks and uses best-fit placement to minimize fragmentation. Under memory pressure, it evicts the oldest inactive requests and retries allocation before rejecting.

I also validated this against a real Ollama backend running Llama 3.1 8B:

  • Cold request: ~8,488 ms
  • Warm request (same prompt): ~5,520 ms
  • Prompt eval dropped from 177ms → 47ms on warm requests

What I learned

Cache hits matter enormously for prefill, but decode dominates total latency. A warm request still takes ~5.5 seconds because token generation is slow regardless of caching. Real serving optimization needs to address decode efficiency too.

Admission control is more important than caching. Accepting all requests under load causes queue growth and latency explosion. Rejecting excess load with a hard concurrency limit keeps tail latency controlled.

Single-request latency is misleading. At concurrency=10, P95 latency was 53.5 seconds — nearly 3× the single-request time. Production serving systems need batching, scheduling, and admission control, not just cache reuse.


Layer 3 — Veriflow: Treating Training Jobs as Distributed Systems

GitHub: https://github.com/NasitSony/veriflow-control-plane

The pain that started everything: training jobs that crash at hour 6 with no checkpoint, no retry, and no idea why.

Veriflow is a Kubernetes-based job orchestrator that treats AI training as what it actually is — a distributed systems problem.

The key insight: checkpoints need to be first-class citizens.

Most job runners treat AI training like a simple script: run it, and if it fails, restart from zero. Veriflow models job lifecycle as a state machine with checkpoint-aware retry:

JOB_SUBMITTED → RUN_CREATED → POD_RUNNING
→ CHECKPOINT_SAVED            ← checkpoint URI persisted
→ RUN_FAILED                  ← something went wrong
→ RETRY_TRIGGERED             ← scheduler picks it up
→ TRAINING_RESUMED            ← resumes from checkpoint
→ JOB_SUCCEEDED
Enter fullscreen mode Exit fullscreen mode

The scheduler uses FOR UPDATE SKIP LOCKED in Postgres for concurrency-safe job claiming — tested with two concurrent scheduler instances processing 20 burst-submitted jobs with zero duplicate dispatches.

GPU-aware placement matches jobs to nodes by GPU type, count, and memory requirements using best-fit allocation. Queue-level fairness and quota enforcement prevent one greedy queue from monopolizing the cluster.

What I learned

FOR UPDATE SKIP LOCKED is underrated. Most people reach for Redis or a dedicated queue for concurrent job processing. Postgres with SKIP LOCKED handles it correctly — and you get transactions and consistency for free.

The scheduler is a control plane, not a cron job. A cron fires and forgets. A control plane continuously reconciles desired state with actual state. This distinction is what makes checkpoint-aware recovery possible.

Checkpoint URIs should be in your job spec from day one. Treating them as an afterthought means you'll always restart from scratch when things go wrong.


Layer 4 — SmartSearch: What Happens When the Pipeline Breaks?

GitHub: https://github.com/NasitSony/SmartSearch

Most RAG demos show the happy path: ingest document, generate embeddings, search, return results.

SmartSearch asks a different question: what happens when things fail?

  • What if the worker crashes mid-processing?
  • What if Kafka replays messages?
  • What if the database goes down?
  • What if duplicate requests arrive?

The system is built to handle these scenarios deterministically:

  • Idempotent ingestion — duplicate Kafka messages don't create duplicate embeddings, enforced via unique constraints
  • Job lifecycle state machinePENDING → PROCESSING → READY | FAILED, no hidden progress
  • Bounded retries + DLQ — failed jobs retry with limits, then go to a dead letter queue
  • Full observability — Prometheus + Grafana dashboards for pipeline pressure, retry rates, and processing age

What I learned

At-least-once + idempotency is the right default. Exactly-once semantics in Kafka are possible but complex. At-least-once with idempotent writes gets you the same correctness guarantees with far less operational complexity.

Processing age is the most important metric nobody talks about. Latency tells you how fast things are going. Processing age tells you how much work is piling up. A rising processing age means your pipeline is falling behind — before latency spikes make it obvious.

The visibility invariant matters. A document is searchable if and only if its state is READY. This single rule prevents partial visibility and makes the system's behavior predictable under any failure scenario.


The Bigger Picture

Building these four systems taught me something that documentation never could:

Every layer of the AI stack is a distributed systems problem.

  • Storage is about durability and consensus
  • Inference serving is about routing and resource management
  • Workload orchestration is about scheduling and fault recovery
  • Data pipelines are about correctness under partial failure

Most AI engineers work at the top of this stack and treat the layers below as black boxes. That works until scale, failure, or cost forces the question: what's actually happening down there?

Understanding these layers doesn't just make you a better infrastructure engineer. It makes you better at every layer above — because you understand what guarantees you can actually rely on, and what you need to handle yourself.


What's Next

  • Demo GIFs for all four projects
  • Distributed control plane for llm-serving-cache (Raft-backed metadata)
  • Web UI for Veriflow job monitoring
  • Exactly-once semantics for SmartSearch (Kafka transactions)

If you found this useful, all four repos are on GitHub. Stars and feedback welcome!

Top comments (0)