DEV Community: Nasit Sony

I Built a Production-Style RAG Backend — Focused on What Happens When Things Break

Nasit Sony — Fri, 29 May 2026 19:07:26 +0000

I Built a Production-Style RAG Backend — Focused on What Happens When Things Break

Most RAG tutorials show you the happy path.

Ingest document → generate embeddings → store in vector DB → search → return results.

It works great in demos. But what happens when:

The worker crashes mid-processing?
Kafka replays messages and you get duplicates?
The database goes down during ingestion?
A malformed document gets stuck in an infinite retry loop? I built SmartSearch to answer those questions — a correctness-first ingestion and retrieval backend designed to handle failures deterministically.

The Problem With Most RAG Systems

Most RAG implementations are optimized for the happy path. They work well when everything goes right, and fail in unpredictable ways when things go wrong.

The result is systems where:

A worker crash leaves jobs in an unknown state
Kafka replays create duplicate embeddings
A bad document retries forever and blocks the queue
Nobody knows why a document isn't searchable SmartSearch is built to make failures explicit, recoverable, and observable.

Architecture

Client
  ↓
API Service (Spring Boot)
  ↓
Kafka (async decoupling + replay)
  ↓
Worker (consumes, embeds, writes)
  ↓
Postgres + pgvector (embeddings + similarity search)
  ↓
Prometheus + Grafana (observability)

The key design decision: decouple ingestion from processing via Kafka. This gives you replay, retry, and resilience — at the cost of eventual consistency.

The Job Lifecycle State Machine

Every ingestion request has an explicit state:

PENDING → PROCESSING → READY
                     → FAILED

Why this matters:

No hidden progress — you always know exactly where a job is
Failures are visible — FAILED jobs appear in the system pressure dashboard
Recovery is deterministic — on restart, PROCESSING jobs are retried The lifecycle invariant: state transitions are monotonic. A job never goes backwards from PROCESSING to PENDING. Once FAILED, it stays FAILED unless explicitly retried.

Idempotent Ingestion

Kafka guarantees at-least-once delivery. This means the same message can arrive multiple times — on retry, on replay, or after a broker restart.

SmartSearch handles this via unique constraints:

UNIQUE(doc_id, chunk_id)

If a chunk already exists, the write is a no-op. This means:

Reprocessing the same message is always safe
No duplicate embeddings, ever
Workers can crash and restart without corrupting state This is the idempotency invariant: reprocessing the same request does not change the final database state.

Failure Handling + DLQ

Workers retry failed jobs with bounded attempts. After exhausting retries:

Job is marked FAILED
Message is sent to a Dead Letter Queue (DLQ)
The job stops blocking other work This prevents poison messages from retrying forever and starving the queue.

The failure isolation invariant: a FAILED job does not corrupt other documents.

Observability

The system exposes a /api/system/pressure endpoint showing live counts:

{
  "pending": 12,
  "processing": 3,
  "ready": 847,
  "failed": 2
}

Prometheus metrics via Spring Boot Actuator:

HTTP request rate and latency
Ingestion pipeline metrics (received, succeeded, failed, retries, DLQ)
Processing age — how long jobs wait before being processed
Database connection pool metrics Processing age is the metric most people overlook. Latency tells you how fast things are going. Processing age tells you how much work is piling up. A rising processing age is an early warning signal before latency spikes become visible.

Failure Matrix

Failure Scenario	Expected Behavior
Worker crash mid-processing	Job retried, no duplicate chunks
Worker crash after DB write	Reprocessing occurs, idempotency holds
Kafka broker restart	Processing resumes, no message loss
Postgres outage	Worker retries, job eventually READY or FAILED
Poison message	Retries exhausted → FAILED + DLQ
Duplicate request	No duplicate embeddings created

All five scenarios were tested and verified to behave as specified.

What I Learned

At-least-once + idempotency is the right default. Exactly-once semantics in Kafka are possible but operationally complex. At-least-once delivery with idempotent writes gives you the same correctness guarantees with far less complexity.

The visibility invariant is underrated. A document should be searchable if and only if its state is READY. This simple rule prevents partial visibility and makes the system's behavior predictable under any failure scenario.

Processing age is the most important metric nobody talks about. Every pipeline should expose how long work sits before being processed. It's the earliest signal of a system falling behind.

Kafka adds complexity but the tradeoffs are worth it. You get replay, retry, and resilience. The operational overhead is real, but for any system where correctness under failure matters, it's the right call.

Try It Yourself

git clone https://github.com/NasitSony/SmartSearch.git
cd SmartSearch
docker compose up -d

# API available at http://localhost:8080
# Grafana at http://localhost:3000
# Prometheus at http://localhost:9090

# Ingest a document
curl -X POST http://localhost:8080/api/documents \
  -H "Content-Type: application/json" \
  -d '{"content": "your document text here"}'

# Search
curl "http://localhost:8080/api/search?q=your+query"

# Check system pressure
curl http://localhost:8080/api/system/pressure

GitHub: https://github.com/NasitSony/SmartSearch

SmartSearch is the data pipeline layer of a larger AI infrastructure stack I've been building. The full stack story is covered in my article: I Built a Complete AI Infrastructure Stack from Scratch.

If you found this useful, a ⭐ on GitHub goes a long way!

How I Built a KV-Cache Control Plane for LLM Inference — With Real Benchmark Results

Nasit Sony — Fri, 29 May 2026 18:55:54 +0000

How I Built a KV-Cache Control Plane for LLM Inference — With Real Benchmark Results

LLM inference is expensive. The prefill step — processing the prompt — is the biggest cost. If you've seen the same prompt before, you shouldn't have to recompute it.

That's the core idea behind KV-cache reuse. But in a distributed system with multiple inference nodes, a new problem emerges: where is the cached prefix stored, and how do you route requests to maximize reuse?

I built llm-serving-cache to answer that question — a metadata-driven control plane for LLM KV-cache placement and routing.

The Problem

In a single-node setup, KV-cache reuse is straightforward. The cache is local and the router is trivial.

In a distributed setup:

Cached prefixes are scattered across nodes
The same prompt might be cached on node-a but the request lands on node-b
Cache misses are expensive — full prefill cost, every time
GPU memory is finite — you need admission control and eviction You need a control plane that knows where every cached prefix lives and routes requests intelligently.

The Architecture

Client Request
      ↓
Router
      ↓
Session Affinity Check   → route to same node if session exists
      ↓
Exact Cache Hit?         → reuse cached result, skip prefill
      ↓
Prefix Match?            → reuse partial computation
      ↓
Cache Miss               → select best node, trigger cache fill
      ↓
[If full] Evict          → remove oldest inactive request
      ↓
Inference + Register     → store new cache entry
      ↓
WAL-backed Metadata Store

Core Components

Router — handles exact hits, prefix matches, session affinity, and cache misses.

Node Registry — tracks available nodes, GPU memory, and utilization.

Metadata Store — persists cache entries and session routes via a WAL-backed KV engine (VeriStore).

Placement Policy — best-fit node selection based on available GPU memory blocks.

Benchmark Results

I ran controlled benchmarks across five cache strategies:

Scenario	Avg Latency	P95 Latency	Hit Rate	Rejection Rate
No Cache	1405 ms	1405 ms	0%	0%
Prefix Reuse	985 ms	1405 ms	50%	0%
Exact Cache	205 ms	205 ms	100%	0%
GPU-Aware	843 ms	1405 ms	25%	25%
GPU-Aware + Eviction	1895 ms	4205 ms	25%	0%

Key observations:

Exact cache reuse reduces latency by ~85% vs no cache
Prefix reuse improves average latency but not tail latency — P95 stays high when misses are still present

- Eviction reduces rejection but increases latency by admitting previously rejected expensive requests

Real Inference Validation (Ollama)

Benchmarks are useful, but I wanted to validate against real inference. I integrated Ollama running Llama 3.1 8B and ran controlled experiments:

Scenario	Total Latency	Prompt Eval	Decode
Cold request	~8,488 ms	177 ms	5,238 ms
Warm request	~5,520 ms	47 ms	5,372 ms
Prefix-related	~5,891 ms	47 ms	5,747 ms

Warm requests drop prompt evaluation from 177ms → 47ms. But total latency is still ~5.5 seconds because decode dominates.

This is the key insight: caching helps prefill, but token generation is the real bottleneck in real inference systems.

GPU Memory Model

GPU memory is modeled as discrete fixed-size blocks (16MB each):

total_blocks = total_vram_mb / block_size
required_blocks = ceil(kv_size_mb / block_size)

Best-fit placement selects the node with the minimum leftover blocks after allocation, reducing fragmentation.

Under memory pressure:

Attempt allocation
If insufficient → trigger eviction of oldest inactive request
Retry allocation

4. If still insufficient → reject request with explicit reason

Admission Control Under Load

The most important result from the concurrent benchmark:

Concurrency	Avg Latency	P95 Latency	Throughput
1	5,771 ms	5,771 ms	0.17 req/s
3	10,963 ms	16,299 ms	0.18 req/s
5	16,560 ms	27,744 ms	0.18 req/s
10	29,040 ms	53,525 ms	0.19 req/s

Throughput stays flat while latency explodes. This is classic queuing behavior — the bottleneck is the inference runtime, not the control plane.

With admission control (--max-active=3):

	No Control	With Control
Accepted	10	3
P95 Latency	~53.5s	~20.7s

Good systems don't try to serve everyone. They protect latency by rejecting excess load.

What I Learned

Prefix reuse is valuable but not sufficient. Caching eliminates prefill cost but generation cost dominates real LLM serving. Effective optimization needs to address both.

Single-request latency is misleading. Always benchmark under concurrency. P95 at concurrency=10 was nearly 3× the single-request time.

Admission control is more important than caching. A system that accepts everything under load will have terrible tail latency. Reject early, protect your SLA.

WAL-backed metadata is fast. Storage recovery for 5,000 cache entries takes ~20ms — completely invisible compared to inference latency. Persistence is free at this scale.

Try It Yourself

git clone --recurse-submodules https://github.com/NasitSony/llm-serving-cache.git
cd llm-serving-cache
cmake -S . -B build
cmake --build build

./build/routing_demo
./build/cache_register_demo

GitHub: https://github.com/NasitSony/llm-serving-cache

This project is the inference serving layer of a larger AI infrastructure stack. The storage layer underneath is VeriStore. The workload orchestration layer above is Veriflow.

If you found this useful, a ⭐ on GitHub goes a long way!

I Built a Storage Engine from Scratch in C++ — WAL, Raft, and Object Storage

Nasit Sony — Fri, 29 May 2026 18:52:54 +0000

I Built a Storage Engine from Scratch in C++ — WAL, Raft, and Object Storage

I wanted to understand one thing: how does data actually survive a crash?

Not what the documentation says. Not what the abstraction promises. What actually happens at the byte level when a process dies mid-write, and how a storage engine recovers from it.

So I built VeriStore — a correctness-first key-value storage engine in C++, built from first principles, evolving from a simple in-memory store to a Raft-replicated distributed system with a mini S3-style object storage layer.

Why Build a Storage Engine?

Every database, stream processor, and distributed system you've ever used is built on top of primitives like:

Write-Ahead Logging (WAL)
Crash-consistent recovery
Group commit batching
Consensus replication Understanding these primitives doesn't just make you a better infrastructure engineer — it makes you better at using these systems because you understand what guarantees they actually provide.

How It Works

v0.1 — In-Memory KV Store

The foundation: a thread-safe key-value map with PUT, GET, and DEL operations, protected by a reader-writer lock using std::shared_mutex.

Simple, but sets the pattern for everything above it.

v0.2 — Write-Ahead Log + Crash Recovery

The first real challenge: making writes survive crashes.

The WAL is an append-only log. Every write is recorded in the log before being applied to the in-memory map. On startup, the log is replayed to reconstruct state.

PUT x 100  → append to WAL → apply to map
PUT y 200  → append to WAL → apply to map
FLUSH      → fsync to disk

[crash]

restart    → replay WAL → x=100, y=200 ✓

CRC validation detects partial or torn writes — if a record is incomplete, it's ignored and replay stops at that point.

Guarantee: If a write returns OK, it survives crashes.

v0.3 — Snapshots + Log Compaction

Replaying the full WAL on every restart gets slow as the log grows. Snapshots solve this:

Serialize the current in-memory state to disk
Truncate the WAL to remove entries before the snapshot
On restart: load snapshot, then replay only the recent WAL entries This keeps recovery time bounded regardless of how long the system has been running.

v0.4 — Group Commit + Performance

fsyncing every write is correct but slow. Group commit batches writes and fsyncs at boundaries:

Mode	Setting	Throughput
Immediate flush	SETBATCH 1	39,216 ops/s
Group commit	SETBATCH 5	104,167 ops/s

~2.7× throughput improvement by reducing fsync frequency. This is the same technique used by PostgreSQL, RocksDB, and etcd.

v0.5 — Raft Consensus Replication

Single-node durability is not enough for production systems. Raft makes the storage engine fault-tolerant across a cluster:

Leader election — nodes elect a leader via randomized timeouts
Log replication — the leader replicates writes to followers before committing
Majority quorum commit — a write is committed only when a majority of nodes acknowledge it
Follower catch-up — a crashed follower replays missed entries on restart Example output from the Raft demo:

[raft] node 3 became LEADER term=1
ProposePut(a=100) -> true
s1.get(a)=100  s2.get(a)=100  s3.get(a)=100

=== Simulating leader crash ===
[raft] node 2 became LEADER term=2
ProposePut(b=200) -> true
s1.get(b)=200  s2.get(b)=200  s3.get(b)=200

Guarantee: The cluster remains consistent despite node failures.

v0.6–v0.8 — Object Storage Layer

On top of the KV engine, I built a mini S3-style object storage system:

Bucket creation and object PUT/GET/DELETE
Chunked storage — large objects are split into fixed-size chunks, each stored as a KV entry
Metadata-based commit — object metadata is written last and acts as the commit point. An object is valid only if committed metadata exists.
Prefix-based listing — ListObjects(bucket, prefix) via prefix scans over the bucket index namespace
Overwrite semantics — new metadata commits atomically replace previous versions
Mark-and-sweep garbage collection — orphaned chunks from overwrites are reclaimed The commit semantics are the key insight:

1. Write chunk data  → KV entries
2. Write metadata    → commit point

Recovery: objects without committed metadata are ignored

This makes crash recovery deterministic — you either have the full object or nothing.

Failure Scenarios Tested

✔ Process crash (kill -9)
✔ Partial disk writes
✔ Leader node crash
✔ Follower crash and recovery
✔ Log divergence repair

- ✔ Replica catch-up via log backtracking

What I Learned

fsync is the durability boundary. A write is only durable once it's fsynced. Group commit is the standard tradeoff — batch writes, fsync at boundaries, accept a small window of potential data loss.

The commit point is everything. Whether it's a WAL record, a metadata entry, or a Raft log index — the commit point is the line between "this happened" and "this might not have happened." Design your commit points explicitly.

Raft is simpler than it looks, but the edge cases are brutal. The basic algorithm is straightforward. But leader crash during replication, log divergence between followers, and split-brain scenarios each required careful handling.

Mac is more forgiving than Linux. The codebase compiled perfectly on macOS but failed on Linux GCC because Apple's headers include <mutex> indirectly. Always test on Linux before shipping.

Try It Yourself

git clone https://github.com/NasitSony/VeriStore.git
cd VeriStore
cmake -S . -B build
cmake --build build

# Run the KV CLI
./build/kv_cli

# Run the Raft demo
./build/raft_demo

GitHub: https://github.com/NasitSony/VeriStore

VeriStore is the storage foundation of a larger AI infrastructure stack I've been building. The next layer up is llm-serving-cache — a KV-cache placement and routing control plane for LLM inference, backed by VeriStore.

If you found this useful, a ⭐ on GitHub goes a long way!

I Built a Complete AI Infrastructure Stack from Scratch — Here's What I Learned

Nasit Sony — Fri, 29 May 2026 18:26:15 +0000

I Built a Complete AI Infrastructure Stack from Scratch — Here's What I Learned

Most AI projects start at the top of the stack.

You grab an LLM API, wire up a vector database, build a RAG pipeline, and ship. That works — until it doesn't. Until your training job crashes at hour 6. Until your inference cache fills up and nobody knows why. Until a worker dies mid-processing and your embeddings are corrupted.

I wanted to understand what happens below the API layer. So I built the whole thing from scratch.

The Stack

Over the past few months I built four interconnected systems that form a complete AI infrastructure stack:

VeriStore          → Storage layer (WAL, Raft, crash recovery)
      ↓
llm-serving-cache  → Inference serving (KV cache, GPU memory, routing)
      ↓
Veriflow           → Workload orchestration (training jobs, checkpoints, GPU scheduling)
      ↓
SmartSearch        → AI data pipeline (async ingestion, Kafka, RAG, fault tolerance)

Each layer depends on the one below it. Each solves a real problem I kept running into. And each taught me something I couldn't have learned from reading documentation.

Layer 1 — VeriStore: How Data Actually Survives Crashes

GitHub: https://github.com/NasitSony/VeriStore

The first question I wanted to answer: how does data survive a crash?

Not "what does the documentation say" — but what actually happens at the byte level when a process dies mid-write.

VeriStore is a correctness-first key-value storage engine in C++ built from first principles:

Write-Ahead Log (WAL) — every write is logged before being applied. On crash, the log is replayed deterministically.
CRC validation — partial or torn writes are detected and ignored.
Group commit batching — instead of fsyncing every write, writes are batched. This improved throughput by ~2.7× in benchmarks.
Snapshot + compaction — periodic snapshots eliminate the need for full log replay on restart.
Raft consensus replication — a 3-node cluster with leader election, majority-based commit, and follower catch-up after crashes.
Mini S3-style object storage — built on top of the KV engine with chunked writes, prefix listing, and mark-and-sweep garbage collection.

What I learned

fsync is expensive, but skipping it is dangerous. Group commit is the right tradeoff — batch writes, fsync at boundaries. This is what RocksDB, PostgreSQL, and etcd all do.

The WAL commit point is everything. An object is valid only if its metadata is committed. This single rule makes crash recovery deterministic — you either have the commit record or you don't.

Raft is simpler than it sounds, but the edge cases are brutal. Leader crash during log replication, follower log divergence, split-brain scenarios — each required careful handling.

Layer 2 — llm-serving-cache: Where Does the KV Cache Live?

GitHub: https://github.com/NasitSony/llm-serving-cache

LLM inference is expensive. The prefill step — processing the prompt — is the main cost. If you've seen the same prompt before, you shouldn't have to recompute it.

llm-serving-cache is a control-plane service that tracks where cached attention prefixes live across distributed inference nodes and routes requests to maximize cache reuse.

Key results from benchmarks:

Scenario	Avg Latency	Hit Rate
No Cache	1405 ms	0%
Prefix Reuse	985 ms	50%
Exact Cache	205 ms	100%
GPU-Aware	843 ms	25%

Exact cache reuse reduces latency by ~85% compared to no cache.

The system models GPU memory as discrete blocks and uses best-fit placement to minimize fragmentation. Under memory pressure, it evicts the oldest inactive requests and retries allocation before rejecting.

I also validated this against a real Ollama backend running Llama 3.1 8B:

Cold request: ~8,488 ms
Warm request (same prompt): ~5,520 ms
Prompt eval dropped from 177ms → 47ms on warm requests

What I learned

Cache hits matter enormously for prefill, but decode dominates total latency. A warm request still takes ~5.5 seconds because token generation is slow regardless of caching. Real serving optimization needs to address decode efficiency too.

Admission control is more important than caching. Accepting all requests under load causes queue growth and latency explosion. Rejecting excess load with a hard concurrency limit keeps tail latency controlled.

Single-request latency is misleading. At concurrency=10, P95 latency was 53.5 seconds — nearly 3× the single-request time. Production serving systems need batching, scheduling, and admission control, not just cache reuse.

Layer 3 — Veriflow: Treating Training Jobs as Distributed Systems

GitHub: https://github.com/NasitSony/veriflow-control-plane

The pain that started everything: training jobs that crash at hour 6 with no checkpoint, no retry, and no idea why.

Veriflow is a Kubernetes-based job orchestrator that treats AI training as what it actually is — a distributed systems problem.

The key insight: checkpoints need to be first-class citizens.

Most job runners treat AI training like a simple script: run it, and if it fails, restart from zero. Veriflow models job lifecycle as a state machine with checkpoint-aware retry:

JOB_SUBMITTED → RUN_CREATED → POD_RUNNING
→ CHECKPOINT_SAVED            ← checkpoint URI persisted
→ RUN_FAILED                  ← something went wrong
→ RETRY_TRIGGERED             ← scheduler picks it up
→ TRAINING_RESUMED            ← resumes from checkpoint
→ JOB_SUCCEEDED

The scheduler uses FOR UPDATE SKIP LOCKED in Postgres for concurrency-safe job claiming — tested with two concurrent scheduler instances processing 20 burst-submitted jobs with zero duplicate dispatches.

GPU-aware placement matches jobs to nodes by GPU type, count, and memory requirements using best-fit allocation. Queue-level fairness and quota enforcement prevent one greedy queue from monopolizing the cluster.

What I learned

FOR UPDATE SKIP LOCKED is underrated. Most people reach for Redis or a dedicated queue for concurrent job processing. Postgres with SKIP LOCKED handles it correctly — and you get transactions and consistency for free.

The scheduler is a control plane, not a cron job. A cron fires and forgets. A control plane continuously reconciles desired state with actual state. This distinction is what makes checkpoint-aware recovery possible.

Checkpoint URIs should be in your job spec from day one. Treating them as an afterthought means you'll always restart from scratch when things go wrong.

Layer 4 — SmartSearch: What Happens When the Pipeline Breaks?

GitHub: https://github.com/NasitSony/SmartSearch

Most RAG demos show the happy path: ingest document, generate embeddings, search, return results.

SmartSearch asks a different question: what happens when things fail?

What if the worker crashes mid-processing?
What if Kafka replays messages?
What if the database goes down?
What if duplicate requests arrive?

The system is built to handle these scenarios deterministically:

Idempotent ingestion — duplicate Kafka messages don't create duplicate embeddings, enforced via unique constraints
Job lifecycle state machine — PENDING → PROCESSING → READY | FAILED, no hidden progress
Bounded retries + DLQ — failed jobs retry with limits, then go to a dead letter queue
Full observability — Prometheus + Grafana dashboards for pipeline pressure, retry rates, and processing age

What I learned

At-least-once + idempotency is the right default. Exactly-once semantics in Kafka are possible but complex. At-least-once with idempotent writes gets you the same correctness guarantees with far less operational complexity.

Processing age is the most important metric nobody talks about. Latency tells you how fast things are going. Processing age tells you how much work is piling up. A rising processing age means your pipeline is falling behind — before latency spikes make it obvious.

The visibility invariant matters. A document is searchable if and only if its state is READY. This single rule prevents partial visibility and makes the system's behavior predictable under any failure scenario.

The Bigger Picture

Building these four systems taught me something that documentation never could:

Every layer of the AI stack is a distributed systems problem.

Storage is about durability and consensus
Inference serving is about routing and resource management
Workload orchestration is about scheduling and fault recovery
Data pipelines are about correctness under partial failure

Most AI engineers work at the top of this stack and treat the layers below as black boxes. That works until scale, failure, or cost forces the question: what's actually happening down there?

Understanding these layers doesn't just make you a better infrastructure engineer. It makes you better at every layer above — because you understand what guarantees you can actually rely on, and what you need to handle yourself.

What's Next

Demo GIFs for all four projects
Distributed control plane for llm-serving-cache (Raft-backed metadata)
Web UI for Veriflow job monitoring
Exactly-once semantics for SmartSearch (Kafka transactions)

If you found this useful, all four repos are on GitHub. Stars and feedback welcome!

VeriStore: https://github.com/NasitSony/VeriStore
llm-serving-cache: https://github.com/NasitSony/llm-serving-cache
Veriflow: https://github.com/NasitSony/veriflow-control-plane
SmartSearch: https://github.com/NasitSony/SmartSearch

I Got Tired of Training Jobs Crashing at Hour 6 — So I Built Veriflow

Nasit Sony — Fri, 29 May 2026 04:59:30 +0000

I Got Tired of Training Jobs Crashing at Hour 6 — So I Built Veriflow

You know the feeling.

You kick off a training job before bed. 8 hours of compute. You wake up, grab your coffee, open the terminal — and see it crashed at hour 6. No checkpoint. No retry. No clue why.

Restart from zero.

That pain is what led me to build Veriflow — a checkpoint-aware, fault-tolerant job orchestrator for AI training workloads on Kubernetes.

The Problem With Existing Tools

Most job runners treat AI training like a simple script:

"Run it. If it fails, restart it."

But training jobs are not simple scripts. They are:

Long-running — hours or days, not seconds
Stateful — they produce checkpoints as they run
Expensive — GPU time costs real money
Distributed — they touch storage, databases, and compute simultaneously

Restarting from zero every time a job fails is not just annoying — it is wasteful and often unacceptable in production.

What you actually need is a system that treats AI workloads as what they are: distributed systems problems.

What Veriflow Does Differently

1. Checkpoint-Aware Retry

When a job fails, Veriflow does not restart from scratch. It resumes from the latest saved checkpoint.

The lifecycle looks like this:

JOB_SUBMITTED
JOB_SCHEDULED
RUN_CREATED
POD_RUNNING
TRAINING_PROGRESS
CHECKPOINT_SAVED        ← checkpoint URI persisted
RUN_FAILED              ← something went wrong
RETRY_TRIGGERED         ← scheduler picks it up
TRAINING_RESUMED        ← resumes from checkpoint
JOB_SUCCEEDED

The checkpoint URI is a first-class citizen in the job spec — not an afterthought bolted on later.

2. Concurrency-Safe Scheduling

Veriflow uses PostgreSQL's FOR UPDATE SKIP LOCKED for job claiming. This means:

Multiple scheduler instances can run simultaneously
No duplicate job dispatches — ever
No complex distributed locking needed

Tested with two concurrent scheduler instances processing 20 burst-submitted jobs — zero duplicate dispatches observed.

3. GPU-Aware Placement

Jobs declare their GPU requirements upfront:

{
  "gpuCount": 2,
  "gpuType": "A100",
  "minGpuMemoryMb": 30000
}

The scheduler matches jobs to nodes that satisfy all constraints, using best-fit placement to avoid fragmentation. If no node satisfies the constraints, the job is deferred with an explicit reason — not silently dropped.

4. Queue-Level Fairness and Quota

Each queue has a GPU quota. Jobs that exceed their queue's quota are deferred, not rejected. The scheduler rotates through queues to prevent starvation — one greedy queue cannot monopolize the cluster.

5. Full Event-Sourced Lifecycle

Every state transition emits an event. This means you always know:

Why a job failed
When a checkpoint was saved
How many retry attempts were made
Exactly how long each phase took

Architecture

Veriflow follows a classic control-plane + data-plane split:

Client
  │  POST /v1/jobs  (Idempotency-Key)
  ▼
Job API (Go)
  │  writes jobs/spec to Postgres
  ▼
Postgres (jobs, runs, events)
  ▲
  │  claim (FOR UPDATE SKIP LOCKED)
  │  dispatch → Kubernetes Job
  │  reconcile runtime + K8s state
  ▼
Scheduler (Go) ───────────► Kubernetes Job / Pod

Control plane = Job API + Scheduler + Postgres
Data plane = Kubernetes Jobs and Pods

This separation makes the system easy to reason about, scale, and debug.

What I Learned Building This

FOR UPDATE SKIP LOCKED is underrated.
Most people reach for Redis or a dedicated queue when they need concurrent job processing. But Postgres with SKIP LOCKED handles it beautifully — and you get transactions, consistency, and a single source of truth for free.

Checkpoint URIs need to be first-class.
The biggest mistake I see in ML infra is treating checkpoints as an implementation detail. They need to be in your job spec, tracked in your database, and passed explicitly on retry. If your orchestrator does not know about checkpoints, you will always restart from zero.

Model your job lifecycle as a state machine.
Once I stopped thinking about jobs as "running or not running" and started modeling them as state machines with explicit transitions, failure handling became trivial. Every failure has a cause. Every retry has a reason. Nothing is ambiguous.

The scheduler is a control plane, not a cron job.
A cron job fires and forgets. A control plane continuously reconciles desired state with actual state. Veriflow's scheduler constantly reconciles Kubernetes pod states, runtime signals, and database state — which is what makes checkpoint-aware recovery possible.

Try It Yourself

git clone https://github.com/NasitSony/veriflow-control-plane.git
cd veriflow-control-plane
make up
make api
make sched
make demo-success
make events

The demo runs a full end-to-end job — submission, scheduling, execution, checkpointing, and success — in under a minute.

What's Next

Metrics and Prometheus integration — expose scheduler and job metrics
Web UI — visualize job lifecycle and GPU utilization
Multi-cluster support — dispatch jobs across multiple Kubernetes clusters

Feedback Welcome

Veriflow is early-stage and I am actively looking for feedback from anyone doing ML infra or platform engineering. What features would make this useful for your workloads?

GitHub: https://github.com/NasitSony/veriflow-control-plane

If you found this useful, a ⭐ on GitHub goes a long way!