<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Nasit Sony</title>
    <description>The latest articles on DEV Community by Nasit Sony (@nasit_sony).</description>
    <link>https://dev.to/nasit_sony</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3957722%2F5233643b-3bce-4005-b0aa-5a2a8d611555.jpg</url>
      <title>DEV Community: Nasit Sony</title>
      <link>https://dev.to/nasit_sony</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/nasit_sony"/>
    <language>en</language>
    <item>
      <title>I Built a Production-Style RAG Backend — Focused on What Happens When Things Break</title>
      <dc:creator>Nasit Sony</dc:creator>
      <pubDate>Fri, 29 May 2026 19:07:26 +0000</pubDate>
      <link>https://dev.to/nasit_sony/i-built-a-production-style-rag-backend-focused-on-what-happens-when-things-break-468c</link>
      <guid>https://dev.to/nasit_sony/i-built-a-production-style-rag-backend-focused-on-what-happens-when-things-break-468c</guid>
      <description>&lt;h1&gt;
  
  
  I Built a Production-Style RAG Backend — Focused on What Happens When Things Break
&lt;/h1&gt;

&lt;p&gt;Most RAG tutorials show you the happy path.&lt;/p&gt;

&lt;p&gt;Ingest document → generate embeddings → store in vector DB → search → return results.&lt;/p&gt;

&lt;p&gt;It works great in demos. But what happens when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The worker crashes mid-processing?&lt;/li&gt;
&lt;li&gt;Kafka replays messages and you get duplicates?&lt;/li&gt;
&lt;li&gt;The database goes down during ingestion?&lt;/li&gt;
&lt;li&gt;A malformed document gets stuck in an infinite retry loop?
I built SmartSearch to answer those questions — a correctness-first ingestion and retrieval backend designed to handle failures deterministically.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Problem With Most RAG Systems
&lt;/h2&gt;

&lt;p&gt;Most RAG implementations are optimized for the happy path. They work well when everything goes right, and fail in unpredictable ways when things go wrong.&lt;/p&gt;

&lt;p&gt;The result is systems where:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A worker crash leaves jobs in an unknown state&lt;/li&gt;
&lt;li&gt;Kafka replays create duplicate embeddings&lt;/li&gt;
&lt;li&gt;A bad document retries forever and blocks the queue&lt;/li&gt;
&lt;li&gt;Nobody knows why a document isn't searchable
SmartSearch is built to make failures explicit, recoverable, and observable.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Client
  ↓
API Service (Spring Boot)
  ↓
Kafka (async decoupling + replay)
  ↓
Worker (consumes, embeds, writes)
  ↓
Postgres + pgvector (embeddings + similarity search)
  ↓
Prometheus + Grafana (observability)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key design decision: &lt;strong&gt;decouple ingestion from processing via Kafka.&lt;/strong&gt; This gives you replay, retry, and resilience — at the cost of eventual consistency.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Job Lifecycle State Machine
&lt;/h2&gt;

&lt;p&gt;Every ingestion request has an explicit state:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;PENDING → PROCESSING → READY
                     → FAILED
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why this matters:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No hidden progress — you always know exactly where a job is&lt;/li&gt;
&lt;li&gt;Failures are visible — FAILED jobs appear in the system pressure dashboard&lt;/li&gt;
&lt;li&gt;Recovery is deterministic — on restart, PROCESSING jobs are retried
The lifecycle invariant: state transitions are monotonic. A job never goes backwards from PROCESSING to PENDING. Once FAILED, it stays FAILED unless explicitly retried.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Idempotent Ingestion
&lt;/h2&gt;

&lt;p&gt;Kafka guarantees at-least-once delivery. This means the same message can arrive multiple times — on retry, on replay, or after a broker restart.&lt;/p&gt;

&lt;p&gt;SmartSearch handles this via unique constraints:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;UNIQUE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;doc_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chunk_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If a chunk already exists, the write is a no-op. This means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Reprocessing the same message is always safe&lt;/li&gt;
&lt;li&gt;No duplicate embeddings, ever&lt;/li&gt;
&lt;li&gt;Workers can crash and restart without corrupting state
This is the idempotency invariant: reprocessing the same request does not change the final database state.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Failure Handling + DLQ
&lt;/h2&gt;

&lt;p&gt;Workers retry failed jobs with bounded attempts. After exhausting retries:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Job is marked &lt;code&gt;FAILED&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Message is sent to a Dead Letter Queue (DLQ)&lt;/li&gt;
&lt;li&gt;The job stops blocking other work
This prevents poison messages from retrying forever and starving the queue.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The failure isolation invariant: a FAILED job does not corrupt other documents.&lt;/p&gt;




&lt;h2&gt;
  
  
  Observability
&lt;/h2&gt;

&lt;p&gt;The system exposes a &lt;code&gt;/api/system/pressure&lt;/code&gt; endpoint showing live counts:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"pending"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"processing"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"ready"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;847&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"failed"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Prometheus metrics via Spring Boot Actuator:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;HTTP request rate and latency&lt;/li&gt;
&lt;li&gt;Ingestion pipeline metrics (received, succeeded, failed, retries, DLQ)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Processing age&lt;/strong&gt; — how long jobs wait before being processed&lt;/li&gt;
&lt;li&gt;Database connection pool metrics
&lt;strong&gt;Processing age is the metric most people overlook.&lt;/strong&gt; Latency tells you how fast things are going. Processing age tells you how much work is piling up. A rising processing age is an early warning signal before latency spikes become visible.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Failure Matrix
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Failure Scenario&lt;/th&gt;
&lt;th&gt;Expected Behavior&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Worker crash mid-processing&lt;/td&gt;
&lt;td&gt;Job retried, no duplicate chunks&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Worker crash after DB write&lt;/td&gt;
&lt;td&gt;Reprocessing occurs, idempotency holds&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Kafka broker restart&lt;/td&gt;
&lt;td&gt;Processing resumes, no message loss&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Postgres outage&lt;/td&gt;
&lt;td&gt;Worker retries, job eventually READY or FAILED&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Poison message&lt;/td&gt;
&lt;td&gt;Retries exhausted → FAILED + DLQ&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Duplicate request&lt;/td&gt;
&lt;td&gt;No duplicate embeddings created&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;All five scenarios were tested and verified to behave as specified.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I Learned
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;At-least-once + idempotency is the right default.&lt;/strong&gt; Exactly-once semantics in Kafka are possible but operationally complex. At-least-once delivery with idempotent writes gives you the same correctness guarantees with far less complexity.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The visibility invariant is underrated.&lt;/strong&gt; A document should be searchable if and only if its state is READY. This simple rule prevents partial visibility and makes the system's behavior predictable under any failure scenario.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Processing age is the most important metric nobody talks about.&lt;/strong&gt; Every pipeline should expose how long work sits before being processed. It's the earliest signal of a system falling behind.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Kafka adds complexity but the tradeoffs are worth it.&lt;/strong&gt; You get replay, retry, and resilience. The operational overhead is real, but for any system where correctness under failure matters, it's the right call.&lt;/p&gt;




&lt;h2&gt;
  
  
  Try It Yourself
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/NasitSony/SmartSearch.git
&lt;span class="nb"&gt;cd &lt;/span&gt;SmartSearch
docker compose up &lt;span class="nt"&gt;-d&lt;/span&gt;

&lt;span class="c"&gt;# API available at http://localhost:8080&lt;/span&gt;
&lt;span class="c"&gt;# Grafana at http://localhost:3000&lt;/span&gt;
&lt;span class="c"&gt;# Prometheus at http://localhost:9090&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Ingest a document&lt;/span&gt;
curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:8080/api/documents &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"content": "your document text here"}'&lt;/span&gt;

&lt;span class="c"&gt;# Search&lt;/span&gt;
curl &lt;span class="s2"&gt;"http://localhost:8080/api/search?q=your+query"&lt;/span&gt;

&lt;span class="c"&gt;# Check system pressure&lt;/span&gt;
curl http://localhost:8080/api/system/pressure
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/NasitSony/SmartSearch" rel="noopener noreferrer"&gt;https://github.com/NasitSony/SmartSearch&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;SmartSearch is the data pipeline layer of a larger AI infrastructure stack I've been building. The full stack story is covered in my article: &lt;a href="https://dev.to"&gt;I Built a Complete AI Infrastructure Stack from Scratch&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;If you found this useful, a ⭐ on GitHub goes a long way!&lt;/p&gt;

</description>
      <category>distributedsystems</category>
      <category>mlop</category>
      <category>java</category>
      <category>opensource</category>
    </item>
    <item>
      <title>How I Built a KV-Cache Control Plane for LLM Inference — With Real Benchmark Results</title>
      <dc:creator>Nasit Sony</dc:creator>
      <pubDate>Fri, 29 May 2026 18:55:54 +0000</pubDate>
      <link>https://dev.to/nasit_sony/how-i-built-a-kv-cache-control-plane-for-llm-inference-with-real-benchmark-results-n5e</link>
      <guid>https://dev.to/nasit_sony/how-i-built-a-kv-cache-control-plane-for-llm-inference-with-real-benchmark-results-n5e</guid>
      <description>&lt;h1&gt;
  
  
  How I Built a KV-Cache Control Plane for LLM Inference — With Real Benchmark Results
&lt;/h1&gt;

&lt;p&gt;LLM inference is expensive. The prefill step — processing the prompt — is the biggest cost. If you've seen the same prompt before, you shouldn't have to recompute it.&lt;/p&gt;

&lt;p&gt;That's the core idea behind KV-cache reuse. But in a distributed system with multiple inference nodes, a new problem emerges: &lt;em&gt;where is the cached prefix stored, and how do you route requests to maximize reuse?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;I built llm-serving-cache to answer that question — a metadata-driven control plane for LLM KV-cache placement and routing.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;In a single-node setup, KV-cache reuse is straightforward. The cache is local and the router is trivial.&lt;/p&gt;

&lt;p&gt;In a distributed setup:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cached prefixes are scattered across nodes&lt;/li&gt;
&lt;li&gt;The same prompt might be cached on node-a but the request lands on node-b&lt;/li&gt;
&lt;li&gt;Cache misses are expensive — full prefill cost, every time&lt;/li&gt;
&lt;li&gt;GPU memory is finite — you need admission control and eviction
You need a control plane that knows where every cached prefix lives and routes requests intelligently.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Architecture
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Client Request
      ↓
Router
      ↓
Session Affinity Check   → route to same node if session exists
      ↓
Exact Cache Hit?         → reuse cached result, skip prefill
      ↓
Prefix Match?            → reuse partial computation
      ↓
Cache Miss               → select best node, trigger cache fill
      ↓
[If full] Evict          → remove oldest inactive request
      ↓
Inference + Register     → store new cache entry
      ↓
WAL-backed Metadata Store
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Core Components
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Router&lt;/strong&gt; — handles exact hits, prefix matches, session affinity, and cache misses.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Node Registry&lt;/strong&gt; — tracks available nodes, GPU memory, and utilization.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Metadata Store&lt;/strong&gt; — persists cache entries and session routes via a WAL-backed KV engine (VeriStore).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Placement Policy&lt;/strong&gt; — best-fit node selection based on available GPU memory blocks.&lt;/p&gt;




&lt;h2&gt;
  
  
  Benchmark Results
&lt;/h2&gt;

&lt;p&gt;I ran controlled benchmarks across five cache strategies:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;Avg Latency&lt;/th&gt;
&lt;th&gt;P95 Latency&lt;/th&gt;
&lt;th&gt;Hit Rate&lt;/th&gt;
&lt;th&gt;Rejection Rate&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;No Cache&lt;/td&gt;
&lt;td&gt;1405 ms&lt;/td&gt;
&lt;td&gt;1405 ms&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Prefix Reuse&lt;/td&gt;
&lt;td&gt;985 ms&lt;/td&gt;
&lt;td&gt;1405 ms&lt;/td&gt;
&lt;td&gt;50%&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Exact Cache&lt;/td&gt;
&lt;td&gt;205 ms&lt;/td&gt;
&lt;td&gt;205 ms&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPU-Aware&lt;/td&gt;
&lt;td&gt;843 ms&lt;/td&gt;
&lt;td&gt;1405 ms&lt;/td&gt;
&lt;td&gt;25%&lt;/td&gt;
&lt;td&gt;25%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPU-Aware + Eviction&lt;/td&gt;
&lt;td&gt;1895 ms&lt;/td&gt;
&lt;td&gt;4205 ms&lt;/td&gt;
&lt;td&gt;25%&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Key observations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Exact cache reuse reduces latency by ~85%&lt;/strong&gt; vs no cache&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prefix reuse improves average latency but not tail latency&lt;/strong&gt; — P95 stays high when misses are still present&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  - &lt;strong&gt;Eviction reduces rejection but increases latency&lt;/strong&gt; by admitting previously rejected expensive requests
&lt;/h2&gt;

&lt;h2&gt;
  
  
  Real Inference Validation (Ollama)
&lt;/h2&gt;

&lt;p&gt;Benchmarks are useful, but I wanted to validate against real inference. I integrated Ollama running Llama 3.1 8B and ran controlled experiments:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;Total Latency&lt;/th&gt;
&lt;th&gt;Prompt Eval&lt;/th&gt;
&lt;th&gt;Decode&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Cold request&lt;/td&gt;
&lt;td&gt;~8,488 ms&lt;/td&gt;
&lt;td&gt;177 ms&lt;/td&gt;
&lt;td&gt;5,238 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Warm request&lt;/td&gt;
&lt;td&gt;~5,520 ms&lt;/td&gt;
&lt;td&gt;47 ms&lt;/td&gt;
&lt;td&gt;5,372 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Prefix-related&lt;/td&gt;
&lt;td&gt;~5,891 ms&lt;/td&gt;
&lt;td&gt;47 ms&lt;/td&gt;
&lt;td&gt;5,747 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Warm requests drop prompt evaluation from 177ms → 47ms. But total latency is still ~5.5 seconds because &lt;strong&gt;decode dominates&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This is the key insight: caching helps prefill, but token generation is the real bottleneck in real inference systems.&lt;/p&gt;




&lt;h2&gt;
  
  
  GPU Memory Model
&lt;/h2&gt;

&lt;p&gt;GPU memory is modeled as discrete fixed-size blocks (16MB each):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;total_blocks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;total_vram_mb&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;block_size&lt;/span&gt;
&lt;span class="n"&gt;required_blocks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;ceil&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;kv_size_mb&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;block_size&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Best-fit placement selects the node with the minimum leftover blocks after allocation, reducing fragmentation.&lt;/p&gt;

&lt;p&gt;Under memory pressure:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Attempt allocation&lt;/li&gt;
&lt;li&gt;If insufficient → trigger eviction of oldest inactive request&lt;/li&gt;
&lt;li&gt;Retry allocation&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  4. If still insufficient → reject request with explicit reason
&lt;/h2&gt;

&lt;h2&gt;
  
  
  Admission Control Under Load
&lt;/h2&gt;

&lt;p&gt;The most important result from the concurrent benchmark:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Concurrency&lt;/th&gt;
&lt;th&gt;Avg Latency&lt;/th&gt;
&lt;th&gt;P95 Latency&lt;/th&gt;
&lt;th&gt;Throughput&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;5,771 ms&lt;/td&gt;
&lt;td&gt;5,771 ms&lt;/td&gt;
&lt;td&gt;0.17 req/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;10,963 ms&lt;/td&gt;
&lt;td&gt;16,299 ms&lt;/td&gt;
&lt;td&gt;0.18 req/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;16,560 ms&lt;/td&gt;
&lt;td&gt;27,744 ms&lt;/td&gt;
&lt;td&gt;0.18 req/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;29,040 ms&lt;/td&gt;
&lt;td&gt;53,525 ms&lt;/td&gt;
&lt;td&gt;0.19 req/s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Throughput stays flat while latency explodes. This is classic queuing behavior — the bottleneck is the inference runtime, not the control plane.&lt;/p&gt;

&lt;p&gt;With admission control (&lt;code&gt;--max-active=3&lt;/code&gt;):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;No Control&lt;/th&gt;
&lt;th&gt;With Control&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Accepted&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;P95 Latency&lt;/td&gt;
&lt;td&gt;~53.5s&lt;/td&gt;
&lt;td&gt;~20.7s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Good systems don't try to serve everyone. They protect latency by rejecting excess load.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  What I Learned
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Prefix reuse is valuable but not sufficient.&lt;/strong&gt; Caching eliminates prefill cost but generation cost dominates real LLM serving. Effective optimization needs to address both.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Single-request latency is misleading.&lt;/strong&gt; Always benchmark under concurrency. P95 at concurrency=10 was nearly 3× the single-request time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Admission control is more important than caching.&lt;/strong&gt; A system that accepts everything under load will have terrible tail latency. Reject early, protect your SLA.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;WAL-backed metadata is fast.&lt;/strong&gt; Storage recovery for 5,000 cache entries takes ~20ms — completely invisible compared to inference latency. Persistence is free at this scale.&lt;/p&gt;




&lt;h2&gt;
  
  
  Try It Yourself
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone &lt;span class="nt"&gt;--recurse-submodules&lt;/span&gt; https://github.com/NasitSony/llm-serving-cache.git
&lt;span class="nb"&gt;cd &lt;/span&gt;llm-serving-cache
cmake &lt;span class="nt"&gt;-S&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt; &lt;span class="nt"&gt;-B&lt;/span&gt; build
cmake &lt;span class="nt"&gt;--build&lt;/span&gt; build

./build/routing_demo
./build/cache_register_demo
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/NasitSony/llm-serving-cache" rel="noopener noreferrer"&gt;https://github.com/NasitSony/llm-serving-cache&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;This project is the inference serving layer of a larger AI infrastructure stack. The storage layer underneath is &lt;a href="https://github.com/NasitSony/VeriStore" rel="noopener noreferrer"&gt;VeriStore&lt;/a&gt;. The workload orchestration layer above is &lt;a href="https://github.com/NasitSony/veriflow-control-plane" rel="noopener noreferrer"&gt;Veriflow&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;If you found this useful, a ⭐ on GitHub goes a long way!&lt;/p&gt;

</description>
      <category>distributedsystems</category>
      <category>cpp</category>
      <category>mlop</category>
      <category>opensource</category>
    </item>
    <item>
      <title>I Built a Storage Engine from Scratch in C++ — WAL, Raft, and Object Storage</title>
      <dc:creator>Nasit Sony</dc:creator>
      <pubDate>Fri, 29 May 2026 18:52:54 +0000</pubDate>
      <link>https://dev.to/nasit_sony/i-built-a-storage-engine-from-scratch-in-c-wal-raft-and-object-storage-4in</link>
      <guid>https://dev.to/nasit_sony/i-built-a-storage-engine-from-scratch-in-c-wal-raft-and-object-storage-4in</guid>
      <description>&lt;h1&gt;
  
  
  I Built a Storage Engine from Scratch in C++ — WAL, Raft, and Object Storage
&lt;/h1&gt;

&lt;p&gt;I wanted to understand one thing: &lt;em&gt;how does data actually survive a crash?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Not what the documentation says. Not what the abstraction promises. What actually happens at the byte level when a process dies mid-write, and how a storage engine recovers from it.&lt;/p&gt;

&lt;p&gt;So I built VeriStore — a correctness-first key-value storage engine in C++, built from first principles, evolving from a simple in-memory store to a Raft-replicated distributed system with a mini S3-style object storage layer.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Build a Storage Engine?
&lt;/h2&gt;

&lt;p&gt;Every database, stream processor, and distributed system you've ever used is built on top of primitives like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Write-Ahead Logging (WAL)&lt;/li&gt;
&lt;li&gt;Crash-consistent recovery&lt;/li&gt;
&lt;li&gt;Group commit batching&lt;/li&gt;
&lt;li&gt;Consensus replication
Understanding these primitives doesn't just make you a better infrastructure engineer — it makes you better at &lt;em&gt;using&lt;/em&gt; these systems because you understand what guarantees they actually provide.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  How It Works
&lt;/h2&gt;

&lt;h3&gt;
  
  
  v0.1 — In-Memory KV Store
&lt;/h3&gt;

&lt;p&gt;The foundation: a thread-safe key-value map with &lt;code&gt;PUT&lt;/code&gt;, &lt;code&gt;GET&lt;/code&gt;, and &lt;code&gt;DEL&lt;/code&gt; operations, protected by a reader-writer lock using &lt;code&gt;std::shared_mutex&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Simple, but sets the pattern for everything above it.&lt;/p&gt;

&lt;h3&gt;
  
  
  v0.2 — Write-Ahead Log + Crash Recovery
&lt;/h3&gt;

&lt;p&gt;The first real challenge: making writes survive crashes.&lt;/p&gt;

&lt;p&gt;The WAL is an append-only log. Every write is recorded in the log &lt;em&gt;before&lt;/em&gt; being applied to the in-memory map. On startup, the log is replayed to reconstruct state.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;PUT x 100  → append to WAL → apply to map
PUT y 200  → append to WAL → apply to map
FLUSH      → fsync to disk

[crash]

restart    → replay WAL → x=100, y=200 ✓
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;CRC validation detects partial or torn writes — if a record is incomplete, it's ignored and replay stops at that point.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Guarantee:&lt;/strong&gt; If a write returns OK, it survives crashes.&lt;/p&gt;

&lt;h3&gt;
  
  
  v0.3 — Snapshots + Log Compaction
&lt;/h3&gt;

&lt;p&gt;Replaying the full WAL on every restart gets slow as the log grows. Snapshots solve this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Serialize the current in-memory state to disk&lt;/li&gt;
&lt;li&gt;Truncate the WAL to remove entries before the snapshot&lt;/li&gt;
&lt;li&gt;On restart: load snapshot, then replay only the recent WAL entries
This keeps recovery time bounded regardless of how long the system has been running.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  v0.4 — Group Commit + Performance
&lt;/h3&gt;

&lt;p&gt;fsyncing every write is correct but slow. Group commit batches writes and fsyncs at boundaries:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Mode&lt;/th&gt;
&lt;th&gt;Setting&lt;/th&gt;
&lt;th&gt;Throughput&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Immediate flush&lt;/td&gt;
&lt;td&gt;SETBATCH 1&lt;/td&gt;
&lt;td&gt;39,216 ops/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Group commit&lt;/td&gt;
&lt;td&gt;SETBATCH 5&lt;/td&gt;
&lt;td&gt;104,167 ops/s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;~2.7× throughput improvement&lt;/strong&gt; by reducing fsync frequency. This is the same technique used by PostgreSQL, RocksDB, and etcd.&lt;/p&gt;

&lt;h3&gt;
  
  
  v0.5 — Raft Consensus Replication
&lt;/h3&gt;

&lt;p&gt;Single-node durability is not enough for production systems. Raft makes the storage engine fault-tolerant across a cluster:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Leader election&lt;/strong&gt; — nodes elect a leader via randomized timeouts&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Log replication&lt;/strong&gt; — the leader replicates writes to followers before committing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Majority quorum commit&lt;/strong&gt; — a write is committed only when a majority of nodes acknowledge it&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Follower catch-up&lt;/strong&gt; — a crashed follower replays missed entries on restart
Example output from the Raft demo:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[raft] node 3 became LEADER term=1
ProposePut(a=100) -&amp;gt; true
s1.get(a)=100  s2.get(a)=100  s3.get(a)=100

=== Simulating leader crash ===
[raft] node 2 became LEADER term=2
ProposePut(b=200) -&amp;gt; true
s1.get(b)=200  s2.get(b)=200  s3.get(b)=200
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Guarantee:&lt;/strong&gt; The cluster remains consistent despite node failures.&lt;/p&gt;

&lt;h3&gt;
  
  
  v0.6–v0.8 — Object Storage Layer
&lt;/h3&gt;

&lt;p&gt;On top of the KV engine, I built a mini S3-style object storage system:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Bucket creation&lt;/strong&gt; and object &lt;code&gt;PUT&lt;/code&gt;/&lt;code&gt;GET&lt;/code&gt;/&lt;code&gt;DELETE&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Chunked storage&lt;/strong&gt; — large objects are split into fixed-size chunks, each stored as a KV entry&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Metadata-based commit&lt;/strong&gt; — object metadata is written last and acts as the commit point. An object is valid only if committed metadata exists.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prefix-based listing&lt;/strong&gt; — &lt;code&gt;ListObjects(bucket, prefix)&lt;/code&gt; via prefix scans over the bucket index namespace&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Overwrite semantics&lt;/strong&gt; — new metadata commits atomically replace previous versions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mark-and-sweep garbage collection&lt;/strong&gt; — orphaned chunks from overwrites are reclaimed
The commit semantics are the key insight:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1. Write chunk data  → KV entries
2. Write metadata    → commit point

Recovery: objects without committed metadata are ignored
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This makes crash recovery deterministic — you either have the full object or nothing.&lt;/p&gt;




&lt;h2&gt;
  
  
  Failure Scenarios Tested
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;✔ Process crash (&lt;code&gt;kill -9&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;✔ Partial disk writes&lt;/li&gt;
&lt;li&gt;✔ Leader node crash&lt;/li&gt;
&lt;li&gt;✔ Follower crash and recovery&lt;/li&gt;
&lt;li&gt;✔ Log divergence repair&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  - ✔ Replica catch-up via log backtracking
&lt;/h2&gt;

&lt;h2&gt;
  
  
  What I Learned
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;fsync is the durability boundary.&lt;/strong&gt; A write is only durable once it's fsynced. Group commit is the standard tradeoff — batch writes, fsync at boundaries, accept a small window of potential data loss.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The commit point is everything.&lt;/strong&gt; Whether it's a WAL record, a metadata entry, or a Raft log index — the commit point is the line between "this happened" and "this might not have happened." Design your commit points explicitly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Raft is simpler than it looks, but the edge cases are brutal.&lt;/strong&gt; The basic algorithm is straightforward. But leader crash during replication, log divergence between followers, and split-brain scenarios each required careful handling.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mac is more forgiving than Linux.&lt;/strong&gt; The codebase compiled perfectly on macOS but failed on Linux GCC because Apple's headers include &lt;code&gt;&amp;lt;mutex&amp;gt;&lt;/code&gt; indirectly. Always test on Linux before shipping.&lt;/p&gt;




&lt;h2&gt;
  
  
  Try It Yourself
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/NasitSony/VeriStore.git
&lt;span class="nb"&gt;cd &lt;/span&gt;VeriStore
cmake &lt;span class="nt"&gt;-S&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt; &lt;span class="nt"&gt;-B&lt;/span&gt; build
cmake &lt;span class="nt"&gt;--build&lt;/span&gt; build

&lt;span class="c"&gt;# Run the KV CLI&lt;/span&gt;
./build/kv_cli

&lt;span class="c"&gt;# Run the Raft demo&lt;/span&gt;
./build/raft_demo
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/NasitSony/VeriStore" rel="noopener noreferrer"&gt;https://github.com/NasitSony/VeriStore&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;VeriStore is the storage foundation of a larger AI infrastructure stack I've been building. The next layer up is &lt;a href="https://github.com/NasitSony/llm-serving-cache" rel="noopener noreferrer"&gt;llm-serving-cache&lt;/a&gt; — a KV-cache placement and routing control plane for LLM inference, backed by VeriStore.&lt;/p&gt;

&lt;p&gt;If you found this useful, a ⭐ on GitHub goes a long way!&lt;/p&gt;

</description>
      <category>distributedsystems</category>
      <category>cpp</category>
      <category>mlop</category>
      <category>opensource</category>
    </item>
    <item>
      <title>I Built a Complete AI Infrastructure Stack from Scratch — Here's What I Learned</title>
      <dc:creator>Nasit Sony</dc:creator>
      <pubDate>Fri, 29 May 2026 18:26:15 +0000</pubDate>
      <link>https://dev.to/nasit_sony/i-built-a-complete-ai-infrastructure-stack-from-scratch-heres-what-i-learned-1de6</link>
      <guid>https://dev.to/nasit_sony/i-built-a-complete-ai-infrastructure-stack-from-scratch-heres-what-i-learned-1de6</guid>
      <description>&lt;h1&gt;
  
  
  I Built a Complete AI Infrastructure Stack from Scratch — Here's What I Learned
&lt;/h1&gt;

&lt;p&gt;Most AI projects start at the top of the stack.&lt;/p&gt;

&lt;p&gt;You grab an LLM API, wire up a vector database, build a RAG pipeline, and ship. That works — until it doesn't. Until your training job crashes at hour 6. Until your inference cache fills up and nobody knows why. Until a worker dies mid-processing and your embeddings are corrupted.&lt;/p&gt;

&lt;p&gt;I wanted to understand what happens &lt;em&gt;below&lt;/em&gt; the API layer. So I built the whole thing from scratch.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Stack
&lt;/h2&gt;

&lt;p&gt;Over the past few months I built four interconnected systems that form a complete AI infrastructure stack:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;VeriStore          → Storage layer (WAL, Raft, crash recovery)
      ↓
llm-serving-cache  → Inference serving (KV cache, GPU memory, routing)
      ↓
Veriflow           → Workload orchestration (training jobs, checkpoints, GPU scheduling)
      ↓
SmartSearch        → AI data pipeline (async ingestion, Kafka, RAG, fault tolerance)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each layer depends on the one below it. Each solves a real problem I kept running into. And each taught me something I couldn't have learned from reading documentation.&lt;/p&gt;




&lt;h2&gt;
  
  
  Layer 1 — VeriStore: How Data Actually Survives Crashes
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/NasitSony/VeriStore" rel="noopener noreferrer"&gt;https://github.com/NasitSony/VeriStore&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The first question I wanted to answer: &lt;em&gt;how does data survive a crash?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Not "what does the documentation say" — but what actually happens at the byte level when a process dies mid-write.&lt;/p&gt;

&lt;p&gt;VeriStore is a correctness-first key-value storage engine in C++ built from first principles:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Write-Ahead Log (WAL)&lt;/strong&gt; — every write is logged before being applied. On crash, the log is replayed deterministically.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CRC validation&lt;/strong&gt; — partial or torn writes are detected and ignored.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Group commit batching&lt;/strong&gt; — instead of fsyncing every write, writes are batched. This improved throughput by ~2.7× in benchmarks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Snapshot + compaction&lt;/strong&gt; — periodic snapshots eliminate the need for full log replay on restart.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Raft consensus replication&lt;/strong&gt; — a 3-node cluster with leader election, majority-based commit, and follower catch-up after crashes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mini S3-style object storage&lt;/strong&gt; — built on top of the KV engine with chunked writes, prefix listing, and mark-and-sweep garbage collection.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  What I learned
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;fsync is expensive, but skipping it is dangerous.&lt;/strong&gt; Group commit is the right tradeoff — batch writes, fsync at boundaries. This is what RocksDB, PostgreSQL, and etcd all do.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The WAL commit point is everything.&lt;/strong&gt; An object is valid only if its metadata is committed. This single rule makes crash recovery deterministic — you either have the commit record or you don't.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Raft is simpler than it sounds, but the edge cases are brutal.&lt;/strong&gt; Leader crash during log replication, follower log divergence, split-brain scenarios — each required careful handling.&lt;/p&gt;




&lt;h2&gt;
  
  
  Layer 2 — llm-serving-cache: Where Does the KV Cache Live?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/NasitSony/llm-serving-cache" rel="noopener noreferrer"&gt;https://github.com/NasitSony/llm-serving-cache&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;LLM inference is expensive. The prefill step — processing the prompt — is the main cost. If you've seen the same prompt before, you shouldn't have to recompute it.&lt;/p&gt;

&lt;p&gt;llm-serving-cache is a control-plane service that tracks where cached attention prefixes live across distributed inference nodes and routes requests to maximize cache reuse.&lt;/p&gt;

&lt;p&gt;Key results from benchmarks:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;Avg Latency&lt;/th&gt;
&lt;th&gt;Hit Rate&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;No Cache&lt;/td&gt;
&lt;td&gt;1405 ms&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Prefix Reuse&lt;/td&gt;
&lt;td&gt;985 ms&lt;/td&gt;
&lt;td&gt;50%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Exact Cache&lt;/td&gt;
&lt;td&gt;205 ms&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPU-Aware&lt;/td&gt;
&lt;td&gt;843 ms&lt;/td&gt;
&lt;td&gt;25%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Exact cache reuse reduces latency by ~85% compared to no cache.&lt;/p&gt;

&lt;p&gt;The system models GPU memory as discrete blocks and uses best-fit placement to minimize fragmentation. Under memory pressure, it evicts the oldest inactive requests and retries allocation before rejecting.&lt;/p&gt;

&lt;p&gt;I also validated this against a real Ollama backend running Llama 3.1 8B:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cold request: ~8,488 ms&lt;/li&gt;
&lt;li&gt;Warm request (same prompt): ~5,520 ms&lt;/li&gt;
&lt;li&gt;Prompt eval dropped from 177ms → 47ms on warm requests&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  What I learned
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Cache hits matter enormously for prefill, but decode dominates total latency.&lt;/strong&gt; A warm request still takes ~5.5 seconds because token generation is slow regardless of caching. Real serving optimization needs to address decode efficiency too.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Admission control is more important than caching.&lt;/strong&gt; Accepting all requests under load causes queue growth and latency explosion. Rejecting excess load with a hard concurrency limit keeps tail latency controlled.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Single-request latency is misleading.&lt;/strong&gt; At concurrency=10, P95 latency was 53.5 seconds — nearly 3× the single-request time. Production serving systems need batching, scheduling, and admission control, not just cache reuse.&lt;/p&gt;




&lt;h2&gt;
  
  
  Layer 3 — Veriflow: Treating Training Jobs as Distributed Systems
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/NasitSony/veriflow-control-plane" rel="noopener noreferrer"&gt;https://github.com/NasitSony/veriflow-control-plane&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The pain that started everything: training jobs that crash at hour 6 with no checkpoint, no retry, and no idea why.&lt;/p&gt;

&lt;p&gt;Veriflow is a Kubernetes-based job orchestrator that treats AI training as what it actually is — a distributed systems problem.&lt;/p&gt;

&lt;p&gt;The key insight: &lt;strong&gt;checkpoints need to be first-class citizens.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Most job runners treat AI training like a simple script: run it, and if it fails, restart from zero. Veriflow models job lifecycle as a state machine with checkpoint-aware retry:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;JOB_SUBMITTED → RUN_CREATED → POD_RUNNING
→ CHECKPOINT_SAVED            ← checkpoint URI persisted
→ RUN_FAILED                  ← something went wrong
→ RETRY_TRIGGERED             ← scheduler picks it up
→ TRAINING_RESUMED            ← resumes from checkpoint
→ JOB_SUCCEEDED
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The scheduler uses &lt;code&gt;FOR UPDATE SKIP LOCKED&lt;/code&gt; in Postgres for concurrency-safe job claiming — tested with two concurrent scheduler instances processing 20 burst-submitted jobs with zero duplicate dispatches.&lt;/p&gt;

&lt;p&gt;GPU-aware placement matches jobs to nodes by GPU type, count, and memory requirements using best-fit allocation. Queue-level fairness and quota enforcement prevent one greedy queue from monopolizing the cluster.&lt;/p&gt;

&lt;h3&gt;
  
  
  What I learned
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;FOR UPDATE SKIP LOCKED is underrated.&lt;/strong&gt; Most people reach for Redis or a dedicated queue for concurrent job processing. Postgres with SKIP LOCKED handles it correctly — and you get transactions and consistency for free.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The scheduler is a control plane, not a cron job.&lt;/strong&gt; A cron fires and forgets. A control plane continuously reconciles desired state with actual state. This distinction is what makes checkpoint-aware recovery possible.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Checkpoint URIs should be in your job spec from day one.&lt;/strong&gt; Treating them as an afterthought means you'll always restart from scratch when things go wrong.&lt;/p&gt;




&lt;h2&gt;
  
  
  Layer 4 — SmartSearch: What Happens When the Pipeline Breaks?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/NasitSony/SmartSearch" rel="noopener noreferrer"&gt;https://github.com/NasitSony/SmartSearch&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Most RAG demos show the happy path: ingest document, generate embeddings, search, return results. &lt;/p&gt;

&lt;p&gt;SmartSearch asks a different question: &lt;em&gt;what happens when things fail?&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What if the worker crashes mid-processing?&lt;/li&gt;
&lt;li&gt;What if Kafka replays messages?&lt;/li&gt;
&lt;li&gt;What if the database goes down?&lt;/li&gt;
&lt;li&gt;What if duplicate requests arrive?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The system is built to handle these scenarios deterministically:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Idempotent ingestion&lt;/strong&gt; — duplicate Kafka messages don't create duplicate embeddings, enforced via unique constraints&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Job lifecycle state machine&lt;/strong&gt; — &lt;code&gt;PENDING → PROCESSING → READY | FAILED&lt;/code&gt;, no hidden progress&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bounded retries + DLQ&lt;/strong&gt; — failed jobs retry with limits, then go to a dead letter queue&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Full observability&lt;/strong&gt; — Prometheus + Grafana dashboards for pipeline pressure, retry rates, and processing age&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  What I learned
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;At-least-once + idempotency is the right default.&lt;/strong&gt; Exactly-once semantics in Kafka are possible but complex. At-least-once with idempotent writes gets you the same correctness guarantees with far less operational complexity.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Processing age is the most important metric nobody talks about.&lt;/strong&gt; Latency tells you how fast things are going. Processing age tells you how much work is piling up. A rising processing age means your pipeline is falling behind — before latency spikes make it obvious.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The visibility invariant matters.&lt;/strong&gt; A document is searchable if and only if its state is READY. This single rule prevents partial visibility and makes the system's behavior predictable under any failure scenario.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Bigger Picture
&lt;/h2&gt;

&lt;p&gt;Building these four systems taught me something that documentation never could:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Every layer of the AI stack is a distributed systems problem.&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Storage is about durability and consensus&lt;/li&gt;
&lt;li&gt;Inference serving is about routing and resource management&lt;/li&gt;
&lt;li&gt;Workload orchestration is about scheduling and fault recovery&lt;/li&gt;
&lt;li&gt;Data pipelines are about correctness under partial failure&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Most AI engineers work at the top of this stack and treat the layers below as black boxes. That works until scale, failure, or cost forces the question: &lt;em&gt;what's actually happening down there?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Understanding these layers doesn't just make you a better infrastructure engineer. It makes you better at every layer above — because you understand what guarantees you can actually rely on, and what you need to handle yourself.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Demo GIFs&lt;/strong&gt; for all four projects&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Distributed control plane&lt;/strong&gt; for llm-serving-cache (Raft-backed metadata)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Web UI&lt;/strong&gt; for Veriflow job monitoring&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Exactly-once semantics&lt;/strong&gt; for SmartSearch (Kafka transactions)&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;If you found this useful, all four repos are on GitHub. Stars and feedback welcome!&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;VeriStore: &lt;a href="https://github.com/NasitSony/VeriStore" rel="noopener noreferrer"&gt;https://github.com/NasitSony/VeriStore&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;llm-serving-cache: &lt;a href="https://github.com/NasitSony/llm-serving-cache" rel="noopener noreferrer"&gt;https://github.com/NasitSony/llm-serving-cache&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Veriflow: &lt;a href="https://github.com/NasitSony/veriflow-control-plane" rel="noopener noreferrer"&gt;https://github.com/NasitSony/veriflow-control-plane&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;SmartSearch: &lt;a href="https://github.com/NasitSony/SmartSearch" rel="noopener noreferrer"&gt;https://github.com/NasitSony/SmartSearch&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>distributedsystems</category>
      <category>mlops</category>
      <category>cpp</category>
      <category>go</category>
    </item>
    <item>
      <title>I Got Tired of Training Jobs Crashing at Hour 6 — So I Built Veriflow</title>
      <dc:creator>Nasit Sony</dc:creator>
      <pubDate>Fri, 29 May 2026 04:59:30 +0000</pubDate>
      <link>https://dev.to/nasit_sony/i-got-tired-of-training-jobs-crashing-at-hour-6-so-i-built-veriflow-797</link>
      <guid>https://dev.to/nasit_sony/i-got-tired-of-training-jobs-crashing-at-hour-6-so-i-built-veriflow-797</guid>
      <description>&lt;h1&gt;
  
  
  I Got Tired of Training Jobs Crashing at Hour 6 — So I Built Veriflow
&lt;/h1&gt;

&lt;p&gt;You know the feeling.&lt;/p&gt;

&lt;p&gt;You kick off a training job before bed. 8 hours of compute. You wake up, grab your coffee, open the terminal — and see it crashed at hour 6. No checkpoint. No retry. No clue why.&lt;/p&gt;

&lt;p&gt;Restart from zero.&lt;/p&gt;

&lt;p&gt;That pain is what led me to build &lt;strong&gt;Veriflow&lt;/strong&gt; — a checkpoint-aware, fault-tolerant job orchestrator for AI training workloads on Kubernetes.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem With Existing Tools
&lt;/h2&gt;

&lt;p&gt;Most job runners treat AI training like a simple script:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Run it. If it fails, restart it."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;But training jobs are not simple scripts. They are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Long-running&lt;/strong&gt; — hours or days, not seconds&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stateful&lt;/strong&gt; — they produce checkpoints as they run&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Expensive&lt;/strong&gt; — GPU time costs real money&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Distributed&lt;/strong&gt; — they touch storage, databases, and compute simultaneously&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Restarting from zero every time a job fails is not just annoying — it is wasteful and often unacceptable in production.&lt;/p&gt;

&lt;p&gt;What you actually need is a system that treats AI workloads as what they are: &lt;strong&gt;distributed systems problems&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Veriflow Does Differently
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Checkpoint-Aware Retry
&lt;/h3&gt;

&lt;p&gt;When a job fails, Veriflow does not restart from scratch. It resumes from the latest saved checkpoint.&lt;/p&gt;

&lt;p&gt;The lifecycle looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;JOB_SUBMITTED
JOB_SCHEDULED
RUN_CREATED
POD_RUNNING
TRAINING_PROGRESS
CHECKPOINT_SAVED        ← checkpoint URI persisted
RUN_FAILED              ← something went wrong
RETRY_TRIGGERED         ← scheduler picks it up
TRAINING_RESUMED        ← resumes from checkpoint
JOB_SUCCEEDED
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The checkpoint URI is a first-class citizen in the job spec — not an afterthought bolted on later.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Concurrency-Safe Scheduling
&lt;/h3&gt;

&lt;p&gt;Veriflow uses PostgreSQL's &lt;code&gt;FOR UPDATE SKIP LOCKED&lt;/code&gt; for job claiming. This means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Multiple scheduler instances can run simultaneously&lt;/li&gt;
&lt;li&gt;No duplicate job dispatches — ever&lt;/li&gt;
&lt;li&gt;No complex distributed locking needed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Tested with two concurrent scheduler instances processing 20 burst-submitted jobs — zero duplicate dispatches observed.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. GPU-Aware Placement
&lt;/h3&gt;

&lt;p&gt;Jobs declare their GPU requirements upfront:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"gpuCount"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"gpuType"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"A100"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"minGpuMemoryMb"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;30000&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The scheduler matches jobs to nodes that satisfy all constraints, using best-fit placement to avoid fragmentation. If no node satisfies the constraints, the job is deferred with an explicit reason — not silently dropped.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Queue-Level Fairness and Quota
&lt;/h3&gt;

&lt;p&gt;Each queue has a GPU quota. Jobs that exceed their queue's quota are deferred, not rejected. The scheduler rotates through queues to prevent starvation — one greedy queue cannot monopolize the cluster.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Full Event-Sourced Lifecycle
&lt;/h3&gt;

&lt;p&gt;Every state transition emits an event. This means you always know:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Why a job failed&lt;/li&gt;
&lt;li&gt;When a checkpoint was saved&lt;/li&gt;
&lt;li&gt;How many retry attempts were made&lt;/li&gt;
&lt;li&gt;Exactly how long each phase took&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;p&gt;Veriflow follows a classic control-plane + data-plane split:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Client
  │  POST /v1/jobs  (Idempotency-Key)
  ▼
Job API (Go)
  │  writes jobs/spec to Postgres
  ▼
Postgres (jobs, runs, events)
  ▲
  │  claim (FOR UPDATE SKIP LOCKED)
  │  dispatch → Kubernetes Job
  │  reconcile runtime + K8s state
  ▼
Scheduler (Go) ───────────► Kubernetes Job / Pod
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Control plane&lt;/strong&gt; = Job API + Scheduler + Postgres&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data plane&lt;/strong&gt; = Kubernetes Jobs and Pods&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This separation makes the system easy to reason about, scale, and debug.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I Learned Building This
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;FOR UPDATE SKIP LOCKED is underrated.&lt;/strong&gt;&lt;br&gt;
Most people reach for Redis or a dedicated queue when they need concurrent job processing. But Postgres with &lt;code&gt;SKIP LOCKED&lt;/code&gt; handles it beautifully — and you get transactions, consistency, and a single source of truth for free.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Checkpoint URIs need to be first-class.&lt;/strong&gt;&lt;br&gt;
The biggest mistake I see in ML infra is treating checkpoints as an implementation detail. They need to be in your job spec, tracked in your database, and passed explicitly on retry. If your orchestrator does not know about checkpoints, you will always restart from zero.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Model your job lifecycle as a state machine.&lt;/strong&gt;&lt;br&gt;
Once I stopped thinking about jobs as "running or not running" and started modeling them as state machines with explicit transitions, failure handling became trivial. Every failure has a cause. Every retry has a reason. Nothing is ambiguous.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The scheduler is a control plane, not a cron job.&lt;/strong&gt;&lt;br&gt;
A cron job fires and forgets. A control plane continuously reconciles desired state with actual state. Veriflow's scheduler constantly reconciles Kubernetes pod states, runtime signals, and database state — which is what makes checkpoint-aware recovery possible.&lt;/p&gt;




&lt;h2&gt;
  
  
  Try It Yourself
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/NasitSony/veriflow-control-plane.git
&lt;span class="nb"&gt;cd &lt;/span&gt;veriflow-control-plane
make up
make api
make sched
make demo-success
make events
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The demo runs a full end-to-end job — submission, scheduling, execution, checkpointing, and success — in under a minute.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Metrics and Prometheus integration&lt;/strong&gt; — expose scheduler and job metrics&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Web UI&lt;/strong&gt; — visualize job lifecycle and GPU utilization&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-cluster support&lt;/strong&gt; — dispatch jobs across multiple Kubernetes clusters&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Feedback Welcome
&lt;/h2&gt;

&lt;p&gt;Veriflow is early-stage and I am actively looking for feedback from anyone doing ML infra or platform engineering. What features would make this useful for your workloads?&lt;/p&gt;

&lt;p&gt;GitHub: &lt;a href="https://github.com/NasitSony/veriflow-control-plane" rel="noopener noreferrer"&gt;https://github.com/NasitSony/veriflow-control-plane&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you found this useful, a ⭐ on GitHub goes a long way!&lt;/p&gt;

</description>
      <category>ai</category>
      <category>kubernetes</category>
      <category>machinelearning</category>
      <category>showdev</category>
    </item>
  </channel>
</rss>
