I Got Tired of Training Jobs Crashing at Hour 6 — So I Built Veriflow

#ai #kubernetes #machinelearning #showdev

I Got Tired of Training Jobs Crashing at Hour 6 — So I Built Veriflow

You know the feeling.

You kick off a training job before bed. 8 hours of compute. You wake up, grab your coffee, open the terminal — and see it crashed at hour 6. No checkpoint. No retry. No clue why.

Restart from zero.

That pain is what led me to build Veriflow — a checkpoint-aware, fault-tolerant job orchestrator for AI training workloads on Kubernetes.

The Problem With Existing Tools

Most job runners treat AI training like a simple script:

"Run it. If it fails, restart it."

But training jobs are not simple scripts. They are:

Long-running — hours or days, not seconds
Stateful — they produce checkpoints as they run
Expensive — GPU time costs real money
Distributed — they touch storage, databases, and compute simultaneously

Restarting from zero every time a job fails is not just annoying — it is wasteful and often unacceptable in production.

What you actually need is a system that treats AI workloads as what they are: distributed systems problems.

What Veriflow Does Differently

1. Checkpoint-Aware Retry

When a job fails, Veriflow does not restart from scratch. It resumes from the latest saved checkpoint.

The lifecycle looks like this:

JOB_SUBMITTED
JOB_SCHEDULED
RUN_CREATED
POD_RUNNING
TRAINING_PROGRESS
CHECKPOINT_SAVED        ← checkpoint URI persisted
RUN_FAILED              ← something went wrong
RETRY_TRIGGERED         ← scheduler picks it up
TRAINING_RESUMED        ← resumes from checkpoint
JOB_SUCCEEDED

The checkpoint URI is a first-class citizen in the job spec — not an afterthought bolted on later.

2. Concurrency-Safe Scheduling

Veriflow uses PostgreSQL's FOR UPDATE SKIP LOCKED for job claiming. This means:

Multiple scheduler instances can run simultaneously
No duplicate job dispatches — ever
No complex distributed locking needed

Tested with two concurrent scheduler instances processing 20 burst-submitted jobs — zero duplicate dispatches observed.

3. GPU-Aware Placement

Jobs declare their GPU requirements upfront:

{
  "gpuCount": 2,
  "gpuType": "A100",
  "minGpuMemoryMb": 30000
}

The scheduler matches jobs to nodes that satisfy all constraints, using best-fit placement to avoid fragmentation. If no node satisfies the constraints, the job is deferred with an explicit reason — not silently dropped.

4. Queue-Level Fairness and Quota

Each queue has a GPU quota. Jobs that exceed their queue's quota are deferred, not rejected. The scheduler rotates through queues to prevent starvation — one greedy queue cannot monopolize the cluster.

5. Full Event-Sourced Lifecycle

Every state transition emits an event. This means you always know:

Why a job failed
When a checkpoint was saved
How many retry attempts were made
Exactly how long each phase took

Architecture

Veriflow follows a classic control-plane + data-plane split:

Client
  │  POST /v1/jobs  (Idempotency-Key)
  ▼
Job API (Go)
  │  writes jobs/spec to Postgres
  ▼
Postgres (jobs, runs, events)
  ▲
  │  claim (FOR UPDATE SKIP LOCKED)
  │  dispatch → Kubernetes Job
  │  reconcile runtime + K8s state
  ▼
Scheduler (Go) ───────────► Kubernetes Job / Pod

Control plane = Job API + Scheduler + Postgres
Data plane = Kubernetes Jobs and Pods

This separation makes the system easy to reason about, scale, and debug.

What I Learned Building This

FOR UPDATE SKIP LOCKED is underrated.
Most people reach for Redis or a dedicated queue when they need concurrent job processing. But Postgres with SKIP LOCKED handles it beautifully — and you get transactions, consistency, and a single source of truth for free.

Checkpoint URIs need to be first-class.
The biggest mistake I see in ML infra is treating checkpoints as an implementation detail. They need to be in your job spec, tracked in your database, and passed explicitly on retry. If your orchestrator does not know about checkpoints, you will always restart from zero.

Model your job lifecycle as a state machine.
Once I stopped thinking about jobs as "running or not running" and started modeling them as state machines with explicit transitions, failure handling became trivial. Every failure has a cause. Every retry has a reason. Nothing is ambiguous.

The scheduler is a control plane, not a cron job.
A cron job fires and forgets. A control plane continuously reconciles desired state with actual state. Veriflow's scheduler constantly reconciles Kubernetes pod states, runtime signals, and database state — which is what makes checkpoint-aware recovery possible.

Try It Yourself

git clone https://github.com/NasitSony/veriflow-control-plane.git
cd veriflow-control-plane
make up
make api
make sched
make demo-success
make events

The demo runs a full end-to-end job — submission, scheduling, execution, checkpointing, and success — in under a minute.

What's Next

Metrics and Prometheus integration — expose scheduler and job metrics
Web UI — visualize job lifecycle and GPU utilization
Multi-cluster support — dispatch jobs across multiple Kubernetes clusters

Feedback Welcome

Veriflow is early-stage and I am actively looking for feedback from anyone doing ML infra or platform engineering. What features would make this useful for your workloads?

GitHub: https://github.com/NasitSony/veriflow-control-plane

If you found this useful, a ⭐ on GitHub goes a long way!