akirayuusha

Posted on Jun 13 • Edited on Jun 14

Launching BabyChain: durable image and video model chains on AWS Aurora and Vercel

#ai #programming #productivity #opensource

Today we are launching BabyChain: a self-hosted canvas studio and durable chain API for image and video model workflows.

The short version is this: BabyChain lets you design a ComfyUI-style media chain on a canvas, then call that same chain from product code as POST /api/v1/chains/runs. Every step executes through provider APIs with server-side credentials, every state transition persists to AWS Aurora, and Vercel functions stay stateless.

The product has one invariant: every output becomes the next input.

Website: https://babychain.babysea.live
GitHub: https://github.com/babysea-community/babychain

If you run model chains on a local GPU workbench today, BabyChain is the version of that workflow you can deploy, call from a backend, and keep forever. The canvas is not a demo shell. It is a visual editor on top of the same durable contract your application calls.

Why we built it: canvas workflows are not production infrastructure

Real generative media work is rarely one model call. It is an image model feeding an image-to-video model, often with a refine step in the middle and a video-to-video step at the end. Canvas tools made that composable, but most of them are creative workbenches. The workflow lives inside a UI, expects a local GPU or a managed model runtime, and does not naturally become an authenticated API that another product can call.

We kept hitting the same wall in our own projects: the moment a visual workflow needed to become product infrastructure (authenticated, retryable, callable from a queue, safe to expose to another backend), we had to rewrite it as glue code.

BabyChain's design goal was to make that distance zero.

Design the chain on a canvas. Call the same chain from your backend. Those should not be two different systems.

What ships today

Multi-flow canvas studio. Many independent image → video flows side by side on one permanent workspace, with autosave and a library of saved chains plus their results.
57 image and video models across Alibaba Cloud, Black Forest Labs, BytePlus, Google, OpenAI, and Runway, for 78,948 valid chain combinations.
Schema-true node cards, powered by Semantic Lady. Every card's fields, enum options, ranges, and defaults are generated from that model's schema, so the UI cannot offer a parameter the API would reject.
BYOK execution. Provider keys live in the server environment, never in the browser and never in caller requests. Caller apps authenticate with BabyChain API keys.
Durable runs on Aurora. Ordered steps, provider request ids, generation ids, outputs, failures, callbacks, and a timeline are all persisted. One signed callback delivers the terminal run when a webhook URL is supplied.

Semantic Lady became necessary because BabyChain supports two execution worlds. In BabySea mode, the BabySea SDK already gives us a normalized generation_* contract. In BYOK mode, BabyChain talks directly to provider APIs, and every provider brings its own schema vocabulary. Semantic Lady is the local SDK that gives BYOK mode a real schema layer: model metadata, field definitions, enum options, defaults, and constraints that the canvas and API can trust before a provider request is sent.

Self-host it in an afternoon

BabyChain is Apache-2.0 and built to be deployed, not hosted by us:

git clone https://github.com/babysea-community/babychain.git
cd babychain && pnpm install --frozen-lockfile
cp .env.example .env.local   # DATABASE_URL, owner login, provider keys
pnpm run aurora:migrate      # applies the schema, idempotent
pnpm dev                     # or use the one-click Vercel deploy button

For production, BabyChain is designed around AWS Aurora. For local development, it can also point at a local PostgreSQL database. The README walks through creating the Aurora cluster, setting DATABASE_URL, applying the schema, and deploying the app on Vercel.

The rest of this post is the architecture: how a canvas workflow becomes durable infrastructure on Aurora and Vercel.

The architecture: Aurora remembers, Vercel advances

The naive way to run a multi-model chain on serverless is to hold the whole chain in one function invocation. That dies quickly. A single image → video → video-modify workflow can spend several minutes inside provider queues, and a stateless function should not be asked to babysit that entire wait.

The durable way is to make the database the only place workflow state lives:

Aurora owns every fact about a run.
Vercel functions are stateless workers.
Each invocation advances a run by at most one step.
Any instance can pick up any run mid-chain.

When a caller creates a run, BabyChain persists the run and its ordered steps to Aurora, may opportunistically advance the first ready step, and returns without waiting for the full chain. Each subsequent poll of GET /api/v1/chains/get/{runId}, or a cron sweep, loads the run from Aurora, advances exactly one provider step (submit or poll), persists the result, and returns. Long chains survive serverless limits because no instance ever needs to outlive a step.

Aurora owns every fact about a run, so a Vercel function is allowed to disappear at any moment. The chain is not.

Aurora Serverless v2 fits the workload: bursty, low-idle, spiky on demo days. The connection pool absorbs Aurora wake-ups when a cluster is configured to pause with a 30-second connection timeout. For Aurora/RDS endpoints, deployers keep ?sslmode=require in DATABASE_URL; BabyChain strips the driver-level query param and connects with TLS, including the RDS CA behavior expected by the Node.js pg client.

The schema: seven tables, one source of truth

Everything durable lives in one private schema, babychain_private, applied idempotently by pnpm aurora:migrate:

Table	Owns
`chain_run`	Run lifecycle: status, input, output, error code/message, idempotency key hash, callback intent
`chain_step`	Ordered steps: per-step params, provider request ids, generation ids, output files, failure details
`canvas`	Saved node graphs as `jsonb`, owner-scoped, with a touch trigger and a `(owner_email, updated_at desc)` index
`api_key`	Hashed caller keys with scopes
`audit_event`	Append-only audit trail
`callback_delivery`	Final signed-webhook delivery state
`babysea_webhook_delivery`	Inbound provider webhook bookkeeping

Two design details earned their keep:

The input_order sidecar. PostgreSQL jsonb does not preserve key order, but the public run resource echoes the caller's input back in API responses, and key order is part of how people read their own request. Run creation stores a small jsonb array of the caller's original key order alongside the canonicalized input, and the presenter re-applies it on the way out. It is a small detail, but it matters when an API response is also a debugging surface.

Guarded state transitions. Steps only leave the queued state through updates with a where status = 'queued' guard. That single predicate makes the fail-fast path race-safe: when a step fails, the runner marks the run failed and sweeps every still-queued downstream step to skipped (their input can never arrive) without ever clobbering a step that a concurrent invocation already started.

Idempotency end to end

Generative media is expensive enough that retries must not multiply spend. BabyChain makes idempotency a property of the whole pipeline, not one endpoint:

Run creation hashes the caller's Idempotency-Key per principal and stores it on chain_run with a unique constraint. A retried create replays the stored run: same id, same response, zero new provider calls.
Step submission derives a deterministic idempotency key per run, step, and chain version. If a function instance dies between submitting to a provider and persisting the result, the retry resubmits with the same key, and providers that honor idempotency can deduplicate server-side.

The same discipline applies on the way out: when a run includes a webhook URL, the terminal callback is claimed on the run row and each signed delivery attempt is recorded in callback_delivery, so concurrent instances do not both send the same terminal callback.

A canvas that cannot lose your work

The studio is a multi-flow React Flow canvas. Every edit autosaves to the canvas table in Aurora, and surviving real-world usage took three iterations:

Debounced autosave is a data-loss bug with good intentions: it can drop the last burst of edits before a reload.

A debounce-based autosave silently dropped the last burst of edits before a reload. We replaced it with a dirty flag plus a steady flush loop, so a flush is always at most ~1.5 seconds behind your latest keystroke.
A sendBeacon final flush covers tab close, reload, navigation, and tab-hide through an owner-authenticated endpoint.
Hydration runs exactly once per mount, so router refreshes can never reset live canvas state.

The result is the demo I like most: edit a prompt, log out, log back in on another machine. The edit is there, served from Aurora. Close the tab mid-run, reopen, and the run resumes, because run progress was never in the browser to begin with.

Where the real pain lived: provider normalization

Six providers (Alibaba Cloud, Black Forest Labs, BytePlus, Google, OpenAI, and Runway), 57 supported models, 78,948 valid chain combinations. Not one provider agrees on what "give me a 16:9 image" means.

The deepest rabbit hole was Alibaba Cloud output sizes. Each model family has different rules, documented nowhere and discovered only by probing the live API:

qwen-image / qwen-image-plus accept exactly five sizes. Anything else is a 400.
qwen-image-max and z-image-turbo cap each dimension at 2048.
The wan2.6 / wan2.7 families enforce per-model pixel budgets.

Provider docs are a starting point. The live API is the truth.

So the adapter computes sizes per model. For budgeted models, a requested ratio w:h is fitted into a pixel budget P_max:

scale = sqrt(P_max / (w * h))
W = floor(scale * w / 16) * 16
H = floor(scale * h / 16) * 16

…and snapped-size models get a lookup table instead, because they allow no freedom at all. Wrong sizes now physically cannot be sent.

The same empirical attitude shaped everything at the boundary: Runway's per-endpoint pixel ratios, OpenAI's permanent quota 429s masquerading as transient rate limits, and BFL output URLs that expire after ~10 minutes (the UI shows honest loading and expiry states instead of leaking alt text).

One structural decision keeps this manageable: the canvas node cards are generated from Semantic Lady schemas (fields, enum options, ranges, defaults). BabySea mode can lean on BabySea's normalized contract; BYOK mode gets a local provider-aware schema catalog instead of a pile of hand-built forms. The UI cannot offer a parameter the API would reject, because both are projections of the same source of truth.

What keeps it honest

BabyChain is built around runtime invariants instead of optimistic workflows. A chain should be able to fail cleanly, resume after an interrupted function, reject invalid model roles and normalized inputs before dispatch, and preserve canvas state even if the browser disappears mid-edit.

The runtime behavior we validated end to end:

Step fails             -> run goes terminal, downstream steps skipped, caller sees the provider's real error
Function instance dies -> next poll resumes the run from Aurora, idempotent resubmit
Client retries create  -> same run replayed, zero duplicate spend
Tab closes mid-edit    -> sendBeacon flush, canvas intact after re-login
Aurora wake-up         -> 30s connection budget absorbs it when pause is enabled

The current project gate is 237 tests plus typecheck, lint, and production build. The tests cover the runner, provider adapters, templates, API behavior, migrations, idempotency errors, callback behavior, and the schema rules that keep the canvas and API aligned.

What I would do next

BabyChain is already usable as a deployable starter, but the next layer is about making runs cheaper to inspect, easier to share, and safer to operate for teams:

Output archival. Copy provider outputs to S3 before short-lived URLs expire, especially for providers whose generated URLs are only useful for minutes.
Branching chains. The canvas already runs flows side by side; the runner should support fan-out inside one chain, such as one image feeding multiple video treatments.
Team workspaces. Add multi-user accounts with per-key scopes and quotas on top of the existing api_key model.
Run economics. Surface per-run cost estimates and provider spend from the data Aurora already stores.

The takeaway

Statelessness is a feature you design for, not a constraint you fight. Once every fact about a run lives in Aurora (runs, steps, provider ids, outputs, failures, callbacks, canvases, audit), serverless time limits, cold starts, and instance churn stop being the center of the system. Vercel gives the control plane instant deployment; Aurora gives it durable memory.

Design on the canvas. Ship the same contract as an API. Let the database remember everything.

Creators and developers: deploy it, chain your own models in your own cloud, and tell us what you automate first.

Try it: https://babychain.babysea.live
Deploy it: https://github.com/babysea-community/babychain

Hackathon note

BabyChain is our entry to the H0: Hack the Zero Stack with Vercel v0 & AWS Databases hackathon. This post was created for the purposes of entering that hackathon. #H0Hackathon

DEV Community