The Brain runs on a schedule now

#opensource #python #postgres #devops

Originally published on mihaibuilds.com. Cross-posting here because dev.to is where I read a lot of work like this myself.

Two weeks ago I shipped the first milestone of The Brain — the bare runner. A Python file with a sequence of steps, brain run path/to/workflow.py, the run lands in Postgres, you inspect it from the CLI. That was M1. It works, and you can run it on demand whenever you want.

Today M2 is done. The Brain now runs on a schedule, on its own, without you in the loop.

Why this matters

The most useful workflow automation only kicks in when you stop having to babysit it — daily digests, scheduled exports, nightly summaries, anything that compounds. M1 proved the runner works. M2 is the milestone where leaving it alone is a reasonable thing to do.

That's the whole point of M2 in one sentence. The rest of this post is what that looks like in practice — and what it deliberately doesn't try to do.

What M2 ships

Cron schedules. Register a workflow on a standard 5-field cron expression. The Brain writes the schedule to Postgres next to your run history.

docker compose exec brain brain register examples/daily_digest.py --cron "0 9 * * 1-5"

The schedule validates the cron expression and the workflow file before it lands in the database. Duplicate schedule names are rejected — no silent overwrite. You can list everything that's registered, see when each one ran last and when it'll fire next, pause and resume schedules (idempotent), and unregister them when you're done. Same CLI you used in M1, against the same database.

A scheduler daemon. The container now runs a long-running process — a daemon — that polls the schedule table every 10 seconds and fires whatever is due. SIGTERM finishes the currently-running workflow before exiting cleanly. On a crash, any run that was in flight gets recovered as a failed run with a clear error, so the run history never lies about what's running and what isn't.

The daemon and the CLI are separate processes against the same database. You don't have to "stop the daemon to run a workflow" — docker compose exec brain brain run ... still works exactly as it did in M1, in parallel with whatever the daemon is doing.

A new brain daemon-status command tells you whether the daemon is alive (exit 0 if it ticked within the last 30 seconds). Docker uses the same command as its container healthcheck.

Workflows that read their previous run. A step can write {previous.<step_name>} in its prompt or command, and The Brain substitutes the same step's output from the last successful run of the same workflow.

LLMStep(
    name="summary",
    prompt=(
        "Yesterday's summary:\n{previous.summary}\n\n"
        "Today's memories:\n{recent}\n\n"
        "Write today's summary."
    ),
)

On the very first run, when there is no previous successful run, the step fails with a clear error rather than silently substituting empty string. Same strict-failure shape as M1's intra-run {step_name} placeholder — better to halt loudly than to leak unresolved braces into a shell command. Once one run has succeeded, every subsequent run sees its output via the placeholder.

An opt-in HTTP endpoint. POST /run accepts a workflow path, runs the workflow, and returns the run's metadata as JSON. Bearer token from an environment variable; without the token in the environment, the service refuses to start. Designed for server-to-server, not browsers — no CORS, no public docs, single token. Opt in by bringing up the api compose profile.

THE_BRAIN_API_TOKEN=your-secret docker compose --profile api up -d

If you want to fire workflows from another machine, this is the surface. If you don't, ignore the profile and nothing in M1 changes.

Architectural decisions worth naming

daemon_tick(now) is the unit of behavior, not the polling loop. The daemon does one thing well: a single async function takes a wall-clock moment and runs one poll cycle (heartbeat, look up due schedules, fire each sequentially, advance next_run_at). A separate run_daemon wraps it in a 10-second loop with signal handlers. Tests drive the cycle function directly with a frozen clock instead of spawning a real long-running process — the wrapper is dumb on purpose. Two hours of test-design savings every time a future scheduler concern needs a regression test.

Skip, don't catch up. If a workflow takes longer than its cron interval — say a 1-minute cron whose last run took 5 minutes — the daemon does not queue up four backlog fires for the boundaries it missed. It fires once, advances next_run_at to the next cron boundary after right-now, and moves on. A schedule that fell six hours behind because the container was off fires once and continues on its current cadence. Catching up across a long outage is almost always wrong; it floods the system with stale work the moment it comes back.

Sequential within a poll cycle. No concurrent workflow execution. A long-running workflow blocks the daemon from picking up other due workflows until it finishes. This is by design for v1.0 — parallel execution and a real work queue carry concurrency-control complexity that needs to wait until I have a real workload to optimize against, not a hypothetical one. v1.1 concern, called out in the explainer notes.

Crash recovery on boot, not in-flight. When run_daemon starts, it sweeps workflow_runs WHERE status='running' and marks them all failed with a locked error message. Under the single-daemon-per-host invariant these are by definition orphans from a previous crash. No heartbeat liveness check, no leader election, no consensus protocol. Single daemon means single source of truth for what "in flight" means.

A planned_steps JSONB snapshot on every run. Each workflow_runs row now has the full step list at run-creation time — [{"name": ..., "type": ...}, ...]. Lets postmortem disambiguate "step absent from output because the run halted before reaching it" from "step never existed in this workflow version." One extra json.dumps per run, no extra query. The cost is rounding error; the postmortem clarity is worth it. Suggested by a comment under the M1 dev.to post — pinned to the schema before the analyzer existed.

{previous.X} is a single indexed lookup. A partial index on workflow_runs (workflow_name, started_at DESC) WHERE status = 'success' makes the previous-run lookup an index-only scan. The previous run's output JSONB is decomposed into a step-name → output map at lookup time, which is what {previous.X} resolves against. Strict on failure: no prior successful run, or step name missing from the previous run, both fail THAT step with a clear distinct error. Two messages, two tests pinning them.

HTTPBearer with auto_error=False. FastAPI's default HTTPBearer returns 403 on a missing Authorization header. That's wrong — RFC 7235 says missing auth is 401, forbidden is 403. The explicit auto_error=False + manual 401 raise corrects this. Small bug, but it's the kind of small bug that wastes a peer's afternoon when they're integrating against the endpoint and can't figure out why curl gets 403 from a missing-header request that should be 401. Pinned by three auth-branch tests.

What v1.0 won't do, on purpose

The daemon is not highly available. One daemon per host. Two running in parallel would clobber each other's crash-recovery logic. The single-daemon invariant is what lets the recovery sweep be a simple UPDATE WHERE status='running'. Adding HA means leader election or run-level ownership — both v1.1+ concerns.

There's no instant pickup. New registrations and cron-boundary fires land within ten seconds of being due. Postgres LISTEN/NOTIFY would close that gap but adds complexity that 10s polling makes unnecessary for v1.0. Most workflows run on minute-or-coarser cron expressions; 10s is rounding error.

There's no queue for missed fires. Skip-don't-catch-up is the locked behavior. If you genuinely need every fire to land, write a workflow that does its own backfill — The Brain won't second-guess your cron expression.

The HTTP endpoint isn't a public API. Single token, no CORS, opt-in, designed for known callers on the same network. Path allowlisting and per-caller scoping are v1.1+. The threat model is single-token-server-to-server; anyone with the token can execute arbitrary server-side Python by pointing the endpoint at any file on the host. The token is the only gate. Treat it like a database password.

Workflows still execute one at a time per host. Sequential within a tick. Concurrent execution is v1.1 territory and brings concurrency-control problems that need to wait for a real workload to design against.

These are deliberate trade-offs. M2 is the smallest correct unattended-runner, not the most ambitious one.

Who this is for

Same audience as M1, with one addition: anyone who needs a scheduled workflow runner they can self-host and inspect end-to-end — and who's tired of either rolling their own cron-in-a-container with no run history, or paying for a managed orchestrator that owns their data.

If you've ever written a Python script, wired it to a system cron entry, then realized a week later you have no record of which days it failed and why — this is for you.

What's next

Milestone 3 is the reactive layer — webhook triggers and file-watcher triggers. That's when The Brain stops only firing on the clock and starts firing in response to things that happen.

The full roadmap and milestone progress table live in the repo's README. Each milestone gets a dev-log post here as it ships — one of four dev.to posts across the build period.

Try it

git clone https://github.com/MihaiBuilds/the-brain
cd the-brain
docker compose up -d
docker compose exec brain brain daemon-status
docker compose exec brain brain register examples/hello.py --cron "*/1 * * * *"
docker compose exec brain brain list

Wait a minute, run brain history, and you'll see the daemon-fired run sitting in there alongside any brain run invocations from the M1 quickstart — same row shape, same inspection commands, same database. The repo has the longer version with state-across-runs and the HTTP endpoint walkthrough.