Kan-Chen Lin

Posted on Jun 15

Hardening a Go Task Scheduler for Production (Part 2) — Tenancy, DST, Idempotency & LLM Reliability

#ai #architecture #go #systemdesign

📦 Repo: github.com/linkc0829/go-chatgpt-tasks

This is Part 2 of 2. Part 1 built the scheduler from scratch — the MCP interface, the queue-decoupled watcher → worker execution path, and the AI workflow that drove it. This part takes that prototype and hardens it for multi-tenant production.

In Part 1 I built an MCP-driven task scheduler: a watcher scans for due jobs, a Redis Streams queue decouples it from a pool of workers, and a clean Job / JobRun / RunEvent state machine ties it together. It worked — but it was a prototype, and prototypes have tells.

It was MCP-only: no HTTP handler, no authentication, driven by an unauthenticated stdio server. There was no tenant_id or user_id anywhere. Recurrence was a naive interval_seconds added to the last run — no timezone or DST awareness. The run_events table was status-only (id, status, created_at) with no event type, payload, or error detail. The executor was a literal no-op stub. And while the Redis queue delivered at-least-once, there was no idempotency on real side effects.

This is the production-hardening cycle that fixed every one of those. Six problem areas, the internal/task/ package rewritten across seven vertical slices, ~8,200 lines of diff. This post covers:

A quick recap of the AI workflow (introduced in Part 1) and how it handled a large change
The high-level design it produced (with a diagram)
The system-design decisions and their trade-offs

1. The AI Workflow at Scale

Part 1 introduced the staged QRSPI workflow — Question → Research → Design → Structure → Plan → Worktree → Implement → PR — where each step produces a written artifact and nothing downstream starts until the upstream artifact exists. Where Part 1 used it to establish patterns in a greenfield feature, Part 2 leaned on it to keep a sprawling six-area change reviewable.

1. Question   → neutral research questions (no opinions)
2. Research   → objective answers, grounded in the actual code
3. Design     → where are we going, and why (+ explicit non-goals)
4. Structure  → vertical slices + test checkpoints
5. Plan       → the tactical, file-by-file working doc
6. Worktree   → isolated git worktree for implementation
7. Implement  → execute phase-by-phase, verify each
8. PR         → description grounded in design + the real diff

Three things mattered most on a change this size:

Research was read-only and opinion-free. Before touching anything, I re-mapped reality: what columns are on jobs / job_runs / run_events today? How does the recurring watcher compute the next run? What does the Redis consumer-group setup actually look like? So decisions referenced recurring_watcher.go:53, not a hallucinated guess.
Design recorded decisions and the rejected alternatives. The design doc explicitly listed the non-goals: fair/weighted scheduling, a real Anthropic adapter, full RFC-5545 RRULE, a DAG engine, exactly-once. Naming what we weren't doing is what stopped a six-area change from becoming a twelve-area one.
Structure forced altitude control. Seven slices, ordered by blast radius — identity first (every later slice needs it on the service signature), then the typed-event/metrics foundation, then quota, recurrence, idempotency, chains, and LLM reliability. Each slice crosses the full stack (migration → domain → service → API/MCP → tests) and ends in a concrete verification checkpoint. Each is independently shippable; drop a later one and the earlier ones still deliver.

The payoff: because the "why" already lived in design.md, the PR description explained reasoning instead of restating the diff — and a reviewer could read the design and structure docs and already know why every decision was made.

2. High-Level Design

The architecture is unchanged from Part 1 — hexagonal (ports & adapters), feature-first — but the task hexagon grew a lot of new surface. The biggest structural change: an authenticated HTTP API alongside the existing MCP transport, both funneling through the same service via an explicit identity.

Read it top-down: both client types funnel into the same service through an explicit Identity; the service talks only to interfaces in ports.go; adapters (Postgres, Redis, the LLM client, and user.Service as a cross-feature TenantLookup port) sit at the edge; and the worker drains the queue, executes idempotently behind the LLM reliability wrapper, and — like the service — emits typed events that feed Prometheus.

That user.Service-as-a-port wiring is worth a note: cross-feature communication never imports another feature directly. task declares a TenantLookup capability interface in its own ports.go, user.Service structurally satisfies it via Go's duck typing, and the composition root (bootstrap/wire.go) injects it. That's how task resolves a tenant from a user without ever importing internal/user.

3. System-Design Decisions

These are the calls that mattered, with the reasoning the workflow forced into writing.

Identity is an explicit argument, not a context value

Service methods take Identity{TenantID, UserID} as a first-class first argument, rather than smuggling it through context.Context. The reason: two very different callers (HTTP, where identity comes from a JWT subject, and MCP, where it comes from a service principal) must supply identity the same way. An explicit parameter makes the dependency visible in the signature and impossible to forget. For v1, with no tenants table in the spec, tenant_id is resolved 1:1 from the user via the TenantLookup port.

Stacked migrations: add-nullable → backfill → NOT NULL

The tables already existed in production-shaped form (from Part 1), so each schema change is a three-step dance within one migration pair: add the column nullable, backfill a sensible default (timezone_id = 'UTC', interval_seconds → equivalent recurrence_rule), then enforce NOT NULL. Heavier than a greenfield template needs, but it's the price of evolving a live schema safely.

A hand-rolled RRULE subset, not a dependency

Recurrence supports FREQ=DAILY|WEEKLY (+ optional INTERVAL) with a local_time and an IANA timezone_id, computed over the stdlib's time.LoadLocation. DST is the whole point: 08:00 America/New_York must stay 8 AM local across a spring-forward boundary. The DST policy is explicit — skipped local times roll to the next valid instant, ambiguous ones pick the first occurrence — and the decision is recorded as a RunEvent so it's auditable. No external RFC-5545 library; the surface we need is small, and a full RRULE engine is a non-goal.

At-least-once delivery + idempotency, not exactly-once

Part 1 gave us at-least-once delivery via Redis Streams consumer-group reclaim. Part 2 makes the handlers idempotent rather than chasing distributed exactly-once (a tar pit). An idempotency_records table backs a check → in-progress → side effect → completed contract, and JobRunMsg now carries an idempotency_key. Duplicate delivery of the same message runs the side effect exactly once — verified by a fake handler that counts its own calls.

Single stream now, fairness deferred — but the door is left open

Fair/weighted per-tenant scheduling was explicitly deferred. But tenant_id was added to the queue payload now, and a cheap intra-batch per-tenant round-robin (Worker.fairOrder) reorders only the messages a worker already read. That's a bounded down-payment on fairness — no per-tenant queues, no cross-worker coordination — that avoids a queue rewrite later.

LLM behind a port + a reliability wrapper + a fake

Part 1's no-op StubExecutor is replaced by an executor that dispatches by job type and wraps an LLMClient interface in a reliability layer: context.WithTimeout, output-schema validation, retry within a budget, and a pre-run cost estimate checked against the tenant's max_daily_llm_cost_cents. Invalid LLM output is never marked success — it retries, then fails. A deterministic fake client backs all tests; a real Anthropic adapter is a thin, deliberately-later add. The interesting logic lives in the wrapper, not the vendor SDK, so it's fully testable without a live API key.

Typed events + feature-level metrics

run_events graduated from status-only to event_type TEXT + event_payload JSONB with error code/message fields, and every lifecycle transition emits one. Alongside, the task feature got its own Prometheus vectors (task_runs_total{status}, task_run_duration_seconds, task_dlq_total, per-tenant quota rejections). The diff also ships a Grafana dashboard JSON and alert rules — though actual dashboard provisioning is correctly scoped as infra work, not part of the code change.

A Few Things Worth Stealing

Write the non-goals down. The single most effective scope-control tool was an explicit "What We're NOT Doing" list — and it's where you put the almost-deferred items (like intra-batch fairness) with a note on exactly how far they go.
Slice by blast radius, end every slice in a checkpoint. Identity went first because every later signature depended on it. Each of the seven phases was independently shippable.
Idempotency beats exactly-once. If you're tempted to chase distributed exactly-once, redirect that energy into making handlers idempotent under at-least-once delivery.
Keep external dependencies behind a port with a fake. The LLM reliability layer was fully testable before a single real API call existed.
Let the design doc write the PR. If your "why" already exists in writing, the pull request stops being a chore and becomes genuinely useful to reviewers.

The throughline across both parts: a disciplined, artifact-producing AI workflow turns both a greenfield build and a sprawling six-area hardening pass into something a reviewer can actually follow — because every decision was written down before it was coded, not reverse-engineered from the diff afterward.

Missed the start? *Part 1 — Building the scheduler** covers the MCP interface, the watcher → queue → worker execution path, and the design decisions behind the prototype.*

DEV Community