<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Kan-Chen Lin</title>
    <description>The latest articles on DEV Community by Kan-Chen Lin (@kanchen_lin_331136af621d).</description>
    <link>https://dev.to/kanchen_lin_331136af621d</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3634020%2F1154df49-6bd8-4f77-84c8-6b2cd0e465e7.jpg</url>
      <title>DEV Community: Kan-Chen Lin</title>
      <link>https://dev.to/kanchen_lin_331136af621d</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/kanchen_lin_331136af621d"/>
    <language>en</language>
    <item>
      <title>Hardening a Go Task Scheduler for Production (Part 2) — Tenancy, DST, Idempotency &amp; LLM Reliability</title>
      <dc:creator>Kan-Chen Lin</dc:creator>
      <pubDate>Mon, 15 Jun 2026 01:53:07 +0000</pubDate>
      <link>https://dev.to/kanchen_lin_331136af621d/hardening-a-go-task-scheduler-for-production-part-2-tenancy-dst-idempotency-llm-reliability-5373</link>
      <guid>https://dev.to/kanchen_lin_331136af621d/hardening-a-go-task-scheduler-for-production-part-2-tenancy-dst-idempotency-llm-reliability-5373</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;📦 &lt;strong&gt;Repo:&lt;/strong&gt; &lt;a href="https://github.com/linkc0829/go-chatgpt-tasks" rel="noopener noreferrer"&gt;github.com/linkc0829/go-chatgpt-tasks&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is &lt;strong&gt;Part 2 of 2&lt;/strong&gt;. Part 1 built the scheduler from scratch — the MCP interface, the queue-decoupled watcher → worker execution path, and the AI workflow that drove it. This part takes that prototype and hardens it for multi-tenant production.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In Part 1 I built an MCP-driven task scheduler: a watcher scans for due jobs, a Redis Streams queue decouples it from a pool of workers, and a clean &lt;code&gt;Job&lt;/code&gt; / &lt;code&gt;JobRun&lt;/code&gt; / &lt;code&gt;RunEvent&lt;/code&gt; state machine ties it together. It worked — but it was a prototype, and prototypes have tells.&lt;/p&gt;

&lt;p&gt;It was &lt;strong&gt;MCP-only&lt;/strong&gt;: no HTTP handler, no authentication, driven by an unauthenticated stdio server. There was no &lt;code&gt;tenant_id&lt;/code&gt; or &lt;code&gt;user_id&lt;/code&gt; anywhere. Recurrence was a naive &lt;code&gt;interval_seconds&lt;/code&gt; added to the last run — no timezone or DST awareness. The &lt;code&gt;run_events&lt;/code&gt; table was status-only (&lt;code&gt;id, status, created_at&lt;/code&gt;) with no event type, payload, or error detail. The executor was a literal no-op stub. And while the Redis queue delivered at-least-once, there was &lt;strong&gt;no idempotency&lt;/strong&gt; on real side effects.&lt;/p&gt;

&lt;p&gt;This is the production-hardening cycle that fixed every one of those. Six problem areas, the &lt;code&gt;internal/task/&lt;/code&gt; package rewritten across seven vertical slices, ~8,200 lines of diff. This post covers:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;A quick recap of the &lt;strong&gt;AI workflow&lt;/strong&gt; (introduced in Part 1) and how it handled a &lt;em&gt;large&lt;/em&gt; change&lt;/li&gt;
&lt;li&gt;The &lt;strong&gt;high-level design&lt;/strong&gt; it produced (with a diagram)&lt;/li&gt;
&lt;li&gt;The &lt;strong&gt;system-design decisions&lt;/strong&gt; and their trade-offs&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  1. The AI Workflow at Scale
&lt;/h2&gt;

&lt;p&gt;Part 1 introduced the staged &lt;strong&gt;QRSPI&lt;/strong&gt; workflow — Question → Research → Design → Structure → Plan → Worktree → Implement → PR — where each step produces a written artifact and nothing downstream starts until the upstream artifact exists. Where Part 1 used it to &lt;em&gt;establish&lt;/em&gt; patterns in a greenfield feature, Part 2 leaned on it to keep a sprawling six-area change &lt;em&gt;reviewable&lt;/em&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1. Question   → neutral research questions (no opinions)
2. Research   → objective answers, grounded in the actual code
3. Design     → where are we going, and why (+ explicit non-goals)
4. Structure  → vertical slices + test checkpoints
5. Plan       → the tactical, file-by-file working doc
6. Worktree   → isolated git worktree for implementation
7. Implement  → execute phase-by-phase, verify each
8. PR         → description grounded in design + the real diff
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three things mattered most on a change this size:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Research was read-only and opinion-free.&lt;/strong&gt; Before touching anything, I re-mapped reality: &lt;em&gt;what columns are on &lt;code&gt;jobs&lt;/code&gt; / &lt;code&gt;job_runs&lt;/code&gt; / &lt;code&gt;run_events&lt;/code&gt; today? How does the recurring watcher compute the next run? What does the Redis consumer-group setup actually look like?&lt;/em&gt; So decisions referenced &lt;code&gt;recurring_watcher.go:53&lt;/code&gt;, not a hallucinated guess.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Design recorded decisions &lt;em&gt;and&lt;/em&gt; the rejected alternatives.&lt;/strong&gt; The design doc explicitly listed the non-goals: fair/weighted scheduling, a real Anthropic adapter, full RFC-5545 RRULE, a DAG engine, exactly-once. Naming what we &lt;em&gt;weren't&lt;/em&gt; doing is what stopped a six-area change from becoming a twelve-area one.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Structure forced altitude control.&lt;/strong&gt; Seven slices, ordered by &lt;strong&gt;blast radius&lt;/strong&gt; — identity first (every later slice needs it on the service signature), then the typed-event/metrics foundation, then quota, recurrence, idempotency, chains, and LLM reliability. Each slice crosses the full stack (migration → domain → service → API/MCP → tests) and ends in a concrete verification checkpoint. Each is independently shippable; drop a later one and the earlier ones still deliver.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The payoff: because the "why" already lived in &lt;code&gt;design.md&lt;/code&gt;, the PR description explained reasoning instead of restating the diff — and a reviewer could read the design and structure docs and already know why every decision was made.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. High-Level Design
&lt;/h2&gt;

&lt;p&gt;The architecture is unchanged from Part 1 — &lt;strong&gt;hexagonal (ports &amp;amp; adapters), feature-first&lt;/strong&gt; — but the &lt;code&gt;task&lt;/code&gt; hexagon grew a lot of new surface. The biggest structural change: an &lt;strong&gt;authenticated HTTP API&lt;/strong&gt; alongside the existing MCP transport, both funneling through the &lt;em&gt;same&lt;/em&gt; service via an explicit identity.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftvg1dudfhh5j2jih8mpj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftvg1dudfhh5j2jih8mpj.png" alt="gpt-task-scheduler-design"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Read it top-down:&lt;/strong&gt; both client types funnel into the &lt;em&gt;same&lt;/em&gt; service through an explicit &lt;code&gt;Identity&lt;/code&gt;; the service talks only to interfaces in &lt;code&gt;ports.go&lt;/code&gt;; adapters (Postgres, Redis, the LLM client, and &lt;code&gt;user.Service&lt;/code&gt; as a cross-feature &lt;code&gt;TenantLookup&lt;/code&gt; port) sit at the edge; and the worker drains the queue, executes idempotently behind the LLM reliability wrapper, and — like the service — emits typed events that feed Prometheus.&lt;/p&gt;

&lt;p&gt;That &lt;code&gt;user.Service&lt;/code&gt;-as-a-port wiring is worth a note: cross-feature communication never imports another feature directly. &lt;code&gt;task&lt;/code&gt; declares a &lt;code&gt;TenantLookup&lt;/code&gt; &lt;strong&gt;capability interface&lt;/strong&gt; in its own &lt;code&gt;ports.go&lt;/code&gt;, &lt;code&gt;user.Service&lt;/code&gt; structurally satisfies it via Go's duck typing, and the composition root (&lt;code&gt;bootstrap/wire.go&lt;/code&gt;) injects it. That's how &lt;code&gt;task&lt;/code&gt; resolves a tenant from a user without ever importing &lt;code&gt;internal/user&lt;/code&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. System-Design Decisions
&lt;/h2&gt;

&lt;p&gt;These are the calls that mattered, with the reasoning the workflow forced into writing.&lt;/p&gt;

&lt;h3&gt;
  
  
  Identity is an explicit argument, not a context value
&lt;/h3&gt;

&lt;p&gt;Service methods take &lt;code&gt;Identity{TenantID, UserID}&lt;/code&gt; as a &lt;strong&gt;first-class first argument&lt;/strong&gt;, rather than smuggling it through &lt;code&gt;context.Context&lt;/code&gt;. The reason: two very different callers (HTTP, where identity comes from a JWT subject, and MCP, where it comes from a service principal) must supply identity &lt;em&gt;the same way&lt;/em&gt;. An explicit parameter makes the dependency visible in the signature and impossible to forget. For v1, with no tenants table in the spec, &lt;code&gt;tenant_id&lt;/code&gt; is resolved 1:1 from the user via the &lt;code&gt;TenantLookup&lt;/code&gt; port.&lt;/p&gt;

&lt;h3&gt;
  
  
  Stacked migrations: add-nullable → backfill → NOT NULL
&lt;/h3&gt;

&lt;p&gt;The tables already existed in production-shaped form (from Part 1), so each schema change is a three-step dance within one migration pair: add the column nullable, backfill a sensible default (&lt;code&gt;timezone_id = 'UTC'&lt;/code&gt;, &lt;code&gt;interval_seconds&lt;/code&gt; → equivalent &lt;code&gt;recurrence_rule&lt;/code&gt;), then enforce &lt;code&gt;NOT NULL&lt;/code&gt;. Heavier than a greenfield template needs, but it's the price of evolving a live schema safely.&lt;/p&gt;

&lt;h3&gt;
  
  
  A hand-rolled RRULE &lt;em&gt;subset&lt;/em&gt;, not a dependency
&lt;/h3&gt;

&lt;p&gt;Recurrence supports &lt;code&gt;FREQ=DAILY|WEEKLY&lt;/code&gt; (+ optional &lt;code&gt;INTERVAL&lt;/code&gt;) with a &lt;code&gt;local_time&lt;/code&gt; and an IANA &lt;code&gt;timezone_id&lt;/code&gt;, computed over the stdlib's &lt;code&gt;time.LoadLocation&lt;/code&gt;. &lt;strong&gt;DST is the whole point&lt;/strong&gt;: &lt;code&gt;08:00 America/New_York&lt;/code&gt; must stay 8 AM local across a spring-forward boundary. The DST policy is explicit — skipped local times roll to the next valid instant, ambiguous ones pick the first occurrence — and the decision is recorded as a &lt;code&gt;RunEvent&lt;/code&gt; so it's auditable. No external RFC-5545 library; the surface we need is small, and a full RRULE engine is a non-goal.&lt;/p&gt;

&lt;h3&gt;
  
  
  At-least-once delivery + idempotency, not exactly-once
&lt;/h3&gt;

&lt;p&gt;Part 1 gave us at-least-once delivery via Redis Streams consumer-group reclaim. Part 2 makes the &lt;em&gt;handlers&lt;/em&gt; idempotent rather than chasing distributed exactly-once (a tar pit). An &lt;code&gt;idempotency_records&lt;/code&gt; table backs a &lt;code&gt;check → in-progress → side effect → completed&lt;/code&gt; contract, and &lt;code&gt;JobRunMsg&lt;/code&gt; now carries an &lt;code&gt;idempotency_key&lt;/code&gt;. Duplicate delivery of the same message runs the side effect exactly once — verified by a fake handler that counts its own calls.&lt;/p&gt;

&lt;h3&gt;
  
  
  Single stream now, fairness deferred — but the door is left open
&lt;/h3&gt;

&lt;p&gt;Fair/weighted per-tenant scheduling was explicitly deferred. But &lt;code&gt;tenant_id&lt;/code&gt; was added to the queue payload &lt;em&gt;now&lt;/em&gt;, and a cheap intra-batch per-tenant round-robin (&lt;code&gt;Worker.fairOrder&lt;/code&gt;) reorders only the messages a worker already read. That's a bounded down-payment on fairness — no per-tenant queues, no cross-worker coordination — that avoids a queue rewrite later.&lt;/p&gt;

&lt;h3&gt;
  
  
  LLM behind a port + a reliability wrapper + a fake
&lt;/h3&gt;

&lt;p&gt;Part 1's no-op &lt;code&gt;StubExecutor&lt;/code&gt; is replaced by an executor that dispatches by job type and wraps an &lt;code&gt;LLMClient&lt;/code&gt; interface in a reliability layer: &lt;code&gt;context.WithTimeout&lt;/code&gt;, output-schema validation, retry within a budget, and a pre-run cost estimate checked against the tenant's &lt;code&gt;max_daily_llm_cost_cents&lt;/code&gt;. &lt;strong&gt;Invalid LLM output is never marked success&lt;/strong&gt; — it retries, then fails. A deterministic fake client backs all tests; a real Anthropic adapter is a thin, deliberately-later add. The &lt;em&gt;interesting logic&lt;/em&gt; lives in the wrapper, not the vendor SDK, so it's fully testable without a live API key.&lt;/p&gt;

&lt;h3&gt;
  
  
  Typed events + feature-level metrics
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;run_events&lt;/code&gt; graduated from status-only to &lt;code&gt;event_type TEXT&lt;/code&gt; + &lt;code&gt;event_payload JSONB&lt;/code&gt; with error code/message fields, and every lifecycle transition emits one. Alongside, the task feature got its own Prometheus vectors (&lt;code&gt;task_runs_total{status}&lt;/code&gt;, &lt;code&gt;task_run_duration_seconds&lt;/code&gt;, &lt;code&gt;task_dlq_total&lt;/code&gt;, per-tenant quota rejections). The diff also ships a Grafana dashboard JSON and alert rules — though actual dashboard provisioning is correctly scoped as infra work, not part of the code change.&lt;/p&gt;




&lt;h2&gt;
  
  
  A Few Things Worth Stealing
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Write the non-goals down.&lt;/strong&gt; The single most effective scope-control tool was an explicit "What We're NOT Doing" list — and it's where you put the &lt;em&gt;almost&lt;/em&gt;-deferred items (like intra-batch fairness) with a note on exactly how far they go.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Slice by blast radius, end every slice in a checkpoint.&lt;/strong&gt; Identity went first because every later signature depended on it. Each of the seven phases was independently shippable.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Idempotency beats exactly-once.&lt;/strong&gt; If you're tempted to chase distributed exactly-once, redirect that energy into making handlers idempotent under at-least-once delivery.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Keep external dependencies behind a port with a fake.&lt;/strong&gt; The LLM reliability layer was fully testable before a single real API call existed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Let the design doc write the PR.&lt;/strong&gt; If your "why" already exists in writing, the pull request stops being a chore and becomes genuinely useful to reviewers.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The throughline across both parts: a disciplined, artifact-producing AI workflow turns both a greenfield build &lt;em&gt;and&lt;/em&gt; a sprawling six-area hardening pass into something a reviewer can actually follow — because every decision was written down &lt;em&gt;before&lt;/em&gt; it was coded, not reverse-engineered from the diff afterward.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Missed the start? *&lt;/em&gt;Part 1 — Building the scheduler** covers the MCP interface, the watcher → queue → worker execution path, and the design decisions behind the prototype.*&lt;/p&gt;

</description>
      <category>go</category>
      <category>ai</category>
      <category>systemdesign</category>
      <category>architecture</category>
    </item>
    <item>
      <title>Building a ChatGPT Task Scheduler in Go (Part 1) — MCP, Queues, and a Research-First AI Workflow</title>
      <dc:creator>Kan-Chen Lin</dc:creator>
      <pubDate>Mon, 15 Jun 2026 01:46:14 +0000</pubDate>
      <link>https://dev.to/kanchen_lin_331136af621d/building-a-chatgpt-task-scheduler-in-go-part-1-mcp-queues-and-a-research-first-ai-workflow-25kn</link>
      <guid>https://dev.to/kanchen_lin_331136af621d/building-a-chatgpt-task-scheduler-in-go-part-1-mcp-queues-and-a-research-first-ai-workflow-25kn</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;📦 &lt;strong&gt;Repo:&lt;/strong&gt; &lt;a href="https://github.com/linkc0829/go-chatgpt-tasks" rel="noopener noreferrer"&gt;github.com/linkc0829/go-chatgpt-tasks&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is &lt;strong&gt;Part 1 of 2&lt;/strong&gt;. Part 1 builds the scheduler from scratch — the MCP interface, the queue-decoupled execution path, and the AI workflow that drove it. Part 2 takes this prototype and hardens it for multi-tenant production (tenancy, DST-safe recurrence, idempotency, job chains, LLM reliability, observability).&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The brief was deceptively small: &lt;em&gt;build a job scheduler with an MCP (Model Context Protocol) interface.&lt;/em&gt; Users schedule tasks via MCP tool calls, a background watcher scans for due jobs and pushes them onto a queue, workers pull and execute them, and the whole thing supports create / list / status / cancel. The interesting constraints were the ones hiding behind that one-liner: how do you make it &lt;em&gt;scale&lt;/em&gt; (the design target was 10K jobs/sec), and how do you make execution &lt;em&gt;reliable&lt;/em&gt; (at-least-once delivery without double-running side effects)?&lt;/p&gt;

&lt;p&gt;This post covers two things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The &lt;strong&gt;AI workflow&lt;/strong&gt; that turned that one-liner into a designed, sliced, verifiable feature&lt;/li&gt;
&lt;li&gt;The &lt;strong&gt;high-level design and the system-design decisions&lt;/strong&gt; behind the prototype&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  1. The AI Workflow: Research and Design Before Code
&lt;/h2&gt;

&lt;p&gt;The failure mode of "AI, build me a task scheduler" is that the model starts editing files in minute two — before anyone has agreed on what's actually being built, or even mapped what already exists. I used a staged workflow (internally I call it &lt;strong&gt;QRSPI&lt;/strong&gt;) that produces a written artifact at every gate, and nothing downstream starts until the upstream artifact exists.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1. Question   → neutral research questions (no opinions)
2. Research   → objective answers, grounded in the actual code
3. Design     → where are we going, and why
4. Structure  → vertical slices + test checkpoints
5. Plan       → the tactical, file-by-file working doc
6. Worktree   → isolated git worktree for implementation
7. Implement  → execute phase-by-phase, verify each
8. PR         → description grounded in design + the real diff
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What this bought me on a greenfield feature:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Research mapped the terrain before designing on it.&lt;/strong&gt; The first pass answered factual questions about the &lt;em&gt;existing&lt;/em&gt; codebase: &lt;em&gt;Is there any scheduler/cron/worker precedent? Is Redis wired up? How does the app lifecycle start and stop goroutines? How many inbound transports exist?&lt;/em&gt; The answers were sobering and useful — there was &lt;strong&gt;no&lt;/strong&gt; background-loop precedent, Redis was wired but passed around as an unused &lt;code&gt;_&lt;/code&gt;, the lifecycle ran exactly one goroutine, and there was exactly one transport (HTTP). Every later decision referenced those facts instead of a guess.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Design named the non-goals.&lt;/strong&gt; Before any code, the design doc wrote down what we were &lt;strong&gt;NOT&lt;/strong&gt; doing: no Python server (the ticket's Python verification commands became Go equivalents), no real task side effects (a stub executor), no auth on the MCP transport this pass, no actually load-testing 10K jobs/sec — just building the &lt;em&gt;shape&lt;/em&gt; that target implies. Naming the non-goals is what kept a "small" scheduler from sprawling.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Structure sliced by demoability.&lt;/strong&gt; Four phases, each independently valuable: (1) the MCP CRUD surface, (2) the watcher + queue, (3) the worker pool + execution + DLQ, (4) recurring re-scheduling. If phase 3 stalled, phases 1–2 still delivered a working MCP surface and a queue-publishing watcher.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The meta-lesson: &lt;strong&gt;on a feature with no local precedent, the research phase is where the value is.&lt;/strong&gt; Half the design decisions were really "establish the &lt;em&gt;first&lt;/em&gt; instance of a pattern this repo will reuse" — and you can only know that if you've mapped what exists first.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. High-Level Design
&lt;/h2&gt;

&lt;p&gt;The codebase is a &lt;strong&gt;hexagonal (ports &amp;amp; adapters) Go backend with a feature-first layout&lt;/strong&gt;: each feature is one package under &lt;code&gt;internal/&amp;lt;feature&amp;gt;/&lt;/code&gt; following an 11-file structure (domain, service, ports, adapters as separate files), with composition happening only in &lt;code&gt;internal/bootstrap/wire.go&lt;/code&gt;. The scheduler became a new &lt;code&gt;internal/task/&lt;/code&gt; slice plus a second inbound transport.&lt;/p&gt;

&lt;p&gt;The core idea is a &lt;strong&gt;producer/consumer split with a queue in the middle&lt;/strong&gt;, so the thing that &lt;em&gt;finds&lt;/em&gt; due work and the thing that &lt;em&gt;executes&lt;/em&gt; it can scale independently:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv4e37o26ib0ve69bdouq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv4e37o26ib0ve69bdouq.png" alt="gpt-task-scheduler-design" width="800" height="487"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The lifecycle of a job, read top-down:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;An MCP tool call (&lt;code&gt;task.create&lt;/code&gt;) goes through a registry into the shared &lt;code&gt;task.Service&lt;/code&gt;, which persists a &lt;code&gt;Job&lt;/code&gt; (the definition) and a first &lt;code&gt;JobRun&lt;/code&gt; in &lt;code&gt;pending&lt;/code&gt; status.&lt;/li&gt;
&lt;li&gt;The &lt;strong&gt;watcher&lt;/strong&gt; periodically queries for &lt;code&gt;pending&lt;/code&gt; runs due within a 5-minute window, pushes each to Redis, and marks it &lt;code&gt;queued&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;worker&lt;/strong&gt; from the pool consumes the stream, runs the (stub) &lt;code&gt;Executor&lt;/code&gt;, and drives the state machine: &lt;code&gt;queued → running → success&lt;/code&gt;, or &lt;code&gt;retry → requeue&lt;/code&gt;, or after max attempts &lt;code&gt;failed → DLQ&lt;/code&gt;. Every transition appends a &lt;code&gt;RunEvent&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;The &lt;strong&gt;recurring watcher&lt;/strong&gt; polls those &lt;code&gt;RunEvent&lt;/code&gt;s; when a recurring job's run reaches a terminal state, it creates the next &lt;code&gt;pending&lt;/code&gt; &lt;code&gt;JobRun&lt;/code&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The same &lt;code&gt;task.Service&lt;/code&gt; is wired into two processes — &lt;code&gt;cmd/api&lt;/code&gt; (HTTP API + the background goroutines) and &lt;code&gt;cmd/mcp&lt;/code&gt; (the stdio MCP server) — sharing one Postgres and one Redis.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. System-Design Decisions
&lt;/h2&gt;

&lt;p&gt;These are the calls that mattered, with the reasoning the workflow forced into writing.&lt;/p&gt;

&lt;h3&gt;
  
  
  Redis Streams as the queue
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;XADD&lt;/code&gt; to enqueue, a &lt;strong&gt;consumer group&lt;/strong&gt; + &lt;code&gt;XREADGROUP&lt;/code&gt; for per-message exclusivity, &lt;code&gt;XACK&lt;/code&gt; on success, and &lt;code&gt;XAUTOCLAIM&lt;/code&gt; (idle-based reclaim) for the visibility-timeout redelivery the brief required. A separate stream is the DLQ. This gives &lt;strong&gt;at-least-once&lt;/strong&gt; delivery natively; pairing it with &lt;code&gt;job_run_id&lt;/code&gt; as an idempotency key gives best-effort exactly-once. Streams hit the sweet spot — richer than a plain list, far less operational weight than Kafka for a prototype.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why a queue between watcher and worker at all
&lt;/h3&gt;

&lt;p&gt;A single cron that &lt;em&gt;both&lt;/em&gt; scans and executes can't keep up at the target rate, and has nowhere to put a failed job. The queue decouples producer from consumer so workers scale horizontally; it lets a worker &lt;strong&gt;requeue and retry&lt;/strong&gt; on failure; it's the layer that provides the at-least-once guarantee; and it gives failed-past-max-retries jobs a &lt;strong&gt;DLQ&lt;/strong&gt; to land in for later review.&lt;/p&gt;

&lt;h3&gt;
  
  
  Time-bucket partitioning for the watcher query
&lt;/h3&gt;

&lt;p&gt;Instead of &lt;code&gt;SELECT … WHERE scheduled_at &amp;lt;= now()&lt;/code&gt; scanning an ever-growing table, &lt;code&gt;job_runs&lt;/code&gt; carries an hourly &lt;code&gt;time_bucket&lt;/code&gt; column and the watcher filters &lt;code&gt;status='pending' AND time_bucket = $bucket AND scheduled_at &amp;lt;= now()+5min&lt;/code&gt;, backed by a composite index. At 600K jobs/minute a naive predicate forces the DB to collect matching rows across the whole table; bucketing keeps each scan local. (Native Postgres range partitioning was the aspiration; the prototype ships the single-table + index form, which satisfies the same query pattern.)&lt;/p&gt;

&lt;h3&gt;
  
  
  Supervised goroutines, never fire-and-forget
&lt;/h3&gt;

&lt;p&gt;The existing app ran a single goroutine and had a fixed-order sequential shutdown. The watcher, worker pool, and recurring watcher all launch under a &lt;strong&gt;derived &lt;code&gt;context.Context&lt;/code&gt; and a &lt;code&gt;sync.WaitGroup&lt;/code&gt;&lt;/strong&gt; inside &lt;code&gt;App.Run&lt;/code&gt;; &lt;code&gt;App.Shutdown&lt;/code&gt; cancels the context and waits for every goroutine to drain &lt;em&gt;before&lt;/em&gt; closing Redis and Postgres. No leaked goroutines, no work cut off mid-flight. This deliberately extended the lifecycle pattern rather than bolting background loops on the side.&lt;/p&gt;

&lt;h3&gt;
  
  
  Domain model: &lt;code&gt;Job&lt;/code&gt; / &lt;code&gt;JobRun&lt;/code&gt; / &lt;code&gt;RunEvent&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;Job&lt;/code&gt; is the definition (one-off or recurring + schedule spec). &lt;code&gt;JobRun&lt;/code&gt; is a single execution attempt with a status. &lt;code&gt;RunEvent&lt;/code&gt; is an append-only audit log of every transition — and crucially, it's the thing the recurring watcher polls to decide when to schedule the next run. Entities hold unexported fields, validate invariants in &lt;code&gt;New*&lt;/code&gt; constructors, and expose &lt;strong&gt;pure, context-free transition methods&lt;/strong&gt; (&lt;code&gt;MarkRunning&lt;/code&gt;, &lt;code&gt;MarkSuccess&lt;/code&gt;, &lt;code&gt;MarkRetry&lt;/code&gt;, &lt;code&gt;MarkFailed&lt;/code&gt;, &lt;code&gt;Cancel&lt;/code&gt;). Zero-value-invalid states are impossible to construct.&lt;/p&gt;

&lt;h3&gt;
  
  
  A registry, not an if-else chain, for MCP tools
&lt;/h3&gt;

&lt;p&gt;Tool dispatch is a &lt;code&gt;map[string]toolHandler&lt;/code&gt; with O(1) routing. The pragmatic reason: adding the 20th tool shouldn't mean editing a 20-branch conditional. The naming convention (&lt;code&gt;task.create&lt;/code&gt;, not &lt;code&gt;createTask&lt;/code&gt;) is a &lt;strong&gt;namespace + action-verb&lt;/strong&gt; pattern that helps the LLM on the other end pick the right tool more reliably.&lt;/p&gt;

&lt;h3&gt;
  
  
  A pluggable, stubbed &lt;code&gt;Executor&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;Real task execution sits behind an &lt;code&gt;Executor&lt;/code&gt; port; the prototype ships a &lt;code&gt;StubExecutor&lt;/code&gt; that logs and marks success (with configurable failure for tests). This was deliberate: it exercises the &lt;em&gt;entire&lt;/em&gt; queue / retry / DLQ / event machinery without committing to real side effects, so the reliability behavior is fully testable before any real handler exists. (Part 2 replaces this stub with a real job-type-dispatching executor and an LLM reliability layer.)&lt;/p&gt;




&lt;h2&gt;
  
  
  A Few Things Worth Stealing
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Research the existing codebase before you design.&lt;/strong&gt; Half my "decisions" were really "this is the first instance of a pattern" — only visible because research established there was no precedent.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Decouple producer from consumer with a queue.&lt;/strong&gt; The watcher finds work; workers execute it; the queue absorbs bursts, enables retries, and gives you a DLQ for free.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;At-least-once + an idempotency key beats chasing exactly-once.&lt;/strong&gt; Let the queue guarantee delivery and make the &lt;em&gt;work&lt;/em&gt; idempotent.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Supervise your goroutines.&lt;/strong&gt; A derived context plus a &lt;code&gt;WaitGroup&lt;/code&gt; turns "background loops" into something that shuts down cleanly instead of leaking.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stub the side effect, exercise the machinery.&lt;/strong&gt; A pluggable stub executor let me prove the whole retry/DLQ path before writing a single real handler.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That's the prototype: an MCP-driven, queue-decoupled scheduler with a clean state machine and a testable execution path — but single-tenant, timezone-naive, with a no-op executor and no idempotency on real side effects. &lt;strong&gt;Part 2 is the production-hardening cycle&lt;/strong&gt; that fixes every one of those.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Next: **Part 2 — Hardening the scheduler for production&lt;/em&gt;&lt;em&gt;: tenancy, DST-safe recurrence, idempotency under at-least-once delivery, linear job chains, an LLM reliability layer, and observability.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>go</category>
      <category>ai</category>
      <category>systemdesign</category>
      <category>architecture</category>
    </item>
    <item>
      <title>From Template to Production-Shaped: An AI-Native Dev Flow for Go Side Projects</title>
      <dc:creator>Kan-Chen Lin</dc:creator>
      <pubDate>Tue, 26 May 2026 02:07:42 +0000</pubDate>
      <link>https://dev.to/kanchen_lin_331136af621d/from-template-to-production-shaped-an-ai-native-dev-flow-for-go-side-projects-245g</link>
      <guid>https://dev.to/kanchen_lin_331136af621d/from-template-to-production-shaped-an-ai-native-dev-flow-for-go-side-projects-245g</guid>
      <description>&lt;p&gt;I wanted my next side project to &lt;em&gt;look&lt;/em&gt; like the kind of code I'd ship at work — hexagonal architecture, sqlc, depguard, integration tests — without the usual side-project tax of spending three evenings on scaffolding before writing the first line of domain logic. So I built it twice. First, I forked a Go backend template I'd been hardening for months. Then I drove every feature on top of it through a structured AI workflow I call &lt;strong&gt;qrspi&lt;/strong&gt;: &lt;em&gt;question → research → structure → plan → implement&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;The product itself is unremarkable on purpose: a QR code generator. Paste a URL, get back a scannable PNG and a &lt;code&gt;/r/:token&lt;/code&gt; redirect, with per-link scan counts and a soft-delete kill switch. The interesting part — the part I'd want a reviewer to look at — is the &lt;em&gt;process&lt;/em&gt; that produced it.&lt;/p&gt;

&lt;p&gt;Repo: &lt;a href="https://github.com/linkc0829/go-qrcode-generator" rel="noopener noreferrer"&gt;linkc0829/go-qrcode-generator&lt;/a&gt;. Every artifact mentioned in this post is committed there.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 1: Choose the template, then commit to its rules
&lt;/h2&gt;

&lt;p&gt;Step zero was actually choosing what to build on. I shortlisted several Go backend templates, walked through each one with Claude to pressure-test the architecture, and landed on the one I'd been hardening for a while: &lt;a href="https://github.com/linkc0829/go-backend-template" rel="noopener noreferrer"&gt;linkc0829/go-backend-template&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The template is a feature-first hexagonal Go backend. Each feature lives in a single package under &lt;code&gt;internal/&amp;lt;feature&amp;gt;/&lt;/code&gt;, and inside that package, &lt;code&gt;domain.go&lt;/code&gt;, &lt;code&gt;service.go&lt;/code&gt;, &lt;code&gt;ports.go&lt;/code&gt;, and the adapters sit side-by-side as separate files. The Go package boundary &lt;em&gt;is&lt;/em&gt; the hexagon edge.&lt;/p&gt;

&lt;p&gt;What makes it stick is &lt;code&gt;depguard&lt;/code&gt; in &lt;code&gt;.golangci.yml&lt;/code&gt;. The build fails if:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;domain.go&lt;/code&gt; imports anything beyond stdlib and shared value objects&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;service.go&lt;/code&gt; reaches for a driver or web framework&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;handler_*.go&lt;/code&gt; touches a repo or cache directly&lt;/li&gt;
&lt;li&gt;one feature imports another feature&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That last rule is the one that pays the most rent. Cross-feature dependencies are forced through &lt;strong&gt;capability ports&lt;/strong&gt; — feature A defines an interface named after the capability it needs, and the composition root in &lt;code&gt;internal/bootstrap/wire.go&lt;/code&gt; injects feature B's service to satisfy it. The features never know about each other.&lt;/p&gt;

&lt;p&gt;This was the first decision I had to actually live with. The template ships with demo &lt;code&gt;user&lt;/code&gt;, &lt;code&gt;order&lt;/code&gt;, and &lt;code&gt;payment&lt;/code&gt; slices. My project has no orders and no payments. The rule is: don't leave dead code as "future scaffolding." Delete the whole slice — the package, the wire block, the SQL queries, the migration tables, the OpenAPI paths, the depguard block. &lt;code&gt;make lint &amp;amp;&amp;amp; make test&lt;/code&gt; after each removal flushes out dangling references. By the time I started writing QR code logic, the repo only knew about things that existed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2: Build the spec from a real system-design prompt
&lt;/h2&gt;

&lt;p&gt;The functional spec came from a system-design exercise I'd worked through separately:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Generate a QR code from a URL&lt;/li&gt;
&lt;li&gt;302 redirect through our server on every scan (so we can count, and so we can kill a link)&lt;/li&gt;
&lt;li&gt;Targets: redirect latency &amp;lt; 100 ms, 1B codes, 100M users&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The high-level design called out the load shape — read-heavy, one write to thousands of reads — and the levers that fall out of it: stateless API behind a gateway, cache &lt;code&gt;qr_token → image_url&lt;/code&gt;, CDN the PNGs, index on &lt;code&gt;qr_token&lt;/code&gt;. Tokens were originally specified as &lt;code&gt;base62(SHA-256(url + user_secret))&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;For the local build, I wrote down explicit &lt;strong&gt;deviations&lt;/strong&gt; from the spec rather than pretending they didn't exist:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Tokens use 96-bit &lt;code&gt;crypto/rand&lt;/code&gt; → &lt;code&gt;base64url&lt;/code&gt;. Loses idempotency for repeated (user, url) pairs but avoids the deterministic-token leak surface.&lt;/li&gt;
&lt;li&gt;The CDN tier is dropped. The browser fetches PNGs directly from MinIO using its anonymous &lt;code&gt;download&lt;/code&gt; bucket policy. Same architectural shape as S3+CloudFront, minus the edge cache.&lt;/li&gt;
&lt;li&gt;Soft delete and PUT/DELETE endpoints land in a later slice.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Writing the deviations down up front is the part that makes the design honest. It's also the part that makes a portfolio reviewer's job easier — they can see &lt;em&gt;what was traded and why&lt;/em&gt;, not just what got built.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 3: qrspi — the workflow that does the actual building
&lt;/h2&gt;

&lt;p&gt;The workflow idea started from &lt;a href="https://github.com/humanlayer/advanced-context-engineering-for-coding-agents" rel="noopener noreferrer"&gt;Research-Plan-Implement&lt;/a&gt; (RPI). QRSPI is an 8-phase extension of it that I picked up from community discussions and adapted for this project.&lt;/p&gt;

&lt;p&gt;Once the spec was on paper, every feature went through the same eight phases. Each phase is a slash command backed by a skill, and each one writes its artifact to &lt;code&gt;thoughts/qrspi/&amp;lt;date&amp;gt;-&amp;lt;slug&amp;gt;/&lt;/code&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;/qrspi:1_question&lt;/code&gt;&lt;/strong&gt; — decompose the ticket into neutral research questions. No opinions yet.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;/qrspi:2_research&lt;/code&gt;&lt;/strong&gt; — answer the questions by reading the codebase. Facts only.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;/qrspi:3_design&lt;/code&gt;&lt;/strong&gt; — discuss &lt;em&gt;where&lt;/em&gt; we're going before &lt;em&gt;how&lt;/em&gt;. Trade-offs surface here.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;/qrspi:4_structure&lt;/code&gt;&lt;/strong&gt; — outline vertical slices with test checkpoints.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;/qrspi:5_plan&lt;/code&gt;&lt;/strong&gt; — the tactical implementation plan; my working document.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;/qrspi:6_worktree&lt;/code&gt;&lt;/strong&gt; — isolated git worktree so the main checkout stays clean.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;/qrspi:7_implement&lt;/code&gt;&lt;/strong&gt; — execute the plan phase by phase, verifying at each checkpoint.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;/qrspi:8_pr&lt;/code&gt;&lt;/strong&gt; — open a PR that carries the design context forward into review.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The MinIO feature shows the whole thing on disk: a &lt;code&gt;ticket.md&lt;/code&gt; pulled from the Notion source via MCP, then &lt;code&gt;questions.md&lt;/code&gt;, &lt;code&gt;research.md&lt;/code&gt;, &lt;code&gt;design.md&lt;/code&gt;, &lt;code&gt;structure.md&lt;/code&gt;, and &lt;code&gt;plan.md&lt;/code&gt;. Each one builds on the last. By the time implementation starts, the agent isn't guessing — it's executing a plan I already agreed with.&lt;/p&gt;

&lt;p&gt;The follow-up Redis redirect cache shipped the same way. The plan called out the read-heavy shape, picked a write-behind click-count buffer to avoid hammering Postgres on every scan, and named the cache invariants explicitly. The implementation was almost mechanical because the design phase had already resolved the interesting questions.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this buys, and what it costs
&lt;/h2&gt;

&lt;p&gt;The cost is real: each feature carries five or six markdown files of design artifacts. For a single-developer side project, that's overhead I wouldn't tolerate in a freeform sketch.&lt;/p&gt;

&lt;p&gt;What it buys:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The diff is reviewable.&lt;/strong&gt; Every commit is small, scoped, and traceable back to a design decision.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The architecture holds.&lt;/strong&gt; depguard catches the slow-drift violations (handler reaches into a repo, feature A imports feature B) the moment they appear, not three months later.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The agent stays useful past 2,000 LOC.&lt;/strong&gt; Most AI-coding flows degrade as the codebase grows because the model loses the plot. Writing the plot down — in &lt;code&gt;design.md&lt;/code&gt;, in &lt;code&gt;plan.md&lt;/code&gt; — keeps the next session grounded.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The portfolio story is the &lt;em&gt;process&lt;/em&gt;, not the artifact.&lt;/strong&gt; Anyone can ship a QR code generator. Shipping one where the architecture, the trade-offs, and the deviations from spec are all written down on disk is a different signal.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The template is on GitHub; the qrspi artifacts are committed alongside the code. If you want to see how a single feature flows from ticket to PR, &lt;a href="https://github.com/linkc0829/go-qrcode-generator/tree/main/thoughts/qrspi/2026-05-15-minio-local-object-storage" rel="noopener noreferrer"&gt;the MinIO slice&lt;/a&gt; is the cleanest example. The architecture ADRs in &lt;code&gt;docs/adr/&lt;/code&gt; cover the two foundational decisions: feature-first hexagonal, and sqlc over an ORM.&lt;/p&gt;

&lt;p&gt;Next up: a metrics slice (Prometheus + a Grafana dashboard for redirect latency), and a proper deletion follow-up so the spec's full CRUD surface lands. Both will go through qrspi. That's the point.&lt;/p&gt;

</description>
      <category>go</category>
      <category>architecture</category>
      <category>ai</category>
      <category>showdev</category>
    </item>
  </channel>
</rss>
