Hot Reloading Go Video Services with Air Without Losing State

#go #webdev #productivity #tutorial

The 40-Second Feedback Loop That Was Killing My Productivity

Last quarter I rewrote the trending-video ingestion service at TopVideoHub from a PHP cron worker into a long-running Go service. The reason was simple: we aggregate trending video metadata across roughly a dozen Asia-Pacific regions, and each region has its own polling cadence, its own language tokenizer quirks, and its own rate-limit budget. A request-scoped PHP process spinning up per cron tick could not hold the in-memory region schedulers, the warm HTTP connection pools to upstream providers, or the prepared statement cache against our SQLite store. Go could.

The trouble showed up on day two. Every time I touched a handler, a region poller, or a CJK normalization helper, I had to Ctrl-C the running binary, run go build ./cmd/ingest, wait for the compile, restart it, and then wait again while the service re-warmed its connection pools and re-read the region config from disk. On my machine that round trip was somewhere between 30 and 45 seconds. When you are debugging why Japanese trending titles are getting tokenized as one giant token, you do that loop forty or fifty times an hour. The math is brutal: I was spending more time waiting for restarts than reading code.

This is the exact problem Air solves. It watches your source tree, rebuilds the binary on change, and restarts the process for you. But the naive setup most tutorials show you will actively hurt a video service: it rebuilds on irrelevant file changes, it kills in-flight upstream fetches mid-write, and it does nothing to preserve the slow-to-rebuild state that made you choose a long-running service in the first place. This article is how I configured Air for a real ingestion service, what broke, and the patterns that made the inner loop feel instant.

What Air Actually Does, and Why a Video Service Stresses It

Air is a file watcher plus a build-and-restart supervisor. The loop is: detect a write under a watched path, debounce, run your build command, and if the build succeeds, kill the old process and exec the new one. Conceptually trivial. The friction comes from the assumptions.

A typical web tutorial assumes your service is stateless and starts in milliseconds. A video aggregation service violates both assumptions:

Startup is expensive. We open a SQLite database with FTS5 enabled, run a quick integrity check, warm a pool of HTTP clients with custom transports, and load region schedules. Cold start is two to three seconds before the first poll fires.
There is real I/O in flight. When Air kills the process, there may be a half-completed batch insert of trending videos for the JP region. Kill it at the wrong moment and you get a partial write or a locked database file.
Generated and data files live in the tree. Our repo carries a data/ directory with the dev SQLite file, a tmp/ build output dir, and generated tokenizer dictionaries. If Air watches those, the service rebuilds itself in an infinite loop the moment it writes its own database.

So the goal is not "turn on Air." The goal is to watch exactly the Go source that matters, rebuild fast, and shut down cleanly enough that a restart never corrupts the dev database.

A Minimal but Honest Air Configuration

Air reads a .air.toml from the project root. Here is the one I run, trimmed of comments so you can see the shape, then I will walk through the decisions.

root = "."
tmp_dir = "tmp"

[build]
  # Build only the ingest entrypoint, output to tmp/
  cmd = "go build -o ./tmp/ingest ./cmd/ingest"
  bin = "./tmp/ingest"
  full_bin = "APP_ENV=dev ./tmp/ingest --config=config/dev.yaml"

  # Only these extensions trigger a rebuild
  include_ext = ["go", "yaml", "sql"]

  # Never watch these — they cause rebuild loops or noise
  exclude_dir = ["tmp", "data", "vendor", "testdata", "node_modules", ".git"]
  exclude_regex = ["_test\\.go", "_gen\\.go"]

  # Debounce: wait 200ms after the last write before building
  delay = 200

  # Give the old process time to shut down gracefully
  kill_delay = "5s"
  send_interrupt = true
  stop_on_error = true

[log]
  time = true

[misc]
  clean_on_exit = true

The lines that matter most for a video service are the last few in [build].

include_ext is an allowlist, not the default catch-all. I include yaml because our region schedules and rate-limit budgets live in config, and I genuinely want a restart when I tweak a polling interval. I include sql because we keep migration files and FTS5 setup statements as .sql and embed them — changing one should rebuild. Everything else is excluded by omission.

exclude_dir lists data and tmp explicitly. This is the single most important line. The service writes its dev SQLite file into data/, and Air writes the compiled binary into tmp/. If either is watched, you get a rebuild loop: the service runs, writes the DB, Air sees the write, rebuilds, restarts, which writes the DB again. I have watched a junior engineer lose an afternoon to exactly this before realizing the watcher was chasing its own tail.

exclude_regex drops _test.go and our generated _gen.go files. I do not want a restart of the running service every time I save a test file — tests run in a separate go test watcher. And the generated tokenizer files change as a side effect of builds, so watching them re-triggers builds.

send_interrupt = true plus kill_delay = "5s" is the graceful-shutdown contract. By default Air sends SIGKILL, which is a hard stop with no cleanup. With send_interrupt, Air sends SIGINT (and on supported platforms a terminate signal), waits up to kill_delay, and only then escalates. That five-second window is what lets an in-flight JP batch insert finish before the process dies.

Making the Service Honor the Interrupt

Air sending SIGINT does nothing useful unless your service listens for it. This is where most setups quietly corrupt their dev database without anyone noticing, because the corruption is intermittent. Here is the shutdown path in our ingest service's main.

package main

import (
    "context"
    "log/slog"
    "os"
    "os/signal"
    "syscall"
    "time"
)

func main() {
    cfg := mustLoadConfig()
    app, err := NewApp(cfg) // opens SQLite, warms pools, loads region schedules
    if err != nil {
        slog.Error("startup failed", "err", err)
        os.Exit(1)
    }

    // Root context cancelled on SIGINT/SIGTERM — the signal Air sends.
    ctx, stop := signal.NotifyContext(context.Background(),
        syscall.SIGINT, syscall.SIGTERM)
    defer stop()

    // Start region pollers; they select on ctx.Done().
    app.Run(ctx)

    <-ctx.Done()
    slog.Info("interrupt received, draining")

    // Bounded drain: let in-flight batch inserts finish, then close cleanly.
    drainCtx, cancel := context.WithTimeout(context.Background(), 4*time.Second)
    defer cancel()

    if err := app.Shutdown(drainCtx); err != nil {
        slog.Error("unclean shutdown", "err", err)
        os.Exit(1)
    }
    slog.Info("drained, exiting")
}

The key relationship is between Air's kill_delay = "5s" and the 4*time.Second drain timeout. The drain must finish comfortably inside the kill delay, otherwise Air escalates to SIGKILL and you are back to hard kills. I keep a one-second safety margin.

The Shutdown method does the boring but essential work: it stops accepting new poll ticks, waits for any goroutine currently writing a batch to release its transaction, runs a PRAGMA wal_checkpoint(TRUNCATE) so the WAL file does not grow unbounded across dozens of dev restarts, and closes the database handle. Closing the SQLite handle cleanly is what prevents the database is locked errors that otherwise appear randomly when Air restarts mid-write.

A subtle point: because we use SQLite in WAL mode for the dev database, an unclean kill leaves the -wal and -shm sidecar files behind. SQLite recovers from them on next open, but if a restart happens during checkpoint you can hit a brief lock. The graceful drain plus checkpoint-on-shutdown removes that class of flakiness entirely. This mirrors the durability discipline we run in production, where the same store backs our FTS5 search index — the dev loop should not be sloppier than prod just because it is dev.

Cutting the Rebuild Time, Not Just Automating It

Air restarting for you is only half the win. If each rebuild still takes 30 seconds because you compile the whole monorepo, you have automated the slowness, not removed it. Three changes took our rebuild from ~12 seconds to under 2.

First, build a single binary, not the world. The cmd = "go build -o ./tmp/ingest ./cmd/ingest" line targets one entrypoint. Earlier I had go build ./... out of laziness, which recompiled every command in the repo on every save. Targeting ./cmd/ingest only rebuilds that package and its changed dependencies.

Second, lean on Go's build cache and skip work that does not help the inner loop. I disable inlining and optimizations during dev builds, which speeds compilation and improves debugger stepping:

[build]
  cmd = "go build -gcflags='all=-N -l' -o ./tmp/ingest ./cmd/ingest"

The -N -l flags turn off optimization and inlining. You would never ship this, but for a dev rebuild it shaves time and makes Delve breakpoints land where you expect. The first build after a go clean -cache is still slow; every incremental one after that is fast because Go caches compiled packages.

Third — and this is the one people miss — keep your slow startup work out of the rebuild path by separating it from the thing that changes often. Our CJK tokenizer dictionary loads from disk at startup and takes about 800ms to parse. I do not want to pay that on every restart while iterating on a handler. So I load it once into a memory-mapped file that survives across restarts when possible, and where that is not possible, I gate it behind a fast-path check. The principle generalizes: anything expensive and unchanging should not be recomputed on every Air restart.

Here is a compact example of that gating pattern — a lazy, restart-aware loader for the region dictionary:

// dictCache loads the CJK tokenizer dictionary at most once per binary
// lifetime, and logs how long startup actually cost so I can see
// whether a given restart paid the full price or hit a warm path.
type dictCache struct {
    once sync.Once
    dict *Dictionary
    err  error
}

func (c *dictCache) Load(path string) (*Dictionary, error) {
    c.once.Do(func() {
        start := time.Now()
        c.dict, c.err = parseDictionary(path) // ~800ms cold
        slog.Info("dictionary loaded",
            "path", path,
            "took", time.Since(start).Round(time.Millisecond))
    })
    return c.dict, c.err
}

The logging matters more than it looks. When I can see took=812ms versus took=3ms in the restart logs, I immediately know whether my config change accidentally invalidated the warm path. Observability of the inner loop is how you keep the inner loop fast over months, not just on the day you set it up.

Watching More Than Go — Templates, Config, and Migrations

A video service is rarely pure Go. Ours renders a handful of admin HTML templates, reads YAML region config, and runs SQL migrations. I want a restart on the right subset of these without drowning in false triggers.

The include_ext = ["go", "yaml", "sql"] line handles config and migrations. For templates, I made a deliberate choice not to restart the whole service. Templates are reparsed lazily on each admin request in dev mode via a build tag, so editing a template shows up on the next page refresh with no rebuild at all. Mixing two strategies — full restart for code, hot reparse for templates — is faster than forcing everything through Air.

The decision rule I use for any file type is a short checklist:

Does changing it require recompilation? Go source and embedded SQL do. Templates loaded at runtime do not.
Does a change need fresh in-memory state? Region schedules and rate budgets do, so YAML triggers a restart. A CSS tweak does not.
Does the file get written by the service itself? The dev SQLite file and the WAL sidecars do, so they are excluded — watching them is the rebuild-loop trap.

Run every file type through those three questions and your .air.toml writes itself. The mistake is treating Air as all-or-nothing. It is a tool for the subset of changes that genuinely need a recompile-and-restart, and the more precisely you scope it, the faster and quieter it is.

Running Air Alongside the Rest of the Stack

The ingest service does not run alone. In dev I also run the PHP frontend (the same LiteSpeed-backed stack that serves the public site), a local Cloudflare-equivalent proxy for testing cache headers, and a go test watcher. Trying to cram all of that into Air is a mistake — Air supervises one process. I use a small process manager and let Air own only the Go ingest service.

The one cross-cutting concern is ports and locks. When Air restarts the ingest service, the new process must be able to re-bind the local admin port and re-open the SQLite file immediately. The graceful shutdown above is what guarantees the old process has released both before the new one starts. Without it, you get address already in use or database is locked on roughly one restart in ten — frequent enough to be maddening, rare enough that you blame your code instead of the watcher.

One more practical tip: pipe Air's output through with timestamps ([log] time = true) and keep the terminal visible. The restart cadence becomes a diagnostic. If I save a handler and see a restart take eight seconds instead of two, something pulled a heavy dependency into the rebuild path, and I want to know immediately, not three weeks later when the inner loop has silently degraded.

What This Bought Us

The honest accounting: setting up Air properly took an afternoon, most of it spent on the graceful-shutdown path and chasing one rebuild loop caused by watching data/. In exchange, the edit-to-running-service loop went from 30-45 seconds to about 2 seconds, and the intermittent database is locked errors that used to punctuate a debugging session disappeared because restarts now drain cleanly.

The lessons that generalize beyond our stack:

Air automates the loop; you still have to make the loop fast. Build one binary, use the build cache, and keep expensive startup work off the rebuild path.
Graceful shutdown is not optional for a stateful service. Match your drain timeout to Air's kill_delay, handle SIGINT, and checkpoint your database on the way out.
Scope the watcher precisely. Allowlist extensions, exclude self-written data and build directories, and run each file type through the recompile/state/self-written checklist.
Instrument the inner loop. Log startup cost per restart so degradation is visible the day it happens.

A long-running Go service was the right call for region-aware video ingestion, but only because the development loop stayed tight. Air, configured with the same care you would give a production process, is what keeps a stateful service pleasant to work on day after day.