DEV Community

Cover image for Building a stateless process supervisor in Go: what the OS fights you on
Yussuf Ajao
Yussuf Ajao

Posted on

Building a stateless process supervisor in Go: what the OS fights you on

Most "run my stack locally" setups converge on one of:

  • a pile of terminal tabs and shell history,
  • Docker Compose (great when you want containers, heavy when you want a binary and a log file),
  • a real supervisor (supervisord, systemd user units) — powerful, but you are signing up for install surface, config dialects, and often a long-lived daemon.

I wanted something narrower: declare processes in YAML, start them in the background, know if they are alive, stop them predictably, tail logs, and reconcile config changes — without a resident agent. That constraint is the core design: each CLI invocation is a fresh process that reads YAML, inspects the filesystem under a well-known home, and mutates OS state (spawn, signal / terminate, unlink PID files).

This post is implementation-oriented: what invariants we keep, where the OS fights you, and why the boring parts (PID files, digests, polling) matter more than the CLI glue.


Model: stateless supervision

ops does not keep an in-memory registry. "What is running?" is always derived from:

  1. On-disk PID files~/.ops/pids/<name>.pid (override base dir with OPS_HOME for tests and hermetic CI).
  2. OS liveness — "does this PID still denote a live process?"

That makes the tool trivial to reason about in CI: no socket, no coordinator, no leader election. It also means no automatic restart unless something external calls ops start again. That is an explicit trade: simplicity and auditability over availability.


Filesystem contract: OpsHome and PID layout

Persistence is intentionally boring: default ~/.ops, overridable with OPS_HOME for tests and hermetic automation.

// internal/process/pid.go
func OpsHome() (string, error) {
    if override := os.Getenv("OPS_HOME"); override != "" {
        return override, nil
    }
    home, err := os.UserHomeDir()
    if err != nil {
        return "", fmt.Errorf("could not determine home directory: %w", err)
    }
    return filepath.Join(home, ".ops"), nil
}
Enter fullscreen mode Exit fullscreen mode

PID files live under pids/; applied spec fingerprints under applied/ (shown next). Every CLI run re-derives "what is running?" from PID file + kernel, not from memory.

Under the ops home:

Path Role
pids/<name>.pid Integer PID written at successful start; removed on clean stop / stale cleanup
applied/<name>.digest SHA-256 fingerprint of the applied process spec (see reload)

Invariant: a PID file is meaningful only if paired with a successful liveness check. Stale PIDs must converge to "not running" on status/stop paths, or operators stop trusting the tool.


Starting a child: isolation of concerns

At a high level, start:

  • validates config and name,
  • refuses double-start when PID file + liveness agree the child is up,
  • merges environment,
  • redirects child stdout/stderr to the configured log file (append semantics on disk),
  • records the child PID,
  • writes the applied digest so reload can tell whether the running world matches the declared world.

The parent CLI exits immediately after the child is spawned and recorded. There is no follow-up goroutine in the parent holding pipes open for streaming — logs are file-backed, which matters for the follow implementation below.


Reload: digest over the fields that actually matter

reload compares the YAML spec to the last applied digest. The fingerprint is deterministic: command, args, sorted env lines, log_file, and stop_timeout, then SHA-256. Sorting keys avoids spurious reloads when map iteration order changes.

// internal/process/applied.go
func ProcessDigest(p config.Process) string {
    var b strings.Builder
    b.WriteString(p.Command)
    b.WriteByte('\n')
    for _, a := range p.Args {
        b.WriteString(a)
        b.WriteByte('\n')
    }
    keys := make([]string, 0, len(p.Env))
    for k := range p.Env {
        keys = append(keys, k)
    }
    sort.Strings(keys)
    for _, k := range keys {
        fmt.Fprintf(&b, "%s=%s\n", k, p.Env[k])
    }
    b.WriteString(p.LogFile)
    fmt.Fprintf(&b, "\nstop_timeout=%d\n", p.StopTimeout)
    sum := sha256.Sum256([]byte(b.String()))
    return hex.EncodeToString(sum[:])
}
Enter fullscreen mode Exit fullscreen mode

The interesting engineering point is not the hash — it is choosing which fields belong in the fingerprint. Omit something material and reload lies; include volatile fields and reload thrashes.

The reconcile loop unions names in YAML with names that still have PID files, then classifies each:

  • removed from YAML but still running → stop
  • unchanged digest → noop
  • changed digest while running → stop + restart
  • changed digest while stopped → update digest only
// internal/process/manager.go (excerpt)
func (m *Manager) Reload(dryRun bool) ([]ReloadStep, error) {
    // ...
    for _, name := range names {
        _, inYAML := m.Config.Processes[name]
        st, err := m.statusOne(name)
        // ...
        if !inYAML {
            if st.State == StateRunning || st.State == StateUnknown {
                // stopped_removed → Stop(...)
                continue
            }
            // noop: not in yaml, not running
            continue
        }
        proc := m.Config.Processes[name]
        want := ProcessDigest(proc)
        old, err := ReadAppliedDigest(name)
        // ...
        if old == want {
            // noop: spec unchanged
            continue
        }
        if st.State == StateRunning {
            // restarted → Stop then Start
            continue
        }
        // updated_stopped → WriteAppliedDigest only
    }
    return steps, nil
}
Enter fullscreen mode Exit fullscreen mode

That structure is what keeps a stateless CLI honest: you can always print --dry-run steps, then execute the same transitions for real.


Logs: file-backed, follow without holding the child

ops logs opens the log path independently of whether the process is still running. That is a deliberate decoupling:

  • crash investigation still works after exit,
  • tailing does not require attaching to the original os/exec session.

--follow seeks to EOF and polls the file on a ticker, with a signal.NotifyContext cancellation path for SIGINT/SIGTERM when the CLI itself is in follow mode.

// internal/process/logs.go (excerpt)
func Follow(ctx context.Context, w io.Writer, path string, poll time.Duration) error {
    if poll <= 0 {
        poll = 300 * time.Millisecond
    }
    f, err := os.Open(path)
    if err != nil {
        return fmt.Errorf("%w: open log: %w", ErrOperation, err)
    }
    defer f.Close()
    // seek to end so we only see new content
    if _, err := f.Seek(0, io.SeekEnd); err != nil {
        return fmt.Errorf("%w: seek log: %w", ErrOperation, err)
    }
    ticker := time.NewTicker(poll)
    defer ticker.Stop()
    for {
        select {
        case <-ctx.Done():
            return nil
        case <-ticker.C:
            if err := followDrainGrowth(ctx, w, f, &pending, readChunk); err != nil {
                return err
            }
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

Polling costs a wake every few hundred milliseconds; the win is portable behavior without tying log tailing to the original exec session or platform-specific notify APIs (inotify, kqueue). For a dev tool, that tradeoff is correct.


isAlive: same symbol, mutually exclusive files

On Unix, "signal 0" via os.Process.Signal is the traditional existence probe: it does not deliver a signal, but errors if the process is gone or permissions deny the check.

On Windows, that story breaks down for arbitrary child PIDs. The implementation opens the process with limited rights, checks GetExitCodeProcess for the well-known STILL_ACTIVE (259) sentinel, and uses a zero-timeout WaitForSingleObject as a second check.

Unix:

// internal/process/isalive_unix.go
//go:build !windows

func isAlive(pid int) bool {
    proc, err := os.FindProcess(pid)
    if err != nil {
        return false
    }
    err = proc.Signal(syscall.Signal(0))
    return err == nil
}
Enter fullscreen mode Exit fullscreen mode

Windows:

// internal/process/isalive_windows.go
//go:build windows

const stillActiveExitCode = 259

func isAlive(pid int) bool {
    if pid <= 0 {
        return false
    }
    access := uint32(windows.SYNCHRONIZE | windows.PROCESS_QUERY_LIMITED_INFORMATION)
    h, err := windows.OpenProcess(access, false, uint32(pid))
    if err != nil {
        return false
    }
    defer windows.CloseHandle(h)
    var exitCode uint32
    if err := windows.GetExitCodeProcess(h, &exitCode); err == nil && exitCode == stillActiveExitCode {
        return true
    }
    ev, werr := windows.WaitForSingleObject(h, 0)
    if werr != nil {
        return false
    }
    switch ev {
    case windows.WAIT_OBJECT_0:
        return false
    case 258: // WAIT_TIMEOUT
        return true
    default:
        return false
    }
}
Enter fullscreen mode Exit fullscreen mode

Build tags (//go:build windows vs //go:build !windows) ensure exactly one implementation is linked per target GOOS — no #ifdef, no runtime dispatch table for this boundary. Same symbol, mutually exclusive compilation units.


Stopping: the Unix vs Windows semantic gap

Unix supervision has a comfortable story: SIGTERM for cooperative shutdown, SIGKILL as the irreversible backstop. You can encode product policy in that split — refusing SIGKILL unless the operator passes --force.

Windows is not a POSIX signal machine. A naive taskkill without /F often leaves a Go HTTP server happily running, which produces the classic failure mode: poll until timeout, then error — even though the operator asked to stop.

The Stop loop resolves timeout, calls gracefulStop, polls isAlive until deadline, then either errors (Unix, no --force) or proceeds to forceStop:

// internal/process/manager.go (excerpt)
deadline := time.Now().Add(stopTimeout)
for time.Now().Before(deadline) {
    if !isAlive(pid) {
        if err := RemovePID(name); err != nil {
            return fmt.Errorf("%w: remove pid for %q: %w", ErrOperation, name, err)
        }
        return nil
    }
    time.Sleep(500 * time.Millisecond)
}
if !isAlive(pid) {
    if err := RemovePID(name); err != nil {
        return fmt.Errorf("%w: remove pid for %q: %w", ErrOperation, name, err)
    }
    return nil
}
if !force && runtime.GOOS != "windows" {
    return fmt.Errorf("%w: process %q did not stop within %s", ErrOperation, name, stopTimeout)
}
if err := forceStop(pid); err != nil {
    // ...
}
Enter fullscreen mode Exit fullscreen mode

The runtime.GOOS != "windows" branch is the policy seam: Unix keeps "no SIGKILL without --force" meaningful; Windows still gets a hard terminate path after the grace window because there is no separate SIGKILL tier in the kernel API you are targeting.

Unix — SIGTERM / SIGKILL:

// internal/process/stop_signal_unix.go
//go:build !windows

func gracefulStop(pid int) error {
    proc, err := os.FindProcess(pid)
    if err != nil {
        return fmt.Errorf("find process %d: %w", pid, err)
    }
    return proc.Signal(syscall.SIGTERM)
}

func forceStop(pid int) error {
    proc, err := os.FindProcess(pid)
    if err != nil {
        return fmt.Errorf("find process %d: %w", pid, err)
    }
    return proc.Signal(syscall.SIGKILL)
}
Enter fullscreen mode Exit fullscreen mode

Windows — tree kill + TerminateProcess:

// internal/process/stop_signal_windows.go
//go:build windows

func gracefulStop(pid int) error {
    _ = exec.Command("taskkill", "/F", "/PID", strconv.Itoa(pid), "/T").Run()
    return nil
}

func forceStop(pid int) error {
    if !isAlive(pid) {
        return nil
    }
    h, err := windows.OpenProcess(windows.PROCESS_TERMINATE, false, uint32(pid))
    // ...
    if err := windows.TerminateProcess(h, 1); err != nil {
        // ...
    }
    return nil
}
Enter fullscreen mode Exit fullscreen mode

gracefulStop intentionally ignores taskkill errors: the authority for "is it dead yet?" is isAlive polling in Stop, not the external tool's exit code.


Failure modes worth keeping in mind

  • Security: YAML selects arbitrary commands; treat it like code. Same class of risk as Make or CI YAML.
  • Permissions: terminating another user's process or protected processes will fail regardless of tool quality.
  • PID reuse race: narrow but real — short-lived processes and aggressive PID cycling on some systems mean "PID file says X" is always a hint until corroborated by the kernel.

Why Go for this

Go gives a single static binary distribution story, solid os/exec ergonomics, reasonable cross-compilation, and golang.org/x/sys/windows for the handful of syscalls you do not want to hand-roll. Cobra handles command routing and flag parsing. The intellectual weight is entirely in process state, not CLI glue.


Closing

ops is intentionally small: a filesystem-backed contract between CLI invocations, with explicit OS-specific adapters for the two places kernels disagree — liveness and termination. If that sounds interesting, the code is at https://github.com/carissaayo/go-ops-cli. The README is the operator manual; this post is the rationale.

Try it in one minute:

git clone https://github.com/carissaayo/go-ops-cli
cd go-ops-cli
go build -o ops ./internal/cmd/ops
# drop an ops.yaml beside the binary, then:
ops start <name>
ops status
Enter fullscreen mode Exit fullscreen mode

If you have built something similar — or hit the Windows termination edge cases from a different angle — I'd like to hear how you handled it.


ops runs commands from your configuration. Only point it at projects and binaries you trust.

Top comments (0)