DEV Community

Cover image for Building Taskmaster: A Go-Powered Process Supervisor from Scratch
UBA-code
UBA-code

Posted on

Building Taskmaster: A Go-Powered Process Supervisor from Scratch

How two 42 School students reimagined process management with Go's concurrency model


If you've ever managed a production server, you know the pain: a critical process crashes at 3 AM, no one notices until morning, and your users are left with a broken experience. Tools like Supervisor were built to solve this — but they come with Python overhead, complex configurations, and aging architectures.

We decided to build our own. Taskmaster is a lightweight, production-ready process supervisor written in Go — and building it taught us more about operating systems, concurrency, and daemon design than any textbook could.


What Is Taskmaster?

Taskmaster is a process control daemon. It manages the full lifecycle of your processes — starting, stopping, restarting, and monitoring them — all through a simple interactive shell. Think of it as a modern, Go-native alternative to Supervisor or systemd service management, without the complexity.

$ ./bin/taskmaster config.yaml
Taskmaster> status
Task          Status    PID    Uptime       Restarts  Command
nginx         RUNNING   1234   2m15s        0         /usr/local/bin/nginx
worker_1      RUNNING   1235   2m15s        1         python3 worker.py
worker_2      STOPPED   -      -            0         python3 worker.py

Taskmaster> start worker_2
Process 'worker_2' started with PID 1240

Taskmaster> logs worker_1 5
[2026-02-02 10:15:25] Processing task #1
[2026-02-02 10:15:26] Task completed
[2026-02-02 10:15:27] Waiting for tasks...
Enter fullscreen mode Exit fullscreen mode

A single YAML file to configure everything. A single binary to run it.


Why Go?

The choice of Go wasn't arbitrary. Process supervision is inherently concurrent — you need to monitor dozens of processes simultaneously without blocking. Traditionally, this is solved with threads and shared memory, which leads to complex locking, race conditions, and hard-to-debug crashes.

Go offers a better model: goroutines and channels.

  • Goroutines are lightweight (a few KB of stack vs. MB for threads) and can be spawned in the thousands without issue.
  • Channels provide safe, structured communication between concurrent components — no mutexes, no shared state hell.

This aligned perfectly with our architecture: one goroutine per managed process, communicating via channels. Elegant, efficient, and easy to reason about.


Architecture Deep Dive

The High-Level Picture

┌─────────────────────────────────────────────────────────────┐
│                        Main Process                          │
│  ┌──────────────┐  ┌─────────────┐  ┌──────────────────┐   │
│  │   CLI Loop   │  │   Config    │  │  Signal Handler  │   │
│  │  (readline)  │  │   Parser    │  │    (SIGHUP)      │   │
│  └──────────────┘  └─────────────┘  └──────────────────┘   │
└─────────────────────────────────────────────────────────────┘
                            │
                ┌───────────┴───────────┐
                │    Task Manager       │
                └───────────┬───────────┘
                            │
        ┌───────────────────┼───────────────────┐
        │                   │                   │
   ┌────▼────┐         ┌────▼────┐        ┌────▼────┐
   │ Process │         │ Process │        │ Process │
   │ Monitor │   ...   │ Monitor │        │ Monitor │
   │(gorout.)│         │(gorout.)│        │(gorout.)│
   └────┬────┘         └────┬────┘        └────┬────┘
        │                   │                   │
   ┌────▼────┐         ┌────▼────┐        ┌────▼────┐
   │  Child  │         │  Child  │        │  Child  │
   │ Process │         │ Process │        │ Process │
   └─────────┘         └─────────┘        └─────────┘
Enter fullscreen mode Exit fullscreen mode

The main process runs the interactive CLI (using the readline library for history and completion) and listens for system signals. It owns a Task Manager which holds the state of all configured processes. Each managed process gets its own goroutine — a StartTaskManager — that is entirely responsible for that process's lifecycle.

Goroutine per Process

Each StartTaskManager goroutine does three things independently:

  1. Listens for control commands over a buffered channel (CmdChan)
  2. Monitors the child process for exits and unexpected crashes
  3. Handles restarts based on the configured policy

Because each process is isolated in its own goroutine, a crash or slowdown in one process monitor cannot affect others. The system stays responsive even when managing hundreds of processes.

Channel-Based Communication

The CLI never directly kills or starts a process. Instead, it sends a message over a channel:

CLI ──── "stop nginx" ───► nginx's CmdChan ──► goroutine acts on it
Enter fullscreen mode Exit fullscreen mode

This decoupling means the CLI remains non-blocking regardless of how long a process takes to shut down. The goroutine handles timeouts, fallback signals, and cleanup entirely on its own.

Three channel types power the system:

  • Command Channels — carry control messages (start, stop, restart)
  • Done Channels — signal that a child process has exited
  • Timeout Channels — implement deadlines for startup grace periods and graceful shutdowns

Process State Machine

Every process moves through a well-defined set of states:

[STOPPED] → start → [STARTED] → (after successfulStartTimeout) → [RUNNING]
                        │
                        └─ unexpected exit ──► [FATAL]
                                                   │
                                         restart policy applies
Enter fullscreen mode Exit fullscreen mode
  • STOPPED: Not running (intentionally or not yet started)
  • STARTED: Running but in the startup grace period
  • RUNNING: Confirmed healthy and past the startup timeout
  • FATAL: Crashed and restart attempts exhausted

The successfulStartTimeout parameter is a key reliability feature. A process that starts and immediately crashes is different from one that runs for 30 seconds before failing — Taskmaster treats them differently.


Configuration: Simple and Expressive

Taskmaster tasks are defined in a single YAML file. Here's a real-world example managing a web server and a pool of background workers:

tasks:
  web_server:
    command: "/usr/local/bin/nginx -g 'daemon off;'"
    instances: 1
    autoLaunch: true
    restart: on-failure
    expectedExitCodes: [0]
    successfulStartTimeout: 3
    restartsAttempts: 3
    stopingSignal: SIGTERM
    gracefulStopTimeout: 10
    stdout: /var/log/taskmaster/nginx.out.log
    stderr: /var/log/taskmaster/nginx.err.log
    environment:
      PORT: "8080"
      ENV: "production"
    workingDirectory: /var/www

  worker:
    command: "python3 worker.py"
    instances: 5
    autoLaunch: true
    restart: always
    restartsAttempts: 5
    gracefulStopTimeout: 15
    stdout: /var/log/taskmaster/worker.out.log
    stderr: /var/log/taskmaster/worker.err.log
Enter fullscreen mode Exit fullscreen mode

A few things worth highlighting:

instances: 5 — Taskmaster automatically spawns 5 copies of worker.py and names them worker_1 through worker_5. You manage them individually or all at once with restart all.

restart: on-failure vs restart: always — The distinction matters in production. on-failure only restarts if the exit code isn't in expectedExitCodes. An intentional exit(0) won't trigger a restart. always is for long-running daemons that should never stop.

gracefulStopTimeout: 15 — When you issue a stop, Taskmaster sends the configured signal and waits up to 15 seconds for a clean exit. If the process hasn't stopped, it gets SIGKILL. No zombie processes.


Hot Reload: Zero-Downtime Config Changes

One of the features we're most proud of: you can change your configuration file and apply it without restarting the daemon or killing your processes.

Taskmaster> reload
Configuration reloaded.
Enter fullscreen mode Exit fullscreen mode

Under the hood, this sends a SIGHUP signal (or you can do it from outside with kill -HUP <pid>). The config parser re-reads the YAML, diffs it against the current state, and applies changes incrementally. New tasks get started; removed tasks get stopped; modified tasks get restarted. Running tasks that haven't changed? They keep running, untouched.


Signal Propagation and Graceful Shutdown

Getting shutdown right is tricky. We had to handle:

  1. The daemon receiving SIGTERM (e.g., from the OS on shutdown)
  2. Propagating the right signal to each child process
  3. Waiting for all children to exit before the daemon itself exits

We used Go's sync.WaitGroup for this. Every goroutine registers itself with a global WaitGroup before starting, and signals done when it exits. The main process waits on this group before terminating — guaranteeing that no child processes are left orphaned.

// Simplified version of the shutdown flow
tasks.WaitGroup.Wait() // Block until all process monitors are done
os.Exit(0)
Enter fullscreen mode Exit fullscreen mode

Key Takeaways

Building Taskmaster from scratch gave us a deep appreciation for:

1. Go's concurrency model is genuinely different. Not just syntactically different from threads — conceptually different. "Don't communicate by sharing memory; share memory by communicating" isn't just a motto. It's a design philosophy that produces cleaner, more correct code.

2. UNIX process management is a deep topic. Process groups, session leaders, signal inheritance, file descriptor leaks, zombie processes — every one of these is a footgun waiting to go off. We hit most of them.

3. Small surface area wins. Taskmaster has one config file, one binary, and one shell. No agents, no web UIs, no databases. This simplicity makes it auditable, embeddable, and easy to debug.


Try It Yourself

Taskmaster is open source and available on GitHub:

git clone https://github.com/UBA-code/taskmaster.git
cd taskmaster
make build
./bin/taskmaster  # Generates an example config and starts the shell
Enter fullscreen mode Exit fullscreen mode

If you're building something in Go that needs process management, or if you're curious about how supervisors work under the hood, we hope Taskmaster serves as a useful reference.


Built with ❤️ by Yassine Bel Hachmi and Hassan Idhmmououhya as part of the 42 School curriculum.

⭐ Star the repo if you find it useful!

Top comments (0)