DEV Community

Cover image for From Netdata Inspiration to SaaS MVP: Server Monitoring with Bun + Claude Code Opus 4.6
Vitalii
Vitalii

Posted on • Originally published at vitalii-nosov.hashnode.dev

From Netdata Inspiration to SaaS MVP: Server Monitoring with Bun + Claude Code Opus 4.6

If you've ever set up Netdata, you know that feeling — hundreds of real-time charts, per-second granularity, metrics you didn't even know your kernel exposed. It's a wonderful piece of software, genuinely one of the best open-source monitoring tools out there.

But here's the thing: I run a small fleet of CDN servers. I don't need 2,000 charts. I need to glance at a single dashboard and know: are my servers healthy or not?

So I built my own lightweight version. And my co-pilot for this entire build was Claude Code Opus 4.6.

This is the story of how it went — from reading /proc files with zero npm dependencies to a working SaaS-ready monitoring dashboard.


What I Built

The system has three components:

1. cdn-agent — A tiny Bun process that runs on each server. It reads Linux /proc files every 10 seconds and POSTs the metrics to my backend. Zero npm dependencies.

2. Backend API — A Bun server that ingests metrics into PostgreSQL (Supabase) and serves aggregated time-series data to the dashboard.

3. Dashboard — A React SPA with live gauges, alert cards, and 10 historical charts per server.

cdn-agent (10s) ──POST──> Backend API ──> Supabase (PostgreSQL)
                                               │
Dashboard (React) <──── GET aggregated data ───┘
Enter fullscreen mode Exit fullscreen mode

Image of ServersPage — the grid view with server cards showing gauges and alerts


The Agent: Zero Dependencies, Pure /proc

This is probably my favorite part. The monitoring agent has no runtime dependencies — just Bun reading the Linux virtual filesystem directly.

Here's the entire main loop:

import { collectCpu } from './collectors/cpu'
import { collectMemory } from './collectors/memory'
import { collectDisks } from './collectors/disk'
import { collectNetwork } from './collectors/network'
import { collectProcesses } from './collectors/processes'
import { collectSystem } from './collectors/system'
import { sendMetrics } from './sender'

const INTERVAL_MS = 10_000

async function collect() {
    const [cpu, memory, drives, network, processes, system] =
        await Promise.all([
            collectCpu(),
            collectMemory(),
            collectDisks(),
            collectNetwork(),
            collectProcesses(),
            collectSystem(),
        ])

    return { ...cpu, ...memory, ...system, drives, network, ...processes }
}

async function loop() {
    // First collection is a warmup — CPU/disk/network
    // deltas need a previous snapshot to calculate rates
    await collect()
    console.log('[cdn-agent] Warmup done, starting main loop')

    while (true) {
        await Bun.sleep(INTERVAL_MS)
        const data = await collect()
        await sendMetrics(data)
    }
}

loop()
Enter fullscreen mode Exit fullscreen mode

Six collectors run in parallel via Promise.all, each responsible for one slice of the system:

Collector Source What it reports
CPU /proc/stat Usage %, I/O Wait %
Memory /proc/meminfo Used/Total/Cached RAM, Swap
Disk df + /proc/diskstats Per-drive usage, read/write MB/s
Network /proc/net/dev Per-interface RX/TX MB/s, errors, drops
Processes /proc/[pid]/stat Top 5 by CPU, Top 5 by memory
System /proc/loadavg, /proc/uptime Load avg, uptime, TCP connections

Notice the warmup pattern — the first collection runs but its results are thrown away. Why? Because metrics like CPU usage and network throughput are calculated as deltas between two snapshots. The first run has no "previous" to compare against, so it would always report 0%. One dummy collection solves that.

Here's how the CPU collector works — 33 lines, no dependencies:

let prevIdle = 0
let prevIowait = 0
let prevTotal = 0

export async function collectCpu() {
    const stat = await Bun.file('/proc/stat').text()
    const parts = stat.split('\n')[0]!.split(/\s+/).slice(1).map(Number)

    const idle = parts[3]! + parts[4]!   // idle + iowait
    const iowait = parts[4]!
    const total = parts.reduce((a, b) => a + b, 0)

    const diffIdle = idle - prevIdle
    const diffIowait = iowait - prevIowait
    const diffTotal = total - prevTotal

    prevIdle = idle
    prevIowait = iowait
    prevTotal = total

    if (diffTotal === 0) return { cpu_percent: 0, iowait_percent: 0 }

    return {
        cpu_percent:
            Math.round(((diffTotal - diffIdle) / diffTotal) * 100 * 100) / 100,
        iowait_percent:
            Math.round((diffIowait / diffTotal) * 100 * 100) / 100,
    }
}
Enter fullscreen mode Exit fullscreen mode

Bun.file('/proc/stat').text() — that's all it takes to read kernel CPU counters. No child_process, no exec, no parsing library. Just read the file and do the math.


The Dashboard: 10 Charts, One Page

The server detail page packs a lot of information into a single view:

  • Circular gauges for CPU and RAM (green/yellow/red based on thresholds)
  • Live stats for network throughput, load average, connections
  • 10 historical charts — CPU, memory, network TX/RX, disk I/O, connections, load avg, utilization, errors
  • Time range selector — 15min, 1h, 6h, 24h, 7 days

Image of ServerDetailPage — the full chart view with gauges at the top and historical charts below

Smart Time-Range Aggregation

One of the trickier problems: how do you show 7 days of data collected every 10 seconds without drowning the browser in 60,000+ data points?

The backend handles this with in-memory bucketing:

const rangeConfig = {
    '15m': { minutes: 15,    bucketSeconds: 10 },    // raw data
    '1h':  { minutes: 60,    bucketSeconds: 10 },    // raw data
    '6h':  { minutes: 360,   bucketSeconds: 60 },    // 1-min averages
    '24h': { minutes: 1440,  bucketSeconds: 300 },   // 5-min averages
    '7d':  { minutes: 10080, bucketSeconds: 1800 },   // 30-min averages
}
Enter fullscreen mode Exit fullscreen mode

For short ranges (15m, 1h), the raw 10-second data goes straight to the chart. For longer ranges, the backend fetches all raw rows, groups them into time buckets, and averages the numeric fields:

// Bucket metrics by time intervals
for (const row of rows) {
    const t = new Date(row.ts).getTime()
    const bucketKey =
        Math.floor(t / (config.bucketSeconds * 1000))
        * (config.bucketSeconds * 1000)

    if (!buckets.has(bucketKey)) {
        buckets.set(bucketKey, {
            ts: new Date(bucketKey).toISOString(),
            points: [],
        })
    }
    buckets.get(bucketKey)!.points.push(row.data)
}

// Average each bucket
const aggregated = Array.from(buckets.values()).map((bucket) => ({
    ts: bucket.ts,
    data: averageMetrics(bucket.points),
}))
Enter fullscreen mode Exit fullscreen mode

No pre-aggregation tables, no materialized views, no time-series database. Just PostgreSQL with a JSONB column and a few lines of bucketing logic. For a handful of servers, this works perfectly — and it's one less thing to maintain.

Dynamic Network Charts

Another nice pattern: the network charts build themselves based on whatever interfaces the server actually has. No hardcoded eth0 or ens3:

const networkData = metrics.map(m => {
    const row: Record<string, unknown> = { ts: m.ts }
    for (const n of m.data.network) {
        row[`${n.iface}_tx`] = n.tx_mb_s
        row[`${n.iface}_rx`] = n.rx_mb_s
    }
    return row
})
Enter fullscreen mode Exit fullscreen mode

If a server has eth0 and eth1, you get two lines on the chart. If another server has ens3, that's what shows up. The dashboard adapts to whatever the agent reports.

Alert System

The overview page shows all servers as cards with color-coded borders and alert badges:

Thresholds are explicit and layered:

Metric Warning Critical
CPU > 80% > 95%
RAM > 85% > 95%
Disk > 90% > 95%
Swap > 50% > 80%
I/O Wait > 20% > 40%
Offline last seen > 30s ago

The "online" check is probably the simplest pattern in the whole system, and one I'm quite happy with:

online: server.last_seen_at
    ? Date.now() - new Date(server.last_seen_at).getTime() < 30_000
    : false
Enter fullscreen mode Exit fullscreen mode

No heartbeat daemon, no WebSocket connection tracking. The agent sends metrics every 10 seconds — if we haven't heard from it in 30 seconds, it's offline. Computed on-the-fly, never stored.


The SaaS Angle

This started as a tool for my own infrastructure, but I realized it has legs as a product. If you're running 2-10 servers — maybe a small startup, a side project with a VPS, or a self-hosted setup — you probably don't want to set up Prometheus + Grafana or pay for Datadog.

What you want is:

  • A single Bun script you can scp to your server
  • A dashboard that shows red/yellow/green at a glance
  • Historical charts for when something goes wrong at 3am
  • 7-day retention so you can spot trends

That's what this is. The agent deploys in 3 commands:

scp -r cdn-agent root@server:/opt/cdn-agent
# On the server:
echo 'AGENT_KEY=xxx\nAGENT_ENDPOINT=https://api.example.com/api/metrics-ingest' > .env
pm2 start bun --name cdn-agent -- run src/index.ts
Enter fullscreen mode Exit fullscreen mode

Building with Claude Code Opus 4.6

I want to be transparent: this feature was built almost entirely in collaboration with Claude Code Opus 4.6. Not as a code autocomplete — as an actual architectural partner.

Here's what that looked like in practice:

  • Architecture decisions: I described what I wanted ("a lightweight Netdata for my CDN servers"), and we iterated on the three-component design together. The in-memory bucketing approach instead of a time-series DB was Claude's suggestion after I explained my scale (~5 servers, 7-day retention).

  • The /proc collectors: Claude knew the exact format of /proc/stat, /proc/meminfo, /proc/net/dev and how to parse them. The delta-based calculation pattern for CPU and network throughput came out correct on the first try.

  • The warmup pattern: When I noticed the first data point was always zero, Claude immediately identified the cause (no previous snapshot for delta calculation) and suggested the warmup loop — a clean solution I might not have thought of as quickly.

  • Speed: The entire feature — agent, backend endpoints, dashboard with 10 charts — came together in a focused session. That's not weeks of development compressed into hours. It's a different way of working, where you're constantly iterating on a working system instead of staring at a blank file.

It's not perfect. Some of the chart styling needed manual tweaking. The alert thresholds are currently hardcoded (they should be configurable). But as a tool for going from idea to working product, Claude Code is genuinely impressive.


What's Next

  • WebSocket for real-time updates — Currently the dashboard polls every 60s. Live streaming would make it feel more like Netdata.
  • Configurable alert thresholds — Per-server, via the dashboard UI.
  • Notifications — Telegram/email alerts when a server goes critical.
  • Public SaaS launch — If there's enough interest, I'd love to open this up.

If you're building monitoring tools, or if you've used Claude Code for a full-feature build, I'd love to hear about your experience in the comments.

Built with Bun, React, Recharts, Supabase, and Claude Code Opus 4.6.

Top comments (0)