DEV Community

Farhan Munir
Farhan Munir

Posted on

Build Log: Shipping a Lean Python Telemetry Agent (CPU, Memory, Disk)

Build Log (April 8, 2026)

Today I implemented the first production-ready telemetry collectors for heka-insights-agent and wired them into the main polling loop.

What I built

  • Added an optimized CPUCollector in src/collectors/cpu.py
  • Added a MemoryCollector in src/collectors/memory.py
  • Added a DiskCollector in src/collectors/disk.py
  • Wired all collectors into src/main.py with a shared loop
  • Added environment-based poll interval support via CPU_POLL_INTERVAL_SECONDS
  • Added python-dotenv in requirements.txt

CPU collector design

I built CPU collection around psutil.cpu_times(...) snapshots and delta math (single source), instead of calling both cpu_percent and cpu_times_percent per cycle.

Key design points:

  • No thread offloading (to_thread) for this workload
  • First cycle is warm-up by design
  • Supports basic and detailed output modes
  • Optional per-core output
  • Uses MonotonicTicker to keep fixed cadence without drift

Memory collector design

Memory collection is intentionally lightweight:

  • One call each to psutil.virtual_memory() and psutil.swap_memory()
  • basic mode returns compact key fields
  • detailed mode returns full psutil fields
  • Raw byte values are preserved (server-side compute handles transformations)

Disk collector design

For disk, I chose cumulative I/O counters (not rates) because central compute is done server-side.

  • Uses psutil.disk_io_counters(perdisk=True)
  • Returns aggregate and per-disk counters
  • Filters to physical devices only
  • Excludes partitions from per-disk payload
  • Added device-name cache with periodic refresh to reduce repeated filtering overhead

Main loop wiring

src/main.py now runs:

  • CPU collector
  • Memory collector
  • Disk collector

All on the same interval, with separate log lines per collector.

Poll interval is loaded from .env via:

CPU_POLL_INTERVAL_SECONDS

Invalid values fall back safely to default 5.0s.

Profiling notes

I profiled a 120-second run and reviewed both process stats and cProfile output.

Key findings:

  • Agent CPU cost is very low (near-idle for this polling interval)
  • Max RSS is about 15 MB
  • Runtime is dominated by intentional sleep (expected)
  • Collector costs are small; disk collection is the heaviest of the three

What changed after profiling

Based on profile output, I optimized disk collection further:

  • Added cached physical-device list to avoid filtering every cycle
  • Kept output shape unchanged (disk_io + disk_io_perdisk)

Current status

The agent now has a clean baseline telemetry pipeline with low overhead and clear extension points for transport/shipping.

Next planned work:

  • Add payload shipping to backend endpoint
  • Add bounded retry/backoff
  • Add collector-focused tests

Repo URL

GitHub logo ronin1770 / heka-insights-agent

A lightweight agent for collecting essential Linux system telemetry and shipping it to a configurable backend.

heka-insights-agent

A lightweight agent for collecting essential Linux system telemetry and shipping it to a configurable backend.

Test




Top comments (0)