Fabricio Amorim

Posted on Mar 25

Your server is already dead. Your monitoring just doesn't know it yet.

#devops #go #architecture #linux

Your server starts dying at T=0.

Prometheus detects it at T=100s.

In between, you're blind. That gap — I call it the Lethal Interval — is where OOM kills happen, where memory leaks spiral, where a payment service crashes while your dashboard still shows green.

This isn't a criticism of Prometheus or Datadog. They're excellent at what they do. The problem is structural: every centralized monitoring system works the same way.

collect metrics on node
  → transmit over network
    → store in TSDB
      → evaluate rules (cpu > 90% for 1m)
        → fire alert

Every step adds latency. By the time the alert fires, your node is already dead.

I built HOSA to fix this.

The biological insight

When you touch something hot, your spinal cord pulls your hand back in milliseconds. Your brain is notified after the reflex — not before.

Your brain is excellent at planning, learning, and making complex decisions. But it's structurally too slow to protect your hand in real time. Evolution solved this by putting a local reflex arc in the spinal cord, completely independent of the brain.

Your servers have the same problem. The "brain" — your Prometheus/Alertmanager/PagerDuty stack — is excellent. But it's structurally too slow for local collapses. HOSA is the spinal cord.

What the Lethal Interval looks like in practice

Here's a concrete scenario: a memory leak at 50MB/s.

Time  │ Node State          │ HOSA (on-node)          │ Prometheus (central)
──────┼─────────────────────┼─────────────────────────┼──────────────────────
t=0   │ Leak starts         │ D_M=1.1 (homeostasis)   │ Last scrape was 8s ago
      │ mem: 61%            │ Level 0                 │ Shows: "healthy"
      │                     │                         │
t=1   │ mem: 64%            │ ⚡ D_M=2.8 — DETECTS    │ (no scrape)
      │ PSI rising          │ Sampling: 10s → 10ms    │
      │                     │                         │
t=2   │ mem: 68%            │ ⚡ D_M=4.7 — CONTAINS   │ (no scrape)
      │ swap activating     │ memory.high throttled   │
      │                     │ Webhook fired           │
      │                     │                         │
t=8   │ mem: 74%            │ ✓ STABILIZED            │ (no scrape)
      │ (contained)         │ System degraded         │
      │                     │ but functional          │
      │                     │                         │
t=100 │ ✓ Recovering        │ ✓ Rollback complete     │ ⚠ ALERT FIRED
      │ (with HOSA)         │                         │ (100x too late)
──────┴─────────────────────┴─────────────────────────┴──────────────────────

Without HOSA, the counterfactual: OOM-kill at t=40, CrashLoopBackOff at t=80, 502s for users from t=40 to t=100. Prometheus fires its first alert at t=100, one minute after the first crash.

How HOSA detects anomalies

The core insight is that static thresholds are fundamentally broken for correlated systems.

CPU at 85% with stable memory, low I/O, and normal network is probably a legitimate video transcode job. CPU at 85% with rising memory pressure, I/O stalls, and network latency spikes is a collapse in progress.

These are the same CPU reading. A static threshold can't tell them apart.

Mahalanobis Distance can.

Instead of monitoring individual metrics, HOSA learns the normal behavioral profile of your node — the covariance structure across CPU, memory, I/O, network, and scheduler metrics simultaneously. Anomaly detection becomes: how far is this moment's system state from the node's learned normal profile, accounting for the correlations between dimensions?

Where:

x is the current metric vector
μ is the learned mean vector (updated incrementally via Welford)
Σ is the learned covariance matrix (captures inter-metric correlations)

But magnitude alone isn't enough. HOSA also tracks the first and second derivatives of the anomaly score using EWMA smoothing. It detects that you're heading toward collapse, not just that you've arrived.

This is the difference between "CPU is high" and "the anomaly score is accelerating."

Collection: eBPF in kernel space

Metrics are collected via eBPF probes attached to kernel tracepoints — no polling, no scraping, no userland agents reading /proc every N seconds.

Four probes currently:

sched_wakeup — scheduler pressure
sys_brk — memory allocation events
page_fault — memory pressure
block_rq_issue — I/O activity

Data flows from kernel space to userland via a lock-free BPF_MAP_TYPE_RINGBUF with 1–10μs latency. The full decision cycle — collection through actuation — targets under 1ms.

I wrote a custom eBPF loader (internal/sysbpf) without third-party dependencies: direct SYS_BPF syscalls, ELF parsing with BPF relocation resolution. No libbpf. This keeps the binary self-contained and eliminates runtime dependency issues across kernel versions.

Actuation: graduated response, not binary kill

When HOSA detects accelerating anomaly, it doesn't kill processes. It throttles them — through the same kernel mechanisms your orchestrator uses, but 100x faster.

Positive semi-axis (over-demand):

Level	Name	Action
+1	Plateau Shift	Sampling frequency increases, baseline refinement pauses
+2	Seasonality	Moderate cpu/memory containment via cgroups v2
+3	Adversarial	Aggressive containment, XDP packet filtering
+4	Local Failure	Hard limits, webhook to orchestrator
+5	Viral Propagation	Full quarantine, SIGSTOP non-critical processes

Negative semi-axis (under-demand) — this is the part most people don't expect:

Level	Name	Trigger	Action
-1	Legitimate Idleness	Below-baseline activity coherent with time window	GreenOps: reduce CPU frequency, increase sampling interval
-2	Structural Idleness	Node permanently oversized	FinOps report: EPI calculation, right-sizing suggestion
-3	Anomalous Silence	Traffic drop incoherent with temporal context	Vigilance → Active Containment

Level -3 is a security scenario. Traditional monitoring says "all healthy" when a server stops receiving traffic (CPU low, memory free, no errors). HOSA detects that the silence itself is the anomaly — possible DNS hijack, silent failure, upstream attack.

Design decisions I'm happy to defend

Why Mahalanobis and not an ML model?

O(n²) constant memory. No GPU. No training pipeline. No data windows stored. Sub-millisecond inference. Interpretable output — every decision comes with the complete mathematical justification: the D_M value, the derivative, the threshold crossed, and the contributing dimensions.

It runs on a Raspberry Pi.

Autoencoders were considered and rejected: opaque, large footprint, require training infrastructure. Isolation Forest: requires data windows, not incremental.

Why Welford incremental covariance?

O(n²) per sample with O(1) allocation. The mean vector and covariance matrix update with each new sample — no data windows stored, no unbounded memory growth. Predictable footprint regardless of uptime.

Why Go and not Rust or C?

Pragmatic. This is a research project — faster iteration matters more than zero-cost abstractions right now. The hot path uses zero-allocation patterns (sync.Pool, pre-allocated slices). GC pauses are sub-millisecond on Go 1.22+. If benchmarks show GC impact on detection latency under adversarial allocation pressure, hot-path migration to Rust or C via CGo is planned.

Why complement monitoring instead of replacing it?

Different timescales, different problems. HOSA solves the Lethal Interval — the milliseconds between collapse onset and external system awareness. Prometheus/Datadog solve capacity planning, trend analysis, cross-service correlation, dashboards. These are genuinely different problems. Trying to do both from one system would compromise both.

What HOSA is not

❌ Not a monitoring system. It doesn't replace Prometheus or Datadog.
❌ Not a HIDS. It doesn't detect intrusions by signature.
❌ Not an orchestrator. It doesn't schedule pods or manage clusters.
❌ Not magic. It has a cold start window (default 5 minutes). A sophisticated attacker who understands the architecture can execute a low-and-slow evasion. Throttling has side effects.

Where it is now

Early alpha. Phase 1 is complete:

eBPF probes for memory, CPU, I/O, scheduler
Welford incremental covariance matrix
Mahalanobis Distance calculation
Hardware proprioception (automatic topology discovery via sysfs — CPU cores, NUMA nodes, L3 cache, VM detection)
EWMA smoothing + first and second temporal derivatives
Graduated response system (Levels -3 to +5)
Thalamic Filter (telemetry suppression during homeostasis — no noise at steady state)
Benchmark suite: detection latency (p50/p99/p999), false positive rate, memory footprint and allocs per cycle

Requires: Linux ≥ 5.8, x86_64 or arm64.

Phase 2 (in progress): webhooks for K8s HPA/KEDA, Prometheus-compatible metrics endpoint, Kubernetes DaemonSet deployment.

Phase 3 (planned): local SLM for post-containment root cause analysis, Bloom Filter in eBPF for known-pattern fast-path blocking, federated baseline sharing across fleet.

Academic context

This comes out of a Master's research project at IMECC/Unicamp (University of Campinas, Brazil). The full theoretical foundation — including the mathematical formulation, the adversarial model, and the formal definition of the Lethal Interval — is in the whitepaper in the repo.

The core concept is what I call Endogenous Resilience: the idea that each node should possess autonomous detection and mitigation capability, independent of network connectivity and external control planes. The dominant model today is Exogenous Telemetry — ship data out, receive instructions back. HOSA proposes a complementary layer that operates entirely locally.

Contribute

GitHub: github.com/bricio-sr/hosa

Areas where contributions are especially welcome:

eBPF expertise: CO-RE compatibility testing across kernel versions, probe overhead optimization
Statistical validation: Testing Mahalanobis robustness under non-Gaussian workload distributions
Chaos engineering: Fault injection scenarios for benchmark coverage

If you want to go deep before touching code, read the whitepaper first — it explains why before how.

"Orchestrators and centralized monitoring are essential for capacity planning and long-term governance. But they are structurally — not accidentally — too slow to guarantee a node's survival in real time. If collapse happens in the interval between perception and exogenous action, the capacity for immediate decision must reside in the node itself."

DEV Community