DEV Community: Fabricio Amorim

Your server is already dead. Your monitoring just doesn't know it yet.

Fabricio Amorim — Wed, 25 Mar 2026 19:24:05 +0000

Your server starts dying at T=0.

Prometheus detects it at T=100s.

In between, you're blind. That gap — I call it the Lethal Interval — is where OOM kills happen, where memory leaks spiral, where a payment service crashes while your dashboard still shows green.

This isn't a criticism of Prometheus or Datadog. They're excellent at what they do. The problem is structural: every centralized monitoring system works the same way.

collect metrics on node
  → transmit over network
    → store in TSDB
      → evaluate rules (cpu > 90% for 1m)
        → fire alert

Every step adds latency. By the time the alert fires, your node is already dead.

I built HOSA to fix this.

The biological insight

When you touch something hot, your spinal cord pulls your hand back in milliseconds. Your brain is notified after the reflex — not before.

Your brain is excellent at planning, learning, and making complex decisions. But it's structurally too slow to protect your hand in real time. Evolution solved this by putting a local reflex arc in the spinal cord, completely independent of the brain.

Your servers have the same problem. The "brain" — your Prometheus/Alertmanager/PagerDuty stack — is excellent. But it's structurally too slow for local collapses. HOSA is the spinal cord.

What the Lethal Interval looks like in practice

Here's a concrete scenario: a memory leak at 50MB/s.

Time  │ Node State          │ HOSA (on-node)          │ Prometheus (central)
──────┼─────────────────────┼─────────────────────────┼──────────────────────
t=0   │ Leak starts         │ D_M=1.1 (homeostasis)   │ Last scrape was 8s ago
      │ mem: 61%            │ Level 0                 │ Shows: "healthy"
      │                     │                         │
t=1   │ mem: 64%            │ ⚡ D_M=2.8 — DETECTS    │ (no scrape)
      │ PSI rising          │ Sampling: 10s → 10ms    │
      │                     │                         │
t=2   │ mem: 68%            │ ⚡ D_M=4.7 — CONTAINS   │ (no scrape)
      │ swap activating     │ memory.high throttled   │
      │                     │ Webhook fired           │
      │                     │                         │
t=8   │ mem: 74%            │ ✓ STABILIZED            │ (no scrape)
      │ (contained)         │ System degraded         │
      │                     │ but functional          │
      │                     │                         │
t=100 │ ✓ Recovering        │ ✓ Rollback complete     │ ⚠ ALERT FIRED
      │ (with HOSA)         │                         │ (100x too late)
──────┴─────────────────────┴─────────────────────────┴──────────────────────

Without HOSA, the counterfactual: OOM-kill at t=40, CrashLoopBackOff at t=80, 502s for users from t=40 to t=100. Prometheus fires its first alert at t=100, one minute after the first crash.

How HOSA detects anomalies

The core insight is that static thresholds are fundamentally broken for correlated systems.

CPU at 85% with stable memory, low I/O, and normal network is probably a legitimate video transcode job. CPU at 85% with rising memory pressure, I/O stalls, and network latency spikes is a collapse in progress.

These are the same CPU reading. A static threshold can't tell them apart.

Mahalanobis Distance can.

Instead of monitoring individual metrics, HOSA learns the normal behavioral profile of your node — the covariance structure across CPU, memory, I/O, network, and scheduler metrics simultaneously. Anomaly detection becomes: how far is this moment's system state from the node's learned normal profile, accounting for the correlations between dimensions?

Where:

x is the current metric vector
μ is the learned mean vector (updated incrementally via Welford)
Σ is the learned covariance matrix (captures inter-metric correlations)

But magnitude alone isn't enough. HOSA also tracks the first and second derivatives of the anomaly score using EWMA smoothing. It detects that you're heading toward collapse, not just that you've arrived.

This is the difference between "CPU is high" and "the anomaly score is accelerating."

Collection: eBPF in kernel space

Metrics are collected via eBPF probes attached to kernel tracepoints — no polling, no scraping, no userland agents reading /proc every N seconds.

Four probes currently:

sched_wakeup — scheduler pressure
sys_brk — memory allocation events
page_fault — memory pressure
block_rq_issue — I/O activity

Data flows from kernel space to userland via a lock-free BPF_MAP_TYPE_RINGBUF with 1–10μs latency. The full decision cycle — collection through actuation — targets under 1ms.

I wrote a custom eBPF loader (internal/sysbpf) without third-party dependencies: direct SYS_BPF syscalls, ELF parsing with BPF relocation resolution. No libbpf. This keeps the binary self-contained and eliminates runtime dependency issues across kernel versions.

Actuation: graduated response, not binary kill

When HOSA detects accelerating anomaly, it doesn't kill processes. It throttles them — through the same kernel mechanisms your orchestrator uses, but 100x faster.

Positive semi-axis (over-demand):

Level	Name	Action
+1	Plateau Shift	Sampling frequency increases, baseline refinement pauses
+2	Seasonality	Moderate cpu/memory containment via cgroups v2
+3	Adversarial	Aggressive containment, XDP packet filtering
+4	Local Failure	Hard limits, webhook to orchestrator
+5	Viral Propagation	Full quarantine, SIGSTOP non-critical processes

Negative semi-axis (under-demand) — this is the part most people don't expect:

Level	Name	Trigger	Action
-1	Legitimate Idleness	Below-baseline activity coherent with time window	GreenOps: reduce CPU frequency, increase sampling interval
-2	Structural Idleness	Node permanently oversized	FinOps report: EPI calculation, right-sizing suggestion
-3	Anomalous Silence	Traffic drop incoherent with temporal context	Vigilance → Active Containment

Level -3 is a security scenario. Traditional monitoring says "all healthy" when a server stops receiving traffic (CPU low, memory free, no errors). HOSA detects that the silence itself is the anomaly — possible DNS hijack, silent failure, upstream attack.

Design decisions I'm happy to defend

Why Mahalanobis and not an ML model?

O(n²) constant memory. No GPU. No training pipeline. No data windows stored. Sub-millisecond inference. Interpretable output — every decision comes with the complete mathematical justification: the D_M value, the derivative, the threshold crossed, and the contributing dimensions.

It runs on a Raspberry Pi.

Autoencoders were considered and rejected: opaque, large footprint, require training infrastructure. Isolation Forest: requires data windows, not incremental.

Why Welford incremental covariance?

O(n²) per sample with O(1) allocation. The mean vector and covariance matrix update with each new sample — no data windows stored, no unbounded memory growth. Predictable footprint regardless of uptime.

Why Go and not Rust or C?

Pragmatic. This is a research project — faster iteration matters more than zero-cost abstractions right now. The hot path uses zero-allocation patterns (sync.Pool, pre-allocated slices). GC pauses are sub-millisecond on Go 1.22+. If benchmarks show GC impact on detection latency under adversarial allocation pressure, hot-path migration to Rust or C via CGo is planned.

Why complement monitoring instead of replacing it?

Different timescales, different problems. HOSA solves the Lethal Interval — the milliseconds between collapse onset and external system awareness. Prometheus/Datadog solve capacity planning, trend analysis, cross-service correlation, dashboards. These are genuinely different problems. Trying to do both from one system would compromise both.

What HOSA is not

❌ Not a monitoring system. It doesn't replace Prometheus or Datadog.
❌ Not a HIDS. It doesn't detect intrusions by signature.
❌ Not an orchestrator. It doesn't schedule pods or manage clusters.
❌ Not magic. It has a cold start window (default 5 minutes). A sophisticated attacker who understands the architecture can execute a low-and-slow evasion. Throttling has side effects.

Where it is now

Early alpha. Phase 1 is complete:

eBPF probes for memory, CPU, I/O, scheduler
Welford incremental covariance matrix
Mahalanobis Distance calculation
Hardware proprioception (automatic topology discovery via sysfs — CPU cores, NUMA nodes, L3 cache, VM detection)
EWMA smoothing + first and second temporal derivatives
Graduated response system (Levels -3 to +5)
Thalamic Filter (telemetry suppression during homeostasis — no noise at steady state)
Benchmark suite: detection latency (p50/p99/p999), false positive rate, memory footprint and allocs per cycle

Requires: Linux ≥ 5.8, x86_64 or arm64.

Phase 2 (in progress): webhooks for K8s HPA/KEDA, Prometheus-compatible metrics endpoint, Kubernetes DaemonSet deployment.

Phase 3 (planned): local SLM for post-containment root cause analysis, Bloom Filter in eBPF for known-pattern fast-path blocking, federated baseline sharing across fleet.

Academic context

This comes out of a Master's research project at IMECC/Unicamp (University of Campinas, Brazil). The full theoretical foundation — including the mathematical formulation, the adversarial model, and the formal definition of the Lethal Interval — is in the whitepaper in the repo.

The core concept is what I call Endogenous Resilience: the idea that each node should possess autonomous detection and mitigation capability, independent of network connectivity and external control planes. The dominant model today is Exogenous Telemetry — ship data out, receive instructions back. HOSA proposes a complementary layer that operates entirely locally.

Contribute

GitHub: github.com/bricio-sr/hosa

Areas where contributions are especially welcome:

eBPF expertise: CO-RE compatibility testing across kernel versions, probe overhead optimization
Statistical validation: Testing Mahalanobis robustness under non-Gaussian workload distributions
Chaos engineering: Fault injection scenarios for benchmark coverage

If you want to go deep before touching code, read the whitepaper first — it explains why before how.

"Orchestrators and centralized monitoring are essential for capacity planning and long-term governance. But they are structurally — not accidentally — too slow to guarantee a node's survival in real time. If collapse happens in the interval between perception and exogenous action, the capacity for immediate decision must reside in the node itself."

Your Monitoring Stack Has a Blind Spot. Here's the 2-Second Window Where Servers Die

Fabricio Amorim — Wed, 11 Mar 2026 19:18:28 +0000

Why I Replaced Static Thresholds with Mahalanobis Distance to Detect Server Failures in Milliseconds

It's 2am. Your payment service is dying.

Not slowly — right now, at 50MB/s, memory is leaking into the void. But your dashboards? Green. Every single one. Prometheus scraped 8 seconds ago. It saw nothing unusual. Your alerting rule says container_memory_usage > 1.8GB for 1m — and you're not there yet. So every on-call rotation sleeps peacefully while three hundred transactions start walking toward a cliff.

At t=40 seconds, the Linux OOM-Killer makes the decision you were too slow to make. The process is gone. The transactions are corrupted. The alert fires at t=100 seconds — a full minute and forty seconds after the point where something could have acted.

This is not a story about a bad alerting threshold. You can tune thresholds forever and still lose this race.

What if the problem was never the alerting threshold — but the model of detection itself?

How the monitoring stack actually works (and where it breaks)

The standard observability loop is deceptively simple: a local agent collects metrics, sends them over the network, a time-series database stores them, a rules engine evaluates them, and if a threshold is crossed, an alert fires. You know this stack. Prometheus, Alertmanager, Grafana — you've probably set it up more than once.

The problem isn't any of those tools. They're excellent at what they do. The problem is structural, and it lives in two places.

Latency of Awareness. Every scrape interval is a gap in your knowledge. At 15–60 seconds per cycle, you're always making decisions based on a photograph of the past. The evaluation happens after collection, after transmission, after storage. By the time container_memory_usage > 1.8GB for 1m becomes true, the rule has to remain true for another full minute before the alert fires. The math is brutal: you could be 2+ minutes behind a fast-moving failure before anyone is paged.

Fragility of Connection. The exogenous monitoring model has a silent assumption baked in: the network is always up. But consider a DDoS attack — the very event that saturates your outbound bandwidth also blinds your observability stack. Your Prometheus scrape fails. Your agent can't push metrics. Your node is simultaneously under attack and operationally invisible. The systems that should be protecting you lose eyes at exactly the moment you need them most.

There's a name for what lives between these two failure modes. I call it the Lethal Interval — the window between when a server starts dying and when your monitoring system first notices. A memory leak at 50MB/s gives you a Lethal Interval of roughly 100 seconds. A fast OOM scenario might be half that. A DDoS with bandwidth saturation could stretch it indefinitely.

This is where most production incidents are born. Not in the explosion — in the silence before it.

A different mental model: the spinal reflex

Here's a question that has nothing to do with servers: when you touch a hot stove, does your brain decide to pull your hand back?

It doesn't. Your spinal cord does — in milliseconds. The signal never reaches your brain before the withdrawal is already executing. Your brain finds out after the reflex has fired. The architecture of your nervous system doesn't make the cerebral cortex the bottleneck for survival-critical responses. It separates fast local reflexes from slow central reasoning, and both coexist perfectly.

That's the architecture I built HOSA on.

Your Kubernetes control plane is your brain. It has global context, complex reasoning, resource scheduling, orchestration. It's irreplaceable. But it operates on timescales of seconds to minutes — and it has the same structural blindspot as Prometheus: it depends on network connectivity and periodic reporting.

The Kubernetes control plane is your brain. HOSA is your spinal cord. Both are necessary. They operate at different speeds.

HOSA doesn't replace your observability stack. It operates in the interval where your observability stack is structurally incapable of acting — the milliseconds between the start of a collapse and the arrival of the first metric at an external system. Each node runs its own detection and response, autonomously, locally, without waiting for the central orchestrator to notice and send instructions back.

This is what the whitepaper formalizes as Endogenous Resilience: the capacity of a node to detect and contain its own collapse, independent of network connectivity, operating at kernel speed.

How HOSA actually works

The sensing layer: eBPF

HOSA doesn't poll metrics like Prometheus does. It uses eBPF probes attached directly to the Linux kernel — tracepoints and kprobes — to observe CPU scheduling, memory pressure, I/O wait, and network state continuously, at sub-millisecond granularity.

If you're not familiar with eBPF: it lets you run sandboxed programs inside the Linux kernel without modifying kernel source code. Think of it as a microscope you can attach to any kernel event. No scrape interval. No transmission delay. Just continuous, in-process observation of what the kernel is actually doing.

This is the first architectural break from the exogenous model: HOSA sees the system in real time, from the inside.

The detection engine: Mahalanobis Distance

This is the core insight, and it's worth slowing down for.

Most monitoring asks a question like: "Is CPU above 90%?"

HOSA asks a different question: "Is the current combination of CPU, memory pressure, I/O latency, and network state statistically unusual — compared to how this specific server normally behaves?"

The distinction matters enormously. CPU at 85% during a video encoding job is completely normal. CPU at 85% while memory pressure is spiking and I/O latency is tripling simultaneously is a completely different event — one that a threshold-based system cannot distinguish from the first case.

The Mahalanobis Distance is a statistical measure of how far a data point is from a normal distribution, accounting for the correlations between variables. It answers the question: "Given everything I know about how these metrics usually move together, how anomalous is this specific combination right now?"

HOSA builds a multivariate statistical baseline for each node — incrementally, using the Welford online algorithm, without requiring a training period. It then continuously computes the Mahalanobis Distance of the current metric vector against that baseline. When the distance spikes, and particularly when the rate of change of that distance is accelerating, HOSA knows something is wrong — often before any single metric has crossed a static threshold.

That's how it detects at t=1s what Prometheus won't see until t=100s.

The response system: graduated, not binary

Most automated responses have two states: do nothing, or kill the process. This binary is dangerous. Killing a payment service to prevent a memory leak is not a win.

HOSA implements six graduated response levels, ranging from increased observation to full network isolation. The key design principle: contain first, preserve function where possible, escalate only if containment fails.

The levels in brief:

Level 0 — Normal operation, baseline monitoring
Level 1 — Vigilância (Watchfulness): anomaly detected, sampling frequency increases from 100ms to 10ms
Level 2 — Contenção (Containment): apply cgroup memory limits, reduce ceiling without killing the process
Level 3 — Pressão (Pressure): further resource restriction, throttle CPU scheduling
Level 4 — Isolamento (Isolation): network traffic throttled via XDP, service degraded but alive
Level 5 — Quarentena (Quarantine): full network isolation, node preserved for forensics

The escalation is governed by both the magnitude of the Mahalanobis Distance and its acceleration — so a fast spike that stabilizes won't needlessly escalate, while a slow but continuously accelerating deviation will.

What this looks like in practice

Let's run the 2am scenario again, but with HOSA running.

t=0 — The payment service begins leaking memory at 50MB/s. Every dashboard shows green. Memory is at 61%. HOSA's baseline says this is within normal range. D_M = 1.1.

t=1s — Memory climbs to 64%. PSI (Pressure Stall Information) ticks up to 18%. No single threshold is broken. But the combination — memory trending up, PSI trending up, swap beginning to activate — produces D_M = 2.8. HOSA detects the anomaly. It increases its own sampling frequency from 100ms to 10ms. No alert fires. It just starts watching more closely.

t=2s — D_M = 4.7. The rate of deviation is accelerating, not stabilizing. HOSA moves to Level 2: Containment. It adjusts the cgroup memory ceiling for the payment-service container from 2GB to 1.6GB, applying backpressure without killing the process. A webhook fires to the operator: container name, resource affected, action taken, current metrics.

t=4s — The rate of memory growth slows. The containment is working. Memory stabilizes around 72%. HOSA confirms the intervention is effective and holds at Level 2.

t=8s — System stabilized at a degraded but functional state. Transactions are still processing, at reduced throughput. The service is alive.

t=100s — Prometheus fires its first alert. By this point, HOSA has been managing the situation for 92 seconds. The operator already has full context from the t=2s webhook: what happened, what was done, whether it worked.

The service stayed alive. Transactions were preserved. The operator received actionable context instead of a wake-up call and a dead container.

Where the project is today

HOSA is a research project — the foundation of my Master's thesis at Unicamp's IMECC, in Applied and Computational Mathematics.

The mathematical framework is fully documented in the whitepaper (v2.1). The architecture is defined in detail: the eBPF sensing layer, the Welford incremental covariance computation, the Mahalanobis detection engine, the cgroup/XDP response system, the graduated response model.

What's working in the current alpha: the eBPF probes, the Welford online covariance update, and the basic Mahalanobis calculation. The architecture is being built out toward experimental validation and comparative benchmarks against Prometheus + Alertmanager under controlled failure scenarios.

What's next: the experimental validation suite, the academic paper, and the benchmark data that will either validate the model or force me to revise it.

GitHub: https://github.com/bricio-sr/hosa
Project site: https://hosaproject.org

If you work with Linux infrastructure, distributed systems, or anomaly detection — I'd genuinely love your feedback. Open an issue, send a message, or just read the whitepaper.

The question that doesn't go away

Back to 2am.

Your infrastructure will face a Lethal Interval. The question isn't whether — it's a structural property of any system that monitors from the outside. The question is whether something is watching closely enough, from the inside, to act before your monitoring stack even notices.

Prometheus tells you what happened. HOSA is designed to act while it's still happening.

What's your current strategy for the gap between anomaly start and first alert? I'd genuinely like to know.

Fabricio Amorim is a Software Engineer & SRE based in Brazil, building distributed systems and critical infrastructure with Go, Python, and C. Currently researching autonomous anomaly detection at the Linux kernel level as part of my Master's in Applied Mathematics at Unicamp's IMECC.
Creator of HOSA (Homeostasis OS Agent) — an open-source autonomous SRE agent that detects and contains system failures in milliseconds using eBPF and multivariate statistics, without dependency on external monitoring infrastructure.
🔗 hosaproject.org · github.com/bricio-sr/hosa · linkedin.com/in/fabricioroney