Mohamed Zrouga

Posted on May 25

I'm not an ML engineer. I built one anyway.

#machinelearning #linux #security #go

Low-resource anomaly detection for ARM devices

Not because I wanted to — but because every tool I tried on ARM edge devices either needed the cloud, needed a GPU, or needed more RAM than the service it was supposed to be watching.

So this post isn't really about Cerberus. It's about the problem it forced me to actually understand: what does anomaly detection require on constrained hardware, and how do you get there without black-box ML?

Here's what I learned.

The observability stack was heavier than the workload

When you're deploying on cloud VMs, the weight of your tooling is invisible. You have RAM to spare. You have fast networks. You have Prometheus scraping endpoints on a LAN that never goes offline.

Drop the same assumptions onto an ARM gateway at a remote industrial site and things break differently. The telemetry pipeline competes with the workload for CPU cycles. The collector needs connectivity that doesn't exist. The ML inference endpoint is somewhere in a cloud region the device can't reach.

The problem isn't the tools — they were built for a different environment. The problem is treating cloud-native observability as a default rather than a choice.

Once I asked "what does edge observability actually need?" the answer was much smaller than I expected:

Did traffic behavior change?
Is something probing unusual ports?
Are protocol patterns different from yesterday?
Is there unexplained traffic acceleration?
Which specific device changed?

That's not a distributed tracing problem. It's a behavioral signal problem.

Why eBPF is the right layer for this

The kernel already sees everything. Every packet, every connection, every flag — it all passes through the network stack before any userspace process touches it.

eBPF lets you attach small programs directly to that stack using TC (Traffic Control) or XDP hooks. Instead of running tcpdump through a pipe, or copying full payloads into userspace for inspection, you write a kernel-side filter that extracts only the metadata you care about and hands it to you via a ring buffer.

For Cerberus, that's roughly 208 bytes per event:

struct network_event {
    __u8  event_type;       // ARP / TCP / UDP / DNS / TLS / HTTP / ICMP
    __u32 src_ip;
    __u32 dst_ip;
    __u16 src_port;
    __u16 dst_port;
    __u8  tcp_flags;
    __u8  l7_payload[128];  // first 128 bytes for L7 inspection
    // ...
};

The kernel filters. The ring buffer delivers. Userspace gets a clean event stream at near-zero overhead — no full payload copies, no extra processes, no agents fighting the workload for CPU.

On ARM systems, this difference is measurable.

What "ML-Lite" actually means

Before I go further: I want to be clear that I'm not an ML engineer. What I built is better described as applied statistics with some online learning layered on top. I'm calling it ML-Lite because that's what it is, not because it sounds impressive.

The instinct when building anomaly detection is to reach for a neural network or a heavy ML runtime. On constrained hardware that's a dead end — both because of resource cost and because the explainability disappears. An operator at 2am staring at an alert doesn't want a confidence score. They want to know what changed.

So the system works in three stages.

Stage 1: Aggregate into windows

Every 30 seconds, the event stream is compressed into a feature vector:

[packet_rate, dns_rate, tls_rate, syn_rate, entropy, unusual_ports]

This is the "network behavior as numbers" step. Each window becomes a compact snapshot of what the device was doing.

Stage 2: Build a baseline

As windows accumulate, the system learns what normal looks like using three tools:

Median + MAD (Median Absolute Deviation) — robust to outliers in a way mean/stddev aren't. If one window has a traffic spike, the baseline doesn't shift.
EWMA (Exponentially Weighted Moving Average) — gives recent windows more weight than old ones, so the baseline adapts slowly over time.
Centroid distance — tracks how far the current feature vector is from the center of historical observations.

The scoring formula for each feature is:

robust_z = |x - median| / MAD

And entropy is computed as:

H(X) = -Σ p(x) log₂ p(x)

Where x is the distribution of destination ports. Normal traffic hits the same handful of ports repeatedly — entropy stays low. A port scan touches 22, 23, 80, 443, 445, 3389 in sequence — entropy spikes.

Stage 3: Explain the score

This is the part I cared most about. The system doesn't just surface a number — it surfaces which features drove it:

WHY?
+ High SYN rate
+ Port entropy spike
+ Traffic acceleration

An operator can act on that immediately.

The evolution wasn't planned

The detection model went through several iterations, each adding a layer without replacing what came before:

v1 — Statistical detection: Median, MAD, thresholds, entropy. Worked. Noisy on IoT networks.

v2 — Adaptive learning: EWMA, rolling baselines, per-device profiles. Reduced false positives significantly once the baseline had enough history.

v3 — Isolation Forest: Unsupervised ML, tree isolation, outlier scoring. Doesn't need labeled attack data. Effective for genuinely novel patterns.

v4 — Tiny autoencoders: Architecture is 9→16→4→16→9. The bottleneck (4 dimensions) forces the model to compress normal behavior into a compact representation. Reconstruction error is the anomaly signal — if the current window can't be reconstructed well, it's unusual.

v5 (in progress) — Temporal Graph ML: Device graphs, sequence analysis, behavior prediction. The goal is to model relationships between devices over time, not just each device in isolation.

The design constraint throughout: offline, CPU/ARM friendly, explainable, hackable. No iteration added a cloud dependency.

Honest limitations

Behavioral models come with real tradeoffs I haven't fully solved:

Baseline drift: Normal behavior changes over time. A device that starts a new cron job at 3am will generate false positives until the baseline adapts.
Encrypted traffic: TLS SNI is visible at the handshake, but payload content isn't. The entropy signals still work on port distributions, but deep inspection has limits.
Noisy IoT environments: Some IoT devices have genuinely chaotic traffic patterns. Per-device profiles help, but they need enough history to be meaningful.
Cold start: Until a device accumulates enough windows to build a stable baseline, scoring is unreliable.

This is not a full IDS. It's an operational visibility tool that can surface unusual behavior — with the reasoning attached.

What I'm still trying to understand

This is where I'd genuinely value input from people with more ML background than me:

For edge anomaly detection, is unsupervised learning (Isolation Forest, autoencoders) fundamentally the right approach, or are there better-suited algorithms for streaming, low-resource environments?
How do you handle baseline drift without introducing false negatives for genuine behavioral shifts?
For temporal modeling on embedded systems, are there efficient graph-based approaches that don't require the full overhead of a GNN framework?
Is there a better feature set for network behavioral anomaly detection beyond what's described here?

I'm building in a space I find genuinely interesting but I'm not formally trained in — any feedback on the ML design is more valuable to me than stars.

Thank you

Cerberus wouldn't be what it is without the people who showed up and contributed real work to it.

Huge thanks to @SvenNellerz and @alexmchughdev — your contributions made the project meaningfully better and I genuinely appreciate it.

The project also stands on cilium/ebpf, BuntDB, and golang-lru — solid libraries that made the Go + eBPF combination practical without CGO headaches.

If you're working in this space — embedded Linux, ARM infrastructure, eBPF, IoT security, or lightweight ML — the repo is at github.com/zrougamed/cerberus and I'd genuinely love to hear how you're approaching the same problems.

Top comments (4)

VoltageGPU • May 27

It's an interesting approach to run ML locally on ARM when cloud isn't an option. Have you looked into using TensorRT or ONNX Runtime for better performance? I've seen similar setups on embedded systems where we had to optimize inference with limited resources.

Mohamed Zrouga • May 28

@voltagegpu Thanks! I have looked at ONNX Runtime and it's definitely on my radar for the autoencoder work. TensorRT is a bit less relevant for my target deployments since most of the environments I'm testing on are CPU-only ARM RPI boards rather than NVIDIA-based edge devices.
One of the design goals for Cerberus is to stay lightweight and explainable, so I started with statistical models, Isolation Forests, and tiny autoencoders before introducing heavier ML infrastructure. That said, benchmarking ONNX Runtime on ARM for the autoencoder stage would be an interesting next step.
Out of curiosity, what kind of embedded systems were you optimizing with ONNX/TensorRT?

Harjot Singh • May 31

The "I'm not X, built one anyway" genre is the defining story of right now, and it's a net good - expertise gates are dropping and more people get to make things. The honest nuance: AI gets you a working ML pipeline fast, but the part it can't hand you is the judgment to know when it's quietly wrong (leaky validation, a metric that looks great and means nothing). The build democratized; the skepticism still has to be learned.

So the move isn't "don't build without the credential," it's "build, but keep verifying what you don't yet know to distrust." Same reason I care about owned, legible output over black-box magic in Moonshift (prompt to a shipped SaaS on your own GitHub+Vercel) - you can actually inspect and learn from what shipped. Respect for diving in; what tripped you up most - the data plumbing or trusting the eval numbers? (Moonshift's first run's free if useful.)

Mohamed Zrouga • May 31

Thanks, that's exactly the challenge I'm working through. Building the pipeline was the easy part; building confidence in the scoring and evaluation is where the real learning happens. Explainability became essential because operators need to know why something was flagged. The good news is it's now being stress-tested by real users and environments, which is helping expose both the strengths and the blind spots.