DEV Community

Satya Kilaru
Satya Kilaru

Posted on • Originally published at satyasaikilaru.vercel.app

Building a Real-Time Network Anomaly Detector with PyTorch and Go

TL;DR — I built a full-stack intrusion detection platform: a hybrid ML model (Focal Loss classifier + autoencoder for zero-days), a 4-service Go/Python backend, and an 11-page React dashboard. R2L recall doubled (14% → 29%), zero-day detection works via reconstruction error, and the whole pipeline runs in <2ms per prediction.


The Problem No One Talks About

Every network security tutorial shows you how to train a classifier on NSL-KDD and call it done. But classifiers have a fatal blind spot: they can only detect attacks they've trained on.

The NSL-KDD test set alone contains 17 attack subtypes that don't appear in training. In production, this ratio is infinitely worse — every new CVE, every new botnet, every novel C2 protocol is invisible to a classifier until you retrain.

I wanted to build something that catches both: known attacks with high precision and novel threats it's never seen before.


Architecture at a Glance

┌─────────────────────────┐
│   React Dashboard :5173 │  11 pages, ECharts, AG Grid, Firebase Auth
│   Apollo Client (GQL+WS)│
└───────────┬─────────────┘
            │
┌───────────▼─────────────┐
│  GraphQL Gateway :4000  │  gqlgen, Firebase Admin SDK, SQLite
│  Auth · Persistence     │
└───────────┬─────────────┘
            │ HTTP proxy
┌───────────▼─────────────┐
│  Go API Backend :8080   │  REST, WebSocket broadcast, Simulation
│  Packet capture control │
└───────────┬─────────────┘
            │ HTTP /api/predict
┌───────────▼─────────────┐
│  FastAPI ML Server :8000│  PyTorch → ONNX inference, <2ms/sample
└─────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Why 4 services? The ML server needs high memory and bursty CPU. The Go API needs low memory and high concurrency. Scaling them as one binary wastes money. Each service deploys and scales independently.


The Hybrid ML Model

Problem 1: Rare attacks are invisible to standard loss functions

NSL-KDD class distribution:

Class Training % What it is
Normal 53% Legitimate traffic
DoS 36% Denial of service
Probe 9.4% Port scanning, reconnaissance
R2L 0.8% Remote-to-Local exploits
U2R 0.04% User-to-Root privilege escalation

Standard CrossEntropy loss treats all samples equally — so the model essentially ignores R2L and U2R. Why learn to detect 0.04% of the data when you can boost accuracy by getting the 53% right?

Fix: Focal Loss.

FL(p_t) = -α_t (1 - p_t)^γ log(p_t)
Enter fullscreen mode Exit fullscreen mode

Focal Loss down-weights easy examples (the Normal traffic the model already handles) and forces it to focus on hard, rare samples. The γ parameter controls how aggressively.

Result:

Metric CrossEntropy (V1) Focal Loss (V2) Change
Probe Recall 68% 88% +20%
R2L Recall 14% 29% +15% (2x)
Macro F1 0.558 0.589 +5.5%

Problem 2: Zero-days are invisible to any classifier

A classifier is a lookup table — "I've seen this pattern before, it's DoS." For the 17 novel attack subtypes in the test set (and the infinite unknown attacks in the wild), we need something fundamentally different.

Fix: Autoencoder trained only on Normal traffic.

The autoencoder learns to compress and reconstruct "what normal looks like." At inference:

  • Low reconstruction error → looks normal → pass
  • High reconstruction error → looks like nothing I've ever seen → anomaly flag

This is zero-day detection by definition: it catches anything that deviates from normal, regardless of whether it matches any known signature.

The 3-Vote Consensus

Two detectors means disagreements. Policy:

Classifier Autoencoder Action
Attack High error High confidence — auto-block above threshold
Normal High error Possible zero-day — escalate to operator
Attack Low error Known pattern — treat at face value
Normal Low error Green — pass

The second row is the gold: a connection that looks normal per-feature but whose overall shape is anomalous. That's where zero-days hide.


The Dashboard — 11 Pages

Not just a prediction endpoint. Operators need context:

Page Why it exists
World Map Animated attack routes with GeoIP — "where is this coming from?"
Traffic Feed AG Grid, bulk select, CSV export — "show me everything"
Incidents AI threat summary + timeline by class — "what's happening now?"
Analytics Time-series, attack distribution — "what's the trend?"
Network Graph Force-directed IP topology — "who's talking to whom?"
Model Compare Radar chart, confusion matrix — "how good are we?"
SOAR Auto-block on confidence thresholds — "respond automatically"
Settings Profile, theme, webhooks, Slack — "configure everything"
System Status Service health, latency sparklines — "is the platform healthy?"

Firebase Auth handles Google Sign-In + Email/Password. First user = admin in SQLite; subsequent = viewers. RBAC enforced at the gateway, not just the UI.


Performance

Class Precision Recall F1
Normal 91.0% 88.9% 90.0%
DoS 95.8% 85.6% 90.4%
Probe 60.4% 88.3% 71.7%
R2L 53.9% 29.4% 38.0%
U2R 2.2% 41.8% 4.2%

Overall: 80.0% accuracy · 0.589 Macro F1

R2L/U2R are still the weakest — genuinely hard with <1% of training data. The autoencoder compensates by catching them as distribution anomalies even when the classifier misses.


What I'd Do Differently

  1. Add a χ² frequency-distribution detector — the same math that cracks a Caesar cipher (compare observed letter frequencies to English) applied to network feature distributions. Catches slow horizontal scans that per-packet models miss.

  2. Per-host behavioral baselines — a global Normal model doesn't notice "this IP was quiet for 6 months and just started scanning port 445." A per-host baseline does.

  3. Adversarial training — FGSM/PGD perturbations during training to harden against evasion. If someone discovers your model, they'll craft packets to slip past it.


Stack

Go · Python · PyTorch · ONNX · FastAPI · React 19 · Vite · Tailwind CSS · ECharts · AG Grid · Apollo Client · GraphQL (gqlgen) · Firebase Auth · SQLite · scapy · GeoIP2


Portfolio · GitHub

If you're building something similar or want to discuss the architecture — connect with me on LinkedIn.

Top comments (2)

Collapse
 
sylwiavargas profile image
Sylwia Vargas

Wow that's really cool! Did you build it as a side project or as a part of a bigger system?

Collapse
 
satya_kilaru_2d631df52d80 profile image
Satya Kilaru

Thanks! Personal project — wanted to go end-to-end on ML + network security outside of work.