DEV Community

Nehemiah
Nehemiah

Posted on

Building a Real-Time Anomaly Detection Daemon for a Live Cloud Storage Platform

We built a real-time traffic defense system for a public Nextcloud instance that could detect abnormal request behavior, block abusive clients in seconds, and keep the service available under unpredictable attack traffic.
The Problem
A public Nextcloud instance is always exposed. It accepts traffic continuously, stores user files, and presents a login surface to the internet at all times. That makes it a natural target for abuse: credential stuffing, path scanning, burst traffic, and opportunistic denial-of-service attempts.
The challenge was to keep the service live and responsive without knowing what attack traffic would look like in advance. We did not know when traffic would spike, how many source IPs would be involved, or whether the traffic would arrive as a flood, a scan, or a distributed burst. Any defensive system had to be deployed before the attack, adapt to live traffic patterns, and respond automatically in real time.
Static defenses were not enough. Hardcoded rate limits fail when normal traffic fluctuates. Static IP blocklists do nothing against unknown sources. Cron-based monitoring is too slow to matter when abuse happens in seconds. This required a streaming system that could observe traffic as it happened, learn what normal looked like, and react immediately when behavior deviated from it.
The Solution
We built a stateful Python daemon that continuously tails Nginx access logs, learns normal request patterns from live traffic, detects anomalies in real time, and responds automatically by blocking abusive IPs at the firewall.
The system runs as a real-time streaming pipeline. Nginx emits structured JSON logs, the daemon ingests each request as it is written, updates in-memory traffic windows, compares current behavior against a learned baseline, and triggers automated response when traffic becomes abnormal.
Because the daemon runs continuously, it keeps all of its working state in memory: recent request history, baseline statistics, active bans, escalation tiers, and audit state. That makes it fast enough to detect and respond within seconds, without the delay or overhead of periodic polling.
Architecture Overview
The system is built as a lightweight streaming pipeline running entirely on the VPS.
Traffic flows through five stages:
Nginx JSON logs → log watcher → sliding windows → anomaly detector → automated response
Nginx writes each request as newline-delimited JSON. A watchdog-based file observer tails the access log in real time and reads new entries as they are written. Each log line is parsed into a structured event containing source IP, timestamp, request path, method, status code, and response size.
Those events are pushed into in-memory sliding windows that maintain a rolling view of traffic over the last sixty seconds. The detector evaluates those windows once per second, compares live traffic against a learned baseline, and decides whether behavior is normal or anomalous.
When a per-IP anomaly is detected, the daemon inserts a firewall rule to drop traffic from that source, sends an alert to Slack, and records the event in an audit log. A background unban loop later removes expired bans and notifies operators automatically.
Detection Model
The detector uses adaptive thresholds instead of static rules.
It maintains rolling sixty-second windows for:
global request traffic
per-IP request traffic
global error traffic (4xx/5xx)
per-IP error traffic
These windows are stored in memory using deques, which allow constant-time inserts, expirations, and reads under sustained load.
Every second, the daemon computes current request rates and compares them against a rolling baseline built from recent traffic. That baseline is learned continuously from per-second request counts and expressed as a mean and standard deviation.
An anomaly is triggered when traffic exceeds either of two conditions:
its z-score rises significantly above normal behavior
its request rate spikes far beyond its learned baseline
This dual-threshold model catches both gradual abuse and sudden bursts. The z-score identifies statistically abnormal traffic even when the increase is moderate. The hard spike threshold catches aggressive surges even when historical variance is high.
The detector also tracks error-heavy behavior. If a source begins generating abnormal volumes of 4xx or 5xx responses, its thresholds are tightened automatically. This makes the system more sensitive to scanning and credential-stuffing patterns before they become full attacks.
Automated Response
Response behavior is intentionally asymmetric.
When a single IP becomes anomalous, the daemon responds automatically:
inserts an iptables DROP rule
sends a Slack alert
records the event in the audit log
starts a timed ban
When traffic is globally anomalous across many IPs, the system alerts but does not block automatically. A global spike may indicate abuse, but it may also reflect legitimate load. Blocking globally without human review risks taking down real users, so global anomalies are escalated to an operator instead of enforced automatically.
Per-IP bans escalate with repeat offenses. A first violation receives a short temporary ban. Repeated offenses receive progressively longer bans, eventually ending in permanent block. This prevents repeated low-grade abuse from cycling indefinitely through short ban windows.
All alerts are dispatched asynchronously so notification failures cannot block detection or enforcement.
Hardest Engineering Problem
The most difficult part of the system was not detection. It was enforcement.
The detector runs in Docker, but the firewall rules needed to affect traffic arriving at the host’s public interface. On modern Debian, Docker containers and the host can appear to share iptables while actually writing to different packet-filtering backends under nftables compatibility mode. The result is deceptive: a rule inserted inside the container appears valid locally, but has no effect on incoming host traffic.
The solution was to run the detector in the host network rather than in isolated.
Without that fix, detection worked and enforcement appeared successful, but abusive traffic was never actually blocked.
Observability
To make the system operationally usable, we paired the detector with a live monitoring dashboard.
A lightweight FastAPI service exposes real-time metrics for:
global request rate
anomaly state
top source IPs
active bans
baseline mean and deviation
CPU and memory usage
service uptime
The frontend polls these metrics every few seconds and renders a live incident view without requiring page reloads. This gave operators immediate visibility into what the daemon was seeing and how it was responding during active tests.
Results
The system met the operational goal it was designed for.
It continuously monitored live traffic, learned a working baseline from real request behavior, and reacted automatically when that behavior changed.
In testing, it:
detected anomalous request bursts within minutes
enforced per-IP bans within the required response window
handled sustained synthetic traffic spikes without service interruption
recovered cleanly from log rotation without daemon restart
maintained real-time operator visibility through the dashboard
The final system ran as a single stateful Python service with no external queue, broker, or stream processor. For the scale and latency requirements of a single VPS, that design kept the system fast, simple, and reliable.
Key Lessons
A streaming detector is fundamentally different from periodic monitoring. Once detection depends on timing, polling becomes the bottleneck.
Adaptive thresholds are more reliable than static rate limits when traffic patterns are unknown in advance.
Per-IP enforcement is safe to automate. Global enforcement usually is not.
In containerized environments, successful firewall writes do not guarantee effective firewall enforcement.
The hardest production problems were not statistical. They were operational.
Conclusion
This system was built to solve a practical problem: protect a public-facing Nextcloud instance from unpredictable attack traffic without relying on static assumptions or manual intervention.
It does not attempt deep traffic inspection or sophisticated behavioral modeling. What it does provide is continuous monitoring, adaptive anomaly detection, and automated response fast enough to matter during a live attack.
The most important design choice was treating traffic as a stream instead of a periodic metric. Once every request became an event to process in real time, the rest of the system followed naturally: rolling state, adaptive baselines, immediate detection, and automated enforcement.

Top comments (0)