Mart Young

Posted on Apr 28 • Edited on May 1

How I Built a Real-Time Anomaly Detector

#devops #docker #linux #devsecops

One quiet evening, a cloud server looked normal on the surface.

Users logged in, files moved, and everything felt stable.

But hidden inside the traffic were strange patterns: sudden spikes, repeated requests, and behavior that did not look human.

My task was simple to say, but serious to build: teach the server what "normal" traffic looks like, then react fast when traffic becomes dangerous.

Here is the world of this project, in one clear flow:

Nginx stands at the gate and receives every request.
Nextcloud serves the actual application.
Nginx JSON logs record each request as structured data.
Detector daemon reads those logs in real time, learns normal behavior, and flags anomalies.
iptables blocks abusive IPs when needed.
Slack alerts report important incidents immediately.
Dashboard shows live health, rates, and bans.

In short: this is not just monitoring. It is a live defense loop.

Request arrives -> Request is logged -> Behavior is measured -> Anomaly is detected -> Response is applied

Architecture diagram

Why this matters: fixed limits fail in real systems. Day traffic and night traffic are different. Normal today may look abnormal tomorrow.

So instead of hardcoding guesses, this system learns from recent traffic and makes decisions from evidence.

I will show how one log line becomes a usable event, field by field, and why that data quality decides whether your detector is smart or blind.

Click here: Github Repo Codebase

02 - From Log Line to Signal

Every defense system begins with one question: what exactly happened?

In this project, the answer comes from Nginx JSON access logs.

Each request is written as one structured line in:

/var/log/nginx/hng-access.log

A typical line includes:

source_ip: who sent the request
timestamp: when it happened
method: how it was sent (GET, POST, etc.)
path: what endpoint was requested
status: how the server responded (200, 404, 500, ...)
response_size: how much data was returned

This is important because attackers leave patterns in these fields:

same IP sending too many requests
repeated hits on sensitive paths
unusual spikes in 4xx/5xx errors

The daemon continuously tails this file, line by line, in real time.

For each line, it does three clear steps:

Parse the JSON safely.
Validate required fields.
Normalize values into one clean event object used by the detector.

In the implementation, this is handled by NginxLogMonitor.parse_line() and NginxLogMonitor.follow() in detector/monitor.py, where malformed JSON is ignored safely and valid lines are converted to typed events.

If a line is malformed, it is skipped and logged, not allowed to crash the daemon.

That keeps the detector resilient during noisy production traffic.

Detector daemon processing live log lines

At this stage, the system is not "guessing attacks" yet.

It is building trusted inputs. And trusted inputs are everything.

Because if your input data is wrong, your baseline is wrong.

If your baseline is wrong, your alerting is noise.

Next, we move from clean events to live behavior modeling using 60-second deque windows and a rolling 30-minute baseline.

03 - Teaching the System What "Normal" Means

Now the detector can read traffic.

The next question is: how does it know what is normal?

I used two ideas together:

a 60-second sliding window for live speed
a 30-minute rolling baseline for memory

Part 1: Sliding Window (what is happening right now)

Think of a moving glass box that always holds only the last 60 seconds of requests.

Every new request timestamp is added.
Any timestamp older than 60 seconds is removed.
The number left in the box is the current traffic pressure.

I keep two window views:

global window: all requests together
per-IP windows: each IP gets its own 60-second queue

This gives immediate answers:

global req/s
top talkers by IP
who is suddenly noisy

Why deque?

Because it adds and removes from both ends in constant time, which is perfect for real-time eviction.

You can see this directly in detector/detector.py, where SlidingWindowEngine keeps global_window and ip_windows as deque objects and evicts old timestamps inside _evict_old().

Part 2: Rolling Baseline (what is usually happening)

A spike is only suspicious if you compare it to history.

So every second, the system stores request counts and keeps only the last 30 minutes (1800 seconds).

Every 60 seconds, it recalculates:

effective_mean (average normal load)
effective_stddev (how much normal load fluctuates)
error_mean (average 4xx/5xx pressure)

It also keeps hourly slots and prefers the current hour once enough data exists.

That prevents midnight traffic patterns from being judged with daytime expectations.

This behavior is implemented in RollingBaselineEngine.recalculate() in detector/baseline.py, where counts are grouped by hour key and the current hour is preferred when min_current_hour_samples is satisfied.

Baseline trend over time with hourly difference

Why this pair works

Sliding window = fast eyes (present moment)
Baseline = memory (recent behavior)

Together, they answer the key security question:

Is this traffic high, or just high compared to what is normal right now?

Next, I will show how the detector makes a final anomaly decision using z-score, multiplier checks, and error-surge tightening.

04 - How the Detector Decides "This Is an Attack"

At this point, the detector has two things:

live request rate (from the sliding window)
normal behavior reference (from the baseline)

Now it must decide: alert or ignore.

I use two decision checks.

If either one fires, the traffic is marked anomalous.

Rule 1: Z-score check

Z-score asks: how far is current traffic from the average, measured in standard deviations?

In simple terms:

small z-score = normal fluctuation
high z-score = unusual surge

Default trigger:

z-score > 3.0

In code, this check is in AnomalyEvaluator.evaluate() (detector/detector.py) where global_z and per-IP ip_z are computed against effective_mean and effective_stddev.

Rule 2: Multiplier check

This one asks: is current traffic many times bigger than normal mean?

Default trigger:

current_rate > 5 x baseline_mean

The same method (AnomalyEvaluator.evaluate()) also applies the multiplier branch when z-score is not the first trigger.

Why keep both rules?

z-score catches statistical outliers
multiplier catches blunt spikes even when variance is noisy

So if one misses, the other often catches.

Global vs Per-IP decisions

The same logic is applied in two scopes:

global scope: total traffic rate
per-IP scope: each source IP rate

This matters because some attacks are distributed (global spike), while others are noisy from one source (single-IP flood).

Error-surge tightening (adaptive sensitivity)

Attack traffic often produces lots of 4xx/5xx responses.

So if an IP's error rate is much higher than normal, the detector becomes stricter for that IP.

Default logic:

if IP 4xx/5xx rate > 3 x baseline error rate, use tighter thresholds

This is implemented by comparing per-IP error window rate with baseline error_mean, then switching to tighter thresholds (tightened_z_threshold, tightened_multiplier_threshold) for that source.

This helps catch abusive behavior earlier, even if total request rate is not yet extreme.

Cooldown to avoid alert spam

Without control, the same spike could trigger repeated alerts every second.

So I added a short cooldown window between repeated alerts for the same scope/IP.

That cooldown is enforced in detector/detector.py using monotonic timestamps (_last_global_alert_at and _last_ip_alert_at).

That keeps alerts actionable instead of noisy.

What this looked like in my real test

During burst testing, I observed both:

scope=global anomaly alerts
scope=ip anomaly alerts with ban_candidate

I also saw the detector switch between:

z-score trigger
rate-multiplier trigger

That confirmed the decision engine is not relying on one fragile rule.

Global anomaly alert in Slack

Next, I will show the response side: iptables blocking, Slack notifications, and timed unban logic.

05 - When the Detector Acts (Not Just Watches)

Detection alone is not defense.

A real system must respond.

In this project, once an anomaly is confirmed, the flow becomes:

Anomaly -> Action -> Alert -> Audit

1) Per-IP anomaly response

If one IP behaves abnormally, the detector:

inserts an iptables DROP rule for that source IP
sends a Slack ban alert
writes an audit log entry
registers the ban in the unban scheduler

The action chain is orchestrated in run() inside detector/main.py, calling IptablesBlocker.block_ip() (detector/blocker.py), then UnbanScheduler.register_ban() (detector/unbanner.py), then notifier and audit handlers.

This is the direct containment path.

iptables output showing blocked IP

2) Global anomaly response

If the whole traffic pattern spikes abnormally, but no single attacker is isolated, the detector:

sends Slack global alert only
writes audit event
does not auto-ban all traffic

This avoids over-blocking during distributed spikes.

3) Slack notifications (operational visibility)

Every critical event is pushed to Slack with context:

condition triggered
current rate
baseline value
timestamp
ban duration (for IP bans)

Those Slack payloads are assembled in detector/notifier.py via send_global_alert(), send_ban_alert(), and send_unban_alert().

Ban notification in Slack

Unban notification in Slack

So the security response is visible in real time, not hidden in terminal logs.

4) Auto-unban backoff policy

A banned IP is not always permanent.

The scheduler uses progressive durations:

1st offense: 10 minutes
2nd offense: 30 minutes
3rd offense: 2 hours
later offenses: permanent

This balances two goals:

recover from possible false positives
become stricter with repeat offenders

The backoff sequence is configured in detector/config.yaml (blocking.ban_durations_seconds) and applied by UnbanScheduler.register_ban() in (detector/unbanner.py).

5) Audit trail (compliance + investigation)

For every key action, a structured line is written:

[timestamp] ACTION ip | condition | rate | baseline | duration

This exact format is written by AuditLogger.write() in detector/audit.py, and called from detector/main.py for BASELINE_RECALC, GLOBAL_ALERT, BAN, and UNBAN.

Examples include:

BASELINE_RECALC
GLOBAL_ALERT
BAN
UNBAN

This made troubleshooting much easier during testing because I could reconstruct exactly what happened and when.

Audit log with BAN, UNBAN, and BASELINE_RECALC events

What I observed in live testing

During burst tests:

anomaly detection trigger on both global and per-IP scopes
iptables DROP rules added for abusive traffic
matching structured audit entries written immediately
detector continuing to run and recalculate baseline after enforcement

That confirmed the project had moved from "monitoring script" to a real response loop.

Next, I will cover the dashboard, testing strategy.

06 - Dashboard and Testing

At this stage, the core detector could already identify and respond to bad traffic.

1) Live metrics dashboard

I added a lightweight dashboard that refreshes every 3 seconds and shows:

current global request rate
top 10 source IPs
currently banned IPs
effective baseline mean/stddev
CPU and memory usage
detector uptime

This is implemented in detector/dashboard.py and started from detector/main.py when dashboard.enabled is true in detector/config.yaml.

The dashboard exposes:

/ for human-readable UI
/metrics for structured JSON output

Live dashboard metrics view

That gave me two benefits:

instant operator visibility during attacks
clear proof of live behavior

2) Testing strategy:

I tested in layers, not all at once.

Ingestion test Confirm log parsing with real Nginx traffic.

   curl -I http://<SERVER_IP>/

Expected: a new parsed event appears in detector output.

Window test Verify counts grow and decay correctly over 60 seconds.

   for i in $(seq 1 40); do curl -s -o /dev/null -I http://<SERVER_IP>/; done

Expected: global_rps and per-IP counts increase, then decay after traffic stops.

Baseline test Verify periodic recalculation and floor behavior.

   tail -f detection-engine/detector/audit.log

Expected: BASELINE_RECALC entries every configured interval with mean/stddev values.

Detection test Trigger both z-score and multiplier conditions.

   for i in $(seq 1 300); do curl -s -o /dev/null -I http://<SERVER_IP>/; done

Expected: anomaly events for both global and per-IP scopes.

Response test Confirm ban, unban, Slack alert, and audit logging.

   sudo iptables -L -n | rg DROP
   tail -n 30 detection-engine/detector/audit.log

Expected: blocked IP rule appears, audit logs show BAN/UNBAN, and Slack receives matching alerts.

Dashboard test Confirm metrics update every <= 3 seconds.

   curl http://<METRICS_SUBDOMAIN>/metrics

Expected: live JSON metrics update continuously (global_rps, baseline, banned IPs, uptime).

This step-by-step test order made debugging faster because each stage had a single purpose.

3) Practical lessons that mattered most

Separate detection from action. Decision logic stayed cleaner when firewall and notifier code lived in dedicated modules.
Structured logs save time. During debugging, audit lines were faster to reason about than raw terminal noise.
Baseline quality is everything. If baseline windows are weak, detector confidence collapses.
Networking details matter. Container-to-host routing and bind addresses can break dashboards even when app logic is correct.
Build first, polish later. Drafting documentation in batches prevented context loss while implementation evolved.

One caveat from real testing: I once got locked out of my own server because the detector correctly flagged and blocked the source IP I was using for SSH.

That incident taught me to always protect administrator IP ranges in blocking.protected_cidrs, keep a recovery path (AWS Console/SSM), and run the detector under systemd so it can recover safely after interruptions.

Closing note

This project started as "watch traffic."

and ended as a full loop:

observe -> learn -> detect -> respond -> explain

And that last part, explain, is what makes the system defensible in both engineering reviews and security operations.

DEV Community