One quiet evening, a cloud server looked normal on the surface.
Users logged in, files moved, and everything felt stable.
But hidden inside the traffic were strange patterns: sudden spikes, repeated requests, and behavior that did not look human.
My task was simple to say, but serious to build: teach the server what "normal" traffic looks like, then react fast when traffic becomes dangerous.
Here is the world of this project, in one clear flow:
- Nginx stands at the gate and receives every request.
- Nextcloud serves the actual application.
- Nginx JSON logs record each request as structured data.
- Detector daemon reads those logs in real time, learns normal behavior, and flags anomalies.
- iptables blocks abusive IPs when needed.
- Slack alerts report important incidents immediately.
- Dashboard shows live health, rates, and bans.
In short: this is not just monitoring. It is a live defense loop.
Request arrives -> Request is logged -> Behavior is measured -> Anomaly is detected -> Response is applied
Why this matters: fixed limits fail in real systems. Day traffic and night traffic are different. Normal today may look abnormal tomorrow.
So instead of hardcoding guesses, this system learns from recent traffic and makes decisions from evidence.
I will show how one log line becomes a usable event, field by field, and why that data quality decides whether your detector is smart or blind.
Click here: Github Repo Codebase
02 - From Log Line to Signal
Every defense system begins with one question: what exactly happened?
In this project, the answer comes from Nginx JSON access logs.
Each request is written as one structured line in:
/var/log/nginx/hng-access.log
A typical line includes:
-
source_ip: who sent the request -
timestamp: when it happened -
method: how it was sent (GET,POST, etc.) -
path: what endpoint was requested -
status: how the server responded (200,404,500, ...) -
response_size: how much data was returned
This is important because attackers leave patterns in these fields:
- same IP sending too many requests
- repeated hits on sensitive paths
- unusual spikes in 4xx/5xx errors
The daemon continuously tails this file, line by line, in real time.
For each line, it does three clear steps:
- Parse the JSON safely.
- Validate required fields.
- Normalize values into one clean event object used by the detector.
In the implementation, this is handled by NginxLogMonitor.parse_line() and NginxLogMonitor.follow() in detector/monitor.py, where malformed JSON is ignored safely and valid lines are converted to typed events.
If a line is malformed, it is skipped and logged, not allowed to crash the daemon.
That keeps the detector resilient during noisy production traffic.
Detector daemon processing live log lines

At this stage, the system is not "guessing attacks" yet.
It is building trusted inputs. And trusted inputs are everything.
Because if your input data is wrong, your baseline is wrong.
If your baseline is wrong, your alerting is noise.
Next, we move from clean events to live behavior modeling using 60-second deque windows and a rolling 30-minute baseline.
03 - Teaching the System What "Normal" Means
Now the detector can read traffic.
The next question is: how does it know what is normal?
I used two ideas together:
- a 60-second sliding window for live speed
- a 30-minute rolling baseline for memory
Part 1: Sliding Window (what is happening right now)
Think of a moving glass box that always holds only the last 60 seconds of requests.
- Every new request timestamp is added.
- Any timestamp older than 60 seconds is removed.
- The number left in the box is the current traffic pressure.
I keep two window views:
- global window: all requests together
- per-IP windows: each IP gets its own 60-second queue
This gives immediate answers:
- global req/s
- top talkers by IP
- who is suddenly noisy
Why deque?
Because it adds and removes from both ends in constant time, which is perfect for real-time eviction.
You can see this directly in detector/detector.py, where SlidingWindowEngine keeps global_window and ip_windows as deque objects and evicts old timestamps inside _evict_old().
Part 2: Rolling Baseline (what is usually happening)
A spike is only suspicious if you compare it to history.
So every second, the system stores request counts and keeps only the last 30 minutes (1800 seconds).
Every 60 seconds, it recalculates:
-
effective_mean(average normal load) -
effective_stddev(how much normal load fluctuates) -
error_mean(average 4xx/5xx pressure)
It also keeps hourly slots and prefers the current hour once enough data exists.
That prevents midnight traffic patterns from being judged with daytime expectations.
This behavior is implemented in RollingBaselineEngine.recalculate() in detector/baseline.py, where counts are grouped by hour key and the current hour is preferred when min_current_hour_samples is satisfied.
Baseline trend over time with hourly difference

Why this pair works
- Sliding window = fast eyes (present moment)
- Baseline = memory (recent behavior)
Together, they answer the key security question:
Is this traffic high, or just high compared to what is normal right now?
Next, I will show how the detector makes a final anomaly decision using z-score, multiplier checks, and error-surge tightening.
04 - How the Detector Decides "This Is an Attack"
At this point, the detector has two things:
- live request rate (from the sliding window)
- normal behavior reference (from the baseline)
Now it must decide: alert or ignore.
I use two decision checks.
If either one fires, the traffic is marked anomalous.
Rule 1: Z-score check
Z-score asks: how far is current traffic from the average, measured in standard deviations?
In simple terms:
- small z-score = normal fluctuation
- high z-score = unusual surge
Default trigger:
z-score > 3.0
In code, this check is in AnomalyEvaluator.evaluate() (detector/detector.py) where global_z and per-IP ip_z are computed against effective_mean and effective_stddev.
Rule 2: Multiplier check
This one asks: is current traffic many times bigger than normal mean?
Default trigger:
current_rate > 5 x baseline_mean
The same method (AnomalyEvaluator.evaluate()) also applies the multiplier branch when z-score is not the first trigger.
Why keep both rules?
- z-score catches statistical outliers
- multiplier catches blunt spikes even when variance is noisy
So if one misses, the other often catches.
Global vs Per-IP decisions
The same logic is applied in two scopes:
- global scope: total traffic rate
- per-IP scope: each source IP rate
This matters because some attacks are distributed (global spike), while others are noisy from one source (single-IP flood).
Error-surge tightening (adaptive sensitivity)
Attack traffic often produces lots of 4xx/5xx responses.
So if an IP's error rate is much higher than normal, the detector becomes stricter for that IP.
Default logic:
- if
IP 4xx/5xx rate > 3 x baseline error rate, use tighter thresholds
This is implemented by comparing per-IP error window rate with baseline error_mean, then switching to tighter thresholds (tightened_z_threshold, tightened_multiplier_threshold) for that source.
This helps catch abusive behavior earlier, even if total request rate is not yet extreme.
Cooldown to avoid alert spam
Without control, the same spike could trigger repeated alerts every second.
So I added a short cooldown window between repeated alerts for the same scope/IP.
That cooldown is enforced in detector/detector.py using monotonic timestamps (_last_global_alert_at and _last_ip_alert_at).
That keeps alerts actionable instead of noisy.
What this looked like in my real test
During burst testing, I observed both:
-
scope=globalanomaly alerts -
scope=ipanomaly alerts withban_candidate
I also saw the detector switch between:
- z-score trigger
- rate-multiplier trigger
That confirmed the decision engine is not relying on one fragile rule.
Next, I will show the response side: iptables blocking, Slack notifications, and timed unban logic.
05 - When the Detector Acts (Not Just Watches)
Detection alone is not defense.
A real system must respond.
In this project, once an anomaly is confirmed, the flow becomes:
Anomaly -> Action -> Alert -> Audit
1) Per-IP anomaly response
If one IP behaves abnormally, the detector:
- inserts an
iptablesDROP rule for that source IP - sends a Slack ban alert
- writes an audit log entry
- registers the ban in the unban scheduler
The action chain is orchestrated in run() inside detector/main.py, calling IptablesBlocker.block_ip() (detector/blocker.py), then UnbanScheduler.register_ban() (detector/unbanner.py), then notifier and audit handlers.
This is the direct containment path.
iptables output showing blocked IP

2) Global anomaly response
If the whole traffic pattern spikes abnormally, but no single attacker is isolated, the detector:
- sends Slack global alert only
- writes audit event
- does not auto-ban all traffic
This avoids over-blocking during distributed spikes.
3) Slack notifications (operational visibility)
Every critical event is pushed to Slack with context:
- condition triggered
- current rate
- baseline value
- timestamp
- ban duration (for IP bans)
Those Slack payloads are assembled in detector/notifier.py via send_global_alert(), send_ban_alert(), and send_unban_alert().
Ban notification in Slack
So the security response is visible in real time, not hidden in terminal logs.
4) Auto-unban backoff policy
A banned IP is not always permanent.
The scheduler uses progressive durations:
- 1st offense: 10 minutes
- 2nd offense: 30 minutes
- 3rd offense: 2 hours
- later offenses: permanent
This balances two goals:
- recover from possible false positives
- become stricter with repeat offenders
The backoff sequence is configured in detector/config.yaml (blocking.ban_durations_seconds) and applied by UnbanScheduler.register_ban() in (detector/unbanner.py).
5) Audit trail (compliance + investigation)
For every key action, a structured line is written:
[timestamp] ACTION ip | condition | rate | baseline | duration
This exact format is written by AuditLogger.write() in detector/audit.py, and called from detector/main.py for BASELINE_RECALC, GLOBAL_ALERT, BAN, and UNBAN.
Examples include:
BASELINE_RECALCGLOBAL_ALERTBANUNBAN
This made troubleshooting much easier during testing because I could reconstruct exactly what happened and when.
Audit log with BAN, UNBAN, and BASELINE_RECALC events

What I observed in live testing
During burst tests:
- anomaly detection trigger on both global and per-IP scopes
- iptables DROP rules added for abusive traffic
- matching structured audit entries written immediately
- detector continuing to run and recalculate baseline after enforcement
That confirmed the project had moved from "monitoring script" to a real response loop.
Next, I will cover the dashboard, testing strategy.
06 - Dashboard and Testing
At this stage, the core detector could already identify and respond to bad traffic.
1) Live metrics dashboard
I added a lightweight dashboard that refreshes every 3 seconds and shows:
- current global request rate
- top 10 source IPs
- currently banned IPs
- effective baseline mean/stddev
- CPU and memory usage
- detector uptime
This is implemented in detector/dashboard.py and started from detector/main.py when dashboard.enabled is true in detector/config.yaml.
The dashboard exposes:
-
/for human-readable UI -
/metricsfor structured JSON output
That gave me two benefits:
- instant operator visibility during attacks
- clear proof of live behavior
2) Testing strategy:
I tested in layers, not all at once.
- Ingestion test Confirm log parsing with real Nginx traffic.
curl -I http://<SERVER_IP>/
Expected: a new parsed event appears in detector output.
- Window test Verify counts grow and decay correctly over 60 seconds.
for i in $(seq 1 40); do curl -s -o /dev/null -I http://<SERVER_IP>/; done
Expected: global_rps and per-IP counts increase, then decay after traffic stops.
- Baseline test Verify periodic recalculation and floor behavior.
tail -f detection-engine/detector/audit.log
Expected: BASELINE_RECALC entries every configured interval with mean/stddev values.
- Detection test Trigger both z-score and multiplier conditions.
for i in $(seq 1 300); do curl -s -o /dev/null -I http://<SERVER_IP>/; done
Expected: anomaly events for both global and per-IP scopes.
- Response test Confirm ban, unban, Slack alert, and audit logging.
sudo iptables -L -n | rg DROP
tail -n 30 detection-engine/detector/audit.log
Expected: blocked IP rule appears, audit logs show BAN/UNBAN, and Slack receives matching alerts.
- Dashboard test Confirm metrics update every <= 3 seconds.
curl http://<METRICS_SUBDOMAIN>/metrics
Expected: live JSON metrics update continuously (global_rps, baseline, banned IPs, uptime).
This step-by-step test order made debugging faster because each stage had a single purpose.
3) Practical lessons that mattered most
- Separate detection from action. Decision logic stayed cleaner when firewall and notifier code lived in dedicated modules.
- Structured logs save time. During debugging, audit lines were faster to reason about than raw terminal noise.
- Baseline quality is everything. If baseline windows are weak, detector confidence collapses.
- Networking details matter. Container-to-host routing and bind addresses can break dashboards even when app logic is correct.
- Build first, polish later. Drafting documentation in batches prevented context loss while implementation evolved.
One caveat from real testing: I once got locked out of my own server because the detector correctly flagged and blocked the source IP I was using for SSH.
That incident taught me to always protect administrator IP ranges in blocking.protected_cidrs, keep a recovery path (AWS Console/SSM), and run the detector under systemd so it can recover safely after interruptions.
Closing note
This project started as "watch traffic."
and ended as a full loop:
observe -> learn -> detect -> respond -> explain
And that last part, explain, is what makes the system defensible in both engineering reviews and security operations.




Top comments (0)