How Anomaly Detection Actually Works in Security Operations

#cybersecurity #infosec #machinelearning #security

Most vendors describe anomaly detection the same way: "our system learns what normal looks like and alerts on deviations." That description is technically accurate and practically useless. It doesn't tell you which deviations get flagged, which ones don't, or why your model fires every Monday morning when the backup job runs.

Understanding what's actually happening mathematically changes how you tune, interpret, and trust these systems.

What "Anomaly" Actually Means

In statistical terms, an anomaly is a data point that is unlikely under the distribution of normal data. That definition is deceptively simple: what makes something unlikely depends entirely on how you model normal.

Three classes of models dominate security applications:

Statistical models: Fit a distribution to your data (Gaussian, Poisson). Flag points that fall beyond a threshold, typically 3 standard deviations from the mean. Fast and interpretable, but fragile when data isn't actually Gaussian. Login counts at 3am are not Gaussian.
Isolation-based models: Build random decision trees that split features. Points that are isolated quickly (short average path length across trees) are anomalies. IsolationForest in scikit-learn implements this. Handles high-dimensional feature spaces without assuming a distribution.
Density-based models: Flag points in low-density regions. DBSCAN labels points as noise if they have fewer than min_samples neighbors within eps distance. Captures non-spherical clusters but requires careful tuning of both parameters.

Each approach has a different definition of "unusual." Picking the wrong one for your data structure is a common reason anomaly detection fails in production.

Detecting Anomalies in Auth Logs

Authentication data is a natural fit. Users log in at predictable times, from predictable locations, and fail at predictable rates. Build features per user per time bucket, then score new observations against the learned baseline.

A useful feature set for per-user, per-hour auth anomaly detection:

Login hour (0-23)
Day of week
Source IP geolocation (country or ASN)
Failed login count in the past hour (Windows Security Event ID 4625)
Time since last successful login, in hours (Event ID 4624)
Distinct source IPs in the past 24 hours

With these features extracted, fitting an IsolationForest looks like this:

import pandas as pd
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import LabelEncoder

# df: one row per (user, hour) with columns as above
le = LabelEncoder()
df['country_encoded'] = le.fit_transform(df['source_country'])

features = ['hour', 'day_of_week', 'country_encoded',
            'failed_count', 'hours_since_last', 'distinct_ips']
X = df[features].fillna(0)

model = IsolationForest(n_estimators=200, contamination=0.01, random_state=42)
df['anomaly_flag'] = model.fit_predict(X)   # -1 = anomaly, 1 = normal
df['raw_score'] = model.score_samples(X)    # lower (more negative) = more anomalous

anomalies = df[df['anomaly_flag'] == -1].sort_values('raw_score')

contamination=0.01 tells the model to treat the bottom 1% of data as anomalous. In a healthy environment, most organizations should be well below that threshold. Start at 0.005 and adjust based on analyst bandwidth, not on how quiet you want the queue.

The raw_score is more useful than the binary flag. A score of -0.80 deserves a closer look before a score of -0.55. Sort by raw score, not just by the flag.

Detecting Anomalies in Network Data

Zeek conn.log gives you connection-level telemetry: source/destination IPs, ports, bytes transferred, duration, and protocol. Long-duration connections with low byte counts are a signature of C2 beaconing (MITRE ATT&CK T1071.001).

import pandas as pd
from sklearn.ensemble import IsolationForest
import numpy as np

df = pd.read_csv('conn.log', sep='\t', comment='#',
    names=['ts', 'uid', 'src_ip', 'src_port', 'dst_ip', 'dst_port',
           'proto', 'service', 'duration', 'orig_bytes', 'resp_bytes',
           'conn_state', 'missed_bytes', 'history', 'orig_pkts', 'orig_ip_bytes',
           'resp_pkts', 'resp_ip_bytes', 'tunnel_parents'])

df['duration'] = pd.to_numeric(df['duration'], errors='coerce').fillna(0)
df['orig_bytes'] = pd.to_numeric(df['orig_bytes'], errors='coerce').fillna(0)

df['bytes_per_second'] = df['orig_bytes'] / (df['duration'] + 1e-9)
df['log_duration'] = np.log1p(df['duration'])

features = ['log_duration', 'bytes_per_second', 'orig_pkts', 'resp_pkts']
X = df[features].fillna(0)

model = IsolationForest(n_estimators=100, contamination=0.005, random_state=42)
df['anomaly'] = model.fit_predict(X)

One important limit: IsolationForest won't reliably catch beaconing if the individual connection parameters look ordinary. A beacon that fires every 60 seconds with 500 bytes transferred each time may sit comfortably inside the "normal" cluster on those features. For periodicity-based detection, apply autocorrelation or a fast Fourier transform on inter-arrival times per destination IP. The anomaly score from the model is a starting point, not the final answer.

What Anomaly Detection Won't Catch

This is the part that gets underemphasized in vendor presentations.

Anomaly detection finds things that are statistically unusual relative to your training data. It does not:

Detect living-off-the-land (LOTL) techniques. An attacker using wmic.exe to enumerate domain controllers (MITRE ATT&CK T1047) looks like an IT admin. If IT admins regularly run wmic, the model will not flag it.
Find slow and careful attackers. If contamination is 0.01, the model ignores the bottom 1% by definition. An attacker who keeps their behavior in the top 99% of normal won't appear.
Survive concept drift. A new remote work policy, a merger, or a change in shift schedules shifts your baseline. Your false positive rate climbs until you retrain. Track the mean and variance of your raw anomaly scores over time; a sustained drift signals a baseline problem.
Cover new users or new systems. If an account is compromised before you have behavioral history for it, you have no baseline to compare against. Rules and threat intelligence are still necessary for first-seen events.

Anomaly detection is a complement to signature-based rules and threat intelligence, not a replacement. Map your gap explicitly: which MITRE ATT&CK techniques in your threat model produce anomalous signals, and which ones don't?

Practical Tuning Notes

A few rules that hold across most deployments:

Train on at least 30 days of data before going live. One week isn't enough to capture weekly cycles: payroll jobs, patch windows, scheduled reports.
Retrain on a fixed schedule (monthly is a common starting point), or trigger retraining when you detect significant distribution shift in your raw score distribution.
Tune against confirmed true positive rate, not alert volume. A queue of 50 alerts with 40 confirmed true positives is far more valuable than 10 alerts with 2.
Log the raw anomaly scores alongside binary flags. Operational teams need the gradient to prioritize, not just a Boolean.

GTK Cyber's applied data science training covers building, calibrating, and evaluating ML-based detection systems with real security datasets, including hands-on labs that walk through exactly the kind of feature engineering and model selection described above.