The outages that hurt the most are not the dramatic ones.
A cell that drops from 99% to 40% RRC success rate gets noticed within minutes — alarms fire, dashboards turn red, someone calls someone. Those are survivable. The ones that cause real damage are the cells that drift from 98.4% to 97.1% to 96.3% over four days. Each step looks like noise. The trend is not.
By the time a cluster of customer complaints arrives, the problem has been running for a week.
This post is about catching that kind of degradation before it becomes visible to anyone outside the operations center.
Why Thresholds Always Fail
Every network operations team has tried thresholds. If RRC success rate drops below 95%, fire an alert. Simple. Understandable. Wrong for most cells most of the time.
A downtown office cell has near-zero RRC attempts at 2 AM. Its success rate at that hour is either 100% (one device connected, one succeeded) or undefined (zero attempts). A threshold fires constantly or never — neither is useful.
A rural coverage cell carries steady low traffic all day. A drop to 94% at any hour is significant and worth investigating immediately.
A stadium cell behaves completely differently on game days versus off days, and again during halftime versus kickoff.
One threshold for all of these is wrong for most of them. Each cell needs to define its own normal.
Building Per-Cell Baselines
For each cell and KPI, learn what the typical value looks like at each hour of the week. Eight weeks of history captures enough weekly cycles without letting a configuration change from months ago contaminate the numbers.
Click to see the full baseline builder
import pandas as pd
import numpy as np
from dataclasses import dataclass, field
from typing import Optional
import logging
logger = logging.getLogger(__name__)
@dataclass
class CellBaseline:
cell_ldn: str
kpi: str
stats: dict = field(default_factory=dict) # keyed by hour_of_week (0-167)
is_valid: bool = False
computed_at: Optional[str] = None
def build_baseline(
df: pd.DataFrame,
cell_ldn: str,
kpi: str,
min_weeks: int = 4,
) -> CellBaseline:
baseline = CellBaseline(cell_ldn=cell_ldn, kpi=kpi)
# Only clean records contribute — late arrivals and schema gaps skew the baseline
cell_df = df[
(df["cell_ldn"] == cell_ldn)
& (~df["is_late_arrival"])
& (~df["is_schema_gap"])
& (df[kpi].notna())
].copy()
if cell_df.empty:
logger.warning("no_data cell=%s kpi=%s", cell_ldn, kpi)
return baseline
cell_df["hour_of_week"] = (
cell_df["collection_time"].dt.dayofweek * 24
+ cell_df["collection_time"].dt.hour
)
for hour, group in cell_df.groupby("hour_of_week"):
if len(group) < min_weeks:
continue
values = group[kpi].values
std = float(np.std(values, ddof=1))
baseline.stats[int(hour)] = {
"mean": float(np.mean(values)),
"std": std if std > 0 else None,
"p10": float(np.percentile(values, 10)),
"p90": float(np.percentile(values, 90)),
"count": len(values),
}
baseline.is_valid = len(baseline.stats) >= 100
baseline.computed_at = pd.Timestamp.now(tz="UTC").isoformat()
return baseline
std = None for constant cells is deliberate, not lazy. A remote industrial cell that consistently reports the same counter value every interval has a standard deviation of zero. Storing None forces the scorer to handle this explicitly rather than silently dividing by zero downstream. The min_weeks=4 guard skips any hour slot with fewer than four historical data points — a sparse baseline is worse than no baseline at all.
Scoring Each New Observation
Every 15-minute counter update gets compared to that cell's expected behavior for that specific hour of the week. The result is either None — nothing to flag — or an AnomalyEvent with enough context for an engineer to act on.
@dataclass
class AnomalyEvent:
cell_ldn: str
kpi: str
collection_time: str
observed: float
baseline_mean: float
z_score: float
direction: str # "low" or "high"
def score_observation(
cell_ldn: str,
kpi: str,
collection_time: pd.Timestamp,
observed_value: float,
baseline: CellBaseline,
z_threshold: float = 3.0,
) -> Optional[AnomalyEvent]:
if not baseline.is_valid:
return None
hour = collection_time.dayofweek * 24 + collection_time.hour
hour_stats = baseline.stats.get(hour)
if hour_stats is None:
return None
if hour_stats["std"] is None:
# Constant cell — any deviation is notable but z-score is undefined
if observed_value != hour_stats["mean"]:
return AnomalyEvent(
cell_ldn=cell_ldn,
kpi=kpi,
collection_time=collection_time.isoformat(),
observed=observed_value,
baseline_mean=hour_stats["mean"],
z_score=float("inf"),
direction="low" if observed_value < hour_stats["mean"] else "high",
)
return None
z = (observed_value - hour_stats["mean"]) / hour_stats["std"]
if abs(z) <= z_threshold:
return None
return AnomalyEvent(
cell_ldn=cell_ldn,
kpi=kpi,
collection_time=collection_time.isoformat(),
observed=observed_value,
baseline_mean=hour_stats["mean"],
z_score=round(z, 3),
direction="low" if z < 0 else "high",
)
The threshold of 3.0 is a starting point, not a rule. In production I use different values per KPI category — availability counters at 2.5, throughput counters at 3.5. Start at 3.0 everywhere and adjust based on the false positive rate you see in the first two weeks.
Storing Events in Snowflake
Detections are only useful if they feed back into something. The acknowledged and resolved_at columns are the feedback mechanism — false positives acknowledged immediately tell you the threshold is too sensitive, events left open for hours tell you it is too loose.
CREATE TABLE anomaly_events (
event_id VARCHAR(36) DEFAULT UUID_STRING(),
detected_at TIMESTAMP_TZ NOT NULL,
cell_ldn VARCHAR(200) NOT NULL,
kpi VARCHAR(100) NOT NULL,
observed_value FLOAT,
baseline_mean FLOAT,
z_score FLOAT,
direction VARCHAR(5),
acknowledged BOOLEAN DEFAULT FALSE,
resolved_at TIMESTAMP_TZ,
PRIMARY KEY (event_id)
)
CLUSTER BY (DATE_TRUNC('day', detected_at), cell_ldn);
What Surprised Me in Production
The cells generating the most false positives were not the noisiest. They were recently reconfigured cells - a parameter change, a software upgrade — where the historical baseline no longer matched new behavior. The fix: suppress anomaly scoring for 72 hours after any configuration change and reset the baseline. Change management gets its own post later in this series.
The most operationally useful catches were not the dramatic drops. Those get noticed regardless. The value was in slow drift - a cell moving from 98.4% to 97.1% to 96.3% over three days, each step individually within noise, but the trend unmistakably downward. That is what per-cell baselines surface that threshold alerts never do.
What Comes Next
The next post switches from cells to devices - how to detect NB-IoT and low-power IoT devices that have gone silent using EDR data. Different problem, different data shape, same principle: define normal, flag when it stops.
Follow me to get a notification when Part 3 drops!
I have been building these systems at national scale for several years. If your KPI names differ, your counter granularity is sub-15-minute, or you have edge cases I haven't covered - I would genuinely like to hear about it in the comments.
Top comments (0)