DEV Community

Cover image for 50,000 Cells. One Network. How Do You Know Which One Is Quietly Breaking?
SESHA GONABOYINA
SESHA GONABOYINA

Posted on

50,000 Cells. One Network. How Do You Know Which One Is Quietly Breaking?

The outages that hurt the most are not the dramatic ones.

A cell that drops from 99% to 40% RRC success rate gets noticed within minutes — alarms fire, dashboards turn red, someone calls someone. Those are survivable. The ones that cause real damage are the cells that drift from 98.4% to 97.1% to 96.3% over four days. Each step looks like noise. The trend is not.

By the time a cluster of customer complaints arrives, the problem has been running for a week.

This post is about catching that kind of degradation before it becomes visible to anyone outside the operations center.


Note: In part one, I ended with this: "The next post builds cell-specific anomaly detection on top of this foundation — how to learn what 'normal' looks like for 50,000 different cells." This is that post.

Why Thresholds Always Fail

Every network operations team has tried thresholds. If RRC success rate drops below 95%, fire an alert. Simple. Understandable. Wrong for most cells most of the time.

A downtown office cell has near-zero RRC attempts at 2 AM. Its success rate at that hour is either 100% (one device connected, one succeeded) or undefined (zero attempts). A threshold fires constantly or never — neither is useful.

A rural coverage cell carries steady low traffic all day. A drop to 94% at any hour is significant and worth investigating immediately.

A stadium cell behaves completely differently on game days versus off days, and again during halftime versus kickoff.

One threshold for all of these is wrong for most of them. Each cell needs to define its own normal.


Building Per-Cell Baselines

For each cell and KPI, learn what the typical value looks like at each hour of the week. Eight weeks of history captures enough weekly cycles without letting a configuration change from months ago contaminate the numbers.

Click to see the full baseline builder
import pandas as pd
import numpy as np
from dataclasses import dataclass, field
from typing import Optional
import logging

logger = logging.getLogger(__name__)


@dataclass
class CellBaseline:
    cell_ldn:    str
    kpi:         str
    stats:       dict = field(default_factory=dict)  # keyed by hour_of_week (0-167)
    is_valid:    bool = False
    computed_at: Optional[str] = None


def build_baseline(
    df: pd.DataFrame,
    cell_ldn: str,
    kpi: str,
    min_weeks: int = 4,
) -> CellBaseline:
    baseline = CellBaseline(cell_ldn=cell_ldn, kpi=kpi)

    # Only clean records contribute — late arrivals and schema gaps skew the baseline
    cell_df = df[
        (df["cell_ldn"] == cell_ldn)
        & (~df["is_late_arrival"])
        & (~df["is_schema_gap"])
        & (df[kpi].notna())
    ].copy()

    if cell_df.empty:
        logger.warning("no_data cell=%s kpi=%s", cell_ldn, kpi)
        return baseline

    cell_df["hour_of_week"] = (
        cell_df["collection_time"].dt.dayofweek * 24
        + cell_df["collection_time"].dt.hour
    )

    for hour, group in cell_df.groupby("hour_of_week"):
        if len(group) < min_weeks:
            continue

        values = group[kpi].values
        std    = float(np.std(values, ddof=1))

        baseline.stats[int(hour)] = {
            "mean":  float(np.mean(values)),
            "std":   std if std > 0 else None,
            "p10":   float(np.percentile(values, 10)),
            "p90":   float(np.percentile(values, 90)),
            "count": len(values),
        }

    baseline.is_valid    = len(baseline.stats) >= 100
    baseline.computed_at = pd.Timestamp.now(tz="UTC").isoformat()
    return baseline
Enter fullscreen mode Exit fullscreen mode

std = None for constant cells is deliberate, not lazy. A remote industrial cell that consistently reports the same counter value every interval has a standard deviation of zero. Storing None forces the scorer to handle this explicitly rather than silently dividing by zero downstream. The min_weeks=4 guard skips any hour slot with fewer than four historical data points — a sparse baseline is worse than no baseline at all.


Scoring Each New Observation

Every 15-minute counter update gets compared to that cell's expected behavior for that specific hour of the week. The result is either None — nothing to flag — or an AnomalyEvent with enough context for an engineer to act on.

@dataclass
class AnomalyEvent:
    cell_ldn:        str
    kpi:             str
    collection_time: str
    observed:        float
    baseline_mean:   float
    z_score:         float
    direction:       str   # "low" or "high"


def score_observation(
    cell_ldn:        str,
    kpi:             str,
    collection_time: pd.Timestamp,
    observed_value:  float,
    baseline:        CellBaseline,
    z_threshold:     float = 3.0,
) -> Optional[AnomalyEvent]:

    if not baseline.is_valid:
        return None

    hour       = collection_time.dayofweek * 24 + collection_time.hour
    hour_stats = baseline.stats.get(hour)

    if hour_stats is None:
        return None

    if hour_stats["std"] is None:
        # Constant cell — any deviation is notable but z-score is undefined
        if observed_value != hour_stats["mean"]:
            return AnomalyEvent(
                cell_ldn=cell_ldn,
                kpi=kpi,
                collection_time=collection_time.isoformat(),
                observed=observed_value,
                baseline_mean=hour_stats["mean"],
                z_score=float("inf"),
                direction="low" if observed_value < hour_stats["mean"] else "high",
            )
        return None

    z = (observed_value - hour_stats["mean"]) / hour_stats["std"]

    if abs(z) <= z_threshold:
        return None

    return AnomalyEvent(
        cell_ldn=cell_ldn,
        kpi=kpi,
        collection_time=collection_time.isoformat(),
        observed=observed_value,
        baseline_mean=hour_stats["mean"],
        z_score=round(z, 3),
        direction="low" if z < 0 else "high",
    )
Enter fullscreen mode Exit fullscreen mode

The threshold of 3.0 is a starting point, not a rule. In production I use different values per KPI category — availability counters at 2.5, throughput counters at 3.5. Start at 3.0 everywhere and adjust based on the false positive rate you see in the first two weeks.


Storing Events in Snowflake

Detections are only useful if they feed back into something. The acknowledged and resolved_at columns are the feedback mechanism — false positives acknowledged immediately tell you the threshold is too sensitive, events left open for hours tell you it is too loose.

CREATE TABLE anomaly_events (
    event_id        VARCHAR(36)   DEFAULT UUID_STRING(),
    detected_at     TIMESTAMP_TZ  NOT NULL,
    cell_ldn        VARCHAR(200)  NOT NULL,
    kpi             VARCHAR(100)  NOT NULL,
    observed_value  FLOAT,
    baseline_mean   FLOAT,
    z_score         FLOAT,
    direction       VARCHAR(5),
    acknowledged    BOOLEAN       DEFAULT FALSE,
    resolved_at     TIMESTAMP_TZ,
    PRIMARY KEY (event_id)
)
CLUSTER BY (DATE_TRUNC('day', detected_at), cell_ldn);
Enter fullscreen mode Exit fullscreen mode

What Surprised Me in Production

The cells generating the most false positives were not the noisiest. They were recently reconfigured cells - a parameter change, a software upgrade — where the historical baseline no longer matched new behavior. The fix: suppress anomaly scoring for 72 hours after any configuration change and reset the baseline. Change management gets its own post later in this series.

The most operationally useful catches were not the dramatic drops. Those get noticed regardless. The value was in slow drift - a cell moving from 98.4% to 97.1% to 96.3% over three days, each step individually within noise, but the trend unmistakably downward. That is what per-cell baselines surface that threshold alerts never do.


What Comes Next

The next post switches from cells to devices - how to detect NB-IoT and low-power IoT devices that have gone silent using EDR data. Different problem, different data shape, same principle: define normal, flag when it stops.

Follow me to get a notification when Part 3 drops!

I have been building these systems at national scale for several years. If your KPI names differ, your counter granularity is sub-15-minute, or you have edge cases I haven't covered - I would genuinely like to hear about it in the comments.

Top comments (0)