DEV Community: Paolo D'Egidio

I built an AI log monitor for my homelab — local LLM reads my *arr logs so I don't have to

Paolo D'Egidio — Fri, 17 Apr 2026 21:35:43 +0000

My homelab runs the usual stack — Sonarr, Radarr, Prowlarr, qBittorrent, Plex. I was getting ntfy alerts at all hours for things like ffprobe metadata reads and HTTP 429s from indexers. Not actionable, just noise.

So I built Cortex: a monitoring layer that sends Docker logs through a local LLM (Ollama) every 30 minutes, filters the noise, and routes only meaningful alerts to my phone.

The problem with threshold-based monitoring

Standard monitoring tools watch numbers. CPU > 80%? Alert. Disk > 90%? Alert. That works for infrastructure — it doesn't work for application logs.

A Sonarr log line like:

[Warn] NzbDrone.Core.Download.TrackedDownloads.TrackedDownloadService: 
Couldn't import album track / No files found are eligible for import

Is that a problem? Maybe. Depends on context. Is it a one-off, or has it been happening for 6 hours? Is the download queue healthy? Did the episode actually get imported by another path?

A fixed threshold can't answer that. A language model can.

Architecture

Docker logs → Cortex → Ollama (local LLM) → parsed report → ntfy
                                                    ↓
                                           Prometheus metrics

Every 30 minutes, cortex-monitor.py runs via cron:

Collects recent log lines from each monitored container
Filters known noise patterns (ffprobe, VideoFileInfoReader, HTTP 429, etc.)
Sends the filtered logs to a local Ollama endpoint
Parses the LLM response into structured alerts
Routes alerts by priority — INFO goes to the daily digest, WARNING/CRITICAL go to ntfy immediately

The Ollama Modelfile

The key is giving the LLM enough context to understand what it's reading. The Modelfile bakes in knowledge of the stack:

SYSTEM """
You are an infrastructure monitoring assistant for a self-hosted homelab.
You analyse log output from Docker containers running *arr media services.

NOISE — these are NOT alerts:
- ffprobe metadata reads
- VideoFileInfoReader routine scans  
- HTTP 429 rate limiting from indexers (expected, indexers throttle)
- Prowlarr health check on port 9696

SIGNAL — these ARE worth reporting:
- Import failures after successful downloads
- Indexer connectivity issues lasting > 30 minutes
- Download client queue stalls
- Authentication errors
- Database errors

Output format:
ALERT_LEVEL: INFO|WARNING|CRITICAL
SUMMARY: one sentence
DETAIL: what happened and why it matters
RECOMMENDATION: what to check or do
"""

Temperature 0.2 keeps the output deterministic and consistent — you don't want creative variation in monitoring alerts.

Noise filtering before the LLM

The LLM call costs time (2-4 seconds on a local GPU). Filtering before sending keeps the context window clean and the latency low:

NOISE_PATTERNS = [
    "ffprobe",
    "VideoFileInfoReader", 
    "429",
    "invalid torrent",
    "9696/",
]

def filter_noise(log_lines: list) -> list:
    return [
        line for line in log_lines
        if not any(pattern in line for pattern in NOISE_PATTERNS)
    ]

On a normal day, this drops 60-70% of log volume before it ever reaches Ollama.

Alert routing with cooldown

Not every WARNING needs an immediate ntfy push. Cortex uses a cooldown per alert type to avoid notification fatigue:

def route_alert(alert: dict, state: dict) -> bool:
    key = f"{alert['container']}:{alert['alert_level']}"
    last_sent = state.get(key, 0)
    cooldown = COOLDOWNS.get(alert['alert_level'], 3600)

    if time.time() - last_sent < cooldown:
        return False  # still in cooldown

    state[key] = time.time()
    return True

INFO alerts accumulate and go into the daily digest at 09:00. WARNING and CRITICAL bypass the cooldown and go out immediately.

The daily digest

Every morning at 09:00, cortex-digest.py sends a summary via ntfy:

📊 Cortex Daily Digest — 2026-04-17

Containers: 5/5 healthy
Alerts last 24h: 2 (1 WARNING, 1 INFO)
Noise filtered: 847 log entries

Top event: prowlarr indexer timeout on NZBgeek (non-critical)
Recommendation: check NZBgeek API key expiry

Imports: 4 episodes, 3 movies — all clean

One message per day with everything that actually happened. No alert fatigue.

Prometheus metrics

cortex-exporter.py exposes metrics on port 9192 for Grafana:

cortex_alerts_total
cortex_last_run_timestamp
cortex_containers_monitored
cortex_noise_filtered_total
cortex_digest_last_sent

The "last run age" gauge is particularly useful — if Cortex stops running, the gauge climbs and you get a Grafana alert.

Hardware requirements

CPU-only: 16GB RAM minimum — runs qwen2.5:7b adequately
GPU: 8GB VRAM — runs qwen2.5:14b comfortably (recommended)

I run it on a machine with a modest GPU. The 30-minute cron cadence means inference load is negligible — one batch call every half hour, not a continuous service.

Getting started

git clone https://github.com/pdegidio/cortex-homelab.git
cd cortex-homelab
bash install.sh

The installer walks you through Ollama endpoint, ntfy config, container names, and cron setup. Done in ~15 minutes.

Full repo: github.com/pdegidio/cortex-homelab — MIT license.

Going further: Cortex Core
After running Cortex on top of my media stack for a while, I packaged the stack itself: a complete Docker Compose for Sonarr + Radarr + Prowlarr + qBittorrent (via Gluetun VPN) + Plex + Authelia 2FA + NGINX Proxy Manager + ntfy + CrowdSec + Watchtower, with 32 documented gotchas from real-world incidents.
If you're starting fresh or rebuilding, that's github.com/pdegidio/cortex-core — also MIT.
The two together (with email support and update notifications) are bundled on Gumroad as Cortex Pro for €35: paolodegidio.gumroad.com/l/xixnpn.
The GitHub versions are free and stay free.

What's your biggest source of homelab alert noise? I'm curious whether the noise filter patterns generalise beyond my stack or if everyone's list is completely different.

How to get individual SMART data from a TerraMaster DAS (and build a failure forecaster around it)

Paolo D'Egidio — Fri, 17 Apr 2026 21:27:29 +0000

If you have a TerraMaster D5-300 — or any DAS with a JMicron JMB576 bridge chip — you've probably hit this wall:

$ smartctl -a /dev/sdb
...
SMART overall-health self-assessment test result: PASSED
...

One result. For the whole enclosure. Not the 5 individual WD Red Pros inside it.

Most SMART monitoring guides stop here and tell you to use the vendor app. I went a different way.

The fix: JMicron pass-through

The JMB576 chip supports SMART pass-through via a specific smartctl flag:

smartctl -a -d jmb39x,0 /dev/sdb   # slot 1
smartctl -a -d jmb39x,1 /dev/sdb   # slot 2
smartctl -a -d jmb39x,2 /dev/sdb   # slot 3
smartctl -a -d jmb39x,3 /dev/sdb   # slot 4
smartctl -a -d jmb39x,4 /dev/sdb   # slot 5

The N in jmb39x,N maps directly to the physical slot. Run this and you get full per-disk SMART output — temperature, reallocated sectors, CRC errors, pending sectors, everything.

Tested on: TerraMaster D5-300, Debian 12.5, smartctl 7.3.

Why this matters

With per-disk SMART data, you can actually monitor disk health instead of just hoping the enclosure is OK. But raw SMART numbers aren't enough — what you really want is trends.

A single udma_crc_error_count = 1 on a 4-year-old disk is probably fine. udma_crc_error_count going from 1 to 3 to 7 to 12 over 60 days is not fine — it's a cable or backplane issue that will get worse.

Building a forecaster

I built Argus around this discovery. It collects SMART attributes every 6 hours, builds a 180-day rolling history, and runs a linear regression over the last 30 days to forecast when each attribute will hit its critical threshold.

The output looks like this:

👁️  Argus SMART Analysis — 2026-04-17T09:00:00+00:00
   Samples: 28 (forecast window: 30d)
   Overall: WARNING

✅ ssd-system (SanDisk Ultra II 960GB)   health=95/100  status=OK
✅ das-slot1  (WD Red Pro 8TB)           health=100/100 status=OK
✅ das-slot2  (WD Red Pro 8TB)           health=100/100 status=OK
✅ das-slot3  (WD Red Pro 8TB)           health=100/100 status=OK
✅ das-slot4  (WD Red Pro 8TB)           health=100/100 status=OK
🟡 das-slot5  (WD Red Pro 8TB)           health=70/100  status=WARNING
    🟡 udma_crc_error_count=1 ≥ WARN (5)
    📈 udma_crc_error_count: 1→100 forecast in 142d

Slot 5 is trending. Not critical yet — but I know about it 142 days before it becomes a problem.

How the forecast works

The core is a simple linear regression over (days_elapsed, attribute_value) pairs:

def linear_forecast(points: list, target: float) -> float | None:
    xs = [p[0] for p in points]
    ys = [p[1] for p in points]
    n = len(points)
    mean_x = sum(xs) / n
    mean_y = sum(ys) / n
    num = sum((x - mean_x) * (y - mean_y) for x, y in zip(xs, ys))
    den = sum((x - mean_x) ** 2 for x in xs)
    if den == 0 or (slope := num / den) <= 0:
        return None
    x_target = (target - (mean_y - slope * mean_x)) / slope
    days_until = x_target - max(xs)
    return round(days_until, 1) if 0 < days_until <= 180 else None

Nothing fancy. But with 6h collection frequency and 30 days of history, it gives you 5+ data points per attribute — enough to catch real trends while ignoring one-off noise.

Thresholds

Argus uses Backblaze-calibrated thresholds rather than vendor defaults. A few notes:

seek_error_rate is excluded — Seagate packs seek totals in the upper 32 bits, making the raw value meaningless for cross-vendor comparison. Backblaze doesn't use it either.
udma_crc_error_count warns at 5, not 1 — a single historical CRC on a multi-year disk is physiological. Growth is what matters, and the forecast captures it.
Temperature anomaly uses z-score over 10+ samples, not a fixed threshold — so a disk that normally runs at 28°C will alert at 36°C, while one that normally runs at 38°C won't.

Config for a TerraMaster D5-300

# /opt/argus/config/argus.conf

[argus]
history_file = /var/lib/argus/argus-history.json
retention_days = 180

[ntfy]
url   = http://your-ntfy:8080
topic = argus-disk

[disk:ssd-system]
device = /dev/sda
type   = sat
class  = ssd

[disk:das-slot1]
device = /dev/sdb
type   = jmb39x,0
class  = hdd

[disk:das-slot2]
device = /dev/sdb
type   = jmb39x,1
class  = hdd

[disk:das-slot3]
device = /dev/sdb
type   = jmb39x,2
class  = hdd

[disk:das-slot4]
device = /dev/sdb
type   = jmb39x,3
class  = hdd

[disk:das-slot5]
device = /dev/sdb
type   = jmb39x,4
class  = hdd

Getting started

git clone https://github.com/pdegidio/argus-disk.git
cd argus-disk
bash install.sh

The installer auto-discovers your disks via smartctl --scan, walks you through the config, and sets up cron jobs for collection and alerting.

Source: github.com/pdegidio/argus-disk — MIT license.

Has anyone else found JMicron pass-through working on other enclosure brands? Curious how far jmb39x,N generalises beyond TerraMaster.