Stephen Souza

Posted on May 12

Stop checking dashboards. Build the monitoring layer that checks for you.

#saas #startup #monitoring #devops

Every developer has a version of this story.

You open your laptop in the morning. Something feels off.

You check
Stripe - green.
Check the server - up.
Check the error logs — clean.

Spend 20 minutes manually clicking through dashboards before realising signups stopped 6 hours ago.

That's not monitoring. That's detective work.

Dashboards are useful. The problem is knowing which one to open. Silent failure monitoring solves that — it watches for anomalies, silence, and broken flows across your stack, then tells you exactly where to look. You still investigate. You just stop guessing where to start.

What silent failures look like in production

Silent failures don't throw errors. They don't trigger uptime alerts. They just quietly cost you money until someone notices.

No new signup in 4 hours on a Wednesday when your baseline is 12 per hour. Payment.initiated that never reaches payment.completed. Nightly sync that ran at midnight, processed zero records, exited clean, no error thrown. AI agent that looped 47 times, burned token budget, produced nothing. Cron job that stopped running Thursday. Zapier workflow that hasn't fired since Tuesday.

None of these show up as red in any dashboard. All of them show up in revenue.

Before you start

Create an account at notilens.com
Create a source in the dashboard — you'll get a token and secret
Install the SDK

Install NotiLens

pip install notilens

npm install @notilens/notilens

Initialize

Python

import notilens

nl = notilens.init(
    name="my-app",
    token="YOUR_TOKEN",
    secret="YOUR_SECRET"
)

Node.js

import { NotiLens } from '@notilens/notilens';

const nl = NotiLens.init('my-app', {
  token: 'YOUR_TOKEN',
  secret: 'YOUR_SECRET'
});

Credentials are saved to ~/.notilens_config.json after first use. Use environment variables NOTILENS_TOKEN and NOTILENS_SECRET to avoid hardcoding.

Cron job monitoring — know when jobs run and what they actually did

The most expensive silent failure for most teams — cron job stops running, nobody knows for days. Or worse, it runs but processes nothing, exits clean, and looks healthy.

NotiLens catches both. Your job pings when it starts, passes records processed and time taken, and pings when it completes. Zero records on a job that normally touches 500 — that's the alert. Job that didn't ping at all — also the alert.

Python

import notilens

nl  = notilens.init(name="invoice-sync", token="TOKEN", secret="SECRET")
run = nl.task("nightly-sync")
run.start()

try:
    run.progress("Fetching invoices")
    records = process_invoices()

    run.metric("records", records)
    run.metric("duration_ms", 1240)

    run.complete(f"Processed {records} invoices")
except Exception as e:
    run.fail(str(e))

Node.js

import { NotiLens } from '@notilens/notilens';

const nl  = NotiLens.init('invoice-sync', { token: 'TOKEN', secret: 'SECRET' });
const run = nl.task('nightly-sync');
run.start();

try {
  run.progress('Fetching invoices');
  const records = await processInvoices();

  run.metric('records', records);
  run.metric('duration_ms', 1240);

  run.complete(`Processed ${records} invoices`);
} catch (err) {
  run.fail(err.message);
}

NotiLens fires a silence alert if the job doesn't ping within its expected window. You open the dashboard knowing exactly which job missed and when.

AI agent monitoring — loops, hangs, token burns, and human approval waits

AI agents fail differently from regular software. They don't crash — they drift, loop, stall, and consume resources while producing nothing. Standard error monitoring has no vocabulary for this.

NotiLens tracks the full agent lifecycle — start, progress, loop detection, output, completion. When something goes wrong you get an alert with context: which agent, which task, how many loops, what metrics looked like before it stalled.

Python

import notilens

nl  = notilens.init(
    name="outreach-agent",
    token="TOKEN",
    secret="SECRET",
    patch=True  # auto-instruments OpenAI, Anthropic, LangChain calls
)

run = nl.task("email-campaign")
run.start()

try:
    run.progress("Fetching leads")

    for i, lead in enumerate(leads):
        run.loop(f"Processing lead {i+1} of {len(leads)}")
        result = agent.process(lead)
        run.metric("tokens", result.usage.total_tokens)
        run.metric("cost", result.usage.cost)

    run.output_generated("Campaign emails ready")
    run.complete(f"Processed {len(leads)} leads")
except Exception as e:
    run.fail(str(e))

Node.js

import { NotiLens } from '@notilens/notilens';

const nl  = NotiLens.init('outreach-agent', { token: 'TOKEN', secret: 'SECRET' });
const run = nl.task('email-campaign');
run.start();

try {
  run.progress('Fetching leads');

  for (const [i, lead] of leads.entries()) {
    run.loop(`Processing lead ${i+1} of ${leads.length}`);
    const result = await agent.process(lead);
    run.metric('tokens', result.usage.totalTokens);
    run.metric('cost', result.usage.cost);
  }

  run.outputGenerated('Campaign emails ready');
  run.complete(`Processed ${leads.length} leads`);
} catch (err) {
  run.fail(err.message);
}

Human-in-the-loop — when your agent needs approval:

Python

run.input_required("Please confirm before sending emails")
run.input_approved("User confirmed")

Node.js

run.inputRequired('Please confirm before sending emails');
run.inputApproved('User confirmed');

Business event monitoring — payments, signups, orders

For business events that don't fit a task lifecycle — payment received, order placed, signup completed — use track. NotiLens watches the frequency of these events and fires a silence alert when they stop arriving.

Python

nl.track("payment.completed", "Payment received", meta={"amount": 149.99})
nl.track("user.signup", "New user registered")
nl.track("order.placed", "Order #1234", meta={"amount": 89.00})

Node.js

nl.track('payment.completed', 'Payment received', { meta: { amount: 149.99 } });
nl.track('user.signup', 'New user registered');
nl.track('order.placed', 'Order #1234', { meta: { amount: 89.00 } });

When payment.completed stops arriving for longer than your baseline window — silence alert fires. You open Stripe knowing exactly what to look for.

ML anomaly detection — no thresholds to configure

Most monitoring requires you to know what's wrong before you can alert on it. Set a threshold, define a rule, write the condition. The problem — you can't threshold what you haven't seen yet.

NotiLens learns your baseline automatically. Your Wednesday signup rate. Your typical payment volume. Your agent's normal run count. Your cron's expected record output. No configuration needed — it figures out what normal looks like and alerts when reality diverges from it.

What it catches that threshold alerts miss:

Spike detection — payment volume 3x above your Wednesday baseline at 2pm. Could be legitimate. Could be fraud. Either way, worth knowing.

Drop detection — signups down 80% from your Tuesday morning baseline. Server is up. No errors. Just unusually quiet.

Drift detection — API response time slowly climbing from 120ms to 250ms over two weeks. No single data point crosses a threshold. The trend is the signal.

Broken flow detection — payment.initiated normally reaches payment.completed within 3 minutes. When the gap extends, NotiLens catches it before the window closes on recovery.

No YAML. No threshold tuning. No false alerts during warm-up — NotiLens shows calibration progress in cold-start mode so you know when the model is ready.

The alert you get isn't "value exceeded X". It's "this is genuinely abnormal for your business at this time of day on this day of week."

Quick alerts — no task context needed

Python

nl.notify("disk.space", "Only 1GB left", level="warning")
nl.notify("deploy.done", "Deployed to production",
    open_url="https://dashboard.example.com/deploys"
)

Node.js

nl.notify('disk.space', 'Only 1GB left', { level: 'warning' });
nl.notify('deploy.done', 'Deployed to production', {
  openUrl: 'https://dashboard.example.com/deploys'
});

CLI — for shell scripts and bash pipelines

No code changes needed. Register once, use anywhere:

notilens init --name my-app --token YOUR_TOKEN --secret YOUR_SECRET

notilens notify order.placed "Order #1234" --name my-app
notilens notify disk.space "Disk 7.5GB" --name my-app --meta size=7.5

notilens start    --name my-app --task nightly-sync
notilens progress "Fetching records" --name my-app --task nightly-sync
notilens metric   records=461 --name my-app --task nightly-sync
notilens complete "Done" --name my-app --task nightly-sync

What you get

One feed across your entire stack. Cron jobs, AI agents, payments, signups, orders, servers — all in one place with context. When an alert fires, you know which dashboard to open, what to look for, and roughly when it started.

No more guessing. No more morning dashboard checks hoping something is obviously wrong.

The dashboards are still there. The monitoring layer just tells you which one matters right now.

DEV Community

Stop checking dashboards. Build the monitoring layer that checks for you.

What silent failures look like in production

Before you start

Install NotiLens

Initialize

Cron job monitoring — know when jobs run and what they actually did

AI agent monitoring — loops, hangs, token burns, and human approval waits

Business event monitoring — payments, signups, orders

ML anomaly detection — no thresholds to configure

Quick alerts — no task context needed

CLI — for shell scripts and bash pipelines

What you get

Top comments (0)