Ilya Ploskovitov

Posted on Jan 3

I got tired of guessing why my server crashed: Building a "Smart" Monitor with Global Checks & JSON Validation

#devops #monitoring #cloudfunctions #webdev

TL;DR: Basic uptime tools just tell you "It's down." I wanted a tool that tells me why (DNS? SSL? App crash?), checks from Tokyo/NY, and validates JSON schemas. So I built OpsPulse.

The "It’s Just Down" Problem

Every developer has been there. It’s 3:00 AM. PagerDuty/Telegram screams "Service Down." You wake up, rush to the terminal, check the logs... and the service is fine.

Was it a network blip? Did the load balancer choke? Did an ISP in Europe drop packets?

Most uptime monitors are lazy. They check for a 200 OK from a single region (usually AWS us-east-1) and call it a day. That wasn't enough for me. I decided to build a platform that digs deeper without costing as much as Datadog.

Here is how I built OpsPulse and what makes it different.

1. Smart Diagnostics (Root Cause Analysis)

The killer feature of OpsPulse is context. It doesn’t just yell "Error!", it tries to diagnose the patient.

When an HTTP check fails, the worker triggers a cascade of lower-level checks:

Ping (ICMP): Is the server even reachable?
TCP Connect: Is the port open, but Nginx is hanging?
SSL Handshake: Did the cert expire, or is the chain of trust broken?

The Result: instead of a generic "Error 500", you get a Telegram alert saying:🔴 Status: DOWN📉 Reason: Web Server Error🧠 Context: Port 443 open, Ping OK, but Nginx returned 502 Bad Gateway. The issue is on the backend.

2. True Global Monitoring (Multi-Region)

Local checks lie. To verify availability properly, I integrated Google Cloud Functions. OpsPulse spins up ephemeral runners to check your resource simultaneously from the US, Europe, and Asia.

This enabled a Global DNS Monitor:

Checks propagation of A, MX, TXT records worldwide.
Uses Fuzzy Matching (handling trailing dots and format quirks).
If your site is up in NY but down in Tokyo — you’ll know.

3. Dev-Centric Features (Not just for websites)

I built this for developers, not just for marketing landing pages.

Advanced HTTP Monitor:It supports custom headers, all methods (GET, POST, PATCH), and strict Content Validation:

Positive Match: Ensure the response contains "Success".
Negative Match: Alert if the response contains "Exception" or "MySQL Error".

Heartbeat (Dead Man's Switch) with JSON Schema:Perfect for backups and cron jobs.

Scenario: Your backup script sends { "status": "ok", "size_mb": 2 }.
Config: Alert if size_mb < 50.
The Magic: I added JSON Schema Validation. You can enforce a strict structure on your incoming webhooks. It turns uptime monitoring into business-metric monitoring.

4. Security First

Since a monitoring tool sends requests everywhere, I had to prevent abuse:

SSRF Protection: Strict blocking of internal network scanning (localhost, 192.168.x.x) and cloud metadata endpoints.
SSL Chain Validation: We don't just check the expiry date. We validate the full chain of trust.
Header Sanitization: Stripping dangerous headers before webhook dispatch.

5. Alerts You Actually Want to Read

Grace Period: Ignore 1-second network hiccups.
Recovery Alerts: Get notified when systems are back online.
Channels: Telegram (bot), Slack (rich formatting), and custom Webhooks.

The Tech Stack

Frontend: Next.js + React (Real-time Dashboard).
Backend Worker: Python (For heavy lifting and network checks).
Cloud: Google Cloud Functions (For global nodes).
Database: PostgreSQL (via Supabase).

Conclusion

OpsPulse started as a side project to stop the 3 AM guessing game. Now it’s a full platform that helps me sleep better.

OpsPulse

What checks are missing from your current monitoring tools? Let me know in the comments! 👇

DEV Community