DEV Community: Anubhav Rai

I got surprised by a GitHub Actions quota. Built a tool to make sure it never happens again, here's how I built it

Anubhav Rai — Wed, 25 Mar 2026 10:13:03 +0000

Few weeks ago I was pushing a fix for a small project I made related to a minecraft server i play :p. My data updates just stopped... Turns out I burned through the free Actions minutes three days earlier. GitHub doesn't email you, they just silently stop running your jobs.

I checked Vercel the next day. 91% bandwidth. Two days from getting throttled.

That's when I realised I was doing manual laps of 4 different billing pages every week just to feel safe. GitHub, Vercel, Supabase, Railway and each buried under a different nav, none of them proactively alerting you. I just started college and wanted to build something meaningful, So with the help of Claude Code I built Stackwatch. Here's how it actually works.

The polling worker

The core is a standalone Node.js worker running on Railway. It's dead simple: a cron job (node-cron) that fires every 5 minutes and loops through every connected integration in the database.

The clever bit is tier-aware polling. Free users get 15-minute intervals, Pro gets 5. The worker runs on a 5-minute tick but filters out integrations that synced too recently for their tier:

const dueIntegrations = integrations.filter((i) => { const tier = tierMap.get(i.user_id) ?? "free"; const interval = tier === "free" ? FREE_POLL_INTERVAL_MS : PRO_POLL_INTERVAL_MS; if (!i.last_synced_at) return true; return now - new Date(i.last_synced_at).getTime() >= interval; });
One worker, two polling rates, no separate queues.

Storing API keys

Users paste their tokens, which get encrypted before hitting the database. I went with AES-256-GCM so I get authenticated encryption and the auth tag catches tampering. Each encryption generates a fresh random IV, and the stored value is iv:authTag:ciphertext. Decryption validates the tag before returning anything:

`const ALGORITHM = "aes-256-gcm";

export function encrypt(plaintext: string): string {
const iv = randomBytes(12);
const cipher = createCipheriv(ALGORITHM, key, iv);
const encrypted = Buffer.concat([cipher.update(plaintext, "utf8"), cipher.final()]);
const authTag = cipher.getAuthTag();
return '${iv.toString("hex")}:${authTag.toString("hex")}:${encrypted.toString("hex")}';
}`
The encryption key is a 64-char hex env var (32 bytes). Raw API keys never touch logs.

Auth and data isolation

Auth is Supabase Auth via email/password, magic link, GitHub and Google OAuth. Every table has Row Level Security enabled so users can only ever read their own rows. The worker uses a service-role key (bypasses RLS intentionally) because it needs to poll all users. The frontend client uses the anon key and relies on RLS.

Alerts

When usage crosses a threshold (default 80%, user-configurable per metric) the worker fires alerts via Resend (email), Slack webhooks, or Discord webhooks. It stores a record in alert_history and won't re-alert on the same metric until it drops below threshold and crosses it again to prevent spam.

Frontend

Next.js App Router, TypeScript throughout. Server components by default, client components only where there's interactivity. The dashboard auto-refreshes every 5 minutes. Usage history graphs are built with Recharts also: if you use a formatted date string (like "Mar 21") as your Recharts dataKey and you have multiple snapshots on the same day, the tooltip snaps to the first point of that date. Fix is to use the raw ISO timestamp as the dataKey and format it only in tickFormatter and labelFormatter.

Stack summary

Next.js (App Router) on Vercel
Supabase for auth, database, and RLS
Railway for the polling worker
Resend for email
Recharts for usage graphs

TypeScript everywhere, no exceptions

It's live at https://stackwatch.pulsemonitor.dev the free tier covers one account per service, which is enough for most solo founders. Happy to answer questions about any part of the build.

How I built root cause analysis into my free API uptime monitor

Anubhav Rai — Thu, 12 Mar 2026 05:09:52 +0000

Most uptime monitors tell you your API is down. Mine tells you why.
I got tired of waking up to a vague "monitor failed" alert with zero context. Is it a DNS issue? Did the server crash? Is it a TLS problem? You have no idea until you log in, dig through logs, and piece it together yourself.
So when I built Pulse — my own API monitoring tool — I made root cause analysis the core feature. Here's how I implemented it.

Most monitors do something like this:

const response = await axios.get(url); if (response.status !== 200) { sendAlert('monitor is down'); }

That tells you nothing. You know the request failed. You don't know where.
An HTTP request isn't a single operation — it's a pipeline of stages. DNS lookup, TCP connection, TLS handshake, time to first byte. Each stage can fail independently and each failure means something completely different.

Switching to native http with timing hooks
Axios doesn't expose per-stage timing. Node's built-in http/https module does via socket events. I rewrote the ping function to capture each stage separately:

const req = transport.request(options, (res) => { timings.ttfb = Date.now() - startTime; res.on('end', () => { timings.total = Date.now() - startTime; }); });

req.on('socket', (socket) => { socket.on('lookup', () => { timings.dnsLookup = Date.now() - startTime; }); socket.on('connect', () => { timings.tcpConnect = Date.now() - startTime - timings.dnsLookup; }); socket.on('secureConnect', () => { timings.tlsHandshake = Date.now() - startTime - timings.tcpConnect; }); });

Now every ping stores dns_lookup_ms, tcp_connect_ms, tls_handshake_ms, and ttfb_ms separately in the database alongside the usual status_code and response_time_ms

The inference logic
With per-stage timings stored, I wrote a pure function that compares the failed ping against the historical baseline for that monitor and infers the likely cause:
// DNS spiked but TCP was fine — DNS issue if (dnsRatio > 3) { return { cause: 'DNS resolution failure', confidence: 75 } }

// TCP failed entirely — server unreachable if (!tcpConnectMs) { return { cause: 'Server unreachable — connection refused', confidence: 85 } }

// Everything fine until TTFB — server-side problem if (ttfbRatio > 5) { return { cause: 'Upstream server overload or slow database query', confidence: 78 } }

// Status code tells us exactly what happened if (statusCode === 503) { return { cause: 'Service unavailable — server overloaded or in maintenance', confidence: 92 } }

No ML, no black box. Just rule-based inference against a baseline. A 503 with normal DNS/TCP/TLS timings but a spiked TTFB looks completely different from a connection timeout with no TCP at all.

What it looks like in practice

When a monitor goes down, instead of just logging the failure, Pulse shows:

Root Cause Analysis
Likely cause: Upstream server overload (78% confidence)

DNS Lookup → 34ms normal
TCP Connect → 28ms normal

TLS Handshake → 71ms normal
Time to First Byte → 8432ms CRITICAL (56x baseline)`

Suggestion: Server is responding but very slowly —
check database queries and server load
That's immediately actionable. You know it's not a network problem. It's not DNS. The server is reachable but something on the backend is choking.

The baseline problem
The tricky part was making the comparisons meaningful. A 200ms TTFB is great for one endpoint and terrible for another. I compute a rolling baseline from the last 20 successful pings for each monitor individually, so the thresholds adapt to the normal behavior of that specific endpoint.

What I learned
The biggest insight was that most of the value isn't in the ML or the fancy inference — it's just in capturing the right data at ping time. Once you have per-stage timings stored, the analysis is mostly pattern matching. The hard part was switching from axios to raw http and making sure the timing hooks fired reliably across both HTTP and HTTPS endpoints.
The second thing I learned: storing this data costs almost nothing. Four extra integer columns per ping row. The diagnostic value is completely disproportionate to the storage cost.

Pulse is free — 5 monitors, no credit card. If you want to see the root cause analysis in action or poke around the implementation: Pulse

Happy to hear opinions!