DEV Community

Cover image for 5 Monitoring Blind Spots That Let My Side Projects Fail Silently
Diven Rastdus
Diven Rastdus

Posted on • Originally published at astraedus.dev

5 Monitoring Blind Spots That Let My Side Projects Fail Silently

I run four side projects. A journaling app, an Android app blocker, a healthcare AI tool, and a content pipeline. Total monitoring budget: $0.

Last month, one of them went down for 24 hours. Nobody told me. I found out by accident.

That scared me enough to audit all four projects. I found the same five blind spots across every single one.

1. No Uptime Checks (The "It's Probably Fine" Gap)

My journaling app runs on Supabase's free tier. Free-tier projects auto-pause after 7 days of inactivity. I knew this in theory.

In practice, I shipped a demo to a potential client. The project had been idle. Supabase paused it. The API returned nothing. The frontend showed a blank screen.

For 24 hours, anyone visiting saw a broken app. I only discovered it when I opened the dashboard for an unrelated reason.

The fix: A cron job that hits every critical endpoint every 6 hours.

#!/bin/bash
# health-check.sh - runs via cron every 6h
ENDPOINTS=(
  "https://my-app.vercel.app/api/health"
  "https://my-backend.supabase.co/rest/v1/"
)

for url in "${ENDPOINTS[@]}"; do
  status=$(curl -sf -o /dev/null -w "%{http_code}" "$url" --max-time 10)
  if [ "$status" != "200" ]; then
    python3 ~/bin/alert.py --message "DOWN: $url returned $status"
  fi
done
Enter fullscreen mode Exit fullscreen mode

Total cost: $0. Runs on the same machine that runs everything else. Better Stack or UptimeRobot are better options. But this costs nothing and catches 80% of failures.

2. Unstructured Logs Across Services

My healthcare AI tool runs three microservices on Cloud Run: an MCP server, an orchestrator agent, and an interaction checker. When a patient reconciliation fails, which service caused it?

With print() statements, the answer is "good luck." Cloud Run interleaves logs from all services. One request touches all three. There's no correlation ID linking them.

I spent 40 minutes tracing a bug that turned out to be a timeout in the MCP server. The orchestrator logged "reconciliation failed." The MCP server logged nothing useful. The interaction checker never got called.

The fix: Structured JSON logs with a request ID passed through every service call.

import json, uuid, sys

def log(level, msg, **extra):
    entry = {"severity": level, "message": msg, **extra}
    print(json.dumps(entry), file=sys.stderr)

# At the request entry point
request_id = str(uuid.uuid4())[:8]
log("INFO", "Reconciliation started",
    request_id=request_id, patient_id=patient_id)

# Pass request_id to downstream services via header
# X-Request-ID: {request_id}
Enter fullscreen mode Exit fullscreen mode

Cloud Run (and most log aggregators) parse JSON automatically. Now I can filter by request_id and see the full trace across services. Structured logging was the single most impactful monitoring improvement I made.

3. Zero Mobile Crash Visibility

My Android app blocker uses Kotlin and Jetpack Compose. R8 (Android's code shrinker) silently removed a class my accessibility service needed. The app installed fine. It launched fine. The core feature just... didn't work.

I found this bug during manual testing on a real device. If this had shipped to users, I would have had zero visibility. No crash reports. No error logs. Nothing.

Android's logcat only works when you're connected via USB. Once the app is on someone else's phone, you're blind.

The fix: At minimum, catch uncaught exceptions and log them somewhere you can read later.

Thread.setDefaultUncaughtExceptionHandler { thread, throwable ->
    val report = buildString {
        appendLine("Thread: ${thread.name}")
        appendLine("Error: ${throwable.message}")
        appendLine(throwable.stackTraceToString().take(2000))
    }
    // Write to local file, upload on next app launch
    File(context.filesDir, "crash.log").writeText(report)
}
Enter fullscreen mode Exit fullscreen mode

Crashlytics, Sentry, or Bugfender give you stack traces, device info, and occurrence counts out of the box. This basic handler still beats flying blind when you're not ready to pay for one.

4. API Quota Exhaustion With No Warning

This week, my social media automation stopped working. No errors in my code. No exceptions. Just... nothing posted.

The X (Twitter) API returns a CreditsDepleted error when you hit your monthly quota. My posting script caught the error, logged it to a file, and moved on. Nobody reads log files proactively.

I discovered the issue 2 days later when I manually checked why engagement dropped to zero.

The fix: Treat quota and billing errors as alerts, not log lines.

try:
    response = api.post_tweet(text)
except ApiError as e:
    # CreditsDepleted = all posting dead until cycle resets. Treat as outage.
    if "CreditsDepleted" in str(e) or e.status_code == 429:
        send_alert(f"API QUOTA HIT: {e}. All posting blocked until reset.")
    else:
        logger.error(f"Tweet failed: {e}")
Enter fullscreen mode Exit fullscreen mode

The distinction matters. A 500 error is transient. A quota error means everything is broken until the billing cycle resets. That deserves a push notification, not a log line buried in a file.

5. CI Tests That Don't Run Against Production

Last week, my healthcare tool's production API broke. I didn't find out from my CI pipeline. I found out from a GitHub notification that sat unread in my inbox for 3 days.

The problem: my end-to-end tests run against local Docker containers. They pass every time. But the deployed Cloud Run services had drifted. CI was green. Production was broken.

The fix: A scheduled workflow that hits the real production URLs every 6 hours.

# .github/workflows/e2e-smoke.yml
name: e2e smoke tests
on:
  schedule:
    - cron: '0 */6 * * *'
jobs:
  smoke:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pip install httpx pytest
      - run: pytest tests/e2e/ -v
        env:
          MCP_SERVER_URL: ${{ secrets.PROD_MCP_URL }}
Enter fullscreen mode Exit fullscreen mode

CI passing on localhost doesn't mean production works. Scheduled tests against real endpoints catch the drift. I also routed failure notifications to Telegram instead of GitHub's notification bell. GitHub is too noisy. A direct push notification cuts through.

The Pattern

Every one of these gaps follows the same shape: something fails, nothing tells me, I find out too late.

The fixes are embarrassingly simple. A cron job. A JSON format string. A try/except that sends a push notification instead of writing to a file. None of this is hard.

Monitoring isn't about the tool. It's about closing the loop between "something broke" and "someone who can fix it found out." If that loop is open, nothing else matters.

Top comments (0)