DEV Community: SecEngineerX

How I Built a Distributed Uptime Monitoring System with FastAPI

SecEngineerX — Thu, 05 Mar 2026 09:34:55 +0000

The Real Problem With Uptime Monitoring

Most uptime monitoring tools work like this:

A single server sends a request to your endpoint every few minutes.

If the request fails, the system declares downtime.
Simple.
Also very wrong.

A single monitor cannot reliably determine whether an application is actually down. Network routing issues, DNS delays, or temporary congestion can produce false downtime alerts even when the service is functioning normally.

In production environments, false positives create a serious problem.

Engineers lose trust in the monitoring system.

Once that happens, alerts stop being useful.

So when I started building TrustMonitor, the first design constraint was simple:

The monitoring system itself must be reliable enough to be trusted.

Architecture Overview
Instead of relying on a single monitor, the system uses a distributed verification approach.
The monitoring flow looks like this:

Scheduler
   ↓
Primary Monitor
   ↓
Secondary Verification
   ↓
Incident Recording
   ↓
Signed Incident Report

Each stage reduces the probability of false alerts and ensures that incidents cannot be modified after they are recorded.

Monitor Scheduling
The system uses a scheduler responsible for dispatching monitoring jobs at defined intervals.

Each job contains:

endpoint URL
expected status code
timeout threshold
verification rules Example structure:

{
  "endpoint": "https://api.example.com/health",
  "expected_status": 200,
  "timeout": 5
}

The scheduler pushes these jobs into a queue where worker nodes perform the actual checks.

Separating scheduling from execution prevents monitoring delays if a worker becomes slow or temporarily unavailable.

Primary Monitor
The primary monitor sends the initial request to the target endpoint.

In the current implementation, this is handled using FastAPI workers running asynchronous HTTP checks.

Example simplified check:

import httpx

async def check_endpoint(url):
    async with httpx.AsyncClient(timeout=5) as client:
        response = await client.get(url)
        return response.status_code

If the response matches the expected conditions, the monitor records a successful check.

If not, the system does not immediately declare downtime.

This is where most monitoring tools fail.

Secondary Verification

Before an incident is recorded, a secondary verification monitor repeats the check.

This step confirms whether the failure is real or caused by temporary network conditions.

Verification logic:

Primary Monitor detects failure
        ↓
Secondary Monitor runs verification check
        ↓
If failure confirmed → incident recorded
If success → ignore false positive

This simple mechanism significantly reduces false downtime alerts.

Incident Recording

Once the failure is verified, the system records an incident containing:

timestamp
endpoint
failure reason
verification results
response data Example incident structure:

{
  "endpoint": "api.example.com",
  "status": "DOWN",
  "timestamp": "2026-03-05T10:20:15Z",
  "verified": true
}

However, recording incidents alone is not enough.

Monitoring systems must also guarantee data integrity.

Cryptographic Incident Signing

A key design decision in TrustMonitor is that incident records are cryptographically signed.

This prevents incidents from being altered later.

Each incident is hashed using a cryptographic digest.

Example concept:

incident_data → SHA256 → incident_signature

The signature proves that the incident existed at a specific time and has not been modified.

This becomes useful for:

post-incident audits
SLA verification
infrastructure debugging

Lessons Learned

Building monitoring systems revealed a few important realities.

Single-location monitoring is unreliable

Network issues happen constantly.

A single monitor cannot determine service health with certainty.

Verification layers are essential.

Monitoring systems must be trustworthy

If alerts generate too many false positives, engineers eventually ignore them.

A monitoring system that isn’t trusted is worse than having none at all.

Incident integrity matters

Monitoring data should be tamper-resistant. Signed incidents create verifiable records of infrastructure events.

Final Thoughts

Monitoring infrastructure sounds simple on paper.

In practice, reliability requires careful design around:

verification
distributed checks
incident integrity

TrustMonitor is still evolving, but building it has already surfaced interesting engineering challenges around monitoring accuracy and system trust.

Future improvements will focus on:

multi-region verification
anomaly detection
improved alert reliability

Because in monitoring systems, trust is everything.

Catching Silent API Failures: A Micro-Lab

SecEngineerX — Mon, 02 Mar 2026 17:09:36 +0000

In most systems, monitoring only checks if an API is “reachable.”

That’s not enough.

Consider a silent failure: the endpoint responds with 200 OK, logs show success, but the data returned is wrong.

Users see broken features, and engineers often don’t know until it’s too late.

I’m exploring this using the OpenAI API structure for my TrustMonitor project.

Screenshot attached shows the full API layout I’m analyzing.

The goal is simple: verify not just uptime, but correctness of the response.

Once verified, silent failures can be caught early, saving time, money, and credibility.

Takeaway: Monitoring isn’t just about uptime it’s about proof your system actually does what it promises.

Next step: automate response verification and alerting, turning silent failures into visible signals.

Retry Logic Is a Policy Decision, Not a Code Pattern

SecEngineerX — Sat, 31 Jan 2026 21:54:40 +0000

I used to think retry logic was an implementation detail.

It isn’t.

Retries encode assumptions about failure, time, trust, and responsibility. When those assumptions are wrong, systems don’t crash. They lie quietly.

This post isn’t about elegance. It’s about being explicit.

The mistake people make

Most retry implementations answer the wrong question.

They ask: “How do I try again?”

The real question is: “Under what failures am I allowed to try again?”

Those are not the same thing.

Retries are not resilience by default

Blind retries are comforting. They make engineers feel proactive.

In reality, they often:

Mask real outages
Amplify load during partial failures
Destroy observability
Delay alerts until damage is done

A retry without a failure model is just noise with a sleep call.

What I learned building a monitoring primitive

While building an async endpoint checker, I was forced to confront a few uncomfortable truths.

Parameters are contracts

If a function exposes a parameter that is not used, it is lying.

Dead parameters rot APIs. They create false confidence and future bugs. Removing them is not cleanup. It’s honesty.

Catching Exception inside retries is negligence

Retrying on all exceptions means retrying on:

Programmer errors
Logic bugs
Invalid states

Those failures should terminate execution immediately.

Retries are for expected, transient failures only. Anything else must fail fast.

HTTP retries without backoff are hostile behavior

Retrying immediately on 500s or 429s is not resilience.

It’s pressure.

If your system retries aggressively during degradation, it becomes part of the outage. Good retry logic reduces harm. Bad retry logic accelerates it.

Time must have a single owner

If multiple layers measure “total time”, metrics become contradictory.

Time is a resource, not a side effect.

Only one layer should own it. Everything else should report partial truth or nothing at all.

Helpers should not know semantics

A retry helper that understands HTTP status codes is doing too much.

Helpers should be stupid and obedient. Policy belongs to the caller.

When helpers start making decisions, architecture leaks.

The most dangerous bug

On the final retry, it’s tempting to overwrite the result:

Force failure
Normalize fields
“Clean things up”

That destroys information.

The last attempt is still truth. Corrupting it poisons analytics, alerting, and postmortems. These bugs don’t show up in logs. They show up in lost trust.

Why this matters in monitoring systems

Some failures justify retries.
Others demand immediate alerts.
Some should be recorded but not acted on.

If a monitoring system cannot explain why something failed, it cannot be trusted when it claims something is broken.

Closing thought
Retry logic is not a loop. It’s a statement about how you believe the world behaves under stress.

If that statement is vague, your system will be vague when it matters most.

Explicit beats clever. Every time.

[Boost]

SecEngineerX — Sat, 31 Jan 2026 09:50:22 +0000

Learning to Model Failure Properly While Building a Monitoring Tool in Python

SecEngineerX — Fri, 30 Jan 2026 15:34:34 +0000

Learning to model failure properly while building a monitoring tool in Python

I’m currently building TrustMonitor, a small website and API monitoring tool using FastAPI, asyncio, and httpx.

One thing that surprised me early was how vague the word failure becomes if you’re not careful.

At first, any unsuccessful check was treated the same. If the request didn’t succeed within a timeout, it failed. That worked, but it hid important differences and made retries noisy.

What I changed

Instead of treating all failures equally, I started separating them into two broad groups:

transport-level failures
application-level failures

Transport-level failures happen before an HTTP response exists. Examples include DNS resolution errors, connection timeouts, TLS issues, and read timeouts.

Application-level failures are valid HTTP responses that still indicate a problem, such as 4xx or 5xx status codes.

A simplified example

try:
    response = await client.get(url, timeout=timeout)
except httpx.ConnectTimeout:
    failure_type = "connect_timeout"
except httpx.ReadTimeout:
    failure_type = "read_timeout"
except httpx.RequestError as exc:
    failure_type = f"request_error:{type(exc).__name__}"
else:
    if response.status_code >= 500:
        failure_type = "server_error"
    elif response.status_code >= 400:
        failure_type = "client_error"
    else:
        failure_type = None

This isn’t final or elegant, but it’s explicit. Naming the failure before reacting to it made retries and alerts easier to reason about.

Why this matters

Some failures justify retries
Others should alert immediately
Aggressive retries can hide real outages

Without clear failure modeling, retries just add noise.

Closing thought
Even in a small project, thinking about time budgets, failure domains, and observability early makes a big difference.
If a monitoring system can’t explain why something failed, it’s hard to trust it when things go wrong.

Most people learn cybersecurity by watching. I learn by building, breaking, and publishing proof. This is a fundamentals-first Python project, documented in public. Code over noise. https://github.com/SecEngineerX/text-analysis-python

SecEngineerX — Sat, 27 Dec 2025 07:31:14 +0000

GitHub - SecEngineerX/text-analysis-python

Contribute to SecEngineerX/text-analysis-python development by creating an account on GitHub.

github.com