<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: SecEngineerX</title>
    <description>The latest articles on DEV Community by SecEngineerX (@secengineerx).</description>
    <link>https://dev.to/secengineerx</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3680957%2F0035d42c-6c7e-472c-adcd-2ab9f0d77299.png</url>
      <title>DEV Community: SecEngineerX</title>
      <link>https://dev.to/secengineerx</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/secengineerx"/>
    <language>en</language>
    <item>
      <title>How I Built a Distributed Uptime Monitoring System with FastAPI</title>
      <dc:creator>SecEngineerX</dc:creator>
      <pubDate>Thu, 05 Mar 2026 09:34:55 +0000</pubDate>
      <link>https://dev.to/secengineerx/how-i-built-a-distributed-uptime-monitoring-system-with-fastapi-1a2h</link>
      <guid>https://dev.to/secengineerx/how-i-built-a-distributed-uptime-monitoring-system-with-fastapi-1a2h</guid>
      <description>&lt;p&gt;&lt;strong&gt;The Real Problem With Uptime Monitoring&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Most uptime monitoring tools work like this:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A single server sends a request to your endpoint every few minutes.&lt;/p&gt;

&lt;p&gt;If the request fails, the system declares downtime.&lt;br&gt;
Simple.&lt;br&gt;
Also very wrong.&lt;/p&gt;

&lt;p&gt;A single monitor cannot reliably determine whether an application is actually down. Network routing issues, DNS delays, or temporary congestion can produce false downtime alerts even when the service is functioning normally.&lt;/p&gt;

&lt;p&gt;In production environments, false positives create a serious problem.&lt;/p&gt;

&lt;p&gt;Engineers lose trust in the monitoring system.&lt;/p&gt;

&lt;p&gt;Once that happens, alerts stop being useful.&lt;/p&gt;

&lt;p&gt;So when I started building &lt;em&gt;TrustMonitor&lt;/em&gt;, the first design constraint was simple:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The monitoring system itself must be reliable enough to be trusted.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Architecture Overview&lt;/strong&gt;&lt;br&gt;
Instead of relying on a single monitor, the system uses a distributed verification approach.&lt;br&gt;
The monitoring flow looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Scheduler
   ↓
Primary Monitor
   ↓
Secondary Verification
   ↓
Incident Recording
   ↓
Signed Incident Report
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each stage reduces the probability of false alerts and ensures that incidents cannot be modified after they are recorded.&lt;/p&gt;

&lt;p&gt;Monitor Scheduling&lt;br&gt;
The system uses a scheduler responsible for dispatching monitoring jobs at defined intervals.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Each job contains:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;endpoint URL&lt;/li&gt;
&lt;li&gt;expected status code&lt;/li&gt;
&lt;li&gt;timeout threshold&lt;/li&gt;
&lt;li&gt;verification rules
&lt;em&gt;Example structure:&lt;/em&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"endpoint"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://api.example.com/health"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"expected_status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"timeout"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The scheduler pushes these jobs into a queue where worker nodes perform the actual checks.&lt;/p&gt;

&lt;p&gt;Separating scheduling from execution prevents monitoring delays if a worker becomes slow or temporarily unavailable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Primary Monitor&lt;/strong&gt;&lt;br&gt;
The primary monitor sends the initial request to the target endpoint.&lt;/p&gt;

&lt;p&gt;In the current implementation, this is handled using FastAPI workers running asynchronous HTTP checks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example simplified check:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;check_endpoint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;AsyncClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the response matches the expected conditions, the monitor records a successful check.&lt;/p&gt;

&lt;p&gt;If not, &lt;strong&gt;the system does not immediately declare downtime.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is where most monitoring tools fail.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Secondary Verification&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Before an incident is recorded, a secondary verification monitor repeats the check.&lt;/p&gt;

&lt;p&gt;This step confirms whether the failure is real or caused by temporary network conditions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Verification logic:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Primary Monitor detects failure
        ↓
Secondary Monitor runs verification check
        ↓
If failure confirmed → incident recorded
If success → ignore false positive
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This simple mechanism significantly reduces false downtime alerts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Incident Recording&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Once the failure is verified, the system records an incident containing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;timestamp&lt;/li&gt;
&lt;li&gt;endpoint&lt;/li&gt;
&lt;li&gt;failure reason&lt;/li&gt;
&lt;li&gt;verification results&lt;/li&gt;
&lt;li&gt;response data
&lt;strong&gt;Example incident structure:&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"endpoint"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"api.example.com"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"DOWN"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"timestamp"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-03-05T10:20:15Z"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"verified"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;However, recording incidents alone is not enough.&lt;/p&gt;

&lt;p&gt;Monitoring systems must also guarantee data integrity.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cryptographic Incident Signing&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A key design decision in TrustMonitor is that incident records are cryptographically signed.&lt;/p&gt;

&lt;p&gt;This prevents incidents from being altered later.&lt;/p&gt;

&lt;p&gt;Each incident is hashed using a cryptographic digest.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example concept:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;incident_data → SHA256 → incident_signature
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The signature proves that the incident existed at a specific time and has not been modified.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This becomes useful for:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;post-incident audits&lt;/li&gt;
&lt;li&gt;SLA verification&lt;/li&gt;
&lt;li&gt;infrastructure debugging&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Lessons Learned&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Building monitoring systems revealed a few important realities.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Single-location monitoring is unreliable&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Network issues happen constantly.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A single monitor cannot determine service health with certainty.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Verification layers are essential.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Monitoring systems must be trustworthy&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If alerts generate too many false positives, engineers eventually ignore them.&lt;/p&gt;

&lt;p&gt;A monitoring system that isn’t trusted is worse than having none at all.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Incident integrity matters&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Monitoring data should be tamper-resistant. Signed incidents create verifiable records of infrastructure events.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Final Thoughts&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Monitoring infrastructure sounds simple on paper.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;In practice, reliability requires careful design around:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;verification&lt;/li&gt;
&lt;li&gt;distributed checks&lt;/li&gt;
&lt;li&gt;incident integrity&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;TrustMonitor is still evolving, but building it has already surfaced interesting engineering challenges around monitoring accuracy and system trust.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Future improvements will focus on:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;multi-region verification&lt;/li&gt;
&lt;li&gt;anomaly detection&lt;/li&gt;
&lt;li&gt;improved alert reliability&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Because in monitoring systems, trust is everything.&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>distributedsystems</category>
      <category>fastapi</category>
      <category>monitoring</category>
      <category>python</category>
    </item>
    <item>
      <title>Catching Silent API Failures: A Micro-Lab</title>
      <dc:creator>SecEngineerX</dc:creator>
      <pubDate>Mon, 02 Mar 2026 17:09:36 +0000</pubDate>
      <link>https://dev.to/secengineerx/catching-silent-api-failures-a-micro-lab-3i3</link>
      <guid>https://dev.to/secengineerx/catching-silent-api-failures-a-micro-lab-3i3</guid>
      <description>&lt;p&gt;In most systems, monitoring only checks if an API is “reachable.”&lt;/p&gt;

&lt;p&gt;That’s not enough.&lt;/p&gt;

&lt;p&gt;Consider a silent failure: the endpoint responds with 200 OK, logs show success, but the data returned is wrong.&lt;/p&gt;

&lt;p&gt;Users see broken features, and engineers often don’t know until it’s too late.&lt;/p&gt;

&lt;p&gt;I’m exploring this using the OpenAI API structure for my TrustMonitor project. &lt;/p&gt;

&lt;p&gt;Screenshot attached shows the full API layout I’m analyzing.&lt;/p&gt;

&lt;p&gt;The goal is simple: verify not just uptime, but correctness of the response.&lt;/p&gt;

&lt;p&gt;Once verified, silent failures can be caught early, saving time, money, and credibility.&lt;/p&gt;

&lt;p&gt;Takeaway: Monitoring isn’t just about uptime it’s about proof your system actually does what it promises.&lt;/p&gt;

&lt;p&gt;Next step: automate response verification and alerting, turning silent failures into visible signals.&lt;/p&gt;

</description>
      <category>api</category>
      <category>backend</category>
      <category>monitoring</category>
      <category>testing</category>
    </item>
    <item>
      <title>Retry Logic Is a Policy Decision, Not a Code Pattern</title>
      <dc:creator>SecEngineerX</dc:creator>
      <pubDate>Sat, 31 Jan 2026 21:54:40 +0000</pubDate>
      <link>https://dev.to/secengineerx/retry-logic-is-a-policy-decision-not-a-code-pattern-1lmi</link>
      <guid>https://dev.to/secengineerx/retry-logic-is-a-policy-decision-not-a-code-pattern-1lmi</guid>
      <description>&lt;p&gt;I used to think retry logic was an implementation detail.&lt;/p&gt;

&lt;p&gt;It isn’t.&lt;/p&gt;

&lt;p&gt;Retries encode assumptions about failure, time, trust, and responsibility. When those assumptions are wrong, systems don’t crash. They lie quietly.&lt;/p&gt;

&lt;p&gt;This post isn’t about elegance. It’s about being explicit.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;The mistake people make&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Most retry implementations answer the wrong question.&lt;/p&gt;

&lt;p&gt;They ask: “How do I try again?”&lt;/p&gt;

&lt;p&gt;The real question is: “Under what failures am I allowed to try again?”&lt;/p&gt;

&lt;p&gt;Those are not the same thing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Retries are not resilience by default&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Blind retries are comforting. They make engineers feel proactive.&lt;/p&gt;

&lt;p&gt;In reality, they often:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Mask real outages&lt;/li&gt;
&lt;li&gt;Amplify load during partial failures&lt;/li&gt;
&lt;li&gt;Destroy observability&lt;/li&gt;
&lt;li&gt;Delay alerts until damage is done&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A retry without a failure model is just noise with a sleep call.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;What I learned building a monitoring primitive&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;While building an async endpoint checker, I was forced to confront a few uncomfortable truths.&lt;/p&gt;




&lt;ol&gt;
&lt;li&gt;Parameters are contracts&lt;/li&gt;
&lt;/ol&gt;




&lt;p&gt;If a function exposes a parameter that is not used, it is lying.&lt;/p&gt;

&lt;p&gt;Dead parameters rot APIs. They create false confidence and future bugs. Removing them is not cleanup. It’s honesty.&lt;/p&gt;




&lt;ol&gt;
&lt;li&gt;Catching Exception inside retries is negligence&lt;/li&gt;
&lt;/ol&gt;




&lt;p&gt;Retrying on all exceptions means retrying on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Programmer errors&lt;/li&gt;
&lt;li&gt;Logic bugs&lt;/li&gt;
&lt;li&gt;Invalid states&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Those failures should terminate execution immediately.&lt;/p&gt;

&lt;p&gt;Retries are for expected, transient failures only. Anything else must fail fast.&lt;/p&gt;




&lt;ol&gt;
&lt;li&gt;HTTP retries without backoff are hostile behavior&lt;/li&gt;
&lt;/ol&gt;




&lt;p&gt;Retrying immediately on 500s or 429s is not resilience.&lt;/p&gt;

&lt;p&gt;It’s pressure.&lt;/p&gt;

&lt;p&gt;If your system retries aggressively during degradation, it becomes part of the outage. Good retry logic reduces harm. Bad retry logic accelerates it.&lt;/p&gt;




&lt;ol&gt;
&lt;li&gt;Time must have a single owner&lt;/li&gt;
&lt;/ol&gt;




&lt;p&gt;If multiple layers measure “total time”, metrics become contradictory.&lt;/p&gt;

&lt;p&gt;Time is a resource, not a side effect.&lt;/p&gt;

&lt;p&gt;Only one layer should own it. Everything else should report partial truth or nothing at all.&lt;/p&gt;




&lt;ol&gt;
&lt;li&gt;Helpers should not know semantics&lt;/li&gt;
&lt;/ol&gt;




&lt;p&gt;A retry helper that understands HTTP status codes is doing too much.&lt;/p&gt;

&lt;p&gt;Helpers should be stupid and obedient. Policy belongs to the caller.&lt;/p&gt;

&lt;p&gt;When helpers start making decisions, architecture leaks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;The most dangerous bug&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;On the final retry, it’s tempting to overwrite the result:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Force failure&lt;/li&gt;
&lt;li&gt;Normalize fields&lt;/li&gt;
&lt;li&gt;“Clean things up”&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;That destroys information.&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The last attempt is still truth. Corrupting it poisons analytics, alerting, and postmortems. These bugs don’t show up in logs. They show up in lost trust.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Why this matters in monitoring systems&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Some failures justify retries. &lt;br&gt;
Others demand immediate alerts.&lt;br&gt;
Some should be recorded but not acted on.&lt;/p&gt;

&lt;p&gt;If a monitoring system cannot explain why something failed, it cannot be trusted when it claims something is broken.&lt;/p&gt;

&lt;p&gt;Closing thought&lt;br&gt;
Retry logic is not a loop. It’s a statement about how you believe the world behaves under stress.&lt;/p&gt;

&lt;p&gt;If that statement is vague, your system will be vague when it matters most.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Explicit beats clever. Every time.&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>python</category>
      <category>devops</category>
      <category>observability</category>
      <category>monitoring</category>
    </item>
    <item>
      <title>[Boost]</title>
      <dc:creator>SecEngineerX</dc:creator>
      <pubDate>Sat, 31 Jan 2026 09:50:22 +0000</pubDate>
      <link>https://dev.to/secengineerx/-3o7a</link>
      <guid>https://dev.to/secengineerx/-3o7a</guid>
      <description></description>
    </item>
    <item>
      <title>Learning to Model Failure Properly While Building a Monitoring Tool in Python</title>
      <dc:creator>SecEngineerX</dc:creator>
      <pubDate>Fri, 30 Jan 2026 15:34:34 +0000</pubDate>
      <link>https://dev.to/secengineerx/learning-to-model-failure-properly-while-building-a-monitoring-tool-in-python-24f6</link>
      <guid>https://dev.to/secengineerx/learning-to-model-failure-properly-while-building-a-monitoring-tool-in-python-24f6</guid>
      <description>

&lt;p&gt;Learning to model failure properly while building a monitoring tool in Python&lt;/p&gt;




&lt;p&gt;I’m currently building TrustMonitor, a small website and API monitoring tool using FastAPI, asyncio, and httpx.&lt;/p&gt;

&lt;p&gt;One thing that surprised me early was how vague the word failure becomes if you’re not careful.&lt;/p&gt;

&lt;p&gt;At first, any unsuccessful check was treated the same. If the request didn’t succeed within a timeout, it failed. That worked, but it hid important differences and made retries noisy.&lt;/p&gt;




&lt;p&gt;What I changed&lt;/p&gt;




&lt;p&gt;Instead of treating all failures equally, I started separating them into two broad groups:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;transport-level failures&lt;/li&gt;
&lt;li&gt;application-level failures&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Transport-level failures happen before an HTTP response exists. Examples include DNS resolution errors, connection timeouts, TLS issues, and read timeouts.&lt;/p&gt;

&lt;p&gt;Application-level failures are valid HTTP responses that still indicate a problem, such as 4xx or 5xx status codes.&lt;/p&gt;




&lt;p&gt;A simplified example&lt;/p&gt;






&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ConnectTimeout&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;failure_type&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;connect_timeout&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ReadTimeout&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;failure_type&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;read_timeout&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RequestError&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;exc&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;failure_type&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;request_error:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;type&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;exc&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;__name__&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;failure_type&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;server_error&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;400&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;failure_type&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;client_error&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;failure_type&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This isn’t final or elegant, but it’s explicit. Naming the failure before reacting to it made retries and alerts easier to reason about.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why this matters&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Some failures justify retries&lt;/li&gt;
&lt;li&gt;Others should alert immediately&lt;/li&gt;
&lt;li&gt;Aggressive retries can hide real outages&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Without clear failure modeling, retries just add noise.&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Closing thought&lt;/strong&gt;&lt;br&gt;
Even in a small project, thinking about time budgets, failure domains, and observability early makes a big difference.&lt;br&gt;
If a monitoring system can’t explain why something failed, it’s hard to trust it when things go wrong.&lt;/p&gt;

</description>
      <category>python</category>
      <category>cybersecurity</category>
      <category>opensource</category>
      <category>backend</category>
    </item>
    <item>
      <title>Most people learn cybersecurity by watching.
I learn by building, breaking, and publishing proof.
This is a fundamentals-first Python project, documented in public.
Code over noise.
https://github.com/SecEngineerX/text-analysis-python</title>
      <dc:creator>SecEngineerX</dc:creator>
      <pubDate>Sat, 27 Dec 2025 07:31:14 +0000</pubDate>
      <link>https://dev.to/secengineerx/most-people-learn-cybersecurity-by-watching-i-learn-by-building-breaking-and-publishing-proof-2o1f</link>
      <guid>https://dev.to/secengineerx/most-people-learn-cybersecurity-by-watching-i-learn-by-building-breaking-and-publishing-proof-2o1f</guid>
      <description>&lt;div class="crayons-card c-embed text-styles text-styles--secondary"&gt;
    &lt;div class="c-embed__content"&gt;
        &lt;div class="c-embed__cover"&gt;
          &lt;a href="https://github.com/SecEngineerX/text-analysis-python" class="c-link align-middle" rel="noopener noreferrer"&gt;
            &lt;img alt="" src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fopengraph.githubassets.com%2F5fd6cbcd6ccf144a0d71576c2cd57a3941e51497aeb1faf87533d03a162f164e%2FSecEngineerX%2Ftext-analysis-python" height="600" class="m-0" width="1200"&gt;
          &lt;/a&gt;
        &lt;/div&gt;
      &lt;div class="c-embed__body"&gt;
        &lt;h2 class="fs-xl lh-tight"&gt;
          &lt;a href="https://github.com/SecEngineerX/text-analysis-python" rel="noopener noreferrer" class="c-link"&gt;
            GitHub - SecEngineerX/text-analysis-python
          &lt;/a&gt;
        &lt;/h2&gt;
          &lt;p class="truncate-at-3"&gt;
            Contribute to SecEngineerX/text-analysis-python development by creating an account on GitHub.
          &lt;/p&gt;
        &lt;div class="color-secondary fs-s flex items-center"&gt;
            &lt;img alt="favicon" class="c-embed__favicon m-0 mr-2 radius-0" src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.githubassets.com%2Ffavicons%2Ffavicon.svg" width="32" height="32"&gt;
          github.com
        &lt;/div&gt;
      &lt;/div&gt;
    &lt;/div&gt;
&lt;/div&gt;


</description>
    </item>
  </channel>
</rss>
