DEV Community: Ahmed Rakan

Learning Load Balancer Algorithms With Nginx

Ahmed Rakan — Wed, 06 May 2026 08:18:32 +0000

Introduction

A laod balancer is a traffic director sets between clients and a pool of backend server, deciding which server handles each request while distributing that traffic in very specific way configured by the developer.

The goal of load balancers is to distribute workload, scale traffic dynamically and enhance services availability* three properties you don't get from a single server no matter how big or sophsiticated you make it.

Load balancers come in two main kinds:

L4 ( Transport Layer ) : Hardware-based load-balancer usually found in data centers. Routes based on IP, port and TCP/UDP connection state. Doesn't inspect content for maximum speed. Fast and for simple load-balancing applications.
L7 ( Application Layer ) : Inspects HTTP headers, paths, cookies, even request bodies. Enables content-based routing and robust security. Slower per-request than L4 but vastly more felxible for wider range of applications. Runs on commodity hardware, which makes it suitable for a wider range of applications.

A third category would be Global Load Balancers ( GSLB ). Operates above both, distributing traffic across regions using DNS-based or anycast routing. That's out of the scope of this post, but worth knowing GSLB usually set on top of the L4/L7 layer.

L4, L7 can be managed or self-hosted. With managed L7 you get less ops but less-low level control. With managed L4 you get limited features for minimal ops and without the complexity of L4 load balancers.

One intresting design worth exploring in depth for load-balancers is the algorithms which helps a load balancer to decide how to pick the destination.

Types of Load-balancer algorithms

There are two types of load balancer algorithms stateful and stateless. Stateful algorithms Track per-backend metrics to make a decision. Stateless algorithms Make decisions from configurations alone. Stateful algorithms are smarter but requires the LB to track more, while stateless algorithms are faster but they are for very specific use-cases.

Round Robin Alogirhtm: Rotate through available servers in a loop.
IP Hash: We want the same client to always hit the same backend. We can do this by the following fuction hash(client_ip % servers_count ).
Weighted Round Robin: We want to send more traffic to proportionally more capable servers. If A has weight 5, B and C have weight 1. A gets ~5/7 of the traffic.
URL hash: Certain paths always go to the same backend. One usecase here could be cache locality, for example, /users have hot cache for user data, thus we want clients to hit a specific backend where that hot cache is stored so we can save memory else where.
Least Connections : Route to the backend which has the fewest active connections right now.
Random Two with Least Connections : Pick backends at random, send to whichever has fewer connections.

One algorithm we missed in that list is sticky cookie where the load-balancer route traffic of the client per sesson with the backend.

The algorithm seams identical per use-case for IP hashing but they are different. We want more accurate traffic routing in this each client route to a known backend.

As you know the say goes like this " talk is easy show me the code" It's actually really easy to do this with known loadbalancers such as nginx.

Round Robin with Nginx

events { worker_connections 1024; }

http {
    upstream backend {
        server backend1:5678;
        server backend2:5678;
        server backend3:5678;
    }
    server {
        listen 80;
        location / { proxy_pass http://backend; }
    }
}

IP Hashing

events { worker_connections 1024; }

http {
    upstream backend {
        ip_hash;
        server backend1:5678;
        server backend2:5678;
        server backend3:5678;
    }
    server {
        listen 80;
        location / { proxy_pass http://backend; }
    }
}

Weighted Round Robin

events { worker_connections 1024; }

http {
    upstream backend {
        server backend1:5678 weight=5;
        server backend2:5678 weight=1;
        server backend3:5678 weight=1;
    }
    server {
        listen 80;
        location / { proxy_pass http://backend; }
    }
}

URL hashing

events { worker_connections 1024; }

http {
    upstream backend {
        hash $request_uri consistent;
        server backend1:5678;
        server backend2:5678;
        server backend3:5678;
    }
    server {
        listen 80;
        location / { proxy_pass http://backend; }
    }
}

Least Connection

events { worker_connections 1024; }

http {
    upstream backend {
        least_conn;
        server backend1:5678;
        server backend2:5678;
        server backend3:5678;
    }
    server {
        listen 80;
        location / { proxy_pass http://backend; }
    }
}

The more modern approach ( Power of Two )

events { worker_connections 1024; }

http {
    upstream backend {
        random two least_conn;
        server backend1:5678;
        server backend2:5678;
        server backend3:5678;
    }
    server {
        listen 80;
        location / { proxy_pass http://backend; }
    }
}

Finally the sticky cookie ( Note only for Nginx Plus version ) :

events { worker_connections 1024; }

http {
    upstream backend {
        sticky name=srv_id expires=1h path=/;
        server backend1:5678;
        server backend2:5678;
        server backend3:5678;
    }
    server {
        listen 80;
        location / { proxy_pass http://backend; }

    }
}

Bonus: Observability

Algorithms are part of the story. The other worth knowing what's actually happening in production. Load-balancers give huge leverage when it comes to visibility. Here is what load-balancers help you see about your services :

Performance: Response time, LB Metrics, Throuput, Latency P50, P99.
Health : Failed health checks, Active health checks.
Errors: HTTP error rates, Dropped connections.
Traffic: Total connections, Request Rate.

All of these can be acehived for free with open-source self managed load balancers. Nevertheless, if you want more you must pay. Pay for what exactly, here is some examples what paid LB versions give you:

Active health checks: OSS LB does only passive checks, it makrs backends down after liveness request fail. Active checks help you go further and decide states such as ( slow-but-alive ) backends.
More stateful algorithms. Nginx OSS give you IP-Hash but most people confuse this as solution for session pased load-balancing, the edge case they missed is that multiple clients can have the same IP, which is poor proxy for the same IP.
Least time : Nginx Plus allow you to route by observed response latency.

Best regards,

Ahmed

New Way to Explore Source Code ( vscode extension )

Ahmed Rakan — Tue, 21 Apr 2026 18:06:24 +0000

Structura is a free, opensource vs code extension that brings next-generation code exploration and analysis to your editor. It transform your codebase into a live, incrementally expanded interative graph that aims to reduce the congitive load, eliminate mice lag, and allow you to present source code to yourself or others in visually appealing and fast way.

Here is the demo link :

YouTubeVideo
In the video :

Why code exploration needs to evolve.
Live demo ( quick and longer versions )
Keybaord-first navigation walkthrough
Roadmap and what's coming next

Currently : Supports JS, TS, JSX, other programming lanaguge support is planned and help is needed. checkout parser.md in source code

Key-features:

Incremental graph exploration ( no hair balls ) .
Neo-vim inspired keyboard navigation ( minimal for maximum movement ).
near-Real-time graph updates ( extremly fast , noval architecture ).
Semantic language and visuals of code intent.
And more...

The noval architecture :

Repo: https://github.com/ARAldhafeeri/structura-v2

Marketplace : https://marketplace.visualstudio.com/items?itemName=AhmedRakan.structura-v2

I let AI Run My Onboarding Here is what happened

Ahmed Rakan — Tue, 21 Apr 2026 17:08:34 +0000

So I sat with myself for few hours a month a go, thinking deeply, what other ways I can leverage agentic ai into my solutions.

A chatbot seams the default everyone going for, but, recently I launched a mini paid ad campgin to finbooki.com.

The marketing funnel was focusing on rebranding the solution towards money clarity.

I thought to myself how I could deliver the promise ( money clarity ) to the users within the first few seconds of them entering the software.

And there I got the aha moment I am trying to deliver to the users.

I could really use agentic AI to help the user acheive a result that takes minutes on the solution in few seconds. And that what I went for.

The user currently on first login, will be asked to describe their current financial situation in short pagragraph.

A sub agent will take that paragraph and turn it into ( categories, income, expensess, budgets, goals, saving ) and so on.

I think many teams could take on this type of innovation and implement it.

As why you provide really long free trail, when you can deliver the promise within few seconds and then the user have really tiny room ( enough ) to decide they will pay or not.

As result, looking at the data I feel now my free trail cost is really low, and users ethier get hooked on the signal or leave.

The conversion increased, we noticed 30% increase in converted users. I asked few old customers and but them through the on-boarding experience too, 9/10 said if they been through this earlier they would have subscribed way before their original subscription date.

Building the Next-Gen Way Developers Explore Code

Ahmed Rakan — Fri, 09 Jan 2026 19:53:48 +0000

I spent a few hours this Friday revisiting some of my older projects code I wrote long before LLMs existed in the form we know today.

The motivation was simple: I don t see LLMs as an enemy to developers, but as a force multiplier a gap allowing us to reason faster, explore deeper, and understand systems more clearly.

The difference is dramatic, and you can see it directly in the YT video linked below.

The goal of this experiment is to build an out-of-the-box code exploration and documentation tool one that can 10 100× the speed at which developers form a mental model of unfamiliar codebases.

Today, most code exploration relies on:

Traditional IDE graphical interfaces
Power-user editors like NeoVim with complex keyboard workflows

This experiment explores a different approach:

A simple, interactive network graph
Powerful search and navigation
Tight integration with familiar IDEs like VS Code

This is an early experiment but it points toward what next-generation code exploration could look like.

I am sharing this to look for early-users ( adopters ), supporters and contirboutors.

Here is the discord channel I built for this purpose:

https://discord.gg/KvJ3GWEb

Youtube Video :

Agentic AI Didn't Break Automation. We Did. Here's the Fix

Ahmed Rakan — Fri, 02 Jan 2026 13:41:41 +0000

Introduction

Today's automation landscape for new soultions is dominated by LLMs and AI agents, yet critical gaps remain. Despite significant investment in AI-driven automation, a fundamental issue persists: trust. In enterprise environments, even a small failure rate—around 1-10%—can undermine confidence. These failures often manifest not as total breakdowns, but as unpredictable, nonsensical outputs in edge cases, security vulnerabilities, or subtle biases that escape initial review. The core challenge is clear: LLMs are probabilistic, people are mis-using them, and trustworthy automation requires determinism. This mismatch is the barrier to true, reliable automation at scale.

The Solution: Introducing the Automation Trust Protocol ( draft )

The real value of LLMs lies in interpretation—yet most automation efforts misuse them for execution. The leap forward isn't more intelligence; it's a trust infrastructure that existing tools lack. Automation people will trust requires:

Predictability: Known outcomes for given inputs.
Observability: Full visibility into each step.
Controllability: The ability to pause or modify execution.
Accountability: Clear attribution for failures.
Recoverability: Mechanisms to undo errors.

Current solutions offer observability and controllability to some extend, but fall short on predictability, accountability, and recoverability. The Automation Trust Protocol bridges this gap by separating intelligence from execution:

Separation of Concerns: AI interprets intent; traditional automation engines handle execution.
Risk-Adaptive Boundaries: Trust boundaries that expand with proven reliability.
Temporal Safety: Built-in review periods, verification, and automatic rollback.
Complete Observability: Audit trails, explanations, and compliance-ready reporting.
Gradual Autonomy: Trust is earned through demonstrated reliability, not assumed or amused.

This protocol addresses the "last 1-10%" of failures that block full enterprise adoption of LLMs, Agentic AI, creating a foundation for automation that is both intelligent and trustworthy.

Setting the Stage for the Protocol

Why hasn’t this been built yet? Venture capital typically funds end-user solutions, not underlying protocols. It also requires deep integration across AI, automation, and compliance domains—a complex intersection. It's the opposite of the funded narrative that AI will replace us all; it will open a massive amount of opportunities for any enterprise that aims to create automation around it. However, the need is becoming urgent: regulated companies are losing trust in LLM-based agents, the insurance industry may soon require such safeguards, and compliance is growing more demanding. By establishing a standard for trust in automation, this protocol can realign the industry's trajectory toward reliable, scalable automation.

Automation Trust Protocol (ATP) ): is a standard for automation systems to communicate risk, ensure accountability, and enable safe execution of automated actions across any platform. Think of it as how OAuth as a protocol brought trust to authorization. Same for ATP, Automation Trust Protocol aims to restore the trust in automation. OAuth didn't reinvent authorization; it defined the trust boundary, flows that are battle-tested and perfectly defined, and your specific use case.

The only way people will see the value of the Automation Trust Protocol (ATP) is through a concrete, practical example. This post aims to demonstrate the protocol by walking through its 9 technical layers with a real-world scenario. As well a demo video at the end that showcase a simple automation platform built around ATP

The protocol consists of nine layers that directly address the five principles outlined earlier: Separation of Concerns, Risk-Adaptive Boundaries, Temporal Safety, Complete Observability, and Gradual Autonomy.

Protocol Layers

Layer 0 - Identity and Authorization

When introducing agency into automation—whether a human, an AI agent, or a scheduled task—every action must be identifiable and authorized. This foundational layer answers: Who did what, and were they allowed to? It creates an immutable anchor for all downstream accountability.

{
  "action_id": "uuid-v4",
  "workflow_id": "wf_customer_refund_v3",
  "initiator": {
    "type": "human|ai_agent|scheduled|event_triggered",
    "user_id": "user_123",
    "agent_id": "agent_gpt4",
    "session_id": "session_456"
  },
  "timestamp": "2025-12-25T10:30:00Z",
  "parent_action_id": "uuid-parent"
}

Layer 1 - Action Declaration

Before execution, the system must declare its intent. This enables predictability (the workflow's path is known in advance), observability (both declared and executed states are logged), and forms the basis for controllability, accountability, and recoverability.

{
  "action": {
    "type": "database.update|api.call|email.send|payment.process|...",
    "target": {
      "system": "stripe",
      "resource": "charges",
      "operation": "refund"
    },
    "payload": {
      "charge_id": "ch_123",
      "amount": 5000,
      "currency": "USD",
      "reason": "customer_request"
    },
    "idempotency_key": "refund_order_789_attempt_1"
  },
  "context": {
    "business_reason": "Customer requested refund within 30-day window",
    "related_entities": ["customer:c_789", "order:ord_789"],
    "prior_actions": ["email_received", "verified_order_date"]
  }
}

Layer 2 - Risk Assessment Request

Here, the system requests a risk evaluation. This is where LLMs excel at interpretation. Risk assessment is inherently probabilistic; within defined trust boundaries, this evaluation determines the subsequent workflow path.

{
  "risk_assessment_request": {
    "action_id": "uuid-v4",
    "evaluate": [
      "financial_risk",
      "compliance_risk",
      "operational_risk",
      "reputational_risk"
    ],
    "require_approvals": "auto_determine"
  }
}

Risk Assessment Response (From AI Agent):

{
  "risk_assessment": {
    "action_id": "uuid-v4",
    "timestamp": "2025-12-25T10:30:01Z",
    "risk_score": {
      "overall": 0.23,
      "financial": 0.15,
      "compliance": 0.05,
      "operational": 0.42,
      "reputational": 0.12
    },
    "risk_factors": [
      {
        "factor": "amount_exceeds_threshold",
        "severity": "medium",
        "threshold": 1000,
        "actual": 5000,
        "multiplier": 5.0
      },
      {
        "factor": "customer_account_age",
        "severity": "low",
        "details": "Account created 2 years ago"
      }
    ],
    "similar_actions": {
      "past_30_days": 147,
      "success_rate": 0.994,
      "average_completion_time": "2.3s",
      "anomalies_detected": 0
    },
    "recommendation": "auto_approve|human_review|reject",
    "confidence": 0.87
  }
}

Layer 3 - Approval Flow

Based on the risk result, the system routes the action for approval. This is not binary. By defining confidence boundaries (e.g., risk < 0.25 auto-approve, 0.25-0.75 human review, >0.75 reject), businesses can create as many trust tiers as needed.

{
  "approval_request": {
    "action_id": "uuid-v4",
    "risk_score": 0.23,
    "approval_type": "human_required|ai_sufficient|pre_approved",
    "approvers": {
      "required": ["role:finance_manager", "role:customer_service_lead"],
      "optional": ["role:ceo"],
      "escalation_after": "1h",
      "auto_approve_if_no_response": false
    },
    "deadline": "2025-12-25T12:30:00Z",
    "priority": "normal|high|critical"
  }
}

Approval Response:

{
  "approval": {
    "action_id": "uuid-v4",
    "decision": "approved|rejected|modified",
    "approver": "user_456",
    "timestamp": "2025-12-25T10:35:00Z",
    "reason": "Within normal parameters, customer has good history",
    "modifications": {
      "amount": 4500,
      "reason": "Waiving shipping fee only, not full refund"
    },
    "conditions": [
      {
        "type": "notification_required",
        "notify": ["user_789"],
        "message": "Large refund processed"
      }
    ]
  }
}

Layer 4 - Pre-Execution Verification

Before the action is sent to the target system, a final set of deterministic checks is performed. This can be a sub-workflow of test cases or an AI-aided verification step.

{
  "pre_execution_check": {
    "action_id": "uuid-v4",
    "checks": [
      {
        "type": "data_validation",
        "status": "pass",
        "details": "All required fields present and valid"
      },
      {
        "type": "preconditions",
        "status": "pass",
        "verified": [
          "charge_exists",
          "charge_not_previously_refunded",
          "within_refund_window"
        ]
      },
      {
        "type": "rate_limit",
        "status": "pass",
        "current": "12 refunds in past hour",
        "limit": "50 per hour"
      },
      {
        "type": "dependency_health",
        "status": "pass",
        "dependencies": [
          {"service": "stripe_api", "status": "healthy", "latency": "120ms"}
        ]
      }
    ],
    "ready_for_execution": true
  }
}

Layer 5 - Execution with Proof

The action executes against the target system, producing immutable, detailed logs and cryptographic proof, thik of it as detailed audit for accountability and future recoverability of the same automated workflow.

{
  "execution": {
    "action_id": "uuid-v4",
    "started_at": "2025-12-25T10:35:05Z",
    "completed_at": "2025-12-25T10:35:07Z",
    "status": "success|failure|partial",
    "result": {
      "refund_id": "re_456",
      "status": "succeeded",
      "amount_refunded": 5000,
      "currency": "USD"
    },
    "proof": {
      "execution_hash": "sha256_hash_of_inputs_and_outputs",
      "signature": "digital_signature",
      "witnesses": ["stripe_api", "internal_ledger"],
      "receipts": [
        {
          "system": "stripe",
          "transaction_id": "re_456",
          "timestamp": "2025-12-25T10:35:06Z"
        }
      ]
    },
    "side_effects": [
      {
        "type": "email_sent",
        "to": "customer@example.com",
        "template": "refund_confirmation",
        "message_id": "msg_789"
      },
      {
        "type": "database_updated",
        "table": "orders",
        "record_id": "ord_789",
        "field": "status",
        "old_value": "completed",
        "new_value": "refunded"
      }
    ]
  }
}

Layer 6 - Post-Execution Verification

The system independently verifies that the action achieved its intended outcome and no unintended side effects occurred.

{
  "verification": {
    "action_id": "uuid-v4",
    "timestamp": "2025-12-25T10:35:10Z",
    "checks": [
      {
        "type": "state_consistency",
        "status": "pass",
        "verified": "Order status matches refund status in Stripe"
      },
      {
        "type": "downstream_effects",
        "status": "pass",
        "verified": [
          "customer_notified",
          "accounting_updated",
          "analytics_recorded"
        ]
      },
      {
        "type": "no_unintended_consequences",
        "status": "pass",
        "checked": [
          "no_duplicate_refunds",
          "customer_balance_correct",
          "inventory_not_affected"
        ]
      }
    ],
    "overall_status": "verified|anomaly_detected|verification_failed",
    "confidence": 0.95
  }
}

Layer 7 - Rollback Capability

If verification fails, the protocol enables a rollback to a previous stable state. This is achieved via compensating transactions or state restoration mechanisms.

{
  "rollback_request": {
    "action_id": "uuid-v4",
    "reason": "downstream_verification_failed",
    "details": "Customer balance shows incorrect amount",
    "strategy": "compensating_transaction|state_restoration",
    "compensating_actions": [
      {
        "type": "api.call",
        "target": "stripe.charges.capture",
        "payload": {...}
      },
      {
        "type": "database.update",
        "target": "orders.status",
        "restore_to": "completed"
      }
    ]
  }
}

Rollback Response:

{
  "rollback": {
    "action_id": "uuid-v4",
    "original_action_id": "uuid-original",
    "status": "completed|partial|failed",
    "compensating_actions_executed": 2,
    "state_restored": true,
    "residual_effects": [
      {
        "type": "audit_trail",
        "description": "Refund attempt recorded in logs",
        "cleanup_required": false
      }
    ]
  }
}

Layer 8 - Learning & Feedback

The system records outcomes to improve future risk assessments, creating a feedback loop for continuous learning and human correction.

{
  "feedback": {
    "action_id": "uuid-v4",
    "outcome": "success|failure|partial|rolled_back",
    "actual_risk_materialized": false,
    "predicted_risk": 0.23,
    "actual_risk": 0.05,
    "learning_signals": [
      {
        "signal": "risk_overestimated",
        "factor": "customer_account_age",
        "adjustment": "lower_weight_for_established_customers"
      },
      {
        "signal": "execution_time",
        "expected": "2.3s",
        "actual": "2.1s",
        "within_normal": true
      }
    ],
    "human_feedback": {
      "provided_by": "user_456",
      "rating": "appropriate_approval_required",
      "comments": "Good catch on the amount threshold"
    }
  }
}

Protocol Endpoints

ATP-compliant systems must implement these core endpoints to facilitate the layered interaction.

Required Endpoints:

POST /atp/v1/actions/declare
- Declare intent before execution.
- Returns: action_id and initial risk assessment.
GET /atp/v1/actions/{action_id}/risk
- Request comprehensive risk assessment.
- Returns: risk scores, factors, recommendation.
POST /atp/v1/actions/{action_id}/approve
- Submit approval decision.
- Returns: execution authorization or rejection.
POST /atp/v1/actions/{action_id}/execute
- Execute the approved action.
- Returns: execution result with proof.
GET /atp/v1/actions/{action_id}/verify
- Verify the action's outcome.
- Returns: verification status.
POST /atp/v1/actions/{action_id}/rollback
- Initiate a compensating transaction or rollback.
- Returns: rollback status.
POST /atp/v1/actions/{action_id}/feedback
- Submit learning feedback for the action.
- Returns: acknowledgment.

Optional Endpoints:

GET /atp/v1/actions/{action_id}/explain
- Get a natural language explanation of the action and its context.
GET /atp/v1/actions/{action_id}/audit-trail
- Retrieve the full, compliance-ready audit trail.
GET /atp/v1/patterns/similar
- Find similar historical actions for pattern analysis.

From Theory to Practice: Show Me the Code

So far, everything looks good on paper. But does this protocol actually solve the automation problems we've identified? To prove it hits all five critical requirements:

Predictability: Known outcomes for given inputs
Observability: Full visibility into each step
Controllability: The ability to pause or modify execution
Accountability: Clear attribution for failures
Recoverability: Mechanisms to undo errors

We need a concrete implementation. Let's walk through a real-world scenario.

The Infrastructure Problem

Consider a typical modern stack:

Monitoring and Alerting System for monitoring and outage notifications
Automation Engine as the automation workflow engine
CI/CD Stack e.g. - GitHub Actions, ArgoCD, Kubernetes for Continous integration continous delivery.

The DevOps setup is solid—until something breaks. Here's what happens today:

Monitoring and Alert System detects a service failure and triggers an automation engine workflow
The workflow sends notifications to the team (that's it)
Continous Delviry may automatically rollback to a previous version (seconds to minutes recovery)
Engineers scramble to debug, potentially taking hours or days depending on failure severity

This setup nails observability and controllability, but completely misses:

Predictability (will this automated response actually fix things?)
Accountability (who approved this rollback? Why was it chosen?)
Recoverability (what if the rollback makes things worse?)

Bridging the Gap with ATP

Instead of direct automation, we insert an ATP Gateway between monitoring and execution, note here we used uptime kuma for alerting, monitoring, n8n as automation engine for simplicity:

The image illustrates the complete flow, but let me walk you through the implementation:

Uptime kuma sends notification to our ATP layer instead of the automation engine.
ATP gateaway declare an action which is roll back deployment and the target would be argocd, namespace production.
ATP gateaway uses LLMs for risk assesement checking all the risk factors given the description of the situation. Which proper action given the risk result for example high risk means human review is a must.
Approval flow low-risk auto-approve ( rollback ) , high risk ( human review required ) .
Determinsitic workflows execution via atuoamtion engine - receives execution request with ATP metadata.
automation engine execute determinsitic workflows: a. Call ArgoCD API to rollback. b. Wait for deployment to complete. c. Check service health. d. Report back to ATP gateaway.
Verfication inside ATP gateaway : ATP verifies the outcome via specific defined checks: execution completed, service health, no side effects via dependencies list, error rate. ANd the result is probalistic socre.
The ATP gateaway records outcome for future risk assesement.

So what is the result ? In my humble opoinion here it's :

Feature	Plain n8n	Pure AI Agent	ATP Solution
Risk Assessment	❌ None	⚠️ Basic, probabilistic	✅ AI-powered, quantitative scoring
Approval Flow	❌ Manual only	⚠️ Ad-hoc, inconsistent	✅ Risk-adaptive, multi-tier rules
Audit Trail	⚠️ Basic logs only	❌ Limited or none	✅ Immutable, cryptographic proof
Rollback	❌ Manual recovery	⚠️ Unreliable or missing	✅ Automated, verified rollback
Learning	❌ None	✅ Yes, but unstable	✅ Continuous, stable improvement
Predictability	⚠️ Brittle workflows	❌ Unpredictable outputs	✅ Declared intent, deterministic execution
Accountability	⚠️ Limited attribution	❌ Unclear responsibility	✅ Clear identity & action tracing
Control	✅ Manual overrides	❌ Limited intervention	✅ Granular, risk-based controls
Execution Type	✅ Deterministic	❌ Probabilistic	✅ Deterministic with AI interpretation
Explainability	✅ Clear workflow steps	❌ Black-box decisions	✅ Transparent decision rationale
Compliance	⚠️ Manual reporting	❌ Difficult to audit	✅ Built-in compliance verification
Trust Boundaries	❌ All-or-nothing	❌ Unbounded autonomy	✅ Configurable, earned trust
Reliability at Scale	✅ High for simple tasks	⚠️ ~90% success rate	✅ 99.9%+ with safeguards
Human Oversight	✅ Required for all	❌ Optional or absent	✅ Risk-adaptive, always available
Recovery Speed	⚠️ Manual, slow	❌ Unpredictable	✅ Automated, verified compensation
Best For	Simple, repetitive tasks	Creative, exploratory tasks	Mission-critical, regulated automation

Source Code | Video Demonstration | Early Adopters Discord

ATP proves that we can have intelligent automation without sacrificing determinism, and automated execution with human-level accountability.

StackUp: One Command to Rule Your Dev Environment

Ahmed Rakan — Thu, 25 Dec 2025 10:18:38 +0000

StackUp: One Command to Rule Your Dev Environment

The Problem

I swapped hard drives between my PC and my brother's gaming rig for the reason I lost interest in graphical game ( except one ) and AI experements giving my brother's the high-end PC and using the OK one. Bad idea. Windows wouldn't even let me log in without a full reset.

As I reinstalled Git, Node, Docker, and everything else for the third time that month for different machine as I setup new environments for experementation in my home lab, I thought: there has to be a better way.

The Solution

StackUp lets you define your entire development environment in a single YAML file and install it with one command across Windows, Linux, and macOS.

profile: web-dev

tools:
  - name: git
    version: latest
    linux:
      package_names:
        apt: git
    macos:
      brew: git
    windows:
      package_names:
        winget: Git.Git

  - name: node
    version: "20.x"
    dependencies: ["git"]

Run it:

./stackup install dev.yaml

That's it. StackUp detects your OS, allow you to pick the right package manager ( for each tool ), and installs everything in the correct order.

What Makes It Different

Cross-platform by design. Define once, run on any OS. No more maintaining separate setup scripts for Windows, Mac, and Linux.

Smart dependency handling. Need WSL before Docker on Windows? StackUp figures it out.

Complex installations made simple. Multi-step installs, pre/post hooks, and custom commands for tools that don't play nice with package managers.

Package managers built in. Works with choco or winget for windows, apt, dnf, pacman for linux.

Use Cases

New machine setup. New hire onboarding. Team environment standardization. Moving between personal and work machines.

Instead of a 10-page wiki with screenshots, your team gets a single YAML file they can trust.

A Word of Caution

StackUp runs installations with elevated privileges. It can execute any command you put in your config file.

Never run a config file you haven't reviewed yourself.

I recommend teams store configs in Git and review them like any other infrastructure code ( Following GitOps approach). Don't pass YAML files around in Slack.

What's Next

I'm working on better security guardrails, an interactive config builder, and proper update/rollback commands. But I wanted to ship this now and get feedback from real users.

The code is open source under MIT. Try it out, break it, tell me what's missing.

GitHub: github.com/araldhafeeri/stackup

Would love to hear what you think. Does this solve a problem you have? What would make it more useful?

Best,

Ahmed

Stop Writing System Logs For Your Mental Model - Write For Your User's Instead

Ahmed Rakan — Sun, 07 Dec 2025 23:57:26 +0000

The Mental Model Mismatch in Logging

Your logs are telling the wrong story.

You're documenting your understanding of the code - the functions, classes, and internal states. But your users (developers, operators, SREs) need to understand their system - the applications, services, and business operations.

The Symptoms

You see this everywhere in production logs:

# Developer mental model
logger.error("ImagePullBackoff for pod webapp-7f8d9")
logger.error("HTTP 500 at /api/v1/process")
logger.error("Database connection timeout")

# vs. User mental model
logger.error("Application 'webapp' cannot start: container image unavailable")
logger.error("Payment processing failed: internal server error for order #12345")
logger.error("User authentication service unavailable: database unreachable")

Both are clear. Both are professional. But only one answers: "What's broken, for whom, and what do I do?"

The Shift

From documenting code flow → To telling the service story

Stop thinking: "What's happening in my function?"
Start thinking: "What business operation is failing?"

From isolated events → To correlated journeys

Every log should answer: "Which user request/service operation does this belong to?" Use correlation IDs religiously.

From technical states → To business impact

"Database connection failed" → "User signups blocked: authentication database unavailable"

Practical Shift

Before writing a log, ask:

Who will read this at 3 AM?
What do they need to know about the service, not the code?
What action should they take?

Your logs shouldn't document your codebase. They should document your service's behavior for the people who keep it running.

Write for the human debugging, using the system, not the humans who are developing it.

Building a Production-Grade MongoDB Cluster on Kubernetes: A Complete Guide to Horizontal Scalability

Ahmed Rakan — Wed, 26 Nov 2025 13:47:38 +0000

Introduction

Distributed systems expertise remains one of the most sought-after skills in software engineering. Engineers who can design, implement, and scale distributed databases command premium compensation for good reason—these systems form the backbone of modern applications serving millions of users.

In this comprehensive guide, we'll build a highly available, horizontally scalable MongoDB cluster using Kubernetes. You'll learn how to create a production-ready database infrastructure that can grow from a single node to hundreds of nodes, scaling seamlessly to meet demanding workloads.

Technology Stack

Our infrastructure leverages three powerful open-source technologies:

MongoDB: A distributed NoSQL database designed for horizontal scalability and high availability
MicroK8s: An ultra-lightweight Kubernetes distribution from Canonical (the creators of Ubuntu), optimized for both development and production environments
OpenEBS: A cloud-native distributed storage solution for Kubernetes that provides persistent volume management

This combination enables true horizontal scalability—you can expand your cluster's capacity by adding more nodes rather than being limited by vertical scaling constraints.

Prerequisites and Initial Setup

Installing MicroK8s

First, set up your MicroK8s cluster. The installation process is straightforward and well-documented:

MicroK8s Getting Started Guide

Follow the official documentation to install MicroK8s on your nodes. Once complete, verify your installation:

microk8s status --wait-ready

Enabling OpenEBS Storage

OpenEBS integration with MicroK8s is remarkably simple, requiring just two commands:

microk8s enable community
microk8s enable openebs

These commands enable the community addon repository and install OpenEBS components into your cluster.

Configuring Distributed Storage

Installing iSCSI on Every Node

OpenEBS relies on iSCSI (Internet Small Computer Systems Interface) for distributed block storage. This protocol enables nodes to access block-level storage over TCP/IP networks, which is essential for our distributed architecture.

Critical: Install iSCSI on every node in your cluster:

sudo apt update
sudo apt install open-iscsi
sudo systemctl enable open-iscsi
sudo systemctl enable iscsid
sudo systemctl start iscsid

Verify that the iSCSI daemon is running:

systemctl status iscsid

You should see output similar to:

● iscsid.service - iSCSI initiator daemon (iscsid)
     Loaded: loaded (/lib/systemd/system/iscsid.service; enabled; vendor preset: enabled)
     Active: active (running) since Tue 2025-11-18 20:05:03 +03; 1 week 1 day ago
TriggeredBy: ● iscsid.socket
       Docs: man:iscsid(8)
   Main PID: 9887 (iscsid)
      Tasks: 2 (limit: 9298)
     Memory: 5.4M
        CPU: 9.154s
     CGroup: /system.slice/iscsid.service
             ├─9886 /sbin/iscsid
             └─9887 /sbin/iscsid

Verifying Storage Classes

Check that OpenEBS storage classes are available in your cluster:

kubectl get storageclass

Expected output includes multiple OpenEBS storage classes:

NAME                          PROVISIONER            RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
microk8s-hostpath (default)   microk8s.io/hostpath   Delete          WaitForFirstConsumer   false                  256d
openebs-device                openebs.io/local       Delete          WaitForFirstConsumer   false                  7d20h
openebs-hostpath              openebs.io/local       Delete          WaitForFirstConsumer   false                  7d20h
openebs-jiva                  jiva.csi.openebs.io    Delete          Immediate              true                   44m
openebs-jiva-csi-default      jiva.csi.openebs.io    Delete          Immediate              true                   7d20h

Configuring High Availability with Replica Count

For production deployments, configure the replication factor for your storage. This ensures data redundancy and high availability:

cat <<EOF | kubectl apply -f -
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: openebs-jiva
provisioner: jiva.csi.openebs.io
parameters:
  replicaCount: "3"  # Use 3 for production, 2 minimum for HA
  policy: openebs-policy-default
allowVolumeExpansion: true
EOF

Replication guidelines:

3 replicas: Recommended for production (tolerates 1 node failure)
2 replicas: Minimum for high availability
Ensure your cluster has at least as many nodes as your replica count

Deploying MongoDB

Creating the Namespace and Service Account

First, create a dedicated namespace for your databases:

kubectl create namespace databases

Now create a service account with appropriate permissions. MongoDB's sidecar container needs to discover other pods in the replica set:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: mongo
  namespace: databases

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: read-pod-service-endpoint
rules:
- apiGroups: [""]
  resources: ["pods", "services", "endpoints"]
  verbs: ["get", "list", "watch"]

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: system:serviceaccount:databases:mongo
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: read-pod-service-endpoint
subjects:
- kind: ServiceAccount
  name: mongo
  namespace: databases

Apply the configuration:

kubectl apply -f service-account.yaml

Creating Credentials Secret

Before deploying MongoDB, create a secret for authentication:

kubectl create secret generic mongo-secret \
  --from-literal=mongo-user=admin \
  --from-literal=mongo-password='YourSecurePassword123!' \
  -n databases

Security note: In production, use a secrets management solution like HashiCorp Vault or sealed-secrets instead of plain Kubernetes secrets.

Deploying the StatefulSet

StatefulSets are designed for stateful applications like databases. Unlike Deployments, they provide:

Stable, unique network identifiers
Stable, persistent storage
Ordered, graceful deployment and scaling

Here's the complete MongoDB StatefulSet configuration:

apiVersion: v1
kind: Service
metadata:
  name: mongo
  namespace: databases
  labels:
    name: mongo
spec:
  selector:
    app: mongo
  ports:
  - protocol: TCP
    port: 27017
    targetPort: 27017
  clusterIP: None  # Headless service for StatefulSet

---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: mongo
  namespace: databases
  labels:
    app: mongo
spec:
  serviceName: "mongo"
  replicas: 3
  selector:
    matchLabels:
      app: mongo
  template:
    metadata:
      labels:
        app: mongo
        role: mongo
        environment: production
    spec:
      serviceAccountName: mongo
      automountServiceAccountToken: true
      terminationGracePeriodSeconds: 30

      containers:
      - name: mongodb
        image: mongo:5.0
        command:
        - mongod
        - "--replSet=rs0"
        - "--bind_ip=0.0.0.0"
        ports:
        - name: mongodb
          containerPort: 27017
          protocol: TCP

        resources:
          requests:
            memory: "1Gi"
            cpu: "1"
          limits:
            memory: "2Gi"
            cpu: "2"

        env:
        - name: MONGO_INITDB_ROOT_USERNAME
          valueFrom:
            secretKeyRef:
              name: mongo-secret
              key: mongo-user
        - name: MONGO_INITDB_ROOT_PASSWORD
          valueFrom:
            secretKeyRef:
              name: mongo-secret
              key: mongo-password

        volumeMounts:
        - name: mongo-persistent-storage
          mountPath: /data/db

        livenessProbe:
          exec:
            command:
            - mongosh
            - --eval
            - "db.adminCommand('ping')"
          initialDelaySeconds: 30
          periodSeconds: 10
          timeoutSeconds: 5

        readinessProbe:
          exec:
            command:
            - mongosh
            - --eval
            - "db.adminCommand('ping')"
          initialDelaySeconds: 5
          periodSeconds: 5
          timeoutSeconds: 3

      - name: mongo-sidecar
        image: morphy/k8s-mongo-sidecar
        env:
        - name: KUBERNETES_POD_LABELS
          value: "app=mongo,role=mongo"
        - name: KUBERNETES_SERVICE_NAME
          value: "mongo"
        - name: KUBERNETES_NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
        - name: MONGODB_USERNAME
          valueFrom:
            secretKeyRef:
              name: mongo-secret
              key: mongo-user
        - name: MONGODB_PASSWORD
          valueFrom:
            secretKeyRef:
              name: mongo-secret
              key: mongo-password

  volumeClaimTemplates:
  - metadata:
      name: mongo-persistent-storage
    spec:
      accessModes: ["ReadWriteOnce"]
      storageClassName: "openebs-jiva-csi-default"
      resources:
        requests:
          storage: 50Gi

Key configuration details:

Headless Service (clusterIP: None): Provides stable DNS entries for each pod (mongo-0, mongo-1, mongo-2)
Volume Claim Templates: Each pod gets its own 50GB persistent volume
Health Probes: Liveness and readiness probes ensure pod health
Sidecar Container: Automatically manages replica set configuration
Resource Limits: Prevents resource exhaustion and enables proper scheduling

Deploy the StatefulSet:

kubectl apply -f statefulset.yaml

Monitor the deployment:

kubectl get pods -n databases -w

Wait for all pods to reach the Running state.

Initializing the Replica Set

Once all pods are running, initialize the MongoDB replica set. Connect to the first pod:

kubectl exec -it mongo-0 -n databases -- mongosh -u admin -p

Enter your password when prompted, then initialize the replica set:

rs.initiate({
  _id: "rs0",
  version: 1,
  members: [
    { _id: 0, host: "mongo-0.mongo.databases.svc.cluster.local:27017" },
    { _id: 1, host: "mongo-1.mongo.databases.svc.cluster.local:27017" },
    { _id: 2, host: "mongo-2.mongo.databases.svc.cluster.local:27017" }
  ]
})

The DNS names follow the pattern: <pod-name>.<service-name>.<namespace>.svc.cluster.local:27017

Verifying Replica Set Status

Check the replica set configuration:

rs.status()

Look for these key indicators of a healthy cluster:

"ok": 1 at the end of the output
One PRIMARY member
Two SECONDARY members
All members showing "health": 1

Example output:

{
  set: 'rs0',
  members: [
    {
      _id: 0,
      name: 'mongo-0.mongo.databases.svc.cluster.local:27017',
      health: 1,
      state: 1,
      stateStr: 'PRIMARY',
      ...
    },
    {
      _id: 1,
      name: 'mongo-1.mongo.databases.svc.cluster.local:27017',
      health: 1,
      state: 2,
      stateStr: 'SECONDARY',
      ...
    },
    ...
  ],
  ok: 1
}

Testing and Validation

Testing Data Persistence

Let's verify that data persists correctly across the cluster. Insert a test document:

kubectl exec -it -n databases mongo-1 -- mongosh -u admin -p --eval '
db = db.getSiblingDB("testdb");
db.testcollection.insertOne({
  name: "test-record",
  timestamp: new Date(),
  message: "Testing OpenEBS Jiva storage",
  node: "mongo-1"
});
db.testcollection.find().pretty();
'

Expected output:

[
  {
    _id: ObjectId('6926f6c7d79787542f544ca7'),
    name: 'test-record',
    timestamp: ISODate('2025-11-26T12:47:03.263Z'),
    message: 'Testing OpenEBS Jiva storage',
    node: 'mongo-1'
  }
]

Now verify the data is replicated by querying a different pod:

kubectl exec -it -n databases mongo-0 -- mongosh -u admin -p --eval '
db = db.getSiblingDB("testdb");
db.testcollection.find().pretty();
'

You should see the same document, confirming replication is working.

Testing High Availability

The true test of high availability is failover. Let's simulate a node failure by deleting the primary pod.

First, identify the primary:

kubectl exec -n databases mongo-0 -- mongosh -u admin -p --eval "rs.isMaster().primary"

Output example:

mongo-1.mongo.databases.svc.cluster.local:27017

Now delete the primary pod:

kubectl delete pod mongo-1 -n databases

The replica set should automatically elect a new primary. Check the new primary:

kubectl exec -n databases mongo-0 -- mongosh -u admin -p --eval "rs.isMaster().primary"

Output:

mongo-0.mongo.databases.svc.cluster.local:27017

What just happened?

The primary pod was deleted
The remaining members detected the failure within seconds
An automatic election occurred
A new primary was elected
Kubernetes recreated the deleted pod
The recreated pod rejoined as a secondary

This demonstrates true high availability—your application experiences minimal disruption during node failures.

Testing Read Operations During Failover

For a more realistic test, run continuous read operations while deleting a pod:

# In terminal 1, start continuous reads
while true; do 
  kubectl exec -n databases mongo-0 -- mongosh -u admin -p --quiet --eval '
    db.getSiblingDB("testdb").testcollection.findOne()
  ' 2>/dev/null && echo "✓ Read successful" || echo "✗ Read failed"
  sleep 1
done

# In terminal 2, delete the primary
kubectl delete pod mongo-1 -n databases

You'll notice only a brief interruption (typically 5-10 seconds) during the election process.

Scaling Your Cluster

Horizontal Scaling

To scale your MongoDB cluster, simply increase the replica count:

kubectl scale statefulset mongo --replicas=5 -n databases

After the new pods are running, add them to the replica set:

kubectl exec -it mongo-0 -n databases -- mongosh -u admin -p

rs.add("mongo-3.mongo.databases.svc.cluster.local:27017")
rs.add("mongo-4.mongo.databases.svc.cluster.local:27017")

Vertical Scaling

To increase resources for existing pods, update the StatefulSet:

resources:
  requests:
    memory: "2Gi"
    cpu: "2"
  limits:
    memory: "4Gi"
    cpu: "4"

Apply the changes and perform a rolling update:

kubectl apply -f statefulset.yaml
kubectl rollout status statefulset/mongo -n databases

Storage Expansion

OpenEBS Jiva supports volume expansion. To increase storage for an existing pod:

kubectl patch pvc mongo-persistent-storage-mongo-0 -n databases -p '{"spec":{"resources":{"requests":{"storage":"100Gi"}}}}'

Note: Not all storage classes support volume expansion. Verify with kubectl get storageclass and check the ALLOWVOLUMEEXPANSION column.

Monitoring and Maintenance

Essential Monitoring Metrics

Deploy monitoring for these critical metrics:

Replica Set Health

   kubectl exec -n databases mongo-0 -- mongosh -u admin -p --eval "rs.status()" | grep -A 3 "stateStr"

Pod Status

   kubectl get pods -n databases -o wide

Storage Usage

   kubectl exec -n databases mongo-0 -- df -h /data/db

Resource Consumption

   kubectl top pods -n databases

Backup Strategy

Implement regular backups using mongodump:

kubectl exec -n databases mongo-0 -- mongodump \
  --username=admin \
  --password=YourPassword \
  --authenticationDatabase=admin \
  --out=/tmp/backup-$(date +%Y%m%d)

For production environments, consider using:

Velero: Kubernetes-native backup solution
MongoDB Ops Manager: MongoDB's enterprise backup solution
Kanister: Application-level data management platform

Production Considerations

Security Hardening

Enable TLS/SSL: Encrypt data in transit

   - "--tlsMode=requireTLS"
   - "--tlsCertificateKeyFile=/etc/mongodb/certs/mongodb.pem"

Network Policies: Restrict pod-to-pod communication

   apiVersion: networking.k8s.io/v1
   kind: NetworkPolicy
   metadata:
     name: mongo-netpol
     namespace: databases
   spec:
     podSelector:
       matchLabels:
         app: mongo
     policyTypes:
     - Ingress
     ingress:
     - from:
       - namespaceSelector:
           matchLabels:
             name: application
       ports:
       - protocol: TCP
         port: 27017

Pod Security Standards: Apply baseline security policies
Secrets Management: Use external secrets management (Vault, AWS Secrets Manager)

Performance Optimization

Anti-Affinity Rules: Distribute pods across nodes

   affinity:
     podAntiAffinity:
       requiredDuringSchedulingIgnoredDuringExecution:
       - labelSelector:
           matchExpressions:
           - key: app
             operator: In
             values:
             - mongo
         topologyKey: kubernetes.io/hostname

Resource Tuning: Adjust based on workload patterns
WiredTiger Cache: Configure based on available memory
Connection Pooling: Optimize application connection settings

Disaster Recovery

Multi-Region Deployment: Deploy across availability zones
Regular Backup Testing: Verify backup integrity and restoration procedures
Runbook Documentation: Document recovery procedures
Automated Failover Testing: Regularly test failover mechanisms

Conclusion

You've successfully built a production-grade, horizontally scalable MongoDB cluster on Kubernetes. This setup provides:

High Availability: Automatic failover with minimal downtime
Horizontal Scalability: Scale from 3 to hundreds of nodes
Data Durability: Replicated storage with OpenEBS
Operational Flexibility: Kubernetes-native management

While this guide makes the deployment appear straightforward, the reality of managing distributed databases in production is complex. This complexity explains why managed database services like MongoDB Atlas command premium pricing—they handle the operational burden of:

24/7 monitoring and alerting
Automated backups and point-in-time recovery
Performance optimization and query analysis
Security patches and upgrades
Multi-region replication and disaster recovery
Expert support and SLA guarantees

When to self-host vs. use managed services:

Self-hosting makes sense when:

You have experienced DevOps and database engineers
You require specific configurations not available in managed services
Cost optimization is critical at scale (100+ nodes)
You need on-premises deployment for compliance reasons
You want complete control over your infrastructure

Managed services make sense when:

Your team lacks Kubernetes and database operations expertise
You want to focus on application development, not infrastructure
You need guaranteed uptime SLAs
You require enterprise support and consulting
Your workload doesn't justify a dedicated operations team

The skills you've developed in this guide—Kubernetes orchestration, distributed systems design, and operational excellence—are valuable regardless of your deployment choice. Understanding how these systems work at a fundamental level makes you a better engineer, whether you're managing your own infrastructure or architecting applications on managed platforms.

Remember: the goal isn't to rebuild MongoDB Atlas, but to understand the principles that make distributed databases resilient and scalable. This knowledge will serve you well in designing and operating any distributed system.

Next Steps

To further enhance your deployment:

Implement Prometheus monitoring with MongoDB exporter
Deploy Grafana dashboards for visualization
Set up automated backups with Velero or Kanister
Configure alerting with AlertManager
Implement GitOps with ArgoCD or Flux for declarative management
Explore sharding for extreme scale (10TB+ datasets)
Test disaster recovery procedures regularly

The journey from a basic deployment to a robust, production-ready system is iterative. Start with this foundation and continuously improve based on your specific requirements and operational experience.

Resources:

Why Even Cloudflare Struggles with DNS: The Deceptively Complex Foundation of the Internet

Ahmed Rakan — Thu, 20 Nov 2025 16:38:13 +0000

The Deceptive Simplicity of DNS

One of the foundational components at Cloudflare is DNS. As one of the largest enterprises in the software industry, managing over 20% of the world's internet traffic, Cloudflare has built its reputation for security, CDN services, and other products on its DNS expertise.

Yet even they experience DNS issues. DNS problems are one of the worst nightmares technical teams face because they cascade across infrastructure like nothing else.

If Cloudflare, with all their expertise, has DNS problems, what does that tell us?

Is DNS Simple? Yes and No.

What is DNS? The Domain Name System is a hierarchical and distributed naming service that provides a naming system for computers, services, and other resources on the internet.

Basically, anything with an IP address can get a domain name. Those domain names usually point to IPs via DNS records.

DNS records live in authoritative DNS servers such as Google's 8.8.8.8, Cloudflare's 1.1.1.1, and others.

When you visit example.com, your device first checks the browser cache and the local DNS cache. If no record is found, it follows your network configuration to locate an authoritative name server to query. After reaching the resolver (your network’s recursive DNS server), the resolver contacts three servers in sequence.

First is the root server, which stores information about TLDs like .io and .com and identifies which TLD server is responsible for each. Next, the TLD server directs the resolver to the authoritative name server for the specific domain. That authoritative server contains the DNS records you’ve configured. Once retrieved, the result is returned to your browser and cached for future use.

Fun fact: DNS is the most heavily queried database system in the world.

DNS servers achieve their speed by storing DNS records in zone files, which are structured text files loaded directly into RAM for extremely fast access.

The "This is Simple" Illusion: Simple Surface, Infinite Depth

Most people add a few records, run some queries, see things working, and think they've mastered DNS. But that's just the tip of the iceberg. As mentioned earlier, one of the hardest problems you'll encounter in software is almost always related to DNS in some way.

Think of DNS as your home address. If no one knows that address, no one can reach you. But unlike a home address that you share with a few people, DNS is an address that needs to propagate to billions of devices worldwide once authoritative name servers get a record of it.

A single misconfiguration in DNS doesn't affect just one component—it cascades faster than almost any other system failure.

This rapid propagation is fundamental to how DNS was designed. You're putting a DNS record in to have a name pointing to the resource you're trying to make accessible to other resources or people, and that usually happens in milliseconds. Even though you sometimes see "changes take 48 hours to appear," the speed is a feature, not a bug. But it means mistakes spread just as quickly as corrections.

Common DNS Problems That Keep Engineers Up at Night

1. TTL (Time To Live) Misconfiguration

Imagine setting a TTL of 86400 seconds (24 hours) on a critical record, then needing to change it urgently. You're now stuck waiting up to 24 hours for this change to fully propagate because caching servers worldwide will hold onto the old value. The cache will only invalidate once the TTL of your previous record expires.

2. CNAME Chain Loops

You create a CNAME record pointing to another domain, which points to another, which accidentally points back to the first. Suddenly DNS resolvers enter an infinite loop. Queries fail, and your entire service becomes unreachable. These chains can be hard to spot in large infrastructures with multiple teams managing different zones.

3. Split-Horizon DNS Conflicts

Your internal DNS says api.example.com points to 10.0.0.5, but your external DNS says it points to 52.123.45.67. An employee working remotely suddenly can't access the internal service because their VPN isn't routing DNS queries correctly. Debugging takes hours because the problem appears and disappears based on network location.

4. DNSSEC Validation Failures

You enable DNSSEC for security, but a key rotation goes wrong or a signature expires. Now, instead of your site being accessible but potentially vulnerable, it's completely unreachable for anyone with DNSSEC validation enabled, with cryptic error messages that don't mention DNS at all.

5. Propagation Delays and Race Conditions

You update a DNS record and immediately deploy new infrastructure to that address. Some users get the new record instantly, while others are still seeing the cached old record for minutes or hours.

The DNS Learning Simulation: A Lesson in Humility

One interesting project I worked on with an intern involved creating a mini DNS simulation. We had fun, but the real purpose was teaching a lesson for both of us : we will never know everything. Our brain isn't designed to store complete knowledge about any complex system. We have limited cognitive capacity, and our best approach is to know just enough to get the job done effectively and know where to reference information when you need to refresh your memory.

This principle holds true even for proclaimed experts in their fields. Take C++ as an example. The language comes with multiple standardizations—C++98, C++11, C++14, C++17, C++20, C++23—each with hundreds of features, edge cases, and gotchas. If someone claims they know everything about C++, you can easily construct a scenario involving template metaprogramming, undefined behavior, or obscure standard library details that will humble them quickly.

DNS is no different. The tip of the iceberg is genuinely simple—point a name to an IP address. But once you decide to dive deeper, there's no end. It's like a decision tree where every node branches into multiple paths, and each path leads to more branches.

Consider this example: You start investigating why a query is slow. That leads you to examine authoritative nameservers, which leads to TTL settings, which leads to caching behavior across multiple resolver layers, which leads to anycast routing, which leads to BGP configurations, which leads to geographic DNS policies, which leads to EDNS client subnet considerations, which leads to privacy implications, which leads to DNS-over-HTTPS versus DNS-over-TLS debates, which leads to studying Certificate Authority Authorization records, which leads back to DNSSEC... and before you arrive at the depth you were searching for, you've probably forgotten which root node your investigation began at. Was it the slow query? The failed health check? The intermittent timeout?

The Missing Tool: Why We Need a DNS Simulator

The best way to truly understand DNS complexity would be through a comprehensive DNS simulator. To my surprise, no such tool exists in production quality. In the current software engineering industry, even at the biggest companies with the best engineers, when they make DNS changes, it's at most an educated guess backed by experience and prayer.

They run staging environments, yes. They have monitoring, absolutely. But they can't truly simulate how a DNS change will propagate across thousands of recursive resolvers with different caching policies, how it will interact with CDN configurations, how mobile devices switching between networks will handle it, or how edge cases in specific resolver implementations will respond.

This tool would need to model:

Multiple recursive resolver behaviors (Google DNS, Cloudflare DNS, OpenDNS, ISP resolvers)
Caching layers at different TTL stages
DNSSEC validation chains
Anycast routing scenarios
Network partition simulations
DNS cache poisoning attempts
Rate limiting behaviors
EDNS extensions and compatibility

Building this will take significant time—likely months of dedicated development to even reach a minimally viable prototype. But it's on my 2026 calendar because the industry desperately needs it. Every day, engineers at companies large and small make DNS changes hoping they won't cause the next cascading failures. A proper simulator could transform DNS operations from educated guessing into confident engineering, done via a simulation.

The Reality of DNS in Production

DNS combines several challenging aspects of distributed systems:

Global scope: Your changes affect the entire internet
Caching complexity: Multiple layers with independent policies
No rollback mechanism: Once propagated, you can't easily undo a DNS change
Debugging difficulty: Problems manifest differently based on location, resolver, and timing
Security implications: DNS is a frequent attack vector (DDoS amplification, cache poisoning, subdomain takeovers)

Even Cloudflare, with their massive infrastructure and DNS expertise, has experienced outages traced back to DNS issues.

The lesson here isn't that DNS is impossible to master. It's that treating it as "simple" is the fastest path to production incidents. Respect its complexity, document your configurations meticulously, make changes conservatively, and always have a rollback plan (even if it means waiting out a TTL period).

Until we have better simulation tools, DNS operations will remain part science, part art, and part crossing your fingers.

Your Understanding of Abstraction is Incomplete (And It's Holding You Back)

Ahmed Rakan — Sat, 15 Nov 2025 15:04:09 +0000

The Hidden Truth About Software Mastery

If there's one concept that separates good developers from exceptional ones, it's abstraction. Yet after 7+ years in professional software engineering and entrepreneurship, I've witnessed countless talented developers fall into the same trap—they use abstraction without truly understanding it.

What Most Developers Get Wrong About Abstraction

Ask any senior software engineer to define abstraction, and you'll typically hear:

"Abstraction is simplifying complex systems by focusing on important characteristics while hiding implementation details."

This definition is correct but dangerously incomplete.

Yes, abstraction allows us to create clean interfaces for complex systems. Yes, it makes frameworks feel "easy to use." But here's the trap: this false sense of simplicity breeds mediocrity.

The Authentication Trap

Here's a pattern I see repeatedly:

The mediocre developer thinks: "The framework provides authentication? Perfect. I'll just call the API and—magic—my application has authentication!"

The great developer asks: "How does this authentication mechanism actually work? What are the security implications? What happens when it fails?"

You cannot hide implementation details effectively if you don't understand them deeply.

The Abstraction Layers: Where Software Actually Lives

Software isn't just "code that runs." It's a carefully orchestrated stack of abstraction layers, each building on the one below:

Every feature you build, every bug you debug, every scaling challenge you face—they all exist somewhere within these layers. The developers who understand layer interactions solve problems 10x faster.

The Down-Up, Up-Down Methodology

I developed this approach to systematically master complex systems beyond their simple interfaces. It's deceptively simple but incredibly powerful:

The Core Principle

Never move to the next abstraction layer until you completely grasp the current one.

When to Use Each Approach

Top-Down (Start at Application Layer):

Security vulnerabilities
Performance optimization
Feature debugging
API design

Bottom-Up (Start at Infrastructure Layer):

Scaling architecture
Reliability improvements
Network issues
Infrastructure debugging

Where to Stop?

Top Layer: Usually obvious—it's your application code or user interface
Bottom Layer: In software, you rarely need to go beyond the OS kernel. Hardware, driver, low-level programmers may need to dive in beyond that.

Real-World Case Study: The 419 Error Mystery

Let me show you how abstraction mastery solves real problems.

The Situation

A client's CI/CD pipeline had been broken for a week. Their entire team was stumped. Only one pipeline failed, returning 419 Request Too Large from their self-hosted container registry.

Their Stack:

Cloud load balancer
Kubernetes cluster
Cloudflare (proxy enabled)
Self-hosted container registry

The Investigation: Layer-by-Layer Analysis

The Three Culprits

Cloudflare Proxy (Layer 5): 500MB request limit for Enterprise plan
- Solution: Disable proxy for registry endpoint
Ingress Controller (Layer 6): Default request size limits
- Solution: Add annotation: nginx.ingress.kubernetes.io/proxy-body-size
Container Registry (Layer 7): Configuration limits
- Solution: Update configuration parameters

One visible error. Three interconnected root causes across different abstraction layers.

Their team spent a week looking at logs. I solved it in hours by systematically analyzing each layer.

Practical Steps to Master Abstraction

1. Read the Source Code

At least once, read the source code of critical tools you use:

Your web framework
Your database driver
Your authentication library
Your cloud SDK

You'll never look at these tools the same way again.

2. Practice Layer-by-Layer Debugging

Next time you encounter a bug:

3. Ask Deeper Questions

When using any framework or tool:

How does this actually work under the hood?
What assumptions is this abstraction making?
What happens when things go wrong?
Which layers does this touch?

4. Build Mental Models

Create diagrams (like the ones in this post) for systems you work with. Visualizing abstraction layers dramatically improves understanding.

The Scalability Question

Here's a common scenario in technical meetings:

Manager: "How do we scale this solution?"

This isn't really a question—it's a disguised request: "Teach me about scalability."

The truth: Scalability, availability, security, robustness, and reliability all come down to understanding abstraction.

Scaling is Layer-by-Layer

You can't architect scalability if you only understand one layer. You need to see how they interact.

The Competitive Advantage

The professionals who truly excel in software engineering are those who:

✅ Understand how abstraction layers interact

✅ Can debug across multiple layers simultaneously

✅ Don't treat frameworks as magic black boxes

✅ Read source code regularly

✅ Apply systematic investigation methodologies

Stop treating abstraction as just theory. It's the practical framework that separates good engineers from great ones.

Your Action Plan

This Week: Pick one framework you use daily and read its source code for 1 hour
This Month: Practice the down-up, up-down approach on your next bug
This Quarter: Create abstraction diagrams for your main systems
This Year: Become the engineer who solves problems others can't

Conclusion

Your understanding of abstraction is likely incomplete—and that's okay. Recognition is the first step.

The question is: What will you do about it?

The developers who master abstraction don't just write code—they architect systems that scale, debug issues that mystify others, and build careers that others envy.

Abstraction isn't just a concept. It's your competitive advantage.

What's your experience with abstraction in software engineering? Have you encountered situations where understanding multiple layers made the difference? Share your stories in the comments below.

Beyond the Hype: Technologies That Will Outlive the AI Bubble

Ahmed Rakan — Sat, 08 Nov 2025 16:08:34 +0000

Understanding the AI Bubble

The "AI bubble" refers to a period of inflated hype, valuations, and investment in artificial intelligence technologies. A phenomenon that historically contracts when expectations outpace reality. We have seen this pattern before with the dot-com bubble of the late 1990s and the blockchain craze of 2017-2018.

How We Got Here

The current AI fervor centers on one audacious promise: Artificial General Intelligence (AGI)—a system that could theoretically solve all problems. The logic seems circular: if we weren't intelligent enough to solve our current problems, how will we create something that solves all problems? Yet this promise has driven unprecedented investment.

The financial stakes are staggering. AI startups require massive upfront capital—data centers, specialized hardware, talent acquisition ( Paid like CTO's, CEO's ), and computational resources. To justify these costs, companies made bold promises centered on one concept: intelligence at scale.

The watershed moment came when OpenAI, initially founded as a non-profit in 2015 with backing from Elon Musk, Sam Altman, and others, transitioned to a "capped-profit" model in 2019 after securing initial funding. This shift signaled that AGI wasn't just a research goal—it was a market opportunity. Major tech companies, nations, and venture capitalists rushed in, inflating valuations to bubble territory.

What Could Burst the Bubble?

The most likely catalyst: failure to deliver on superintelligence promises. When investors and businesses realize that general intelligence remains elusive, or that the returns don't justify the astronomical investments, a correction becomes inevitable.

But here's the crucial insight: when bubbles burst, they don't destroy everything. The technologies that survive are those with fundamental utility—tools that solve concrete, enduring problems regardless of hype cycles.

This article explores those resilient technologies. Not to dismiss AI's legitimate achievements, but to help individuals and organizations position themselves wisely for what comes next.

I. Foundational Compute and Infrastructure

The bedrock of all digital systems isn't going anywhere. Even AI systems depend entirely on these fundamentals.

1. Semiconductors & Chip Design

Why it matters: The world will always need faster, more efficient processors. Whether the focus is CPUs, GPUs, NPUs, or AI accelerators, the fundamental need for better silicon is eternal.

Market reality: The semiconductor industry represents a $500+ billion global market with applications far beyond AI—automotive, telecommunications, consumer electronics, defense, and medical devices all depend on continuous chip innovation.

Key players: TSMC, NVIDIA, Intel, AMD, Samsung, ASML

2. Cloud Computing

Why it matters: On-demand computing power and storage are at their highest demand in history. Even if AI-specific workloads decrease, the global trend toward digitization and remote everything ensures cloud longevity.

Market reality: Cloud infrastructure spending exceeded $240 billion in 2024, driven by enterprises migrating critical workloads, remote work infrastructure, streaming services, and global-scale applications.

Key players: AWS, Microsoft Azure, Google Cloud, Alibaba Cloud

3. Quantum Computing (Research Field)

Why it matters: Quantum computers' ability to solve specific classically intractable problems—materials science, drug discovery, cryptography, optimization—ensures long-term investment despite commercial viability remaining years away.

Current state: Still largely in research phase, but companies like IBM, Google, and IonQ are making steady progress. The technology solves problems that classical computers fundamentally cannot, making it strategically important.

II. Software Engineering & Development

1. Cybersecurity

Why it matters: As long as digital systems exist, malicious actors will try to exploit them. Cybersecurity is an eternal cat-and-mouse game that becomes more critical as systems grow more sophisticated.

Market reality: The global cybersecurity market is projected to reach $400+ billion by 2030, driven by increasing attack sophistication, regulatory requirements (GDPR, CCPA, NIS2), and the expanding attack surface of IoT and cloud systems.

Persistent threats: Ransomware, supply chain attacks, state-sponsored espionage, and zero-day exploits ensure this field remains mission-critical.

2. Open-Source Software

Why it matters: The vast majority of the internet, cloud infrastructure, and embedded systems run on open-source software. Linux powers over 90% of cloud infrastructure. This collaborative model for building foundational tools is proven and permanent.

Examples: Linux kernel, Kubernetes, PostgreSQL, Python, React, TensorFlow—these projects form the backbone of modern technology and aren't owned by any single entity.

3. Databases & Data Engineering

Why it matters: "Data is the new oil" may be cliché, but it's accurate. The ability to store, manage, process, and move large amounts of data reliably is fundamental to every modern business—AI-driven or not.

Enduring truth: SQL, written in the 1970s, remains ubiquitous in 2025. People will likely still write SQL in 3025 if civilization survives. Data engineering—ETL pipelines, data warehousing, real-time streaming—solves problems that don't disappear with hype cycles.

Key technologies: PostgreSQL, Apache Kafka, Snowflake, Apache Spark, Redis

4. Low-Level Programming Languages (Rust, C, C++)

Why it matters: These languages aren't replaceable. They're essential for building operating systems, browsers, game engines, embedded systems, and performance-critical applications.

Why they persist: When you need direct hardware control, predictable performance, and minimal overhead, high-level abstractions won't suffice. These languages will likely outlive everything else on this list.

Examples: Windows, Linux, Chrome, Firefox, Unreal Engine, and most firmware are written in these languages.

III. Hardware and Connectivity

1. Robotics and Automation

Why it matters: The desire to automate dangerous, dirty, dull, or precision-requiring tasks is a fundamental economic driver. From manufacturing and logistics to surgery, robotics solves clear physical problems.

Economic incentive: Companies invest in robotics to automate expensive manual tasks into more autonomous, less expensive, streamlined operations. This equation doesn't change with AI hype cycles.

Applications: Warehouse automation (Amazon), surgical robots (da Vinci), manufacturing (Tesla Gigafactories), agriculture (autonomous tractors)

2. Internet of Things (IoT)

Why it matters: The ability to gather real-world data and remotely control devices has vast utility in agriculture, logistics, smart cities, healthcare, and industrial settings.

Scale: By 2025, there are over 30 billion connected IoT devices globally, enabling everything from precision farming to predictive maintenance in factories.

3. Networking (5G, 6G, and Beyond)

Why it matters: The world's demand for faster, more reliable, and lower-latency connectivity is insatiable. Network infrastructure is the backbone of modern society.

Evolution: Each generation of wireless technology enables new use cases—3G enabled mobile internet, 4G enabled streaming and social media, 5G enables real-time applications and IoT at scale. This progression continues regardless of AI trends.

4. Renewable Energy & Battery Technology

Why it matters: The transition to sustainable energy is one of the defining challenges of our century—arguably the real next industrial revolution. Technologies for generating, storing, and managing clean energy are always critical.

Market forces: Climate change, energy security, and economics all drive renewable adoption. Solar, wind, battery storage, and grid management technologies will remain strategic priorities for decades.

IV. Emerging Software Paradigms

1. DevOps and Platform Engineering

Why it matters: The culture and practice of streamlining software development, deployment, and maintenance is all about efficiency and reliability—goals that remain in demand regardless of technology trends.

Evolution: The shift from DevOps to Platform Engineering reflects the maturation of these practices, focusing on building internal developer platforms that improve productivity across organizations.

2. Privacy-Enhancing Technologies (PETs)

Why it matters: As digital awareness grows, so does demand for privacy. Technologies like differential privacy, zero-knowledge proofs, homomorphic encryption, and end-to-end encryption will become standard, not optional.

Regulatory pressure: GDPR, CCPA, and emerging AI regulations worldwide are making privacy a legal requirement, not just a nice-to-have.

Examples: Signal's encryption protocol, Apple's differential privacy implementations, blockchain privacy solutions

3. Digital Identity and Authentication

Why it matters: Proving who you are online is a foundational problem that needs increasingly robust, secure solutions. As digital interactions grow, so does identity fraud—making this an arms race.

Emerging solutions: Passwordless authentication, biometrics, decentralized identity, WebAuthn, and multi-factor authentication are all evolving to meet growing security demands.

What Will Likely Die?

1. AI-Washed Products

Companies that simply slapped an "AI" label on mediocre products without real technological edge or solid business models. The market eventually punishes branding over substance.

2. Purely Speculative Startups

Startups with huge valuations based on "future AI potential" but no clear path to profitability, defensible moat, or definable market. When capital becomes expensive, these companies evaporate.

3. Undifferentiated Foundation Models

Many companies building giant, general-purpose LLMs from scratch will struggle to compete with established players like OpenAI, Google DeepMind, Anthropic, and Meta. The "me-too" models will consolidate or disappear as the economics become clear—training costs billions, and monetization remains challenging.

The Bottom Line

History teaches us that technological bubbles don't destroy innovation—they expose what's truly valuable. The dot-com crash didn't kill the internet; it killed companies with unsustainable business models. The survivors—Amazon, Google, eBay—built on genuine utility.

The same pattern will repeat with AI. The technologies that survive will be those solving concrete, enduring problems: secure systems, efficient infrastructure, data management, physical automation, energy sustainability, and human connectivity.

The wise strategy isn't to abandon AI entirely, but to recognize where genuine value lies. Build skills in fundamentals. Invest in technologies with clear use cases. Bet on problems that won't disappear when the hype cycle turns.

As the saying goes: when the tide goes out, you see who's been swimming naked. The technologies listed here? They're wearing suits made of steel.

Building MeridianDB: Solving AI's Memory Crisis with Multi-Dimensional RAG

Ahmed Rakan — Wed, 05 Nov 2025 08:42:34 +0000

Why I Built This

When exploring cloud platforms, I don't just read documentation—I build something substantial. Recently, I dove deep into Cloudflare Workers, and I wanted to tackle a problem that's becoming critical in today's AI landscape: catastrophic forgetting.

The Problem: AI Agents That Forget

Traditional RAG (Retrieval-Augmented Generation) systems use vector databases to enhance AI outputs by storing data as embeddings—multi-dimensional vectors that machines can understand. When you search, the system transforms your query into vectors and performs similarity searches using mathematical distance calculations.

This approach searches for meaning, not just text. But it fails to solve a fundamental problem in agentic AI: catastrophic forgetting—when AI systems learn new information, they often forget old knowledge.

Standard RAG mitigates this issue but doesn't fundamentally solve it. As user data grows exponentially, two critical questions emerge:

How does retrieved data affect AI generation quality?
How relevant is this data over time?

The Solution: Multi-Dimensional Memory

MeridianDB goes beyond traditional RAG by adding multiple dimensions on top of semantic search. Built entirely on Cloudflare's infrastructure (Workers, D1, Vectorize, KV, Queues, and R2), it provides Auto-RAG that's highly scalable, performant, and runs at the edge—near your users, without headaches.

The Four Dimensions of Memory

1. Semantic Search

Like any RAG database, MeridianDB uses Cloudflare Vectorize at its core. When your AI agent sends a query, it performs semantic search to retrieve meaningful data. We recommend over-fetching to allow other features to refine results.

2. Behavioral Learning

When your agent retrieves data, you can add like/dislike buttons to generated responses. User feedback creates behavioral signals—all memories retrieved get penalized for negative signals. Combined with agent configuration, this filters out memories that produce poor results.

3. Temporal Decay

Facts become irrelevant over time. We provide temporal features where you can:

Mark data as factual (always included, no decay)
Mark data as irrelevant (always excluded)
Let intelligent active/passive learning determine inclusion based on smart filtering and access patterns

Our exponential decay algorithm with frequency boost ensures recent and frequently accessed memories stay relevant while old, unused memories naturally fade.

4. Contextual Filtering

Developers or other AI agents can describe memories for specific tasks. This additional metadata helps task-performing agents find precisely what they need.

The Science Behind It

We considered adding graph capabilities—giving agentic AI the ability to build knowledge graphs would be powerful. We could implement this with edge columns and JOIN queries, but decided against it for now to maintain simplicity and performance.

The core challenge is balancing stability and plasticity:

Stability: AI systems must consolidate old knowledge when learning new things
Plasticity: AI agents must learn new things quickly

This balance varies wildly by use case. A chatbot's stability-plasticity requirements differ dramatically from a coding agent, which needs longer memory consolidation and slower learning rates.

MeridianDB's federated database is extremely configurable, with passive/active learning controlled through agent configuration.

Architecture Decisions

Handling Consistency

Many developers overlook a critical question: when building RAG, your queries are federated (affecting multiple databases)—how do you handle consistency?

Data can go out of sync. Embeddings may succeed while record insertion fails. Lots can go wrong.

MeridianDB handles all of this out of the box.

Our white paper details our approach:

Queue-based writes ensure eventual consistency without manual orchestration
Data is redundantly stored (Vectorize ( stores only Id of memory in D1 ) + D1 ( memory content )) to preserve multi-dimensional context
Automatic retries, failover, graceful degradation on retrieval, NewSQL inspired transactions and event-driven processing

The Learning Phases

We recommend operating agents in two phases:

Phase 1: Passive Learning

Start with successRate: 0.0 and stabilityThreshold: 0.0. This prevents false positives when the system lacks sufficient data. The agent collects interaction data without aggressive filtering.

Phase 2: Active Learning

Once you've accumulated meaningful data, activate filtering by setting appropriate thresholds. The system automatically filters out:

Memories with low success rates (behavioral)
Memories with low stability scores (temporal)

Temporal Configuration

We use exponential decay with frequency boost. Each agent has its own configuration:

Balanced (Default)

{
  halfLifeHours: 168,      // 7 days
  timeWeight: 0.6,
  frequencyWeight: 0.4,
  decayCurve: 'hybrid',
  decayFloor: 0.15
}

Aggressive Decay (for chatbots)

{
  halfLifeHours: 72,       // 3 days
  timeWeight: 0.7,
  frequencyWeight: 0.3,
  decayCurve: 'exponential'
}

Long-Term Memory (for knowledge bases)

{
  halfLifeHours: 720,      // 30 days
  timeWeight: 0.5,
  frequencyWeight: 0.5,
  decayCurve: 'polynomial'
}

The recency score calculation runs in SQL, keeping retrieval latency at 300-500ms.

Behavioral Configuration

Behavioral features use the Wilson score confidence interval—a statistically robust method for scoring with sparse data:

function wilsonScore(success: number, failure: number, confidence = 0.95) {
  const total = success + failure;
  if (total === 0) return 0;

  const p = success / total;
  const z = confidence;

  const denominator = 1 + (z * z) / total;
  const center = p + (z * z) / (2 * total);
  const spread = z * Math.sqrt((p * (1 - p) + (z * z) / (4 * total)) / total);

  return Math.max(0, (center - spread) / denominator);
}

This prevents manipulation from sparse data and provides conservative scoring for new memories.

Developer Experience

Simple SDK

Install via npm:

npm i meridiandb-sdk

Three core methods: store, retrieve, recordFeedback.

Example usage:

import { MeridianDBClient } from "meridiandb-sdk";

const client = new MeridianDBClient({
  url: "https://api.meridiandb.com",
  accessToken: "your-token"
});

// Retrieve memories
const memories = await client.retrieveMemoriesSingleAgent({
  query: "user preferences"
});

// Store new memory
await client.storeMemory({
  agentId: "chatbot-v1",
  content: "User prefers dark mode",
  isFactual: true,
  context: "UI preferences"
});

// Record feedback
await client.recordFeedback({
  success: true,
  memories: ["memory-id-1", "memory-id-2"]
});

Admin Portal

Built with React and Vite, deployable to Cloudflare Pages. The operator UI provides observability, data management, and debugging tools.

Technical Stack

Cloudflare D1: Relational metadata & feature storage
Cloudflare Vectorize: Embedding storage & similarity search
Cloudflare KV: Session state, counters, cache
Cloudflare R2: Object storage for models, artifacts, backups
Cloudflare Workers: Edge-native compute
Cloudflare Queues: Event-driven processing (enterprise version)

For development/free tier, we provide cfw-poor-man-queue—a lightweight distributed queue implementation that lets you run MeridianDB on Cloudflare's free plan.

Performance & Scalability

<500ms retrieval latency including multi-dimensional filtering
Global edge deployment for low-latency access worldwide
SQL-based scoring for maximum scalability
Event-driven updates prevent write-on-read latency penalties
Horizontally scalable architecture

Limitations

Being transparent about trade-offs:

Eventual consistency: Reads may slightly lag behind writes
Manual context: Developers must supply contextual features (auto-generation coming)
Storage constraints: D1 has a 10GB limit per database
Platform coupling: Optimized for Cloudflare ecosystem - but replacing D1 with SQLite, workers with nodejs, vectorize with chromadb, cloudflare or PMQ with rabbitmq or kafka is totally doable.
Learning curve: Multi-dimensional retrieval differs from traditional vector search

Getting Started

Clone the repository

   git clone https://github.com/ARAldhafeeri/MeridianDB
   cd MeridianDB
   npm install

Set up Cloudflare resources

   # Create vectorize index
   npx wrangler vectorize create meridiandb --dimensions=768 --metric=cosine

   # Create metadata index for agent isolation
   npx wrangler vectorize create-metadata-index meridiandb --property-name=agentId --type=string

Run migrations

   npm run server:migrations
   npm run server:migrate:local

Start development

   npm run dev

Initialize super admin Hit /auth/init endpoint to set up admin access

Resources

Home Page
GitHub Repository: Source code
Documentation: Full API reference and guides
White Paper: Mathematical foundations and research
Postman Collections: API examples and testing

Why This Matters

Cloudflare offers Auto-RAG as a product. But if you want state-of-the-art RAG that actively learns from user behavior, adapts over time, and balances stability with plasticity—try MeridianDB.

The future of AI agents depends on memory systems that don't just store and retrieve, but actively curate knowledge based on utility, recency, and performance. MeridianDB makes this vision practical and deployable today.

Interested in using MeridianDB for your team? Book a meeting to discuss your use case.

Scientific Foundation

MeridianDB's approach is grounded in established research:

Ebbinghaus (1885): Forgetting curve and memory decay models
Wilson (1927): Confidence intervals for behavioral scoring
Mikolov et al. (2013): Word embeddings and semantic representations
Parisi et al. (2019): Continual learning in neural networks
Randazzo et al. (2022): Memory models for spaced repetition

By combining neuroscience-inspired principles with modern vector databases and edge computing, MeridianDB offers a mathematically grounded solution to one of AI's most challenging problems: building agents that learn continuously without forgetting what matters.