DEV Community: Tobi Lekan Adeosun

Your services already know why they broke. You just delete that knowledge at deploy time.

Tobi Lekan Adeosun — Fri, 17 Jul 2026 03:14:31 +0000

I want to start with a moment most of us have lived through.

It's 3 a.m. A dashboard is red. You're eight terminals deep in grep, trying to work out which service actually fell over and why. And the whole time there's this nagging feeling that you're doing archaeology on a system you wrote last month.

Here's what got under my skin about it. The answer was never actually lost. Back in the source code it said, in plain terms, that the payment service talks to Postgres through a connection pool. That this retry backs off three times. That this particular dependency is external and you must never, ever try to "just restart it." That was all right there at build time. Then we packaged everything up, deployed, threw that structure in the bin, and asked a sleep-deprived human to reconstruct it from log lines.

That gap bugged me enough that I spent a while building something around it. This post is about that.

Autoscaling is good at the wrong problem

We've gotten genuinely good at reacting to resource pressure. Traffic climbs, a box gets slow, CPU pins, and the autoscaler adds capacity or sheds load. No complaints there, it's kept things running for years.

The problem is it has no idea what your app is for. It can't tell a service that's slow because it's healthy and hammered from a service that's fast because it's quietly writing garbage to the database. It never had a model of the application in the first place.

So a whole category of failures just sails right past it. A connection pool getting drained by something downstream. A poison message kicking off a retry storm. A schema change that breaks one code path and leaves the other one looking perfectly fine. Infrastructure that only thinks in CPU and memory is blind to all of that.

The "throw an LLM at it" era

The going answer right now is to bolt a large language model onto your observability stack. Fire hose all the logs, traces, and metrics at a big central model and ask it what happened.

I get why. I also think it's the wrong shape, and I keep coming back to three things.

It's expensive in a way that doesn't survive a finance review. Frontier models bill per token, telemetry is enormous, and you're proposing to push all of it through the priciest thing in your stack.

It's centralized, so you've built yourself a single brain that now needs its own babysitting and happens to sit a long way from the failure it's supposed to understand.

And the part I actually find frustrating: it hallucinates architecture. We hand the model a heap of logs and ask it to infer who calls who, what a retry means here, what's safe to restart. That information wasn't fuzzy or missing. It was sitting in the code, exact and unambiguous, until we deleted it at deploy time and then paid a model to guess it back, usually worse.

Which made me stop asking "how do we read logs better" and start asking a different question: why can't the application just reason about itself?

What I ended up building

The thing I built, the Cognitive Autonomic Framework, does four things. I'll be honest that none of the individual pieces are new. The bet is on how they fit together.

The self-model comes out of the code. At build time a compiler walks the AST and pulls out what I call a Runtime Semantic Topology, or RST. It's a small, dense graph: who depends on who, failure domains, retry policies, latency budgets, and the closed list of repairs an agent is even permitted to try on each node. A whole microservice graph fits in a few kilobytes, so it just rides along inside the container. When something breaks, nothing has to guess the system's shape. It reads it.

A node in that graph is roughly this:

{
  "service": "payment",
  "depends_on": ["postgres", "stripe"],
  "failure_domain": "billing",
  "criticality": "high",
  "retry_policy": { "max_attempts": 3, "backoff": "exponential" },
  "external_dependencies": ["stripe"],
  "permitted_repairs": ["restart_pool", "shed_load", "open_circuit"]
}

The bit I care about is permitted_repairs, and the fact that it's a closed list. The agent is allowed to restart the pool. It is not allowed to restart stripe, because the model flatly says that's an external dependency. The safety lives in the structure, not in whether the model happened to be in a good mood.

Reasoning stays near the failure. Every service gets a small local agent, and they sit in three tiers:

Tier 0   reflex        circuit breakers, timeouts       ~ms      free
Tier 1   node-local    reads evidence + self-model      ~100ms   cheap
Tier 2   global        frontier model, off to the side  rare     expensive

Most incidents never leave Tier 0, where boring, dependable machinery handles them. Tier 1 reads the local evidence, checks the self-model, and tries a bounded fix. The big expensive model at Tier 2 only gets pulled in for the genuinely nasty stuff, and even then it's off the critical path, not sitting in the middle of it.

Nodes gossip beliefs, not heartbeats. Usually gossip is health and counters. Here nodes gossip hunches: "I think the billing pool is saturated, confidence 0.9, here's my evidence." If neighbors independently land on the same thing, confidence climbs and you've got a systemic fault. If nobody else sees it, it's isolated, and the node can heal it locally. Telling "my problem" apart from "everyone's problem," with no central coordinator making the call, turns out to be the useful trick.

Autonomy is fenced in on purpose. This is the part I won't wave away. An agent can never run an arbitrary command. It picks from the closed set the model declared for that node, and any repair it proposes has to clear verification in a shadow environment before it's allowed anywhere near production.

proposes  ->  in permitted_repairs?  ->  passes shadow check?  ->  ships to prod
                   | no                       | no
                   v                          v
                reject                     reject

The model suggests, the pipeline decides. I wanted safety to be a property of the plumbing, not a thing you hope the model gets right.

What I can actually stand behind

There's a running implementation, not a diagram. Six real services, each in its own container, agents gossiping beliefs over actual TCP, and a diagnostic reasoner that's deterministic out of the box but swaps out for a hosted model if you want one. It's public, and every number in the paper came off that running system. Nothing's simulated.

What it showed:

Giving the reasoner its self-model roughly doubled how often it found the real root cause.
That came from the structure, not the model size. gpt-4o-mini and gpt-4o scored the same.
The safety boundary threw out every repair it wasn't allowed to make, including the ones where the model confidently pointed at the wrong thing.
Per-node overhead stayed small.

And what I'm not claiming, because it's easy to oversell this stuff: I haven't shown it holds at fleet scale, across a wide spread of fault types, over a long deployment. The economics pitch, that most incidents get handled cheaply and locally, is still something I'm measuring, not a result I get to wave around yet. The limitations section in the paper is written to be read, not skimmed past.

Why it might matter past DevOps

The framing I actually care about isn't "AI for ops." It's software that carries an explicit model of itself and reasons over it. Reliability is just the first place that's obviously useful. The same fabric could just as easily carry adaptive routing, or intrusion detection, or a service that rewrites its own model as it changes.

The whole thing boils down to one stubborn idea: an application should carry and reason over a model of itself, instead of handing that job to some outside observer that has to rebuild the picture from scratch every single time something goes wrong.

If that lands for you, or if you can see exactly where it falls apart, I'd genuinely like to hear it.

Listen: [https://notebooklm.google.com/notebook/4779c275-9060-44bb-b0a9-bd2eb953592f?authuser=6]

Paper: zenodo.org/records/21352597 (DOI 10.5281/zenodo.21352597)

Code: github.com/tflux2011/caf-prototype (archived at DOI 10.5281/zenodo.21363579)

I'm Tobi Adeosun, an independent researcher in Texas. You can reach me at me@tadeosun.com.

Why Merging AI Models Fails (And How a 'Gossip Handshake' Fixed It)

Tobi Lekan Adeosun — Sat, 07 Mar 2026 06:54:38 +0000

The Problem: AI is Too Centralized
Right now, the "AI Arms Race" is happening in giant data centers. But what happens in a rural village in Africa, or a high-security office with no internet? These communities need to share knowledge between their local AI models without a central server.

I spent the last few months researching Decentralized Knowledge Sharing. The goal: Could two different AI "experts"—say, an Agronomy Expert and a Veterinary Expert, combine their brains into one?

The "Common Sense" Failure: Weight-Space Merging
The current trend in AI is called Weight-Space Merging (like TIES-Merging). It basically tries to "average" the math of two models to create a single super-model.

I tested this, and the results were catastrophic.

When I merged a model that knew how to fix tractors with a model that knew how to treat cattle, the resulting "merged" model scored below random chance. It didn't just forget; it got confused. It tried to apply tractor repair logic to sick cows.

I call this the Specialization Paradox: The smarter your individual AI models get, the harder they are to merge.

The Solution: The Gossip Handshake Protocol
Instead of trying to smash two brains together, I built the Gossip Handshake.

Instead of merging weights, we:

Gossip: Devices discover each other via Bluetooth (BLE) and swap tiny 50MB "LoRA adapters" (knowledge packets).

Handshake: The device stores these adapters in a local library.

Route: When you ask a question, a lightweight Semantic Router picks the right expert for the job.

The Results: 13x Better Performance
I ran this on Apple Silicon (M-series) using the Qwen2.5 model family (0.5B and 1.5B parameters).

Method	Configuration	Agronomy	Veterinary	Overall Score
Baseline	Standalone Expert	68.0%	92.0%	80.0%
Standard Merge	TIES-Merging (d=0.5)	20.0%	8.0%	14.0%
Our Approach	Gossip Handshake	64.0%	92.0%	78.0%

The gap is massive. By simply switching instead of merging, we achieved a 5.6x to 13x leap in performance.

Why This Matters for Digital Sovereignty
This isn't just about better scores; it's about Sovereignty.

Zero Internet: This protocol works in "Zero-G" zones.
Privacy: Your raw data never leaves your device. Only the "math" (the adapter) is shared.
Scalable: You can add 100 experts to a single phone, and it only takes milliseconds to switch between them.

Try it Yourself (Open Source)
I've open-sourced the entire pipeline. You can generate the synthetic data, train the adapters, and run the Gossip Protocol on your own laptop.

👉 GitHub Repository: https://github.com/tflux2011/gossip-handshake

Final Thoughts
We need to stop trying to force AI into a "one size fits all" box. The future of AI is Modular, Decentralized, and Local.

I’d love to hear from you: Have you tried merging LoRA adapters? What were your results? Let’s discuss in the comments!

Stop Trusting Your AI Agents: How to Build a "Constitutional Sentinel"

Tobi Lekan Adeosun — Sat, 21 Feb 2026 19:37:45 +0000

In my last post, I wrote about why "Always-Online" AI agents fail in the real world and how to build an offline-first architecture.

But solving the connectivity problem introduces a much scarier problem: Autonomous Risk. When an AI agent is operating offline or at the edge, it is making decisions without immediate human oversight. LLMs are notoriously "confident idiots", they will happily generate code that grants isAdmin=true to a guest user, or confidently drop a database table because it misunderstood a prompt.

If you are building Agentic workflows, you cannot just hook an LLM directly to your execution environment. You need a middleman.

In my Contextual Engineering framework, we call this the Constitutional Sentinel.

What is a Constitutional Sentinel?
A Sentinel is a deterministic safety layer (hardcoded logic) that wraps around your probabilistic AI agent. Before the agent is allowed to execute any tool_call or API request, the Sentinel intercepts the payload, evaluates it against a set of hard constraints (the "Constitution"), and decides whether to:

Allow the execution.

Block the execution and return an error to the agent to try again.

Escalate to a Human-in-the-Loop (HITL).

The Implementation (Python)
Here is a simplified look at how to implement a Sentinel pattern to catch dangerous agent actions before they execute.

class ConstitutionalSentinel:
    def __init__(self):
        # Hardcoded constraints the AI is NEVER allowed to break
        self.banned_actions = ["drop_table", "delete_user", "grant_admin"]
        self.max_spending_limit = 50.00

    def evaluate_action(self, agent_proposed_action, payload):
        """
        Intercepts the agent's decision BEFORE execution.
        """
        print(f"🔍 Sentinel Intercept: Evaluating '{agent_proposed_action}'...")

        # 1. Check for universally banned actions
        if agent_proposed_action in self.banned_actions:
            return self._block(f"Action '{agent_proposed_action}' violates core safety constitution.")

        # 2. Check context-specific constraints (e.g., financial limits)
        if agent_proposed_action == "issue_refund":
            amount = payload.get("amount", 0)
            if amount > self.max_spending_limit:
                return self._escalate_to_human(agent_proposed_action, amount)

        # 3. If it passes all checks, allow execution
        return self._allow()

    def _block(self, reason):
        print(f"❌ BLOCKED: {reason}")
        # Return context back to the LLM so it can correct its mistake
        return {"status": "blocked", "feedback": reason}

    def _escalate_to_human(self, action, context):
        print(f"⚠️ ESCALATED: Human approval required for {action} ({context})")
        return {"status": "pending_human_review"}

    def _allow(self):
        print("✅ ALLOWED: Action passed constitutional checks.")
        return {"status": "approved"}


# --- Example Usage in your Agent Loop ---
sentinel = ConstitutionalSentinel()

# The AI Agent decides it wants to grant admin access based on a user prompt
proposed_action = "grant_admin"
payload = {"user_id": "9942"}

# The Sentinel intercepts it
decision = sentinel.evaluate_action(proposed_action, payload)

if decision["status"] == "approved":
    execute_tool(proposed_action, payload)
else:
    print("Execution halted. Agent must rethink or wait for human.")

Why "Green Checkmarks" Are Dangerous
Without a Sentinel, your tests might pass because the AI successfully generated the correct JSON structure for the API call. But structurally correct doesn't mean logically safe.

The Sentinel shifts your architecture from "Assuming the AI is right" to "Assuming the AI is a liability." It forces the system to prove its safety deterministically.

The Full Blueprint
The Constitutional Sentinel is just one piece of the Contextual Engineering architecture.

If you want to see how this Sentinel integrates with the Sync-Later Queue and the Hybrid Router to build resilient, offline-first AI for low-resource environments, I’ve open-sourced the complete reference manuscript.

You can download the full PDF on Zenodo for free (recently crossed 200+ downloads by other builders!):
👉 https://zenodo.org/records/18005435

Let’s stop building agents that just "work," and start building agents we can actually trust.

Building Offline-First AI Agents: Why "Always-Online" Architectures Fail in the Real World

Tobi Lekan Adeosun — Wed, 28 Jan 2026 03:21:18 +0000

The "Happy Path" Problem
If you look at the documentation for most AI agent frameworks (LangChain, AutoGPT, CrewAI), they all share a dangerous assumption: Abundance Connectivity.

They assume your API calls to OpenAI will always succeed. They assume your websocket will never drop. They assume your user has stable 5G.

But I build software for Lagos, Nigeria. Here, power flickers, fiber cuts happen, and latency is a physical constraint, not an edge case. When I tried deploying standard agentic workflows here, they didn't just fail, they failed catastrophically. Users lost data, workflows hallucinated, and API credits were wasted on timeouts.

I call this the "Agentic Gap", the massive divide between how AI works in a demo video in San Francisco and how it works in a resource-constrained environment.

We Need "Contextual Engineering"I spent the last year re-architecting how we build these systems. I call the approach Contextual Engineering. It’s not about making models smarter; it’s about making the system around them resilient.

Here are two architectural patterns I built to fix this, which you can use in your own Python projects today.

Pattern 1: The "Sync-Later" Queue
Most agents use a synchronous User -> LLM -> Response loop. If the network dies in the middle, the context is lost.

Instead, we treat every user intent as a Transaction.

Serialize the Intent: When a user prompts the agent, we don't hit the API immediately. We serialize the request and store it in a local SQLite queue.

Cryptographic Signing: We sign the request to ensure integrity.

Opportunistic Sync: A background worker checks for connectivity (Ping/Heartbeat). Only when $N(t) = 1$ (network is available) do we flush the queue.

The Python Implementation:Instead of a direct requests.post, we use a local buffer. Here is the logic from the open-source framework:

import sqlite3
import uuid

def queue_action(user_input, intent_type):
    # 1. Create a transaction ID
    tx_id = str(uuid.uuid4())

    # 2. Store locally first (Offline-First)
    conn = sqlite3.connect('agent_state.db')
    cursor = conn.cursor()
    cursor.execute(
        "INSERT INTO pending_actions (id, input, status) VALUES (?, ?, 'PENDING')",
        (tx_id, user_input)
    )
    conn.commit()

    # 3. Try to sync (if online)
    if check_connectivity():
        sync_manager.flush()
    else:
        print(f"Network down. Action {tx_id} queued for later.")

This ensures Zero Data Loss. The user can keep working, and the agent "catches up" when the internet comes back.

Pattern 2: The Hybrid Inference Router
Why route a simple "Hello" or "Summarize this text" to GPT-4? It’s slow, expensive, and requires a heavy internet connection.

I implemented a Router Logic Gate that inspects the prompt before it leaves the device.

Low Complexity? → Route to a local SLM (like Llama-3-8B or Phi-2) running on-device. (Cost: $0, Latency: Low).
High Complexity? → Route to the Cloud (GPT-4o).

The decision function looks like this:

# The Routing Logic
if network_is_down() or complexity < threshold:
    model = "Local Llama-3 (8B)" # Free, Fast, Offline
else:
    model = "GPT-4o"             # Smart, Costly, Online

This simple check saved us about 40-60% on API costs and made the application feel "instant" for basic tasks, even on 3G networks.

The "Contextual Engineering" Framework
These patterns aren't just hacks; they are part of a broader discipline I’m trying to formalize called Contextual Engineering. It’s about building AI that respects the Contextual Tuple (C = {I, K, R}): Infrastructure, Knowledge (Culture), and Regulation.

I’ve open-sourced the entire reference architecture. It includes the routing logic, the SQLite queue wrappers, and the "Constitutional Sentinel" for safety.

Where to find the code
I want to see more engineers building specifically for the Global South. You can find the full Python implementation here:

👉 Star the GitHub Repository

The Deep Dive (Free Book)
For those who want the math and the full architectural theory, I also wrote a 90-page reference manuscript titled "Contextual Engineering: Architectural Patterns for Resilient AI." It covers the full "Agentic Gap" theory and detailed diagrams.

📖 Download the PDF (Open Access)

Let me know in the comments: How do you handle network flakes in your LLM apps?