DEV Community: Shakib S.

The Hitchhiker's Guide to Running Agentic Systems Locally

Shakib S. — Thu, 09 Apr 2026 18:07:09 +0000

Engineering a Hybrid LLM Router for Production Agentic Systems

Every agentic system eventually confronts the same wall: intelligence costs latency, and latency destroys experience. The standard prescription — throw more compute at it — is lazy engineering disguised as ambition.

After months of iterating on agentic workflows on an Arch Linux rig, I found a third path. Not faster models. Not cheaper models. A smarter layer that decides which model to use — and when.

Small open-weight models are excellent for routine tasks: fast, private, and inexpensive to run. Their limitation appears when prompts require multi-step reasoning, structured output, or strict tool use.

But it has what I call a reasoning ceiling — a hard limit where multi-step logical deduction collapses into what I call the confidence loop: the model executes the wrong tool with complete conviction, or hallucinates a JSON schema that does not exist.

Frontier APIs — DeepSeek V3.2, GPT-4o — solve the ceiling problem. But they also introduce extra latency, especially when used for simple requests and a monthly bill that resembles a lease payment more than a software expense.

The more practical solution is not to depend on one model for everything, but to add a routing layer that chooses the right model for the task.

What follows is a production engineering account of how to build one.

The Routing Layer: Theory vs. Reality

The naive formulation is seductive: route 'trivial' tasks to the local 9B, 'complex' tasks to the cloud. The production reality is that complexity is not a binary flag — and the cost of misclassification flows in both directions.

The Failure of Keyword-Based Routing

My first iteration used a keyword router. Prompts containing 'analyze' or 'compare' were dispatched to the cloud. It produced two failure modes that made it unusable in production.

False positives: A prompt like 'Compare 2+2 and 3+3' hit the cloud, wasting API credits and adding 2+ seconds of network round-trip for a task any model could answer instantly.

False negatives: A deceptively simple-looking prompt — 'Summarize this 10k-token log and find the one-line error' — was sent local. The 9B choked on the context window and hallucinated the result with full confidence.

The Confidence-Based Architecture

The solution was to evaluate prompts across three independent signal vectors rather than classify them by keyword pattern.

Constraint Density. Does the prompt contain more than three strict, simultaneous constraints? ('JSON output,' 'Under 50 words,' 'Reference Page 4.') High constraint density is a reliable predictor of structured output failure in quantized models. Route to cloud.

Context Pressure. Is the input token count above 8k? Local 9B models begin to exhibit needle-in-a-haystack degradation in this range — they process the context window but lose positional accuracy on retrievals. Route to cloud.

The Scout Classifier. A dedicated 1B model — lightweight enough to run in under 50ms — whose sole function is to categorize incoming prompts as Trivial, Standard, or Complex. It adds minimal overhead while dramatically improving routing accuracy.

The Unit Economics of Agentic Compute

The standard metric — monthly API spend — is the wrong unit of measurement for agentic systems. It optimizes for the wrong variable and obscures the actual cost structure of the work.

The correct metric is Cost per Successful Task (CPST). If a local model is 'free' but fails 30% of the time, requiring manual human correction, the cost is not zero — it is your time, which is the most expensive resource in the system. If a cloud model charges $0.05 and succeeds 100% of the time, it is, by any rational accounting, cheaper.

Free models are not free. They externalize cost onto the operator's attention.

The Tradeoffs in the Quantization Curve

A week of benchmarking q4_K_M vs. q8_0 (GGUF) produced a finding that materially changes how the hybrid system should be designed:

For most routine tasks, q4_K_M performs close to full precision, but structured tool-calling is where reliability begins to degrade.

For structured tool-calling, q4 quantization introduces an intermittent bracket-drop failure: occasionally missing a closing brace in a generated JSON schema. This single failure mode propagates up the agent loop and crashes the entire execution chain.

The engineering resolution is straightforward: use q4_K_M for scout classification and general conversational tasks; reserve a dedicated q8_0 inference slice for the tool-calling engine specifically. The additional memory overhead is modest. The reliability gain is non-trivial.

ROI Breakdown: Daily Heavy Usage

The daily operating cost of this architecture — under heavy professional use — is approximately $0.17. For comparison: a single GPT-4o API session on a complex document task can easily exceed that figure on its own.

Implementation: Beyond the Toy Script

A production router cannot be a collection of if-statements around os.popen calls. The core requirements are: asynchronous evaluation, type-safe output validation, and resilient fallback semantics.

The Async + Pydantic Stack

The current implementation uses asyncio for parallel prompt evaluation — the scout classification and context pressure check run concurrently, not sequentially. Pydantic enforces the routing schema: if the local model produces an invalid tool call, the resulting ValidationError is caught at the boundary, and execution silently fails over to the cloud model. The user sees a valid response. The failure is logged for analysis.

## Production routing logic - async, type-safe, resilient

async def route_request(prompt: str) -&gt; LLMResponse:
    ## Parallel evaluation: scout + context pressure
    classification, pressure = await asyncio.gather(
        scout_model.classify(prompt),
        check_context_pressure(prompt)
    )

    if classification == 'complex' or pressure:
        return await cloud_engine.generate(prompt)

    try:
        result = await local_engine.generate(prompt)
        return ResponseSchema.model_validate(result)  # Pydantic guard
    except (LocalInferenceError, ValidationError):
        ## Graceful degradation - invisible to the caller
        return await cloud_engine.generate(prompt)

The ValidationError catch is the critical architectural seam. Without it, a single malformed tool call from the local model becomes a process-level exception that kills the agent loop.

With it, the system degrades gracefully to the cloud path without user-visible impact, while preserving the performance characteristics for the 90%+ of tasks that the local model handles correctly.

Observability: What to Instrument

Route decision distribution per session (local% vs. cloud%)
Local validation failure rate — a rising trend signals model drift or prompt distribution shift
End-to-end CPST across task categories — the ground truth metric for system health
Scout classifier latency — should remain under 80ms or the overhead defeats its purpose

Computational Sovereignty: The Strategic Dimension

This architecture is more than a cost optimization exercise. It is a hedge against dependency risk — a class of risk that most engineering teams do not model until it becomes acute.

Depending entirely on a cloud provider for inference capability introduces a category of operational risk that conventional SLAs do not address. API pricing is not fixed. Rate limits shift. Providers modify moderation behavior. A workflow that runs cleanly today may be rejected tomorrow due to policy changes applied server-side, without notice, to weights you do not own.

By maintaining a local 9B baseline, you own the weights. You retain a survival-minimum of intelligence that operates offline, during outages, and independent of any provider's policy decisions.

The hybrid architecture is not a concession to the limits of local models. It is a deliberate design choice that treats cloud inference as a performance upgrade — powerful, but optional — rather than a fundamental dependency. The baseline capability is yours. The ceiling is rented.

Conclusion: The Personal Intelligence Stack

The pattern described here — scout, route, validate, fallback — is not novel as an abstract architecture. Routing layers exist across distributed systems engineering. What is new is applying it to the inference layer of agentic AI, at the level of the individual practitioner.

Today, this looks like a sophisticated personal optimization. A few years from now, it will look like table stakes. The practitioners who ship reliable agentic systems are not waiting for a single model to solve the latency-intelligence paradox. They are building the routing layer that makes the paradox irrelevant.

People won't talk about using AI tools. They'll talk about running personal intelligence stacks. This is what that looks like — early.

The weights are already cheap. The inference hardware is already accessible. The only remaining variable is the engineering discipline to compose them correctly.

Build the router.

I Tried to Build a Local Claude-Style Assistant

Shakib S. — Mon, 09 Mar 2026 11:35:53 +0000

I didn't want a demo. I wanted a real assistant.

The plan was simple. To use OpenClaw with Qwen's latest 3.5 Large Language Models.

Not another local LLM that could summarize text and write boilerplate. I wanted something with memory — something that could keep a user profile, retrieve my notes, use tools, live in Telegram, and actually feel persistent. The kind of thing you close your laptop and trust is still there.

So I did what the local AI community makes look achievable: I tried to build it myself.

My machine: Arch Linux, an RTX 3050-class GPU with 6 GB VRAM, ~12 GB of system RAM. Enough to run small models. Enough to experiment. Enough, I thought, to build something real.

What I got instead was a sharp education in the gap between "running a model locally" and "running an agent framework locally." They are not the same workload, and conflating them is the most common mistake in this space.

The Stack That Made Sense on Paper

The tool I wanted to build on was OpenClaw — an open-source agent framework that layers tools, memory, sessions, multi-channel support, and structured workflows on top of a language model backend. It promised to be the missing piece between "I can run a model" and "I have an assistant."

The plan:

Ollama to serve models locally
Qwen as the model (small, recent, reportedly strong for its size)
OpenClaw as the agent layer

On paper, this is a reasonable stack. In practice, it surfaces every assumption that local AI discourse glosses over.

Mistake #1: Assuming the Model Was the Whole Problem

I started with qwen3.5:4b through Ollama.

It ran. That wasn't the issue.

The issue was behavioral: the model kept slipping into visible chain-of-thought output. Every response came with narrated reasoning. Not broken, but wrong — asking for a daily assistant and getting a model that wanted to think out loud at every turn is like hiring a receptionist who reads their internal monologue aloud before answering the phone.

So I did what everyone does: I tried to prompt my way out of it. Stricter system prompts. Custom Modelfiles. Context limits. Persona instructions. Explicit /no_think directives.

That helped. It didn't solve the core mismatch.

The qwen3.5:4b model is from the thinking branch of the Qwen family — it's optimized for reasoning tasks, not conversational fluency. The model family matters. Using a reasoning model for a chat assistant is a category error, not a configuration problem.

The fix was straightforward once I admitted it: switch to an instruct model.

ollama pull qwen3:4b-instruct-2507-q4_K_M

q4_K_M quantization is a solid default for 4B models on 6 GB VRAM — it cuts memory footprint meaningfully without destroying output quality. With a pure instruct model, the behavioral issues cleared up immediately.

But the harder problem was still waiting.

Mistake #2: Thinking "Model Works" Means "Stack Works"

Here's what a plain local chat app sends to a model:

[system prompt]
[recent conversation turns]
[user message]

That's it. A few hundred tokens, maybe a few thousand if you keep a long history. Totally manageable.

Here's what an agent framework sends to a model:

[system prompt]
[tool schemas — every tool the agent can call]
[session state and memory]
[workspace context]
[bootstrap instructions]
[structured output expectations]
[past actions and results]
[conversation history]
[user message]

Every one of those elements costs tokens. And on a local setup, tokens aren't a billing abstraction — they're a memory and latency problem.

This is the thing local AI content almost never explains clearly: the model is not the product. The framework around the model is the product. And that framework has weight.

OpenClaw made this visible very quickly.

The Context Window Trap: 4K vs 16K vs 262K

The first error I hit was blunt:

Error: context window too small. Minimum required: 16000 tokens. Current: 4096.

Fine. I increased the context window.

Then Ollama started allocating memory as if the context was 262,144 tokens.

Error: model requires ~38.9 GiB of system memory. Available: ~12.5 GiB.

That's not a typo. 38.9 GB for a 4B model, because of context window size.

Here's why: transformer attention is quadratic in sequence length. When you set a 262K context window, the KV cache — the memory structure that stores computed attention for every token in context — scales with it. For a 4B model at q4_K_M quantization, the model weights themselves are around 2.5 GB. But a 262K context KV cache can dwarf that by an order of magnitude.

The math, roughly:

KV cache size ≈ 2 × layers × heads × head_dim × context_length × bytes_per_element

For Qwen 4B:
≈ 2 × 32 × 8 × 128 × 262144 × 2 bytes
≈ ~34 GB

You can see how a "reasonable" context ceiling becomes a hardware wall fast.

The trap is that the minimum functional context OpenClaw needs (16K) is already well above what fits comfortably in a 6 GB VRAM setup if you want any headroom for inference. And 16K is the floor, not the sweet spot.

The Configuration Spiral

What followed was a long sequence of plausible-looking fixes that went nowhere:

Set num_ctx: 8192 in Ollama → OpenClaw complained the session minimum wasn't met
Set num_ctx: 16384 → worked, but sessions kept inflating back to 262K in the state view
Manually edited OpenClaw config files → watched changes revert
Checked whether I was editing the wrong state directory → yes, sometimes
Created local model aliases with explicit context caps → partial success
Hard-capped context in every config location I could find → Ollama respected it; OpenClaw didn't always agree with the result

The pattern is seductive. Every partial success feels like momentum. The model loaded? Great. The context lowered? Okay. The agent accepted the config? Almost there.

Then the next hidden assumption surfaces.

The system wasn't randomly broken. It was consistently surfacing the same underlying incompatibility: the framework assumed a context budget that my hardware couldn't provide. Every workaround was borrowing against that fundamental gap, not closing it.

The Insight That Reframed Everything

At some point I stopped treating this as a configuration bug and asked a different question:

What kind of system is OpenClaw actually designed for?

Not "can I make it work with a 4B model on 6 GB VRAM?" but "what does the framework assume about its environment?"

The answer, once you look honestly at the architecture, is clear:

OpenClaw is built for models with real context headroom — 32K, 64K, 128K tokens where the agent scaffolding is a small fraction of available budget rather than the entire budget.

It's built for models with low-latency inference — where tool call round-trips and multi-step reasoning don't become multi-minute waits.

It's built for the API tier, not the consumer GPU tier.

That's not a criticism. It's a design reality. The framework does things that genuinely require those resources: persistent memory, multi-turn tool use, session-aware behavior, complex orchestration. That stuff is the whole value proposition. And it has a minimum viable substrate.

What Changed When I Switched to a Cloud Backend

I pointed OpenClaw at Kimi K2.5 via a cloud API and the experience shifted immediately.

Not magically. The framework still has quirks. But the fundamental friction — the constant negotiation over whether the infrastructure could physically support the next operation — disappeared.

Messages went through cleanly. Context stopped being the entire conversation. The tool layer worked the way the documentation described. I could actually evaluate the product rather than fighting the substrate.

The comparison is useful: the same framework, the same prompts, the same configuration. The only variable was whether the model backend could absorb the overhead without drowning.

Local: every interaction was a resource negotiation.

Cloud: the framework did what it was supposed to do.

What This Actually Means for Local AI

I want to be precise here, because "just use the cloud" is a lazy conclusion and I don't believe it.

Small local models are genuinely good at a real set of tasks:

Plain conversational chat — instruct models at 4B–8B are solid here
Focused code help — constrained tasks where context window is not the bottleneck
Document drafting — one-shot or few-shot generation over small inputs
Local RAG — retrieval over a small, well-scoped document set
Privacy-sensitive workflows — anything that shouldn't leave your machine

Where local models struggle is not a model quality problem. It's a systems design problem: if your tool layer, memory layer, session model, and orchestration all assume a generous context budget, a small local setup will spend most of its energy surviving the framework rather than doing useful work.

The honest reframe is this:

Running a 4B model in a simple chat interface and running that same 4B model inside a full agent framework are not the same workload. One fits on your GPU. The other assumes datacenter-class headroom.

Treating them as equivalent is why so many local AI projects stall out in configuration hell rather than producing something useful.

Practical Recommendations

If your goal is a local everyday assistant:

Use a small instruct model in a lean interface. Ollama + Open WebUI or a minimal Python frontend. Keep the context requirement below 8K. Avoid frameworks that inject large amounts of scaffolding unless you've measured the overhead. A q4_K_M quantized 4B–8B instruct model in a simple chat loop is genuinely useful.

ollama pull llama3.2:3b-instruct-q4_K_M  # ~2GB, fast, good for chat
ollama pull qwen3:8b-instruct-q4_K_M     # ~5GB, better reasoning

If your goal is to actually experience OpenClaw:

Start with a cloud-backed model. Let the framework do what it was designed to do before you optimize for local deployment. You'll learn what the product actually is rather than spending all your time fighting the substrate.

If you're committed to local + agent features:

You need either a machine with 24+ GB VRAM (RTX 4090, A-series workstation GPUs), or you need to be very intentional about which agent features you enable and what their context cost is. Profile the token overhead of each feature before enabling it.

The Question I Should Have Asked First

I went into this project asking: "Which model should I run?"

That's the wrong question. The right question is:

"What kind of system am I trying to run, and what does that system actually require?"

Plain chat and agent frameworks are different categories with different resource profiles. A model that works beautifully in a simple interface can fail badly inside a framework that assumes 10x the context budget.

Understanding that distinction early would have saved me a lot of configuration spirals. It's also just a more accurate mental model for thinking about local AI in general — not as "models you run" but as "systems with compute requirements," where the framework overhead is often larger than the model itself.

That's the lesson that actually transfers.

Have you run into context window or memory walls with local agent frameworks? What workarounds have actually held up? I'd like to hear what's working.

Why Local AI Agents Fail Silently — and What to Measure Before You Ship

Shakib S. — Sun, 08 Mar 2026 18:10:44 +0000

If you spend enough time around AI engineering, you eventually run into the same frustration.

You use a cloud model like ChatGPT or Claude, and it feels impressively capable. It can reason through multi-step tasks, fetch up-to-date information, write code, and respond with the kind of fluency that makes it feel far more useful than a simple text generator.

Then you run a local model on your own machine — Llama, Mistral, Qwen, or another open model — and the experience feels much more limited.

It cannot answer questions about current events.
It struggles with tasks that require live information.
It often feels weaker than the cloud systems you are used to.

The immediate reaction is: "Open-source models just aren't as good."

But that explanation is incomplete.

The real difference between a cloud AI product and a local model is rarely just model quality. More often, the gap comes from the surrounding infrastructure.

Cloud AI systems are rarely "just a model." They are packaged with orchestration layers, tool calling, retrieval systems, search, memory, and routing logic that make the model feel more capable than it would on its own.

When you run a local model, you are usually interacting with the raw foundation model directly.

To make a local LLM genuinely useful, you often do not need a bigger model first. You need to give it access to tools. You need to turn it into an agent.

In this tutorial, we will build a simple local AI agent using Python and Ollama. By the end, your local model will be able to:

reason about a task
decide when it needs external information
call a web search tool
return answers that are more useful and up to date

What We Will Build

By the end of this tutorial, you will have a local AI agent that can:

run a local LLM with Ollama
decide when it needs external data
call a web search tool automatically
integrate tool results into its reasoning loop

The architecture looks like this:

User
  │
  ▼
Local LLM (Ollama)
  │
  ▼
Agent Loop (ReAct)
  │
  ▼
Tool Router
  │
  └── Web Search Tool

Instead of treating the model as an all-knowing oracle, we treat it as the reasoning engine inside a larger application.

Why Local LLMs Feel Weak

Before writing any code, it's important to understand why local models feel weaker out of the box.

The issue isn't necessarily the model. The issue is the missing infrastructure around the model.

1. Missing Orchestration

When you chat with systems like ChatGPT, your message is not simply passed to a model. Behind the scenes, an orchestration layer decides things like:

whether the model should search the web
whether it should execute Python
whether additional context should be retrieved

Your local LLM does none of this. It simply predicts the next token in a sequence. Without orchestration, the model is forced to guess information it cannot access.

2. Lack of External Tools

LLMs are excellent reasoning engines, but terrible databases. If you ask a model for today's weather, the correct answer requires live data.

Humans solve this by using tools: calculators, web browsers, APIs. LLMs can do the same — but only if you give them access to those tools. Without tools, the model is effectively trapped inside its training data.

3. Missing Middleware and State

Production AI systems include middleware that handles:

context management
memory summarization
structured tool outputs
retry logic

Without these systems, a local model can quickly lose context or fail at multi-step tasks.

This leads to an important shift in perspective:

❌ Old thinking	✅ New thinking
"The model should know everything."	"The model should decide which tools to use."

Your LLM becomes the CPU of an AI application.

Agent Architecture

To enable tool usage, we need a simple agent architecture. One of the most widely used patterns is called ReAct (Reasoning + Acting).

Instead of generating a single response, the model runs inside a reasoning loop:

The user sends a query
The model reasons about the problem
The model decides if it needs a tool
The tool executes
The result is returned to the model
The model produces the final answer

Components of the System

The Brain (LLM)
The local model running in Ollama. It reads the query and decides what to do.

The Tool Library
A collection of Python functions such as search_web(), read_file(), calculate(). Each tool exposes a clear schema so the model knows how to call it.

The Router
The router connects the LLM to the tools. If the LLM requests a tool call, the router identifies the tool, executes the Python function, and returns the result to the model.

Step 1 — Install Ollama

Ollama makes it easy to run local models with an API interface similar to OpenAI.

Download Ollama from https://ollama.com and install it for your OS.

Pull a tool-capable model:

ollama pull llama3.1

💡 Instruct-tuned models generally perform better with tool calling.

Start the model:

ollama run llama3.1

This downloads approximately 4–5 GB of model weights. Exit with /bye or Ctrl+D.

Set up your Python environment:

python -m venv agent_env
source agent_env/bin/activate
pip install ollama duckduckgo-search

We will use:

ollama for model interaction
duckduckgo-search for live web search

Step 2 — Add a Web Search Tool

Create a file called agent.py and add the following:

from duckduckgo_search import DDGS
import json

def web_search(query: str) -> str:
    """
    Search the web for up-to-date information.
    """
    print(f"\n[Tool] Searching the web for: {query}")
    results = DDGS().text(query, max_results=3)
    formatted = []
    for r in results:
        formatted.append({
            "title": r.get("title"),
            "snippet": r.get("body")
        })
    return json.dumps(formatted)

Why docstrings matter

When Ollama exposes tools to the model, it builds a schema from the function name, its arguments, and its docstring.

The model uses that schema to understand:

what the tool does
when it should be used
what arguments to pass

⚠️ If your tool descriptions are vague, the model is more likely to hallucinate bad tool calls. Good docstrings directly improve reliability.

Step 3 — Implement a Tool Router

Next, add a router that safely executes tool calls:

import ollama

AVAILABLE_TOOLS = {
    "web_search": web_search
}

def execute_tool_call(tool_call):
    function_name = tool_call.function.name
    arguments = tool_call.function.arguments

    if function_name not in AVAILABLE_TOOLS:
        return f"Error: tool {function_name} not found"

    function = AVAILABLE_TOOLS[function_name]

    try:
        result = function(**arguments)
        return result
    except Exception as e:
        return f"Tool execution error: {str(e)}"

This router acts as a security layer. If the LLM hallucinates a tool like hack_mainframe(), the router safely blocks it.

Step 4 — Run the Agent Loop

Now implement the full ReAct agent loop:

def run_agent(query):
    messages = [
        {
            "role": "system",
            "content": "You are an assistant with access to tools."
        },
        {
            "role": "user",
            "content": query
        }
    ]

    while True:
        response = ollama.chat(
            model="llama3.1",
            messages=messages,
            tools=[web_search]
        )

        message = response.message
        messages.append(message)

        if not message.get("tool_calls"):
            print("Agent:", message.content)
            break

        for tool_call in message.tool_calls:
            result = execute_tool_call(tool_call)
            messages.append({
                "role": "tool",
                "content": result,
                "name": tool_call.function.name
            })

This loop implements the full ReAct cycle. The model can reason, call a tool, receive results, and generate a final answer — all in one flow.

Step 5 — Test It

Add this to the bottom of agent.py:

if __name__ == "__main__":
    run_agent("Who won the Super Bowl in 2024?")

Run it:

python agent.py

Example output:

User: Who won the Super Bowl in 2024?
[Tool] Searching the web for: Super Bowl 2024 winner
Agent: The Kansas City Chiefs won Super Bowl LVIII in 2024 with a score of 25-22 against the San Francisco 49ers.

At this point, the local model has done something important:

It recognized it did not know the answer
It decided to call the web search tool
It retrieved fresh information
It incorporated that result into its final response

Your model is no longer limited to its training data.

The Full `agent.py`

from duckduckgo_search import DDGS
import json
import ollama


def web_search(query: str) -> str:
    """
    Search the web for up-to-date information.
    """
    print(f"\n[Tool] Searching the web for: {query}")
    results = DDGS().text(query, max_results=3)
    formatted = []
    for r in results:
        formatted.append({
            "title": r.get("title"),
            "snippet": r.get("body")
        })
    return json.dumps(formatted)


AVAILABLE_TOOLS = {
    "web_search": web_search
}


def execute_tool_call(tool_call):
    function_name = tool_call.function.name
    arguments = tool_call.function.arguments

    if function_name not in AVAILABLE_TOOLS:
        return f"Error: tool {function_name} not found"

    function = AVAILABLE_TOOLS[function_name]

    try:
        result = function(**arguments)
        return result
    except Exception as e:
        return f"Tool execution error: {str(e)}"


def run_agent(query):
    messages = [
        {
            "role": "system",
            "content": "You are an assistant with access to tools."
        },
        {
            "role": "user",
            "content": query
        }
    ]

    while True:
        response = ollama.chat(
            model="llama3.1",
            messages=messages,
            tools=[web_search]
        )

        message = response.message
        messages.append(message)

        if not message.get("tool_calls"):
            print("Agent:", message.content)
            break

        for tool_call in message.tool_calls:
            result = execute_tool_call(tool_call)
            messages.append({
                "role": "tool",
                "content": result,
                "name": tool_call.function.name
            })


if __name__ == "__main__":
    run_agent("Who won the Super Bowl in 2024?")

What's Next?

Using the same pattern, you can extend the agent with tools that:

query databases
read PDFs
run shell commands
call external APIs

Some ideas to try:

# Add more tools to AVAILABLE_TOOLS
AVAILABLE_TOOLS = {
    "web_search": web_search,
    "read_file": read_file,
    "run_python": run_python_snippet,
    "query_db": query_database,
}

Conclusion

The perceived "incompetence" of local LLMs is often not a model problem first. It is an infrastructure problem.

Once you wrap a local model in an agent architecture — with a reasoning loop, tool library, and routing layer — it becomes far more useful than a raw chat interface suggests.

Cloud AI systems will continue to dominate on raw scale and infrastructure maturity. But local agents offer something different: control, privacy, flexibility, and the ability to shape the system around your own workflow.

Once you start thinking this way, local models stop feeling like weak copies of cloud AI — and start feeling like programmable building blocks.

Found this useful? Drop a ❤️ and follow for more AI engineering content.

Why Your Local LLM Feels "Dumb" Compared to Cloud APIs

Shakib S. — Sat, 28 Feb 2026 09:28:10 +0000

The Experiment That Changed How I Think About AI

I had just set up a fully local AI stack. Ollama running Llama 3 8B, clean terminal, no API keys, no monthly bill. I typed a real-world task:

"Research the latest patch notes for Elden Ring and save a summary to my desktop."

It failed. Politely, apologetically — but it failed.

Then I opened Claude. Same task. It searched the web, summarized the results, and handed me a file.

My first instinct was the obvious one: Claude is just smarter.

That instinct was wrong. And understanding why it's wrong is the most useful thing you can learn about AI right now.

The Brain in a Jar Problem

A local LLM in its default state is exactly this: a brain in a jar.

It has knowledge. Enormous knowledge, compressed into billions of parameters. But it has no limbs. It cannot reach the internet. It cannot write to your filesystem. It cannot verify its own outputs. It cannot call an API. It just... predicts the next token, over and over, inside a sealed container.

When Claude "thinks" through a complex task, it isn't doing so purely with a bigger or smarter model. It's using tool calling — a mechanism where the model emits structured instructions, and a surrounding system executes them.

The model says: "I need to search the web."

The system does the search.

The results come back into context.

The model says: "Now write this to a file."

The system writes the file.

The model itself never touched the internet. Never touched your filesystem. It just coordinated — and the orchestration layer did the actual work.

That orchestration layer is what you're paying for when you pay for Claude or GPT-4. The model is increasingly a commodity. The IP is the system built around it.

Proving It: Raw vs. Orchestrated

To make this concrete, I ran an experiment with a simple setup.

The stack:

Inference: Ollama (Llama 3 8B)
Orchestration: Python middleware with basic function calling
Tools: A web search API (Tavily) and a local filesystem writer

The logic flow for the Elden Ring task:

Without orchestration, the model has no path to success. It knows what patch notes are. It knows how to summarize. But it cannot get the data, so the task dies before it starts.

With orchestration, the flow looks like this:

User sends the request
The model emits structured JSON: {"tool": "web_search", "query": "Elden Ring patch notes 2026"}
The Python middleware intercepts this, runs the search, returns the text
The model summarizes and emits: {"tool": "write_file", "filename": "summary.txt", "content": "..."}
The middleware writes the file

Same 8B model. Completely different result.

The finding was stark: an orchestrated 8B model consistently outperforms a raw 70B model on real-world tasks. Not because it's smarter — because it has agency. It has limbs.

The Code That Makes It Real

Here's a minimal Python implementation of this pattern. This is not pseudocode — this runs.

import json
import requests

OLLAMA_URL = "http://localhost:11434/api/chat"
MODEL = "llama3:8b"

TOOLS = {
    "web_search": lambda query: search_web(query),
    "write_file": lambda filename, content: write_to_disk(filename, content),
}

def search_web(query: str) -> str:
    # Replace with your Tavily or SearXNG endpoint
    response = requests.get(
        "https://api.tavily.com/search",
        params={"query": query, "api_key": "YOUR_KEY"}
    )
    results = response.json().get("results", [])
    return "\n".join([r["content"] for r in results[:3]])

def write_to_disk(filename: str, content: str) -> str:
    with open(f"/tmp/{filename}", "w") as f:
        f.write(content)
    return f"File written: /tmp/{filename}"

def run_agent(user_input: str):
    system_prompt = """
    You are an agent with access to tools. When you need to use a tool, 
    respond ONLY with valid JSON in this format:
    {"tool": "tool_name", "args": {"arg1": "value1"}}

    Available tools:
    - web_search: {"tool": "web_search", "args": {"query": "your query"}}
    - write_file: {"tool": "write_file", "args": {"filename": "name.txt", "content": "..."}}

    When the task is complete, respond normally in plain text.
    """

    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_input}
    ]

    for _ in range(5):  # max 5 tool calls
        response = requests.post(OLLAMA_URL, json={
            "model": MODEL,
            "messages": messages,
            "stream": False
        })
        reply = response.json()["message"]["content"].strip()

        try:
            tool_call = json.loads(reply)
            tool_name = tool_call["tool"]
            args = tool_call.get("args", {})

            if tool_name in TOOLS:
                result = TOOLS[tool_name](**args)
                messages.append({"role": "assistant", "content": reply})
                messages.append({"role": "user", "content": f"Tool result: {result}"})
                continue

        except (json.JSONDecodeError, KeyError):
            # Model responded in plain text — task is done
            print("Final answer:", reply)
            return reply

    print("Max iterations reached.")

run_agent("Research the latest Elden Ring patch notes and save a summary to my desktop.")

This is the bridge. Not complex — but powerful. Once you have this pattern, you can add any tool: database lookups, calendar access, code execution, anything.

The Distillation Bonus

There's a second force making this even more interesting right now.

Larger models are being used to train smaller ones — a process called distillation. The practical result: we are approaching a point where the reasoning capability of a trillion-parameter model gets compressed into 7 billion weights.

Models like DeepSeek-R1, Qwen3, and Mistral's latest releases are examples of this trend. The gap between a "small" local model and a frontier cloud model is shrinking every quarter.

This creates a compounding advantage for the local orchestration approach:

The model gets smarter (distillation brings frontier reasoning to edge hardware)
The system gives it agency (orchestration adds tools and memory)
Your data stays local (the privacy moat stays intact)

You get increasing capability without increasing your exposure. That's a rare combination.

The Practical Takeaway

Stop searching for a smarter model. The model is rarely your bottleneck.

The three things that actually determine usefulness are:

1. Web access — A model without current information is operating blind. Add a search tool. Even a free SearXNG instance changes everything.

2. Memory / context persistence — By default, each conversation starts from zero. A simple vector store (Chroma, Qdrant) or even a plain text log fed back into context gives your model continuity.

3. File system / execution access — The ability to write, read, and run code transforms the model from an advisor into an agent.

These are not advanced features. Each one is a weekend project. And together, they close most of the gap between a local 8B model and a productized cloud API.

Closing: The Race Has Already Shifted

The AI competition used to be about who could train the biggest model. That race is becoming irrelevant for most practical use cases.

The new race is about who builds the best system around the most efficient model.

Cloud providers understood this first — that's why their APIs feel so capable. But the tools to replicate that architecture locally are open, documented, and running on consumer hardware right now.

You don't need a bigger model.

You need better plumbing.

If you want to go deeper on the self-hosting stack itself — Ollama, Open WebUI, and SearXNG working together — I covered the full setup in Part 1 of this series.

"Fuck You NVIDIA" (and What I Learned Staring at a Blank Screen)

Shakib S. — Thu, 26 Feb 2026 18:13:03 +0000

When I A bug or an system display related issue I found in Arch Linux running with SDDM.

It wakes up.

Black screen.

I'm running Arch Linux with SDDM. NVIDIA GPU. Consumer card.

Here's the thing about NVIDIA on Linux: their power management on consumer-level hardware is broken by design. When the display sleeps and wakes, the driver conflicts. The screen doesn't recover. You're left staring at nothing, wondering if you broke something or if something was always broken.

It's a known issue. Documented in forums. Mentioned in bug trackers. NVIDIA just hasn't cared enough to properly fix it for us.

Linus Torvalds, 2012 (Still accurate)

A Year on Arch (Without Going Down the Rice Hole)

I've been on Arch for a year now.

I didn't rice it. Not obsessively, anyway. I knew the trap — you spend three months building a desktop that's perfectly yours: every keybinding, every color, every font chosen by your own hands, understood by exactly one person on Earth. Beautiful to you. Useless to your deadline.

That wasn't wise for me. Not yet.

But I still learned more from this OS than any hand-holding distro ever taught me. Arch doesn't protect you from yourself. It hands you a blank canvas, a wiki, and your own stubbornness — then steps back.

You learn because you have to. And somehow that sticks.

Today's Rabbit Hole

I was installing Omarchy on a CachyOS base — Hyprland setup, fresh install, ready to go. Used what was labeled a "safe" test script from GitHub.

It wasn't perfect.

I started troubleshooting. Logs, terminal output, forum threads from 2019 that are somehow still the most relevant thing on the internet. Feeding output to my AI, clicking through configs, muttering to myself.

And then — the pieces connected.

That display bug I'd been living with for months? I finally traced it. NVIDIA drivers conflicting on wake from sleep. Consumer power management. A problem that's been sitting in plain sight, documented and unfixed, waiting for me to finally look it in the eye.

The Hours You Spend Confused Are the Investment

Nobody tells you this when you install Arch.

The blank screens, the journalctl rabbit holes, the 3am forum threads — that's not wasted time. That's tuition. You're paying for a mental model of your own system. One that no YouTube tutorial can hand you.

I questioned those moments. Hard. Staring at nothing, wondering why I was doing this to myself, wondering what normal people do with their evenings.

But today I can say: I know what was wrong. I understand my system.

And the fix?

# In /etc/systemd/logind.conf
HandleLidSwitch=ignore

One line. Toggle the lid behavior off. The display stops sleeping. The conflict never triggers.

That's it. That's the ending. Months of blank screens, solved by one config line I could've written on day one — if I'd known enough to write it.

Still With You, Linus

NVIDIA makes powerful hardware. They also make Linux users' lives unnecessarily difficult — and have for decades. The open-source community has worked around them, patched around them, and occasionally yelled at them in legendary fashion.

I'm still here. Still on Arch. Still learning things the hard way, which turns out to be the only way that actually sticks.

The system I'm running today — I understand it. Not perfectly. Not completely. But more than I did yesterday, and infinitely more than if I'd stayed somewhere comfortable.

What I Actually Learned (TL;DR for the skimmers)

NVIDIA + Linux + display sleep = known conflict, poorly maintained
HandleLidSwitch=ignore in /etc/systemd/logind.conf sidesteps the wake issue
A year on Arch without ricing was the right call for me — depth over aesthetics
The painful hours are the curriculum. There's no shortcut that gives you the same understanding
Omarchy on CachyOS/Hyprland is worth exploring — just go in with eyes open

Running Arch. Still here. Send help (or just more coffee).

If you've hit the same NVIDIA sleep bug — drop your fix in the comments. There are a hundred ways to solve this and I've probably only found one of them.

LeetCode: The “Contains Duplicate” Problem

Shakib S. — Mon, 16 Feb 2026 19:25:23 +0000

Day 2 of Refusing to Write Code Without Understanding It.

There’s a specific question the computer is asking:

“Does this list have the same number appearing more than once?”
Simple. Almost boring.

But I made one rule for myself on this journey:

I don’t just want the green “Accepted” badge on LeetCode.

I want to see the code actually run.

I want to watch the output.

I want to feel the logic execute.

That rule has kept this process fulfilling.

The Problem (Explained)

Examples everyone understands:

[3, 7, 1, 9]   → false 
[3, 7, 3, 9]   → true 
[5]            → false 
[8, 8]         → true 
[]             → false

We need a reliable way to detect if any number appears more than once.

Let’s solve it the most human way first.

The “Looking With Your Eyes” Method

Imagine this list:

[4, 1, 7, 2, 9, 4, 8]

You would naturally do this:

Take 4 → remember it
Take 1 → new → remember
Take 7 → new
Take 2 → new
Take 9 → new
Take 4 → wait… I’ve seen 4 before → duplicate!
That’s the entire algorithm.

Remember what you’ve seen.

If you see it again → stop.

Now let’s write that in code.

Turning Human Thinking Into Python

def has_duplicate(nums):
   seen = []                # empty list = our memory

   for num in nums:         # look at each number one by one
       if num in seen:      # is it already in memory?
           return True      # yes → duplicate found
       seen.append(num)     # no → remember it

   return False             # no duplicates

Let’s mentally run it:

Input: [4, 1, 7, 4] 4 → add → [4] 1 → add → [4,1] 7 → add → [4,1,7] 4 → already in → return True

Perfect.

But here’s where things get interesting.

The Performance Reality

Lists in Python are slow at checking membership.

When you do:

if num in seen:

Python checks every element in the list until it finds a match.

That’s O(n) time.

If the list has 100,000 elements, that check might scan through all 100,000.

And we do that inside a loop.

That makes the total time complexity:
O(n²)

For interviews? Not good.

Enter the Hash Map (via Set)

Python has a special data structure called a set. This is the secret sauce.

A set is implemented using a hash table (a structure that allows near constant-time lookup).

That means:

Checking num in seen becomes approximately O(1).

Now the entire loop becomes O(n).

Here’s the optimized version:

def has_duplicate(nums):
   seen = set()             # fast lookup structure

   for num in nums:
       if num in seen:
           return True
       seen.add(num)

   return False

Same logic.
Different container.
Massive performance difference.
That’s the power of data structures.

The Lab Setup (The Rule I Follow. So Should You.)

This is where my personal rule kicks in.

I don’t just submit to LeetCode.

I build a tiny lab and run it myself.

class Solution:
   def containsDuplicate(self, nums):
       seen = set()
       for num in nums:
           if num in seen:
               return True
           seen.add(num)
       return False

s = Solution()
nums = [1, 0, 2, 5, 8, 9, 1]
result = s.containsDuplicate(nums)
print(result)

Now I see:

True

It’s more satisfying.

It feels real.

It feels engineered.

Not gamified.

What This Problem Actually Taught Me

This wasn’t about duplicates.

It was about:

Choosing the right data structure
Understanding time complexity
Thinking in terms of scale
Translating human logic into machine logic

LeetCode is helping me build algorithmic discipline.

And that discipline matters when:

Designing APIs
Optimizing backend systems
Handling large datasets
Preventing performance bottlenecks

Thanks for reading.

This week I’m focused on hash maps and strings.
Cheers to learning.