DEV Community: Hieu Pham

Why Your Custom NemoClaw LLM Takes Forever to Respond (Or Completely Ignores You)

Hieu Pham — Tue, 24 Mar 2026 18:45:16 +0000

You finally set up a local AI agent to help you tackle your dev backlog (if you haven't yet, check out my guide on how to run NemoClaw with a local LLM & connect to Telegram).

The goal is simple: feed it your local codebase so it can help you refactor complex components, map out new business logic, or write comprehensive unit tests—all without sending proprietary company code to an external API.

You fire up an agentic framework like NemoClaw on your RTX 4080, paste in your prompt, and... the agent completely loses its mind.

Instead of writing code, it either ghosts you, dumps a wall of unformatted JSON into your terminal, or gets trapped in an infinite 3-second retry loop until the session crashes.

After spending a full day digging through API logs, I realized this isn't a network bug. It is a fundamental flaw in how local agent frameworks handle context windows, and it affects almost every developer trying to build private AI workflows.

If your local agent is stuck in an infinite loop or timing out, here is the exact architectural bottleneck causing it, and how to permanently fix it.

The Root Cause: The Hidden ReAct Loop Trap

Frameworks like NemoClaw, AutoGen, and LangChain operate on a "Reasoning and Acting" (ReAct) loop. To make the AI autonomous, the framework secretly injects a massive set of invisible system instructions, tool schemas, and strict JSON formatting rules before it even attaches your actual question.

By the time you ask the agent to review a few hundred lines of code, your total prompt size easily explodes past 12,000 tokens.

Here is where the pipeline breaks:

The 4k Wall: By default, local inference engines like Ollama cap their context window at 4,096 tokens to save VRAM.
The Decapitation: When the framework sends that massive 12k-token prompt, the inference engine blindly chops off the oldest 8,000 tokens to make it fit. Unfortunately, those oldest tokens contain the framework's critical JSON formatting rules.
The Infinite Loop: The lobotomized model replies with broken, plain-text formatting. The framework's parser catches the bad JSON, slaps the model on the wrist, and automatically replies: "Invalid JSON schema, try again." The model tries again, gets truncated again, and you are officially trapped in a rapid-fire retry loop that hammers your GPU until the 60-second gateway timeout drops the connection.

The False Fix (`OLLAMA_NUM_CTX`)

Your first instinct as a lead developer is probably to just restart the server and force a larger context window via an environment variable: OLLAMA_NUM_CTX=16384 ollama serve.

This will not work. Most agent frameworks communicate with Ollama via the OpenAI compatibility endpoint (/v1/chat/completions). If the client framework doesn't explicitly declare a custom context size in its JSON payload, that specific endpoint completely ignores your environment variable and forces the model back to its 4k default.

To fix this, you have to bypass the API completely and bake the larger context window directly into the model's DNA.

The Real Fix: Building a Custom Modelfile

First, you need a highly capable "Instruct" model. With 16GB of VRAM on an RTX 4080, you have the perfect amount of hardware headroom to run a brilliant mid-weight model (like qwen2.5:14b) and a massive 16k context window without spilling over into agonizingly slow system RAM.

1. Bake the 16k Context into the DNA

In your terminal, create a custom Ollama model with the 16k limit hardcoded using a Modelfile:

echo "FROM qwen2.5:14b" > Modelfile
echo "PARAMETER num_ctx 16384" >> Modelfile
ollama create qwen14b-agent-16k -f Modelfile

2. Update the Gateway Route

Tell your framework's API gateway to route all inference to your newly minted, wide-context model. (For OpenShell/NemoClaw, it looks like this):

openshell inference set --provider ollama --model qwen14b-agent-16k

3. Wipe the Corrupted Memory

Because your agent just spent the last 20 minutes screaming at itself in broken JSON, its session history is deeply corrupted. If you don't wipe it, the memory manager will crash trying to read the garbage data on your next prompt. Clear out the session storage before testing again.

# For NemoClaw users:
rm /sandbox/.openclaw-data/agents/main/sessions/*

The Result

Because the massive system prompt is no longer being decapitated, the 14b model perfectly understands the framework's JSON instructions. It can hold its tool schemas, its system prompt, and your entire codebase in its head simultaneously.

It executes its tool calls seamlessly and replies in natural language in just a few seconds.

You now have a lightning-fast, fully autonomous local agent running securely on your own hardware, taking full advantage of that 16GB of VRAM.

Have you tried pushing the limits of your GPU with local agent frameworks? Let me know your stack in the comments!

How to Run NemoClaw with a Local LLM & Connect to Telegram (Without Losing Your Mind)

Hieu Pham — Tue, 24 Mar 2026 16:51:54 +0000

I just spent a full day wrestling with NemoClaw so you don’t have to.

NemoClaw is an incredible agentic framework, but because it is still in beta, it has its fair share of quirks, undocumented networking hurdles, and strict kernel-level sandboxing that will block your local connections by default.

My goal was to run a fully private, locally hosted AI agent using a local LLM that I could text from my phone via Telegram. Working with an RTX 4080 and its strict 16GB VRAM limit meant I had to optimize my model choice and bypass a maze of container networks to get everything talking.

If you are trying to ditch the cloud and run OpenClaw locally on WSL2, here is the exact step-by-step fix to get your agent online.

Part 1: Escaping the Sandbox (Connecting the Local LLM)

By default, NemoClaw runs your agent inside a nested Kubernetes (k3s) container within WSL2. If you try to point it to your local Ollama instance using localhost or the default Docker bridge, the sandbox's strict egress policies will hit you with an endless stream of HTTP 503 errors.

To fix this, we have to route the traffic out the "front door" via your primary WSL network interface.

1. Force Ollama to Listen on All Interfaces

Stop your background Ollama service and force it to broadcast to your local WSL network:

sudo systemctl stop ollama
OLLAMA_HOST=0.0.0.0 ollama serve

2. Grab Your Primary WSL IP

In a new terminal tab (outside the OpenShell sandbox), grab your virtual machine's IP:

export WSL_IP=$(ip -4 addr show eth0 | grep -Po 'inet \K[\d.]+')

3. Wire the OpenShell Gateway

Delete the broken default provider and recreate it pointing to your WSL IP, then set your inference route (I used qwen3.5:9b as it comfortably fits my hardware constraints):

openshell provider delete ollama

openshell provider create \
  --name ollama \
  --type openai \
  --credential OPENAI_API_KEY=empty \
  --config OPENAI_BASE_URL=http://$WSL_IP:11434/v1

openshell inference set --provider ollama --model qwen3.5:9b --no-verify

4. Clear the Stale Locks

If your agent crashed during setup, clear the locked session files inside the sandbox or your next prompt will timeout:

rm /sandbox/.openclaw-data/agents/main/sessions/*.lock

Part 2: The Telegram Integration

NemoClaw has a built-in Telegram bridge, but attempting to run it with the default Nemotron cloud model is notoriously unstable. I found that the connection repeatedly gets dropped.

Switching the "brain" over to the local LLM we just configured fixes this pipeline entirely.

1. Get Your Token

Message the BotFather on Telegram to create a new bot and grab your HTTP API token.

2. Export and Start

On your host WSL terminal (not inside the sandbox), pass the token to the service manager:

export TELEGRAM_BOT_TOKEN="your_token_here"
nemoclaw start

💡 Troubleshooting Tip: If nemoclaw status says the bridge is running but it keeps crashing, you likely have a stale PID file. Run kill -9 <PID> to clear the zombie process, run nemoclaw stop, and try starting it again.

The Result

Once that bridge is live, you have a completely private, localized AI agent running on your own GPU that you can text from anywhere in the world.

⚠️ Wait, is your agent ghosting you or trapped in a loop?

If you got the bridge running successfully, but your AI is taking forever to respond, spitting out raw JSON, or stuck in an infinite retry loop, you have likely hit the hidden context window trap.

I wrote a complete follow-up guide on exactly why this happens. Check out Why Your Custom NemoClaw LLM Takes Forever to Respond (Or Completely Ignores You) to learn how to permanently fix the truncation error by building a custom Modelfile.

Have you experimented with NemoClaw or OpenShell yet? Let me know in the comments if you've hit any other weird WSL networking snags!

DEV Community: Hieu Pham

Why Your Custom NemoClaw LLM Takes Forever to Respond (Or Completely Ignores You)

The Root Cause: The Hidden ReAct Loop Trap

The False Fix (OLLAMA_NUM_CTX)

The Real Fix: Building a Custom Modelfile

1. Bake the 16k Context into the DNA

2. Update the Gateway Route

3. Wipe the Corrupted Memory

The Result

How to Run NemoClaw with a Local LLM & Connect to Telegram (Without Losing Your Mind)

Part 1: Escaping the Sandbox (Connecting the Local LLM)

1. Force Ollama to Listen on All Interfaces

2. Grab Your Primary WSL IP

3. Wire the OpenShell Gateway

4. Clear the Stale Locks

Part 2: The Telegram Integration

1. Get Your Token

2. Export and Start

The Result

⚠️ Wait, is your agent ghosting you or trapped in a loop?

The False Fix (`OLLAMA_NUM_CTX`)