<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Hieu Pham</title>
    <description>The latest articles on DEV Community by Hieu Pham (@minhiu).</description>
    <link>https://dev.to/minhiu</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F387749%2Fc4816425-d3a8-4c9c-b553-2a73d08083f7.jpg</url>
      <title>DEV Community: Hieu Pham</title>
      <link>https://dev.to/minhiu</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/minhiu"/>
    <language>en</language>
    <item>
      <title>Why Your Custom NemoClaw LLM Takes Forever to Respond (Or Completely Ignores You)</title>
      <dc:creator>Hieu Pham</dc:creator>
      <pubDate>Tue, 24 Mar 2026 18:45:16 +0000</pubDate>
      <link>https://dev.to/minhiu/why-your-custom-nemoclaw-llm-takes-forever-to-respond-or-completely-ignores-you-237h</link>
      <guid>https://dev.to/minhiu/why-your-custom-nemoclaw-llm-takes-forever-to-respond-or-completely-ignores-you-237h</guid>
      <description>&lt;p&gt;You finally set up a local AI agent to help you tackle your dev backlog (if you haven't yet, check out my guide on &lt;a href="https://dev.to/minhiu/how-to-run-nemoclaw-with-a-local-llm-connect-to-telegram-without-losing-your-mind-3lk"&gt;how to run NemoClaw with a local LLM &amp;amp; connect to Telegram&lt;/a&gt;). &lt;/p&gt;

&lt;p&gt;The goal is simple: feed it your local codebase so it can help you refactor complex components, map out new business logic, or write comprehensive unit tests—all without sending proprietary company code to an external API.&lt;/p&gt;

&lt;p&gt;You fire up an agentic framework like NemoClaw on your RTX 4080, paste in your prompt, and... the agent completely loses its mind. &lt;/p&gt;

&lt;p&gt;Instead of writing code, it either ghosts you, dumps a wall of unformatted JSON into your terminal, or gets trapped in an infinite 3-second retry loop until the session crashes. &lt;/p&gt;

&lt;p&gt;After spending a full day digging through API logs, I realized this isn't a network bug. It is a fundamental flaw in how local agent frameworks handle context windows, and it affects almost every developer trying to build private AI workflows. &lt;/p&gt;

&lt;p&gt;If your local agent is stuck in an infinite loop or timing out, here is the exact architectural bottleneck causing it, and how to permanently fix it.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Root Cause: The Hidden ReAct Loop Trap
&lt;/h2&gt;

&lt;p&gt;Frameworks like NemoClaw, AutoGen, and LangChain operate on a "Reasoning and Acting" (ReAct) loop. To make the AI autonomous, the framework secretly injects a massive set of invisible system instructions, tool schemas, and strict JSON formatting rules before it even attaches your actual question.&lt;/p&gt;

&lt;p&gt;By the time you ask the agent to review a few hundred lines of code, your total prompt size easily explodes past &lt;strong&gt;12,000 tokens&lt;/strong&gt;. &lt;/p&gt;

&lt;p&gt;Here is where the pipeline breaks:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;The 4k Wall:&lt;/strong&gt; By default, local inference engines like Ollama cap their context window at 4,096 tokens to save VRAM. &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Decapitation:&lt;/strong&gt; When the framework sends that massive 12k-token prompt, the inference engine blindly chops off the oldest 8,000 tokens to make it fit. Unfortunately, those oldest tokens contain the framework's critical JSON formatting rules.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Infinite Loop:&lt;/strong&gt; The lobotomized model replies with broken, plain-text formatting. The framework's parser catches the bad JSON, slaps the model on the wrist, and automatically replies: &lt;em&gt;"Invalid JSON schema, try again."&lt;/em&gt; The model tries again, gets truncated again, and you are officially trapped in a rapid-fire retry loop that hammers your GPU until the 60-second gateway timeout drops the connection.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The False Fix (&lt;code&gt;OLLAMA_NUM_CTX&lt;/code&gt;)
&lt;/h2&gt;

&lt;p&gt;Your first instinct as a lead developer is probably to just restart the server and force a larger context window via an environment variable: &lt;code&gt;OLLAMA_NUM_CTX=16384 ollama serve&lt;/code&gt;. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This will not work.&lt;/strong&gt; Most agent frameworks communicate with Ollama via the OpenAI compatibility endpoint (&lt;code&gt;/v1/chat/completions&lt;/code&gt;). If the client framework doesn't explicitly declare a custom context size in its JSON payload, that specific endpoint completely ignores your environment variable and forces the model back to its 4k default.&lt;/p&gt;

&lt;p&gt;To fix this, you have to bypass the API completely and bake the larger context window directly into the model's DNA. &lt;/p&gt;




&lt;h2&gt;
  
  
  The Real Fix: Building a Custom Modelfile
&lt;/h2&gt;

&lt;p&gt;First, you need a highly capable "Instruct" model. With 16GB of VRAM on an RTX 4080, you have the perfect amount of hardware headroom to run a brilliant mid-weight model (like &lt;code&gt;qwen2.5:14b&lt;/code&gt;) &lt;em&gt;and&lt;/em&gt; a massive 16k context window without spilling over into agonizingly slow system RAM. &lt;/p&gt;

&lt;h3&gt;
  
  
  1. Bake the 16k Context into the DNA
&lt;/h3&gt;

&lt;p&gt;In your terminal, create a custom Ollama model with the 16k limit hardcoded using a &lt;code&gt;Modelfile&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"FROM qwen2.5:14b"&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; Modelfile
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"PARAMETER num_ctx 16384"&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; Modelfile
ollama create qwen14b-agent-16k &lt;span class="nt"&gt;-f&lt;/span&gt; Modelfile
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. Update the Gateway Route
&lt;/h3&gt;

&lt;p&gt;Tell your framework's API gateway to route all inference to your newly minted, wide-context model. (For OpenShell/NemoClaw, it looks like this):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;openshell inference &lt;span class="nb"&gt;set&lt;/span&gt; &lt;span class="nt"&gt;--provider&lt;/span&gt; ollama &lt;span class="nt"&gt;--model&lt;/span&gt; qwen14b-agent-16k
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3. Wipe the Corrupted Memory
&lt;/h3&gt;

&lt;p&gt;Because your agent just spent the last 20 minutes screaming at itself in broken JSON, its session history is deeply corrupted. If you don't wipe it, the memory manager will crash trying to read the garbage data on your next prompt. Clear out the session storage before testing again.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# For NemoClaw users:&lt;/span&gt;
&lt;span class="nb"&gt;rm&lt;/span&gt; /sandbox/.openclaw-data/agents/main/sessions/&lt;span class="k"&gt;*&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  The Result
&lt;/h2&gt;

&lt;p&gt;Because the massive system prompt is no longer being decapitated, the &lt;code&gt;14b&lt;/code&gt; model perfectly understands the framework's JSON instructions. It can hold its tool schemas, its system prompt, and your entire codebase in its head simultaneously.&lt;/p&gt;

&lt;p&gt;It executes its tool calls seamlessly and replies in natural language in just a few seconds. &lt;/p&gt;

&lt;p&gt;You now have a lightning-fast, fully autonomous local agent running securely on your own hardware, taking full advantage of that 16GB of VRAM. &lt;/p&gt;

&lt;p&gt;Have you tried pushing the limits of your GPU with local agent frameworks? Let me know your stack in the comments!&lt;/p&gt;

</description>
      <category>llm</category>
      <category>nemoclaw</category>
      <category>ai</category>
      <category>openclaw</category>
    </item>
    <item>
      <title>How to Run NemoClaw with a Local LLM &amp; Connect to Telegram (Without Losing Your Mind)</title>
      <dc:creator>Hieu Pham</dc:creator>
      <pubDate>Tue, 24 Mar 2026 16:51:54 +0000</pubDate>
      <link>https://dev.to/minhiu/how-to-run-nemoclaw-with-a-local-llm-connect-to-telegram-without-losing-your-mind-3lk</link>
      <guid>https://dev.to/minhiu/how-to-run-nemoclaw-with-a-local-llm-connect-to-telegram-without-losing-your-mind-3lk</guid>
      <description>&lt;p&gt;I just spent a full day wrestling with NemoClaw so you don’t have to. &lt;/p&gt;

&lt;p&gt;NemoClaw is an incredible agentic framework, but because it is still in beta, it has its fair share of quirks, undocumented networking hurdles, and strict kernel-level sandboxing that will block your local connections by default. &lt;/p&gt;

&lt;p&gt;My goal was to run a fully private, locally hosted AI agent using a local LLM that I could text from my phone via Telegram. Working with an RTX 4080 and its strict 16GB VRAM limit meant I had to optimize my model choice and bypass a maze of container networks to get everything talking. &lt;/p&gt;

&lt;p&gt;If you are trying to ditch the cloud and run OpenClaw locally on WSL2, here is the exact step-by-step fix to get your agent online.&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 1: Escaping the Sandbox (Connecting the Local LLM)
&lt;/h2&gt;

&lt;p&gt;By default, NemoClaw runs your agent inside a nested Kubernetes (&lt;code&gt;k3s&lt;/code&gt;) container within WSL2. If you try to point it to your local Ollama instance using &lt;code&gt;localhost&lt;/code&gt; or the default Docker bridge, the sandbox's strict egress policies will hit you with an endless stream of &lt;code&gt;HTTP 503&lt;/code&gt; errors. &lt;/p&gt;

&lt;p&gt;To fix this, we have to route the traffic out the "front door" via your primary WSL network interface.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Force Ollama to Listen on All Interfaces
&lt;/h3&gt;

&lt;p&gt;Stop your background Ollama service and force it to broadcast to your local WSL network:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl stop ollama
&lt;span class="nv"&gt;OLLAMA_HOST&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0.0.0.0 ollama serve
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. Grab Your Primary WSL IP
&lt;/h3&gt;

&lt;p&gt;In a new terminal tab (outside the OpenShell sandbox), grab your virtual machine's IP:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;WSL_IP&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;ip &lt;span class="nt"&gt;-4&lt;/span&gt; addr show eth0 | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-Po&lt;/span&gt; &lt;span class="s1"&gt;'inet \K[\d.]+'&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3. Wire the OpenShell Gateway
&lt;/h3&gt;

&lt;p&gt;Delete the broken default provider and recreate it pointing to your WSL IP, then set your inference route (I used &lt;code&gt;qwen3.5:9b&lt;/code&gt; as it comfortably fits my hardware constraints):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;openshell provider delete ollama

openshell provider create &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--name&lt;/span&gt; ollama &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--type&lt;/span&gt; openai &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--credential&lt;/span&gt; &lt;span class="nv"&gt;OPENAI_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;empty &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--config&lt;/span&gt; &lt;span class="nv"&gt;OPENAI_BASE_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;http://&lt;span class="nv"&gt;$WSL_IP&lt;/span&gt;:11434/v1

openshell inference &lt;span class="nb"&gt;set&lt;/span&gt; &lt;span class="nt"&gt;--provider&lt;/span&gt; ollama &lt;span class="nt"&gt;--model&lt;/span&gt; qwen3.5:9b &lt;span class="nt"&gt;--no-verify&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  4. Clear the Stale Locks
&lt;/h3&gt;

&lt;p&gt;If your agent crashed during setup, clear the locked session files inside the sandbox or your next prompt will timeout:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;rm&lt;/span&gt; /sandbox/.openclaw-data/agents/main/sessions/&lt;span class="k"&gt;*&lt;/span&gt;.lock
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Part 2: The Telegram Integration
&lt;/h2&gt;

&lt;p&gt;NemoClaw has a built-in Telegram bridge, but attempting to run it with the default Nemotron cloud model is notoriously unstable. I found that the connection repeatedly gets dropped. &lt;/p&gt;

&lt;p&gt;Switching the "brain" over to the local LLM we just configured fixes this pipeline entirely. &lt;/p&gt;

&lt;h3&gt;
  
  
  1. Get Your Token
&lt;/h3&gt;

&lt;p&gt;Message the &lt;strong&gt;BotFather&lt;/strong&gt; on Telegram to create a new bot and grab your HTTP API token.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Export and Start
&lt;/h3&gt;

&lt;p&gt;On your host WSL terminal (not inside the sandbox), pass the token to the service manager:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;TELEGRAM_BOT_TOKEN&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"your_token_here"&lt;/span&gt;
nemoclaw start
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;💡 Troubleshooting Tip:&lt;/strong&gt; If &lt;code&gt;nemoclaw status&lt;/code&gt; says the bridge is running but it keeps crashing, you likely have a stale PID file. Run &lt;code&gt;kill -9 &amp;lt;PID&amp;gt;&lt;/code&gt; to clear the zombie process, run &lt;code&gt;nemoclaw stop&lt;/code&gt;, and try starting it again.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  The Result
&lt;/h2&gt;

&lt;p&gt;Once that bridge is live, you have a completely private, localized AI agent running on your own GPU that you can text from anywhere in the world.&lt;/p&gt;

&lt;h3&gt;
  
  
  ⚠️ Wait, is your agent ghosting you or trapped in a loop?
&lt;/h3&gt;

&lt;p&gt;If you got the bridge running successfully, but your AI is taking forever to respond, spitting out raw JSON, or stuck in an infinite retry loop, you have likely hit the hidden context window trap. &lt;/p&gt;

&lt;p&gt;I wrote a complete follow-up guide on exactly why this happens. Check out &lt;strong&gt;&lt;a href="https://dev.to/minhiu/why-your-custom-nemoclaw-llm-takes-forever-to-respond-or-completely-ignores-you-237h"&gt;Why Your Custom NemoClaw LLM Takes Forever to Respond (Or Completely Ignores You)&lt;/a&gt;&lt;/strong&gt; to learn how to permanently fix the truncation error by building a custom Modelfile.&lt;/p&gt;

&lt;p&gt;Have you experimented with NemoClaw or OpenShell yet? Let me know in the comments if you've hit any other weird WSL networking snags!&lt;/p&gt;

</description>
      <category>agents</category>
      <category>llm</category>
      <category>openclaw</category>
      <category>tutorial</category>
    </item>
  </channel>
</rss>
