<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Arjun Singh</title>
    <description>The latest articles on DEV Community by Arjun Singh (@ansh0x).</description>
    <link>https://dev.to/ansh0x</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3839660%2Fade22745-48f9-48db-b465-9887cbdf9e24.jpeg</url>
      <title>DEV Community: Arjun Singh</title>
      <link>https://dev.to/ansh0x</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/ansh0x"/>
    <language>en</language>
    <item>
      <title>Your Agentic AI's Safety System Gets Dumber As It Thinks Longer (And how to fix it)</title>
      <dc:creator>Arjun Singh</dc:creator>
      <pubDate>Mon, 30 Mar 2026 11:35:11 +0000</pubDate>
      <link>https://dev.to/ansh0x/your-agentic-ais-safety-system-gets-dumber-as-it-thinks-longer-2731</link>
      <guid>https://dev.to/ansh0x/your-agentic-ais-safety-system-gets-dumber-as-it-thinks-longer-2731</guid>
      <description>&lt;p&gt;Agentic AI systems fail in production all the time. The usual fix? A strongly-worded system prompt. That's not safety engineering, that's hoping the model behaves. Here's why prompt-based guardrails are fundamentally broken, and what an actual architectural solution looks like.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;LLMs generate text by navigating a vector space, finding relevant regions based on input context. But, safety guardrails added via system prompts are also just tokens competing for attention like everything else.&lt;/p&gt;

&lt;p&gt;It introduces two failure modes:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Jailbreaking&lt;/strong&gt; — because all possible outputs exist somewhere in the model's vector space (it's a product of pretraining on human-generated text, including harmful content), prompt-based guardrails can only make certain regions harder to reach, but not impossible. With the right prompt framing you can always nudge the model's internal state toward those regions, which generates these harmful responses. You can't delete a region from the vector space with a prompt.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Context Window Dilution&lt;/strong&gt; — transformers use attention, which is essentially a key-value lookup weighted by relevance. A guardrail system prompt at position 0 of a long context competes with everything that comes after it. As the context fills, nearby tokens dominate attention and the guardrail's influence weakens — it gets "forgotten" not because the model ignores it but because attention naturally prioritizes recent and contextually relevant tokens. The guardrail was never architecturally special, just another token sequence.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The Solution — Overseer Architecture
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzty8z7gorzsuieu5eui1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzty8z7gorzsuieu5eui1.png" alt="Diagram showing the LLM passing its output input into Overseer model sitting outside" width="715" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Instead of relying on a guardrail living inside the main model's context, use a separate small fine-tuned LLM as an external validator — the Overseer.&lt;/p&gt;

&lt;h4&gt;
  
  
  How it works:
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Overseer is initialized once with just the guardrails, its state is set at that point&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;It never sees the full growing conversation context&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;It only ever receives prompt-response pairs from the main model&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;It's fine-tuned specifically to detect when a response violates the original guardrail intent&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;div class="crayons-card c-embed"&gt;

  &lt;br&gt;
&lt;strong&gt;The Key Insight&lt;/strong&gt; is separating the guardrail from the generation context entirely, rather than trying to make it persist inside the same context window where it will always eventually lose to dilution.&lt;br&gt;

&lt;/div&gt;


&lt;p&gt;&lt;a href="https://www.linkedin.com/in/ansh0x" class="crayons-btn crayons-btn--primary" rel="noopener noreferrer"&gt;Connect on LinkedIn for further Discussion&lt;/a&gt;
&lt;/p&gt;

</description>
      <category>security</category>
      <category>ai</category>
      <category>agents</category>
      <category>architecture</category>
    </item>
    <item>
      <title>Fine-tuning a 0.5B LLM to run on a potato laptop</title>
      <dc:creator>Arjun Singh</dc:creator>
      <pubDate>Mon, 23 Mar 2026 15:03:10 +0000</pubDate>
      <link>https://dev.to/ansh0x/fine-tuning-a-05b-llm-to-run-on-a-potato-laptop-4okb</link>
      <guid>https://dev.to/ansh0x/fine-tuning-a-05b-llm-to-run-on-a-potato-laptop-4okb</guid>
      <description>&lt;p&gt;I wanted a tool that could automate stuff on my laptop without sending everything to the cloud. Problem: my laptop is ancient. There's only so much a laptop with Intel Pentium CPU, 8GB RAM and HDD can do. Regardless of the constraints, I set out the sails to work on it.&lt;/p&gt;

&lt;p&gt;I won't say it's SOTA quality, but it's decent, and it works.&lt;br&gt;
The first problem was very clear: to find a SLM (Small Language Model) that is good, even for its size. And after searching, for a bit I settled with &lt;a href="https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct" rel="noopener noreferrer"&gt;Qwen2.5-0.5B-Instruct&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Now, for the data generation and how I did it. This part was the most tedious thing. First, I prompted &lt;a href="https://huggingface.co/Qwen/Qwen2.5-7B-Instruct" rel="noopener noreferrer"&gt;Qwen2.5-7B-Instruct&lt;/a&gt; to generate instructions (like- Open Reddit, Copy X file from Y to Z, etc). Then, I iterated through each instruction, and prompted model, to first generate some paraphrases, and then corresponding elements for input and output. But the data wasn't clean, I had to regenerate numerous times. Even after that, when I started fine-tuning the model, it kept overfitting. So I took a look at the data more closely and found that even after all the regeneration, the data wasn't consistent. Examples, where &lt;code&gt;directory&lt;/code&gt; field had Windows file-paths, &lt;code&gt;response&lt;/code&gt; still had Linux commands in it. So I decided to go back and regenerate the data from scratch again, and then, finally, I got data that was usable to fine-tune the model.&lt;br&gt;
The structure I decided my model to train is like:&lt;br&gt;
&lt;strong&gt;Input prompt&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; 
  &lt;/span&gt;&lt;span class="nl"&gt;"task"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;&amp;lt;task&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"directory"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;&amp;lt;directories&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;#&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Needs&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;to&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;pass&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;full&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;paths&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;right&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;now&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"available_hotkeys"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;&amp;lt;list_of_hotkeys_relevant_for_this_task&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"iteration_context"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;&amp;lt;iteration_context&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;#&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Usually&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;only&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;used&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;when&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;the&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;task&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;is&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;repetitive&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Response&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"task_type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;&amp;lt;task_type&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;#&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Model&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;predicts&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;task&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;types:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;atomic&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;repetitive&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;clarification&lt;/span&gt;&lt;span class="w"&gt; 
  &lt;/span&gt;&lt;span class="nl"&gt;"output"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"execution_plan"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"hotkeys"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="err"&gt;&amp;lt;hotkey_plan&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"cli"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="err"&gt;&amp;lt;cli_plan&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Well now, you might think why I chose this structured input-output instead of doing something like streaming the output and doing something what people usually call &lt;em&gt;tool-calling&lt;/em&gt;. But let's be honest, there's a problem with that: &lt;em&gt;What if model fails midway and starts emitting wrong output?&lt;/em&gt; Or worse, &lt;em&gt;destructive outputs&lt;/em&gt; &lt;/p&gt;

&lt;p&gt;You might be aware of a recent case that was trending, where the Meta researcher Summer Yue told her OpenClaw agent, quote: "Check this inbox too and suggest what you would archive or delete, don’t action until I tell you to", and it started deleting her mails without permission.&lt;/p&gt;

&lt;p&gt;Instead, having the model output its full plan before execution, you get the power, to either let it execute or not. But this might be a bad UX choice, not for everyone, but some it might be. Though there are some ways, hypothetically to counter that too, at least in my POV. But it'll be off-topic to discuss them in this post. &lt;/p&gt;

&lt;p&gt;So that's why I chose to go with having model output full execution plan before taking actions. Anyway, this is how ACE, yes that's what the tool is called, ACE - Adaptive Command Executor, works.&lt;/p&gt;

&lt;p&gt;Though calling it Adaptive might be overkill, given that the model is a bit overfit and doesn't correctly emit hotkeys because of some of my beginner mistakes during fine-tuning.&lt;/p&gt;

&lt;p&gt;Now for some minor details; other than Qwen as its main generative model, ACE uses &lt;a href="https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2/" rel="noopener noreferrer"&gt;all-MiniLM-L6-v2&lt;/a&gt; to embed the tasks and hotkeys' descriptions, then uses cosine similarity to retrieve hotkeys for the current task. It might be inefficient to recompute the embedding of descriptions, so instead it uses &lt;code&gt;.npz&lt;/code&gt; file to cache the description and uses that. During loading models, it also checks if there's been any changes to the &lt;code&gt;hotkeys.json&lt;/code&gt; and recomputes the descriptions accordingly.&lt;/p&gt;

&lt;p&gt;As for LoRA configuration, I used, &lt;code&gt;r = 16&lt;/code&gt;, &lt;code&gt;alpha = 32&lt;/code&gt;, and &lt;code&gt;dropout = 0.2&lt;/code&gt;, and trained for 1750 steps.&lt;/p&gt;

&lt;p&gt;In the future versions, I plan to do something similar to search for the files, that user wants to perform operations on, and remove the need of passing full paths.&lt;/p&gt;

&lt;p&gt;All of this, going back and forth between data generation and fine-tuning took about 6 weeks and 2 extra weeks coding for model to actually work as an automation system. It might've taken less if I had incorporated more use of AI, but then there wouldn't have been any learning of things when facing errors. In all of this I learned quite a lot. Foremost is the fine-tuning, and got a gist of how RAG system works while implementing the hotkey retrieval. Other than that, I learned why data quality is the most important thing, not just in LLM fine-tuning, but in training and any kind of model. &lt;/p&gt;

&lt;p&gt;Well, that's it. If you want to look at the code or models:&lt;br&gt;
Code: &lt;a href="https://github.com/ansh0x/ace" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;&lt;br&gt;
Models: &lt;a href="https://huggingface.co/ansh0x/ace-0.5b-gguf" rel="noopener noreferrer"&gt;HuggingFace&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>automation</category>
      <category>llm</category>
    </item>
  </channel>
</rss>
