<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: BN</title>
    <description>The latest articles on DEV Community by BN (@bn3020).</description>
    <link>https://dev.to/bn3020</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3906874%2F8ba9b1d6-61d5-48b5-9e4d-c46da33d3fc5.png</url>
      <title>DEV Community: BN</title>
      <link>https://dev.to/bn3020</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/bn3020"/>
    <language>en</language>
    <item>
      <title>I built a token-level debugger for comparing two LLMs</title>
      <dc:creator>BN</dc:creator>
      <pubDate>Tue, 26 May 2026 00:14:28 +0000</pubDate>
      <link>https://dev.to/bn3020/i-built-a-token-level-debugger-for-comparing-two-llms-5cn3</link>
      <guid>https://dev.to/bn3020/i-built-a-token-level-debugger-for-comparing-two-llms-5cn3</guid>
      <description>&lt;p&gt;Same prompt, two models, different outputs. No tooling was actually showing me where they diverged.&lt;br&gt;
Built tokenflame that gives entropy heatmaps, tokenizer diffs, divergence markers, token-by-token replay. One command, one HTML file.&lt;br&gt;
pip install tokenflame&lt;/p&gt;

</description>
      <category>llm</category>
      <category>mlops</category>
      <category>rag</category>
    </item>
    <item>
      <title>I built a vector embedding cache that makes stale hits structurally impossible</title>
      <dc:creator>BN</dc:creator>
      <pubDate>Sat, 16 May 2026 21:49:52 +0000</pubDate>
      <link>https://dev.to/bn3020/i-built-a-vector-embedding-cache-that-makes-stale-hits-structurally-impossible-gjo</link>
      <guid>https://dev.to/bn3020/i-built-a-vector-embedding-cache-that-makes-stale-hits-structurally-impossible-gjo</guid>
      <description>&lt;p&gt;Wrote up the design behind embcache, a GPU-native two-tier cache for embeddings and KV states.&lt;/p&gt;

&lt;p&gt;The problem it solves: embedding caches that key on content hash alone silently return stale vectors after a model upgrade or tokenizer change. The cache looks healthy. The vectors are wrong.&lt;/p&gt;

&lt;p&gt;The fix is a composite EmbeddingFingerprint covering model_id, tokenizer hash, chunking strategy, normalization version, prompt template, and dataset version. No partial matches, so no path to a stale hit from a pipeline change.&lt;/p&gt;

&lt;p&gt;Full writeup with benchmarks (98.3% hit rate, 400-450x speedup on KV cache hits) on Medium: &lt;a href="https://bh3r1th.medium.com/the-vector-embedding-cache-bug-that-costs-nothing-and-corrupts-everything-157be6c575e8" rel="noopener noreferrer"&gt;https://bh3r1th.medium.com/the-vector-embedding-cache-bug-that-costs-nothing-and-corrupts-everything-157be6c575e8&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Repo: &lt;a href="https://github.com/bh3r1th/embcache" rel="noopener noreferrer"&gt;https://github.com/bh3r1th/embcache&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Not on PyPI yet. Looking for feedback, especially on whether the fingerprint schema covers all the axes that could cause a stale hit in your pipeline.&lt;/p&gt;

</description>
      <category>rag</category>
      <category>vectordatabase</category>
      <category>llm</category>
      <category>python</category>
    </item>
    <item>
      <title>Most RAG failures don’t crash. They silently return bad answers. I built a repair layer for that.</title>
      <dc:creator>BN</dc:creator>
      <pubDate>Sun, 10 May 2026 01:51:54 +0000</pubDate>
      <link>https://dev.to/bn3020/most-rag-failures-dont-crash-they-silently-return-bad-answers-i-built-a-repair-layer-for-that-2609</link>
      <guid>https://dev.to/bn3020/most-rag-failures-dont-crash-they-silently-return-bad-answers-i-built-a-repair-layer-for-that-2609</guid>
      <description>&lt;p&gt;Most RAG tooling provides a score but fails to specify what actually went wrong.&lt;/p&gt;

&lt;p&gt;I had retrieval failures, grounding issues, generation going sideways, all showing up as a number. No way to know which failure caused which run to go wrong. No way to fix it without guessing.&lt;/p&gt;

&lt;p&gt;So I built ragbolt.&lt;/p&gt;

&lt;p&gt;ragbolt is a failure-aware repair layer for RAG pipelines that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Detects whether the failure originated from retrieval, generation, or grounding&lt;/li&gt;
&lt;li&gt;Applies one bounded repair at a time&lt;/li&gt;
&lt;li&gt;Re-verifies the result&lt;/li&gt;
&lt;li&gt;Emits a full trace to show exactly what changed and why&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It’s not a framework.&lt;br&gt;
Not an agent.&lt;br&gt;
Not "self-healing RAG".&lt;/p&gt;

&lt;p&gt;Just a small wrapper around existing RAG pipelines with explicit repair limits, auditability, and a hard stop when confidence breaks down.&lt;/p&gt;

&lt;p&gt;It runs standalone and integrates with LangChain + LlamaIndex.&lt;/p&gt;

&lt;p&gt;pip install ragbolt&lt;/p&gt;

</description>
      <category>rag</category>
      <category>llm</category>
      <category>ai</category>
      <category>deterministic</category>
    </item>
    <item>
      <title>Deterministic reliability stack for LLM pipelines</title>
      <dc:creator>BN</dc:creator>
      <pubDate>Sat, 09 May 2026 18:28:13 +0000</pubDate>
      <link>https://dev.to/bn3020/deterministic-reliability-stack-for-llm-pipelines-24ba</link>
      <guid>https://dev.to/bn3020/deterministic-reliability-stack-for-llm-pipelines-24ba</guid>
      <description>&lt;p&gt;I have been spending the last few months wiring up a deterministic reliability stack for structured LLM pipelines.&lt;/p&gt;

&lt;p&gt;Today, LLM Contract Check (locc) and Release Governor went live on PyPI. EGA went live last week.&lt;/p&gt;

&lt;p&gt;The stack is straightforward:&lt;br&gt;
LLM Contract Check - CI contract testing to catch schema regressions.&lt;br&gt;
Release Governor - Blocks staging promotion if malformed outputs leak.&lt;br&gt;
EGA - Runtime enforcement. Forces outputs to ground against source evidence before they move downstream.&lt;/p&gt;

&lt;p&gt;The idea is simple:&lt;br&gt;
don’t wait until production logs or human evals tell you something broke.&lt;/p&gt;

&lt;p&gt;Try to catch:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;unstable contracts in CI&lt;/li&gt;
&lt;li&gt;leakage before deploy&lt;/li&gt;
&lt;li&gt;unsupported outputs at runtime&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Still early.&lt;br&gt;
Not benchmarked.&lt;br&gt;
Definitely not claiming this "solves AI safety."&lt;/p&gt;

&lt;p&gt;I'm mainly looking for engineers building RAG or structured-output systems who are willing to plug pieces of this in and tell me where the assumptions break.&lt;/p&gt;

&lt;p&gt;pip install llm-locc&lt;br&gt;
pip install llm-release-governor&lt;br&gt;
pip install ega&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>mlops</category>
      <category>rag</category>
    </item>
    <item>
      <title>EGA: Runtime Enforcement for LLM Outputs (v1.0.0)</title>
      <dc:creator>BN</dc:creator>
      <pubDate>Fri, 01 May 2026 01:36:39 +0000</pubDate>
      <link>https://dev.to/bn3020/ega-runtime-enforcement-for-llm-outputs-v100-1b89</link>
      <guid>https://dev.to/bn3020/ega-runtime-enforcement-for-llm-outputs-v100-1b89</guid>
      <description>&lt;p&gt;I built EGA, a runtime enforcement layer for LLM outputs.&lt;/p&gt;

&lt;p&gt;The problem: eval tools usually score after something already went wrong.&lt;/p&gt;

&lt;p&gt;They do not stop bad outputs from going downstream.&lt;/p&gt;

&lt;p&gt;EGA sits in the runtime path and checks the model output against the source before letting it pass through.&lt;/p&gt;

&lt;p&gt;If something does not have support, it gets dropped or flagged.&lt;/p&gt;

&lt;p&gt;v1.0.0 is live on PyPI today.&lt;/p&gt;

&lt;p&gt;This is still early:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;not benchmarked yet&lt;/li&gt;
&lt;li&gt;not production-grade calibration yet&lt;/li&gt;
&lt;li&gt;needs real RAG pipeline feedback&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I am looking for engineers building RAG pipelines who are willing to plug this in and tell me where it breaks.&lt;/p&gt;

&lt;p&gt;pip install ega&lt;br&gt;
GitHub: &lt;a href="https://github.com/bh3r1th/llm-evidence-gated-generation" rel="noopener noreferrer"&gt;https://github.com/bh3r1th/llm-evidence-gated-generation&lt;/a&gt;&lt;br&gt;
PyPI: &lt;a href="https://pypi.org/project/ega/1.0.0/" rel="noopener noreferrer"&gt;https://pypi.org/project/ega/1.0.0/&lt;/a&gt;&lt;/p&gt;

</description>
      <category>llm</category>
      <category>rag</category>
      <category>mlops</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
