<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Mirage995</title>
    <description>The latest articles on DEV Community by Mirage995 (@mirage995).</description>
    <link>https://dev.to/mirage995</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3842293%2Fee1dc3f5-3c84-4bf8-998a-a58b299095c0.png</url>
      <title>DEV Community: Mirage995</title>
      <link>https://dev.to/mirage995</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/mirage995"/>
    <language>en</language>
    <item>
      <title>I wrapped Gemini Flash with memory and a swarm. It went from 9/12 to 12/12 on a bug benchmark, and the 3 it failed were brutal</title>
      <dc:creator>Mirage995</dc:creator>
      <pubDate>Tue, 24 Mar 2026 23:11:12 +0000</pubDate>
      <link>https://dev.to/mirage995/i-wrapped-gemini-flash-with-memory-and-a-swarm-it-went-from-912-to-1212-on-a-bug-benchmark-and-3ic0</link>
      <guid>https://dev.to/mirage995/i-wrapped-gemini-flash-with-memory-and-a-swarm-it-went-from-912-to-1212-on-a-bug-benchmark-and-3ic0</guid>
      <description>&lt;p&gt;I've been building SHARD for a few months: an agentic scaffold that wraps LLMs with persistent memory, multi-agent&lt;br&gt;
swarms, and a nightly self-study loop. Last night I ran a full benchmark — 12 hard Python bug-fix tasks, naked Gemini Flash vs SHARD wrapping the same model.&lt;/p&gt;

&lt;p&gt;Tasks fully solved: naked 9/12 → SHARD 12/12.&lt;/p&gt;

&lt;p&gt;The 3 it couldn't close alone are worth examining.&lt;/p&gt;

&lt;p&gt;The 3 tasks naked LLM failed&lt;/p&gt;

&lt;p&gt;T1 — html_trap (naked: 38.9%, SHARD: 100%)&lt;br&gt;
  HTML rendering pipeline with XSS injection via unescaped f-strings. The model kept fixing the obvious paths and&lt;br&gt;
  missing the edge cases. SHARD's Security reviewer flagged the exact injection vector on attempt 2.&lt;/p&gt;

&lt;p&gt;T10 — template_parser (naked: 20%, SHARD: 100%)&lt;br&gt;
  Real bug from pylint#7993 — regex .+? vs \w+? inside a template parser. Naked model passed 2/10 tests and confidently&lt;br&gt;
  produced wrong output. SHARD passed all 10 on attempt 1 because the GraphRAG had causal context from a prior study&lt;br&gt;
  session on regex semantics.&lt;/p&gt;

&lt;p&gt;T2 — ghost_bug (naked: 93.8%, SHARD: 100%)&lt;br&gt;
  Almost there. Naked missed 1 test out of 16. SHARD closed it at attempt 3 via swarm — the EdgeCases reviewer found the&lt;br&gt;
   boundary condition the solo model skipped.&lt;/p&gt;




&lt;p&gt;The architecture that fills the gap&lt;/p&gt;

&lt;p&gt;Attempt 1: LLM solo&lt;br&gt;
  Attempt 2+: Architect → Coder → [Concurrency, Security, EdgeCases, Performance, DataIntegrity]&lt;br&gt;
               ↑&lt;br&gt;
               GraphRAG injects: "regex .+? matches newlines — causes_conflict with line-by-line parsers"&lt;/p&gt;

&lt;p&gt;Every reviewer runs in parallel. Their critiques are merged into a single patch prompt. If the same test stays stuck&lt;br&gt;
  for 2+ rounds, Focus Mode fires: all reviewers silenced, direct Architect → Coder channel only. This killed a&lt;br&gt;
  previously unresolvable stuck-at-15/16 loop in testing.&lt;/p&gt;




&lt;p&gt;What surprised me&lt;/p&gt;

&lt;p&gt;The cases where SHARD adds zero value (tasks 3-9, 11-12) are exactly the tasks where the bug is local and syntactic.&lt;br&gt;
  One wrong operator, one missing None check. A single LLM call is fine for those.&lt;/p&gt;

&lt;p&gt;The delta appears on structural bugs: things that require understanding the interaction between components, historical&lt;br&gt;
   context about why a pattern breaks, or cross-attempt reasoning. That's where stateless prompting hits its ceiling.&lt;/p&gt;




&lt;p&gt;Try it on your own code&lt;/p&gt;

&lt;p&gt;git clone &lt;a href="https://github.com/Mirage995/shard-v1" rel="noopener noreferrer"&gt;https://github.com/Mirage995/shard-v1&lt;/a&gt;&lt;br&gt;
  python shard_challenge.py buggy.py tests.py&lt;/p&gt;

&lt;p&gt;# With a GitHub repo as extra context (uses Repomix)&lt;br&gt;
  python shard_challenge.py buggy.py tests.py --repo &lt;a href="https://github.com/you/repo" rel="noopener noreferrer"&gt;https://github.com/you/repo&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;Full benchmark results, task definitions, and architecture doc are in the repo. Happy to dig into any of the&lt;br&gt;
  components.&lt;/p&gt;

</description>
      <category>python</category>
      <category>ai</category>
      <category>machinelearning</category>
      <category>programming</category>
    </item>
  </channel>
</rss>
