<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Nikita Benkovich</title>
    <description>The latest articles on DEV Community by Nikita Benkovich (@nikita_benkovich_eb86e54d).</description>
    <link>https://dev.to/nikita_benkovich_eb86e54d</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3801565%2F78591683-574b-4474-994c-a33c345b62a5.png</url>
      <title>DEV Community: Nikita Benkovich</title>
      <link>https://dev.to/nikita_benkovich_eb86e54d</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/nikita_benkovich_eb86e54d"/>
    <language>en</language>
    <item>
      <title>We Investigated Why Claude’s Memory Fails. Here’s What We Learned</title>
      <dc:creator>Nikita Benkovich</dc:creator>
      <pubDate>Wed, 01 Apr 2026 06:47:24 +0000</pubDate>
      <link>https://dev.to/nikita_benkovich_eb86e54d/we-investigated-why-claudes-memory-fails-heres-what-we-learned-3pl6</link>
      <guid>https://dev.to/nikita_benkovich_eb86e54d/we-investigated-why-claudes-memory-fails-heres-what-we-learned-3pl6</guid>
      <description>&lt;p&gt;Memory is arguably the most fundamental capability for a useful agent. Without it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The agent can't adapt to a user's preferences over time&lt;/li&gt;
&lt;li&gt;It repeats the same mistakes across sessions&lt;/li&gt;
&lt;li&gt;The user has to re-prompt the same rules every conversation ("always use strict typing", "never mock the database in tests", "our auth token expiry is 900s")&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This isn't a minor UX annoyance — it's the difference between an agent that gets better the more you use it and one that resets to zero every session.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The problem with Claude's native memory&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Claude Code gives you two options:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;CLAUDE.md&lt;/strong&gt; — you write it, you maintain it. Rules the agent should follow, project context, conventions. Fully manual.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MEMORY.md&lt;/strong&gt; — Claude maintains it automatically, appending notes during sessions. No manual work.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;MEMORY.md sounds like it solves the problem. In practice it has three hard constraints:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Constraint&lt;/th&gt;
&lt;th&gt;What it means&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Per-repo scope&lt;/td&gt;
&lt;td&gt;Global preferences (code style, team conventions) have to be duplicated into every project&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;One-time startup injection&lt;/td&gt;
&lt;td&gt;The file is read once at session start, not re-surfaced when relevant mid-session&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Size-bounded by position&lt;/td&gt;
&lt;td&gt;Only the first N lines are read — notes get dropped by position, not by relevance&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The result: as MEMORY.md grows, older notes fall off the bottom regardless of how important they are. And a rule written on day 1 of project A doesn't exist in project B.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What we wanted instead&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We set a design target before writing any code:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Context-activated&lt;/strong&gt; — hints surface when relevant to the current work, not at startup&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No extra user-facing calls&lt;/strong&gt; — injection should be automatic, invisible, fast&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Unbounded storage&lt;/strong&gt; — adding notes shouldn't degrade performance or relevance&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Global&lt;/strong&gt; — preferences like "never use duck typing when object structure is defined" should apply across all repos&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cheap&lt;/strong&gt; — not a $0.10/tool-call overhead&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Easy to install&lt;/strong&gt; — one command installation&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;The core idea: a dedicated LLM whose context is your memory&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Traditional retrieval — keyword search, vector embeddings matches on surface similarity. It doesn't understand what the agent is currently doing.&lt;/p&gt;

&lt;p&gt;The approach here is different: instead of a search index, use a separate LLM that holds your stored notes as its context (system prompt), and prompt it with the current run context to retrieve relevant hints.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────┐
│  Memory LLM                     │
│  system prompt: stored notes    │  ← memory lives here as context
│                                 │
│  user prompt: current transcript│  ← what is the agent doing right now?
│              + upcoming tool    │
│                                 │
│  output: relevant hints         │  ← surfaces only what matters
└─────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The LLM reasons about relevance rather than matching strings. Each note is stored with a plain-language &lt;code&gt;--when&lt;/code&gt; activation condition:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;cmr-memory write &lt;span class="s2"&gt;"auth token expiry is 900s not 3600s"&lt;/span&gt; &lt;span class="nt"&gt;--when&lt;/span&gt; &lt;span class="s2"&gt;"working on auth, tokens, config.ts"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A note with &lt;code&gt;--when "working on authentication"&lt;/code&gt; will surface when the agent is editing a JWT middleware file, even if neither "authentication" nor "JWT" appears in the transcript at that moment. The memory LLM understands context — it's not doing substring search.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Deduplication keeps context bounded&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Once a hint is injected, it's already in the transcript. Re-injecting it on the next tool call adds tokens but zero information. Before running retrieval, we extract hints already present in the conversation and skip anything that's already there — including semantically equivalent hints worded differently.&lt;/p&gt;

&lt;p&gt;Context overhead stays bounded by the number of unique topics the agent works on in a session, not by session length or total notes stored.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Second core concept: map-reduce over memory chunks&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Memory retrieval runs on every tool call — not occasionally, but every single time the agent does anything. That makes latency and cost non-negotiable. A single LLM call over all notes fails on three dimensions simultaneously: context window limits (1,000 notes × ~400 tokens = ~400k tokens), attention degradation in long contexts ("Lost in the Middle", Liu et al. 2023), and cost at scale.&lt;/p&gt;

&lt;p&gt;The solution — split notes into fixed-size chunks and apply a map-reduce pattern:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Notes split into chunks (~50 notes each)
         │
    ┌────┴────┐
  chunk-1  chunk-2  ...  chunk-N     ← map: one parallel Haiku call per chunk
    │         │               │
  hints     hints           hints
    └────┬────┘
       reduce: merge + deduplicate against hints already in context
         │
    inject only new hints as &amp;lt;system-reminder&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Map (scatter)&lt;/strong&gt;: one Haiku call per chunk fires in parallel. Each chunk is small enough for accurate attention. Calls are parallel, so time doesn't grow with memory size.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reduce (gather)&lt;/strong&gt;: merges all candidates, filters against hints already in the transcript, returns only what's new.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We chose Haiku specifically for this role: it's fast, cheap, and performs well on focused tasks with small contexts — exactly what each chunk call is. You don't need a frontier model to decide whether "auth token expiry is 900s" is relevant to what the agent is currently doing.&lt;/p&gt;

&lt;p&gt;Two compounding benefits make this fast and cheap:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Parallelism&lt;/strong&gt; — 20 chunks takes roughly the same wall-clock time as 1 chunk, because all map calls fire simultaneously.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Prompt caching&lt;/strong&gt; — each chunk's notes live in the system prompt, which is stable and never changes once the chunk is sealed (full). Anthropic's prompt caching means repeated retrievals against the same chunk are served from cache — dramatically lower cost and faster responses on every subsequent tool call.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Hooks make it fully automatic&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The entire retrieval pipeline runs via Claude Code's PreToolUse/PostToolUse hooks — the agent doesn't call memory explicitly at all.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;PreToolUse hook&lt;/strong&gt;: fires before every tool call. Reads the conversation transcript to understand current context, runs scatter-gather retrieval, and injects relevant hints as a &lt;code&gt;&amp;lt;system-reminder&amp;gt;&lt;/code&gt; block. The agent sees the hints without doing anything.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PostToolUse hook&lt;/strong&gt;: fires after every tool call. Sends a static nudge (~15 tokens, no LLM call) asking the agent whether anything noteworthy just happened. If yes, the agent writes a note. No forced writes — the agent decides.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The transcript is the key input: it gives the memory LLM a full picture of what the agent is working on right now, which is what makes context-activated retrieval possible.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Implementation: a single CLI package&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Everything ships as one npm package. Running &lt;code&gt;init&lt;/code&gt; does three things automatically:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Registers the hooks&lt;/strong&gt; — wires PreToolUse and PostToolUse into Claude Code's config&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Injects memory rules&lt;/strong&gt; — adds instructions to CLAUDE.md so the agent knows how and when to write notes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Configures the memory LLM&lt;/strong&gt; — sets up Haiku as the retrieval model with your Anthropic API key&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Two commands to get started:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-g&lt;/span&gt; @agynio/cmr-memory
cmr-memory init &lt;span class="nt"&gt;--api-key&lt;/span&gt; sk-ant-...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. Now you Claude can use memory.&lt;/p&gt;




&lt;p&gt;We built this as part of our open-source research into multi-agent engineering systems agyn.io&lt;/p&gt;

</description>
      <category>ai</category>
      <category>opensource</category>
      <category>agents</category>
      <category>llm</category>
    </item>
    <item>
      <title>Coding Agent Teams Outperform Solo Agents: 72.2% on SWE-bench Verified</title>
      <dc:creator>Nikita Benkovich</dc:creator>
      <pubDate>Mon, 02 Mar 2026 12:10:49 +0000</pubDate>
      <link>https://dev.to/nikita_benkovich_eb86e54d/coding-agent-teams-outperform-solo-agents-722-on-swe-bench-verified-4of5</link>
      <guid>https://dev.to/nikita_benkovich_eb86e54d/coding-agent-teams-outperform-solo-agents-722-on-swe-bench-verified-4of5</guid>
      <description>&lt;p&gt;Most AI coding agents work alone. You give them an issue, they figure it out, they hand you a fix. It's the AI equivalent of a lone wolf developer — capable, but not how real software teams actually operate.&lt;/p&gt;

&lt;p&gt;A team of researchers at &lt;a href="https://agyn.io" rel="noopener noreferrer"&gt;Agyn&lt;/a&gt; asked a different question: &lt;strong&gt;what if instead of a single agent, you used a coding agent team — with real roles, real review loops, and real coordination?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The results are hard to ignore.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Idea: Stop Treating Issue Resolution as a Solo Task
&lt;/h2&gt;

&lt;p&gt;Real software development involves coordination. A problem lands, someone researches it, someone else implements a fix, a reviewer pushes back, things iterate. The system that emerges from that process is more robust than anything one person (or one agent) would ship alone.&lt;/p&gt;

&lt;p&gt;The Agyn system — described in a &lt;a href="https://arxiv.org/abs/2602.01465" rel="noopener noreferrer"&gt;paper published on arXiv&lt;/a&gt; — encodes this directly. Rather than routing a GitHub issue through a single agent with a big context window, it spins up a team:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Manager&lt;/strong&gt; — coordinates execution, communication, and knows when to stop&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Researcher&lt;/strong&gt; — explores the repository, gathers context, writes the specification&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Engineer&lt;/strong&gt; — implements the fix, debugs failures&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reviewer&lt;/strong&gt; — evaluates the PR and enforces acceptance criteria&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each agent has a clearly scoped role, runs in its own isolated sandbox, and communicates through standard GitHub artifacts — commits, PR descriptions, and review comments. Just like a real team would.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Coding Agent Teams Work Better Than Solo Agents
&lt;/h2&gt;

&lt;p&gt;A few design decisions make this more than just "more agents":&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Isolated execution environments.&lt;/strong&gt; Each agent gets its own sandbox with shell access. No shared filesystem. Agents can install dependencies, run tests, and configure their environment without stepping on each other. Failures are easy to attribute.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Explicit role enforcement.&lt;/strong&gt; Every role specifies which model to use, what reasoning level, what tools, and what responsibilities. This prevents the "do everything" trap where a single agent accumulates too much context and starts hallucinating. It also means you can allocate expensive, high-reasoning models only where they're needed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Structured communication, not a fixed pipeline.&lt;/strong&gt; The Manager dynamically coordinates execution rather than following a script. If the Reviewer rejects the PR, the Engineer iterates. The system adapts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Context management for long tasks.&lt;/strong&gt; Large artifacts are persisted to the filesystem rather than stuffed into the model context. Accumulated context is summarized automatically. This is how you run a system end-to-end on complex issues without it falling apart.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Benchmark Results
&lt;/h2&gt;

&lt;p&gt;The team evaluated the system on SWE-bench Verified — a widely used benchmark where models must resolve real GitHub issues by modifying codebases and producing PRs that pass the project's test suite.&lt;/p&gt;

&lt;p&gt;The system resolved &lt;strong&gt;72.2% of tasks&lt;/strong&gt;, using GPT-5 and GPT-5-Codex at medium reasoning levels.&lt;/p&gt;

&lt;p&gt;Here's how that compares to other top systems at evaluation time:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;System&lt;/th&gt;
&lt;th&gt;Model(s)&lt;/th&gt;
&lt;th&gt;Resolved&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;agyn&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;GPT-5 / GPT-5-Codex (medium reasoning)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;72.2%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OpenHands&lt;/td&gt;
&lt;td&gt;GPT-5 (high reasoning)&lt;/td&gt;
&lt;td&gt;71.8%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;mini-SWE-agent&lt;/td&gt;
&lt;td&gt;GPT-5.2 (high reasoning)&lt;/td&gt;
&lt;td&gt;71.8%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;mini-SWE-agent&lt;/td&gt;
&lt;td&gt;GPT-5 (medium reasoning)&lt;/td&gt;
&lt;td&gt;65.0%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The key detail: &lt;strong&gt;this system wasn't tuned for the benchmark&lt;/strong&gt;. The same prompts, role definitions, tools, and execution model used in production were applied directly. It outperformed competitors using higher-reasoning model variants — without needing them.&lt;/p&gt;

&lt;p&gt;The 7.2% gain over the single-agent baseline using the same model class comes purely from the team structure.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Means for Agent Design
&lt;/h2&gt;

&lt;p&gt;The paper makes an argument that's easy to overlook in the current race to improve models: &lt;strong&gt;organizational design matters as much as model quality&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;We've spent a lot of energy making individual models smarter. But real-world software development scaled because of &lt;em&gt;how teams work&lt;/em&gt; — division of labor, code review, shared artifacts, iteration. Replicating that structure in an agent system produces measurable gains without touching the underlying model.&lt;/p&gt;

&lt;p&gt;The results suggest a few things worth taking seriously:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Role separation reduces errors.&lt;/strong&gt; When each agent has a narrow job, there's less opportunity for confusion and accumulated mistakes.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Review loops improve output quality.&lt;/strong&gt; Having a dedicated Reviewer that can send work back to the Engineer catches problems before they become permanent.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;You don't always need the biggest model.&lt;/strong&gt; Allocating medium-reasoning models across a well-structured team can beat a single high-reasoning agent doing everything.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  What’s Next
&lt;/h2&gt;

&lt;p&gt;The Agyn platform is [open source on GitHub(&lt;a href="https://github.com/agynio/platform" rel="noopener noreferrer"&gt;https://github.com/agynio/platform&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;We believe the future is not a single general-purpose “super agent,” but &lt;strong&gt;teams of specialized agents&lt;/strong&gt;, organized the way real organizations operate. Different roles. Different responsibilities. Clear coordination. Explicit review. Shared context.&lt;/p&gt;

&lt;p&gt;And we’re building toward that vision.&lt;/p&gt;

&lt;h3&gt;
  
  
  Coming Next
&lt;/h3&gt;

&lt;h4&gt;
  
  
  1. Flexible, Modular Agent Organizations
&lt;/h4&gt;

&lt;p&gt;Instead of a fixed pipeline, you’ll be able to compose agent teams like building blocks:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Define custom roles&lt;/li&gt;
&lt;li&gt;Assign different models per role&lt;/li&gt;
&lt;li&gt;Configure tools and permissions&lt;/li&gt;
&lt;li&gt;Isolate execution environments&lt;/li&gt;
&lt;li&gt;Design explicit coordination flows&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Not a monolith. An organization.&lt;/p&gt;

&lt;h4&gt;
  
  
  2. New Agent Communication Paradigms
&lt;/h4&gt;

&lt;p&gt;Real teams do not operate in a single synchronous loop. They:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Open threads
&lt;/li&gt;
&lt;li&gt;Leave structured comments
&lt;/li&gt;
&lt;li&gt;Request reviews
&lt;/li&gt;
&lt;li&gt;Resume work later
&lt;/li&gt;
&lt;li&gt;Escalate decisions
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We are introducing structured communication protocols between agents, including &lt;strong&gt;asynchronous collaboration&lt;/strong&gt;, so coordination can happen across time, not just across steps.&lt;/p&gt;

&lt;p&gt;The lone wolf agent had a good run.&lt;/p&gt;

&lt;h2&gt;
  
  
  The team might take it from here.
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Paper: &lt;a href="https://arxiv.org/abs/2602.01465" rel="noopener noreferrer"&gt;Agyn: A Multi-Agent System for Team-Based Autonomous Software Engineering&lt;/a&gt; — Nikita Benkovich, Vitalii Valkov (2026)&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Blog post: &lt;a href="https://agyn.io/blog/we-tested-ai-team-swe-bench-verified" rel="noopener noreferrer"&gt;We tested how an AI team improves issue resolution on SWE-bench Verified&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>softwareengineering</category>
      <category>llm</category>
    </item>
  </channel>
</rss>
