<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Shakib S. </title>
    <description>The latest articles on DEV Community by Shakib S.  (@oppenheimerrick).</description>
    <link>https://dev.to/oppenheimerrick</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3995583%2Fac93ab64-f264-49fc-a283-ced514023fe2.jpg</url>
      <title>DEV Community: Shakib S. </title>
      <link>https://dev.to/oppenheimerrick</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/oppenheimerrick"/>
    <language>en</language>
    <item>
      <title>I Built an Open-Source AI Agent That Benchmarks Itself (And It's Actually Good)</title>
      <dc:creator>Shakib S. </dc:creator>
      <pubDate>Sun, 21 Jun 2026 17:42:25 +0000</pubDate>
      <link>https://dev.to/oppenheimerrick/i-built-an-open-source-ai-agent-that-benchmarks-itself-and-its-actually-good-4dln</link>
      <guid>https://dev.to/oppenheimerrick/i-built-an-open-source-ai-agent-that-benchmarks-itself-and-its-actually-good-4dln</guid>
      <description>&lt;p&gt;&lt;strong&gt;No API costs. No VC funding. Just 3,000 lines of Python and a llama.cpp backend.&lt;/strong&gt;&lt;/p&gt;




&lt;p&gt;Hello I've build yet another terminal IDE too that works out-of-the-box with a local LLM backed.&lt;/p&gt;

&lt;p&gt;It's called &lt;a href="https://github.com/oppenheimer-rick/open-agent" rel="noopener noreferrer"&gt;open-agent&lt;/a&gt;. Running Qwen 3.6 35B on 6GB VRAM with pretty incredible results for a privacy centered setup/environment.&lt;/p&gt;

&lt;p&gt;Here's the deep dive into Why's and How's that'll hopefully help some fellow builders.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem With Every Agent Framework
&lt;/h2&gt;

&lt;p&gt;I spent months testing LangChain, CrewAI, AutoGen, and the rest.&lt;/p&gt;

&lt;p&gt;They all share the same DNA: &lt;strong&gt;they're API wrappers dressed up as agents&lt;/strong&gt;. You configure a pipeline, wire it to GPT-4, and call it a day. The moment your credit card runs out, so does your agent.&lt;/p&gt;

&lt;p&gt;And the benchmarks? Most frameworks cherry-pick numbers from someone else's paper. They don't run SWE-bench on their own code. They don't prove their agent can &lt;em&gt;actually&lt;/em&gt; fix a bug in a repo it's never seen before.&lt;/p&gt;

&lt;p&gt;I wanted something different.&lt;/p&gt;

&lt;p&gt;A single-file agent that runs on my laptop. No API keys. No cloud dependencies. An agent that &lt;strong&gt;benchmarks itself&lt;/strong&gt; using the same industry standards as OpenAI and DeepSeek — inside its own loop, not in some external harness that hides the cracks.&lt;/p&gt;

&lt;p&gt;So I built one.&lt;/p&gt;




&lt;h2&gt;
  
  
  Meet Open-Agent
&lt;/h2&gt;

&lt;p&gt;Three thousand lines of Python. A single file. Twenty-four tools. Eleven REPL commands. Four benchmarks. Zero API costs.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;loop.py  ─  The whole thing
benchmark/
  ├── bigcodebench.py    Code synthesis (1140 problems)
  ├── swebench.py        Software engineering (Docker eval)
  ├── agentic_bench.py   Multi-step tool use (10 tasks)
  └── gaia.py            Meta's reasoning benchmark
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It doesn't wrap an API. It &lt;strong&gt;is&lt;/strong&gt; the agent. Every tool, every system prompt, every context management trick — it's all right there in one file you can read, modify, and understand.&lt;/p&gt;




&lt;h2&gt;
  
  
  How It Works
&lt;/h2&gt;

&lt;p&gt;The core is a &lt;strong&gt;ReAct loop&lt;/strong&gt; — Reason + Act, repeated until the task is done. But it's the details that matter.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Loop
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1. System prompt → injected with your bio, preferences, and 24 tool definitions
2. Preflight    → maps your project, searches the web for context
3. Think        → LLM decides what to do next
4. Act          → executes a tool (edit a file, search the web, run Python)
5. Observe      → feeds the result back into context
6. Repeat       → until the task is complete
7. Return       → final message
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Nothing revolutionary on paper. The magic is in what happens between the steps.&lt;/p&gt;

&lt;h3&gt;
  
  
  Context Management That Actually Works
&lt;/h3&gt;

&lt;p&gt;Small models (7B, 14B, 35B) fill their context window fast. The naive approach — keep appending turns until you hit the limit — works for about 20 minutes before the model forgets what it's doing.&lt;/p&gt;

&lt;p&gt;Open-agent uses a &lt;strong&gt;rolling window&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────┐
│  System prompt          ─  always kept  │
│  Grounding context      ─  always kept  │
│  Memory / bio / prefs   ─  always kept  │
│  ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│  Middle turns           ─  archived     │
│  ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│  Last N turns           ─  preserved    │
└─────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The first 3 messages (system, grounding, memory) stay. The last 9 turns stay. Everything in between gets compressed into a "Shadow Context" summary.&lt;/p&gt;

&lt;p&gt;Result: the agent can run 500+ steps without losing the plot. On a 7B model.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Web-First Philosophy
&lt;/h3&gt;

&lt;p&gt;Large language models are frozen in time. Yours, mine, everyone's. Their training data is at least six months old, often older.&lt;/p&gt;

&lt;p&gt;Open-agent treats the web as its &lt;strong&gt;primary reasoning engine&lt;/strong&gt;, not a fallback.&lt;/p&gt;

&lt;p&gt;Every non-trivial task starts with &lt;code&gt;search_web&lt;/code&gt; — not as a checkbox feature, but as a hard requirement embedded in the system prompt:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"You are FORBIDDEN from writing any implementation code during Step 1 and Step 2. Your FIRST action MUST be to call search_web."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This discipline — research first, code second — is what makes small models punch above their weight. A 35B model with good search results beats a 70B model guessing from memory.&lt;/p&gt;




&lt;h2&gt;
  
  
  The 24 Tools
&lt;/h2&gt;

&lt;p&gt;Tools are the agent's hands. Each one is a Python function registered as an LLM callable via function calling.&lt;/p&gt;

&lt;h3&gt;
  
  
  File Operations
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;What it does&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;read_file_section&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Reads 20-50 lines (context discipline baked in)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;write_file&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Creates new files&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;patch_file&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Precision edits — no full rewrites&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;outline_file&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Scans structure without reading content&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Search
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;What it does&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;search_web&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;SearXNG + Mojeek fallback, multi-variant&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;web_fetch&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Downloads pages, smart-slices first 1000 lines&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;scout_website&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Recursive doc hub extraction&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;grep_codebase&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Regex across the project&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;graph_search&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;AST-level symbol lookup&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Execution
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;What it does&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;run_python&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Sandboxed execution (30s timeout)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;run_bash&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Any shell command&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;git_status&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Check what changed&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Planning &amp;amp; Memory
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;What it does&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;todo_write / read / update&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Mission-critical plan tracking&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;memory_save / load&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Persistent session facts&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;consolidate_goals&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Scans memory, triggers deep research&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;summarize_progress&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Shadow Context compression&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Meta &amp;amp; Self-Improvement
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;What it does&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;sentinel_map_codebase&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Global project blueprint&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;skill_factory&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Records patterns as reusable skills&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;load_skill&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Fetches skill definitions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;verify_syntax&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Catches hallucinated syntax errors&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  The Part That Actually Excites Me: Self-Benchmarking
&lt;/h2&gt;

&lt;p&gt;Every agent framework claims performance numbers. Almost none of them &lt;strong&gt;run their own benchmarks inside their own agent loop&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Open-agent does.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;benchmark.bigcodebench&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;run_benchmark&lt;/span&gt;

&lt;span class="c1"&gt;# Same function used in interactive mode
&lt;/span&gt;&lt;span class="nf"&gt;run_benchmark&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_instances&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;subset&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hard&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or from the REPL:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;/benchmark bigcodebench --instances 50 --subset hard
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The agent calls &lt;code&gt;run_agent()&lt;/code&gt; — the same function you use in interactive mode — on every benchmark problem. Same tools. Same context management. Same system prompts. No subprocess. No wrapper. No cheating.&lt;/p&gt;

&lt;h3&gt;
  
  
  Four Benchmarks
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;BigCodeBench&lt;/strong&gt; — 1,140 code synthesis problems with embedded unittest test cases. Used by Qwen and DeepSeek. Evaluated locally — no external package needed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SWE-bench Lite&lt;/strong&gt; — 300 real GitHub bugs from 12 popular Python repos. The agent clones each repo, explores the codebase, applies a fix, and produces a git patch. Evaluated with swebench's official Docker harness.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agentic Bench&lt;/strong&gt; — 10 deterministic tool-use tasks: build an OpenAI-compatible proxy for llama.cpp, a model router, a log analyser, a context window visualiser, a skill generator. Everything about self-hosted LLM infrastructure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GAIA&lt;/strong&gt; — Meta's gold standard for multi-step reasoning. The agent searches the web, downloads files, processes data, and synthesises answers. Requires HuggingFace auth.&lt;/p&gt;

&lt;p&gt;Each benchmark module is a standalone file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;benchmark/
  bigcodebench.py    ←  imports run_agent() directly
  swebench.py        ←  imports run_agent() in cloned repo
  agentic_bench.py   ←  imports run_agent() in temp dir
  gaia.py            ←  imports run_agent() directly
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;No dispatcher layer. No CLI runner. No abstraction indirection.&lt;/strong&gt; Each benchmark is a self-contained function you can call from Python or from the REPL.&lt;/p&gt;




&lt;h2&gt;
  
  
  What 3,000 Lines Buys You
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Count&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Lines of Python&lt;/td&gt;
&lt;td&gt;3,062&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LLM-callable tools&lt;/td&gt;
&lt;td&gt;24&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;REPL commands&lt;/td&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;System prompts&lt;/td&gt;
&lt;td&gt;2 (general + coding)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Benchmarks&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Search backends&lt;/td&gt;
&lt;td&gt;2 (SearXNG + Mojeek)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Context window&lt;/td&gt;
&lt;td&gt;Rolling (12-turn sliding)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;File watcher&lt;/td&gt;
&lt;td&gt;Bidirectional (editor ↔ agent)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;API cost&lt;/td&gt;
&lt;td&gt;Zero&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;It runs on llama.cpp at localhost:8083. It falls back to any OpenAI-compatible endpoint. It never pays per token.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Open Source Matters Here
&lt;/h2&gt;

&lt;p&gt;The agent framework space is crowded with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Vendor playthings&lt;/strong&gt; — frameworks designed to sell you API credits&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Academic prototypes&lt;/strong&gt; — papers with GitHub repos that haven't been touched in months&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Configuration nightmares&lt;/strong&gt; — YAML files for days&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Open-agent is none of those.&lt;/p&gt;

&lt;p&gt;It's a single Python file you can read in an afternoon. It doesn't hide complexity behind abstractions — it puts everything in the open. The benchmarks are real. The evaluation is honest. The tools are practical.&lt;/p&gt;

&lt;p&gt;And because it's a single file, you can fork it, gut it, rewrite the system prompts, add your own tools, and understand every line that runs on your machine.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Roadmap
&lt;/h2&gt;

&lt;p&gt;What comes next:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Multi-agent orchestration&lt;/strong&gt; — spawn sub-agents for parallel research&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Vision tools&lt;/strong&gt; — process screenshots and diagrams&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Long-term memory&lt;/strong&gt; — vector store for cross-session recall&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;WebSocket bridge&lt;/strong&gt; — attach to VS Code as a copilot alternative&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But the foundation is already solid. An agent that runs locally, works reliably, and tells you honestly how it performs.&lt;/p&gt;




&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/your-username/open-agent
&lt;span class="nb"&gt;cd &lt;/span&gt;open-agent
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt

&lt;span class="c"&gt;# Start llama.cpp on port 8083&lt;/span&gt;
&lt;span class="c"&gt;# Then:&lt;/span&gt;
python loop.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For the benchmarks:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python &lt;span class="nt"&gt;-m&lt;/span&gt; benchmark.bigcodebench &lt;span class="nt"&gt;--instances&lt;/span&gt; 10
python &lt;span class="nt"&gt;-m&lt;/span&gt; benchmark.swebench &lt;span class="nt"&gt;--instances&lt;/span&gt; 5
python &lt;span class="nt"&gt;-m&lt;/span&gt; benchmark.agentic_bench

&lt;span class="c"&gt;# Or from inside the REPL&lt;/span&gt;
&lt;span class="c"&gt;# /benchmark bigcodebench --instances 10&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;p&gt;&lt;em&gt;Built with llama.cpp, Python, and the unshakeable belief that local AI is the future. No API keys were harmed in the making of this agent.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>opensource</category>
      <category>agentic</category>
      <category>design</category>
    </item>
  </channel>
</rss>
