<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Richard Simmons</title>
    <description>The latest articles on DEV Community by Richard Simmons (@lam8da).</description>
    <link>https://dev.to/lam8da</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3868113%2F26c7a40a-0e11-46d6-b9d3-d566fc7d71ab.png</url>
      <title>DEV Community: Richard Simmons</title>
      <link>https://dev.to/lam8da</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/lam8da"/>
    <language>en</language>
    <item>
      <title>I Built a Tool to Test Whether Multiple LLMs Working Together Can Beat a Single Model</title>
      <dc:creator>Richard Simmons</dc:creator>
      <pubDate>Wed, 08 Apr 2026 19:58:08 +0000</pubDate>
      <link>https://dev.to/lam8da/i-built-a-tool-to-test-whether-multiple-llms-working-together-can-beat-a-single-model-4g0l</link>
      <guid>https://dev.to/lam8da/i-built-a-tool-to-test-whether-multiple-llms-working-together-can-beat-a-single-model-4g0l</guid>
      <description>&lt;h2&gt;
  
  
  The Question
&lt;/h2&gt;

&lt;p&gt;Can you get a better answer by having multiple LLMs collaborate than by just asking one directly?&lt;/p&gt;

&lt;p&gt;That's the thesis behind &lt;strong&gt;&lt;a href="https://github.com/rich1398/Multi-Model-Benchmarking" rel="noopener noreferrer"&gt;Occursus Benchmark&lt;/a&gt;&lt;/strong&gt; — an open-source benchmarking platform that systematically tests multi-model LLM synthesis pipelines against single-model baselines across 4 providers and 29 orchestration strategies.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fblvsgswu2tl4zhn6o81f.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fblvsgswu2tl4zhn6o81f.png" alt=" " width="800" height="259"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxsy9oj2n8xi0a5acqurr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxsy9oj2n8xi0a5acqurr.png" alt=" " width="800" height="707"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo49bgqv6hec4mte5qqfl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo49bgqv6hec4mte5qqfl.png" alt=" " width="800" height="548"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  What It Does
&lt;/h2&gt;

&lt;p&gt;Occursus Benchmark runs the same task through 29 different orchestration strategies — from a simple single-model call to a 13-call graph-mesh collaboration — and scores every output using &lt;strong&gt;dual blind judging&lt;/strong&gt; (two frontier models score independently on a 0-100 scale, averaged). This tells you whether adding pipeline complexity actually improves quality, or just burns tokens and money.&lt;/p&gt;

&lt;p&gt;The tool supports &lt;strong&gt;4 LLM providers&lt;/strong&gt;: Ollama (local/free), OpenAI (GPT-5.4), Anthropic (Claude Opus 4.6), and Google Gemini 2.5 Pro. You toggle models on and off; the tool auto-assigns them to pipeline roles (generator, critic, synthesizer, reviewer).&lt;/p&gt;

&lt;h3&gt;
  
  
  The 29 Pipelines
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tier&lt;/th&gt;
&lt;th&gt;Pipelines&lt;/th&gt;
&lt;th&gt;Strategy&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;1 — Baseline&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Single Model, Best of 3, Sample &amp;amp; Vote&lt;/td&gt;
&lt;td&gt;Direct call and simple selection&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;2 — Synthesis&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Full Merge, Critique Then Merge, Ranked Merge&lt;/td&gt;
&lt;td&gt;Multi-persona generation + synthesis&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;3 — Adversarial&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;2-Way Debate, Dissent Merge, Red Team/Blue Team, Expert Routing, Constraint Checker&lt;/td&gt;
&lt;td&gt;Models challenge each other's work&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;4 — Deep&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Chain of Verification, Iterative Refinement, Mixture of Agents, Self-MoA, Adaptive Debate, Reflexion&lt;/td&gt;
&lt;td&gt;Multi-round reasoning loops&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;5 — Experimental&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Persona Council, Adversarial Decomposition, Reverse Engineer, Tournament, Graph-Mesh, Mesh+Verify, Mesh+Ranked, GSV, Mesh+Ranked+Verify, Adaptive Cascade, Managed Team, Corp Hierarchy&lt;/td&gt;
&lt;td&gt;Heavy orchestration and combination pipelines (2-17 calls)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Several pipelines implement architectures from recent research papers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Self-MoA&lt;/strong&gt; (Princeton 2025): Same-model sampling outperforms multi-model mixing by 6.6%&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Adaptive Debate / A-HMAD&lt;/strong&gt; (2025): Specialist debaters achieved +13.2% over baselines on GSM8K&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reflexion&lt;/strong&gt; (2023+): Verbal self-reflection memory produces &amp;gt;18% accuracy gains&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Graph-Mesh&lt;/strong&gt; (MultiAgentBench ACL 2025): All-to-all topology outperforms star/chain/tree&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Three Ways to Call LLMs
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Mode&lt;/th&gt;
&lt;th&gt;Speed (29 pipelines x 8 tasks)&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;th&gt;Best for&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;API&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;10-14 hours&lt;/td&gt;
&lt;td&gt;~$50-80&lt;/td&gt;
&lt;td&gt;Fastest, full parameter control&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Hybrid&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;12-18 hours&lt;/td&gt;
&lt;td&gt;~$15-25&lt;/td&gt;
&lt;td&gt;Best balance — CLI primary, API for parallelism&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Subscription CLI&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;30-45 hours&lt;/td&gt;
&lt;td&gt;$0&lt;/td&gt;
&lt;td&gt;Cheapest, overnight/weekend runs&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  API Mode
&lt;/h3&gt;

&lt;p&gt;Standard REST API calls. Full control over temperature, token limits, and concurrency. Uses flagship models: Claude Opus 4.6, GPT-5.4, Gemini 2.5 Pro.&lt;/p&gt;

&lt;h3&gt;
  
  
  Hybrid Mode (Recommended)
&lt;/h3&gt;

&lt;p&gt;CLI primary ($0), API as parallel fallback. When a subscription CLI call would be blocked waiting for another call to the same provider, the system automatically uses API instead. If API credits run out mid-run, it &lt;strong&gt;gracefully falls back to CLI-only&lt;/strong&gt; — no data loss, just slower. ~2x faster than pure CLI at ~30% of pure API cost.&lt;/p&gt;

&lt;h3&gt;
  
  
  Subscription CLI Mode
&lt;/h3&gt;

&lt;p&gt;Routes calls through your existing paid subscriptions at &lt;strong&gt;$0 extra cost&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Claude via &lt;code&gt;claude -p --model opus&lt;/code&gt; (Anthropic Pro/Max subscription)&lt;/li&gt;
&lt;li&gt;ChatGPT via &lt;code&gt;codex exec&lt;/code&gt; (OpenAI subscription)&lt;/li&gt;
&lt;li&gt;Gemini via &lt;code&gt;gemini -p --model gemini-2.5-pro&lt;/code&gt; (Google subscription)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  How It Works
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Toggle models on/off&lt;/strong&gt; — 6 preset models across 4 providers, simple checkboxes&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Select pipelines and tasks&lt;/strong&gt; — Choose which strategies to benchmark against which problems&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Click Run&lt;/strong&gt; — The tool auto-assigns models to roles:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Claude or GPT as the primary generator and synthesizer&lt;/li&gt;
&lt;li&gt;The other as critic and alternative generator&lt;/li&gt;
&lt;li&gt;Gemini for diversity in multi-model pipelines&lt;/li&gt;
&lt;li&gt;Ollama for speed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;4. Watch results stream in&lt;/strong&gt; — Real-time Server-Sent Events update a score matrix, bar charts, and statistics as each cell completes&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Dual blind judge&lt;/strong&gt; — Both Claude and GPT score every output independently. Scores are averaged into a single 0-100 result. Neither judge knows which pipeline produced the output.&lt;/p&gt;




&lt;h2&gt;
  
  
  4 Task Suites
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Suite&lt;/th&gt;
&lt;th&gt;Tasks&lt;/th&gt;
&lt;th&gt;Difficulty&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Smoke&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;Easy&lt;/td&gt;
&lt;td&gt;Quick validation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Core&lt;/td&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;td&gt;Easy-Medium&lt;/td&gt;
&lt;td&gt;Standard benchmark&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Stress&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;Hard&lt;/td&gt;
&lt;td&gt;Complex reasoning and planning&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Thesis&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;8&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Very Hard&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Designed to break single-model ceilings&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The &lt;strong&gt;thesis tasks&lt;/strong&gt; specifically target areas where research suggests multi-model approaches should excel:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cross-domain synthesis&lt;/strong&gt;: Design silicon-based biology with chemistry equations (requires deep knowledge from two unrelated fields)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-file code refactoring&lt;/strong&gt;: Refactor Flask to FastAPI with 6 simultaneous requirements (SQL injection fix, OAuth2, async, Pydantic, bcrypt, preserve JSON schema)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Constraint satisfaction&lt;/strong&gt;: Write a debate without using the letter 'z', with exactly 3 rhetorical questions and a 10-word final sentence&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Needle-in-haystack&lt;/strong&gt;: Find contradictions across 5 quarterly financial reports and calculate EBITDA&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are problems where a single LLM routinely drops constraints, misses cross-domain connections, or loses track of conflicting information.&lt;/p&gt;




&lt;h2&gt;
  
  
  Enhancement Toggles
&lt;/h2&gt;

&lt;p&gt;Beyond pipeline selection, the tool offers toggles that modify how every pipeline behaves:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Chain-of-Thought&lt;/strong&gt; — Forces step-by-step reasoning before final answers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Token Budget Management&lt;/strong&gt; — Reserves 60% of the token budget for the synthesis step (prevents verbose intermediate steps from starving the final answer)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Adaptive Temperature&lt;/strong&gt; — Auto-classifies tasks (factual/code/analytical/creative) and sets optimal temperature&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Repeat Runs&lt;/strong&gt; — Run each cell 1/3/5 times, report mean ± std dev for statistical significance&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost Tracking&lt;/strong&gt; — Display estimated $ per pipeline using published per-token pricing&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Benchmark Results
&lt;/h2&gt;

&lt;h3&gt;
  
  
  With flagship models and strict dual judging (Opus 4.6 + GPT-5.4)
&lt;/h3&gt;

&lt;p&gt;On hard thesis tasks (cross-domain synthesis, code refactoring, constraint satisfaction, needle-in-haystack):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Rank&lt;/th&gt;
&lt;th&gt;Pipeline&lt;/th&gt;
&lt;th&gt;Avg Score&lt;/th&gt;
&lt;th&gt;vs Baseline&lt;/th&gt;
&lt;th&gt;Calls&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Graph-Mesh Collab&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;93.3&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+10.8&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Chain of Verification&lt;/td&gt;
&lt;td&gt;91.3&lt;/td&gt;
&lt;td&gt;+8.8&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Ranked Merge&lt;/td&gt;
&lt;td&gt;91.2&lt;/td&gt;
&lt;td&gt;+8.7&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;Tournament&lt;/td&gt;
&lt;td&gt;90.0&lt;/td&gt;
&lt;td&gt;+7.5&lt;/td&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;Reverse Engineer&lt;/td&gt;
&lt;td&gt;89.7&lt;/td&gt;
&lt;td&gt;+7.2&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;Single (baseline)&lt;/td&gt;
&lt;td&gt;82.5&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Multi-model pipelines beat single-model by up to 10.8 points on hard tasks.&lt;/strong&gt; This is not the 1-2% margin we saw on easy tasks — it's a genuine, substantial improvement.&lt;/p&gt;

&lt;h3&gt;
  
  
  Key findings
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Selection pressure&lt;/strong&gt; (Sample &amp;amp; Vote, Ranked Merge) consistently outperforms synthesis — generating multiple candidates and picking the best beats trying to merge them&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Graph-mesh topology&lt;/strong&gt; (all-to-all agent communication) is the strongest collaboration pattern&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Structured verification&lt;/strong&gt; catches errors that single models miss, especially on constraint-heavy tasks&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Debate-style pipelines&lt;/strong&gt; underperform — forcing adversarial opposition on complex tasks destroys quality&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Flagship judges are stricter&lt;/strong&gt; — Opus 4.6 + GPT-5.4 produce more discriminating scores than mid-tier models, making the differentiation between pipelines more meaningful&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2-2-2 provider balance&lt;/strong&gt; — distributing roles evenly (2 per provider) maximises parallelism&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  What doesn't work
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Generic debate (57.1) — forcing opposition on settled facts&lt;/li&gt;
&lt;li&gt;Dissent-then-merge (68.9) — harsh critique without structure loses good content&lt;/li&gt;
&lt;li&gt;Self-MoA on hard tasks (77.0) — same-model sampling lacks diverse perspectives&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;29 pipelines including 5 combination variants (fusing the top-3 architectures) and 2 corporate hierarchy variants are being benchmarked.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tech Stack
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Backend&lt;/strong&gt;: Python, FastAPI, fully async&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Providers&lt;/strong&gt;: Ollama, OpenAI, Anthropic, Gemini — with auto-routing by model name and retry with exponential backoff&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Frontend&lt;/strong&gt;: Vanilla HTML/JS/CSS, Chart.js, Server-Sent Events&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Storage&lt;/strong&gt;: SQLite (WAL mode), CSV/JSON export&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pipelines&lt;/strong&gt;: 10 module files implementing 22 strategies, all sharing a common &lt;code&gt;BasePipeline&lt;/code&gt; interface&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/rich1398/Multi-Model-Benchmarking.git
&lt;span class="nb"&gt;cd &lt;/span&gt;Multi-Model-Benchmarking
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt
python app.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Open &lt;code&gt;http://localhost:8000&lt;/code&gt;, configure your API keys (or just use Ollama for free local testing), and run your first benchmark.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitHub&lt;/strong&gt;: &lt;a href="https://github.com/rich1398/Multi-Model-Benchmarking" rel="noopener noreferrer"&gt;github.com/rich1398/Multi-Model-Benchmarking&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This is an active research project. The next benchmark run will test all 22 pipelines against the thesis task suite with enhancement toggles enabled. If you have ideas for pipeline architectures that might beat single-model baselines, open an issue or PR.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
