<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Matthew Karaula</title>
    <description>The latest articles on DEV Community by Matthew Karaula (@karamatt_).</description>
    <link>https://dev.to/karamatt_</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3870896%2F694b2914-e442-40b8-8150-719fdf1e96fb.png</url>
      <title>DEV Community: Matthew Karaula</title>
      <link>https://dev.to/karamatt_</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/karamatt_"/>
    <language>en</language>
    <item>
      <title>How I Rebuilt My AI Decision Tool From a Summarizer Into a Constraint-Driven Arbitrator</title>
      <dc:creator>Matthew Karaula</dc:creator>
      <pubDate>Fri, 10 Apr 2026 04:47:49 +0000</pubDate>
      <link>https://dev.to/karamatt_/how-i-rebuilt-my-ai-decision-tool-from-a-summarizer-into-a-constraint-driven-arbitrator-5fc7</link>
      <guid>https://dev.to/karamatt_/how-i-rebuilt-my-ai-decision-tool-from-a-summarizer-into-a-constraint-driven-arbitrator-5fc7</guid>
      <description>&lt;p&gt;A few weeks ago, I shipped a tool called Arbiter that takes a business decision, runs it through GPT-4o, and returns a structured analysis. The output looked impressive. Recommendation, confidence score, pros and cons, risk ratings, next steps. Everything you'd expect from an AI decision tool.&lt;/p&gt;

&lt;p&gt;Then I posted it on Reddit and got destroyed in the comments.&lt;br&gt;
Not because the output was wrong. Because the output was vague. One commenter pointed out that the AI was just hand-waving its way to a conclusion. Another asked how it handled contradictory evidence between different perspectives. A third said the confidence scores felt arbitrary there was no mechanism that would actually drop confidence when the evidence was weak.&lt;/p&gt;

&lt;p&gt;They were right. I was running a single LLM call with a clever prompt and pretending it was decision intelligence.&lt;/p&gt;

&lt;p&gt;This post is about how I rebuilt the pipeline to actually adjudicate decisions instead of summarizing them, and the architectural decisions that made the difference.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The original and how it failed&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The first version was simple. One system prompt, one user prompt, one JSON response.&lt;br&gt;
User input → GPT-4o (with structured prompt) → JSON output&lt;br&gt;
The prompt asked the model to play "senior strategy analyst," analyze options, return pros and cons, and assign a confidence score. It worked in the sense that it produced reasonable-looking output. It failed in three specific ways.&lt;/p&gt;

&lt;p&gt;First, the model could justify any conclusion with confident-sounding prose. There was no internal mechanism forcing it to actually weigh evidence it just had to sound like it did.&lt;/p&gt;

&lt;p&gt;Second, confidence scores were cosmetic. The model would output 85% confidence on a vague decision and 75% on a well-defined one, with no consistent logic. I couldn't trace where the score came from.&lt;/p&gt;

&lt;p&gt;Third, when the same decision was run twice, the recommendations would sometimes flip. A single LLM call has no internal debate mechanism whichever framing the model latched onto first won.&lt;/p&gt;

&lt;p&gt;The redesign: separating extraction from advocacy from adjudication&lt;br&gt;
The core insight was that real decision-making isn't a single act of reasoning. It's at least three distinct cognitive operations:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Defining what success looks like&lt;/strong&gt; (constraints, criteria, non-negotiables)&lt;br&gt;
&lt;strong&gt;Building the strongest case for each option&lt;/strong&gt; (advocacy)&lt;br&gt;
*&lt;em&gt;Evaluating each case against the success criteria *&lt;/em&gt;(adjudication)&lt;/p&gt;

&lt;p&gt;A single LLM call was trying to do all three at once, which is why it could rationalize any answer. The fix was to separate them into distinct stages where each stage's output became a hard input to the next.&lt;br&gt;
Here's the new pipeline:&lt;br&gt;
User input&lt;br&gt;
  ↓&lt;br&gt;
Stage 1: Constraint Extraction&lt;br&gt;
  ↓&lt;br&gt;
Stage 2: Research (with web search)&lt;br&gt;
  ↓&lt;br&gt;
Stage 3: Independent Advocates (parallel)&lt;br&gt;
  ↓&lt;br&gt;
Stage 4: Arbitrator&lt;br&gt;
  ↓&lt;br&gt;
Decision Brief&lt;/p&gt;

&lt;p&gt;Each stage has a specific job, runs as its own LLM call with its own system prompt, and passes structured JSON to the next stage. Let me walk through what each one does and why it matters.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stage 1: Constraint extraction&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This was the biggest unlock. Before any reasoning happens, the system extracts a normalized constraint framework from the user's inputs.&lt;/p&gt;

&lt;p&gt;json{&lt;br&gt;
  "hard_constraints": [&lt;br&gt;
    {"id": "HC1", "constraint": "Budget capped at $300K", "source": "user_input"}&lt;br&gt;
  ],&lt;br&gt;
  "soft_constraints": [&lt;br&gt;
    {"id": "SC1", "constraint": "Minimize disruption to existing team", "weight": "high"}&lt;br&gt;
  ],&lt;br&gt;
  "decision_criteria": [&lt;br&gt;
    {"id": "DC1", "criterion": "Operational within 4 months", "measurable": "go-live date"}&lt;br&gt;
  ],&lt;br&gt;
  "risk_tolerance": "moderate",&lt;br&gt;
  "non_negotiables": ["No customer downtime"],&lt;br&gt;
  "unknown_critical_inputs": ["Current team capacity"]&lt;br&gt;
}&lt;/p&gt;

&lt;p&gt;The point isn't the format. The point is that every downstream stage now references the same constraint IDs. When an advocate argues for an option, they have to explicitly show which constraints their option satisfies. When the Arbitrator scores options, it scores them against the same constraint set, not against free-form prose.&lt;br&gt;
This single change eliminated about 80% of the hand-waving. The model couldn't just say "this option seems best" anymore. It had to point to specific constraints and show satisfaction.&lt;br&gt;
The other useful thing constraint extraction does is identify what the user didn't tell you. The unknown_critical_inputs field forces the model to flag missing information. That data later becomes input to the confidence calculation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stage 2: Research with real web search&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The original tool relied entirely on training data for "industry context." The output looked authoritative but was completely ungrounded — citing statistics that may or may not exist, referencing competitor moves the model imagined.&lt;/p&gt;

&lt;p&gt;The fix was Tavily, a search API designed for LLM consumption. The Research Agent generates three focused search queries from the decision context, executes them in parallel, and synthesizes the results into structured findings.&lt;/p&gt;

&lt;p&gt;The key design decision was how to handle uncertainty about source quality. Rather than pretending every claim is equally evidenced, every finding gets tagged:&lt;/p&gt;

&lt;p&gt;json{&lt;br&gt;
  "claim": "Australian SaaS NRR averaged 112% in Q4 2025",&lt;br&gt;
  "evidence_strength": "high",&lt;br&gt;
  "source_type": "cited",&lt;br&gt;
  "source_url": "https://...",&lt;br&gt;
  "source_title": "..."&lt;br&gt;
}&lt;/p&gt;

&lt;p&gt;source_type is one of cited, inference, or model_knowledge. evidence_strength is high, medium, or low. The rule baked into the prompt: a claim cannot be marked as high-strength evidence unless it has a real URL backing it.&lt;br&gt;
This sounds obvious but it took multiple iterations to get the model to actually respect it. Models have a strong default behavior of confidently asserting things. Breaking that habit required restating the rules in three different places in the prompt and explicitly forbidding fabricated citations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stage 3: Parallel advocates&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For each option the user provides, an advocate LLM call runs in parallel, building the strongest possible case. The system prompt instructs them to be persuasive but honest, and crucially:&lt;/p&gt;

&lt;p&gt;Your argument must be structured around the CONSTRAINTS defined by the decision analyst. You cannot hand-wave, you must explicitly show how your option satisfies each hard constraint, decision criterion, and key soft constraint.&lt;/p&gt;

&lt;p&gt;Each advocate returns:&lt;br&gt;
json{&lt;br&gt;
  "option": "Option A",&lt;br&gt;
  "executive_argument": "...",&lt;br&gt;
  "constraint_satisfaction": [&lt;br&gt;
    {"constraint_id": "HC1", "satisfied": "yes|partial|no", "reasoning": "..."}&lt;br&gt;
  ],&lt;br&gt;
  "supporting_evidence": [&lt;br&gt;
    {"point": "...", "evidence_strength": "high|medium|low", "source_ref": "..."}&lt;br&gt;
  ],&lt;br&gt;
  "acknowledged_weaknesses": [...]&lt;br&gt;
}&lt;br&gt;
The acknowledged_weaknesses field matters. Without it, advocates produced suspiciously one-sided arguments. Forcing them to acknowledge their own option's weaknesses produced more honest output, and gave the Arbitrator material to work with in the next stage.&lt;br&gt;
Running advocates in parallel was an obvious win for latency. Three options means three concurrent LLM calls instead of three sequential ones.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stage 4: The Arbitrator&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is where the real adjudication happens. The Arbitrator receives the constraint framework, the research findings, and all advocate arguments. Its system prompt explicitly tells it that its job is not to summarize:&lt;/p&gt;

&lt;p&gt;You are NOT summarizing the advocates. You are ADJUDICATING.&lt;br&gt;
Your process:&lt;/p&gt;

&lt;p&gt;Score each option against the constraints&lt;br&gt;
Identify contradictions between advocate arguments resolve them with evidence.&lt;/p&gt;

&lt;p&gt;Assess evidence strength for each advocate's claims&lt;br&gt;
Deliver a clear ruling. Do not hedge.&lt;br&gt;
Assess your own confidence based on constraint clarity, evidence quality, advocate agreement, and unknown critical inputs&lt;/p&gt;

&lt;p&gt;The output includes a constraint scorecard that maps every constraint to a pass/partial/fail rating per option, a list of contradictions between advocates with how they were resolved, sensitivity variables (concrete values that would flip the ruling), and the actual ruling itself.&lt;br&gt;
The most important field is certainty_rationale. The model has to explain why its confidence is what it is. This makes the score legible — you can see whether the 72% confidence comes from "strong evidence but advocates disagree" or "weak evidence but clear constraint winner." Two different stories that should produce different actions from the user.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What this cost me&lt;/strong&gt;&lt;br&gt;
A single LLM call with the original architecture was about $0.02 per analysis on GPT-4o-mini. The new pipeline runs five LLM calls (constraint extraction, research synthesis, three advocates, arbitrator) plus three Tavily searches. Cost per brief is now closer to $0.10 on the same model. Latency went from ~15 seconds to ~45 seconds.&lt;/p&gt;

&lt;p&gt;That's a 5x cost increase and 3x latency increase. For most consumer products it would be a bad trade. For a tool whose entire value proposition is "give me a structured ruling I can act on," it's worth it. Users will wait 45 seconds for output that actually helps them. They won't pay for output that looks like a ChatGPT response.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What I learned&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Three things from the rebuild that I'd apply to any multi-stage LLM system.&lt;/p&gt;

&lt;p&gt;Separation of concerns matters more than prompt engineering. I spent weeks trying to make a single prompt produce better output. Splitting that prompt into four prompts each with a narrow job did more in a day than the prompt tweaks did in two weeks. Each stage gets to specialize. Each stage's output becomes a hard constraint for the next stage instead of a suggestion. &lt;/p&gt;

&lt;p&gt;Models will fabricate confidence unless you make confidence expensive. The original tool happily output 90% confidence because nothing in the prompt punished it for being overconfident. The new tool ties certainty to specific factors (evidence strength, advocate agreement, missing inputs) and forces the model to justify its score in writing. When the model has to explain its confidence, it gets more conservative.&lt;/p&gt;

&lt;p&gt;Adversarial structure produces better reasoning than collaborative structure. The original prompt asked the model to "consider all perspectives." The new architecture has independent advocates each arguing their case, then a neutral arbitrator weighing them against criteria. The adversarial setup produces sharper arguments because each advocate is incentivized to make the strongest case. The arbitrator then has real material to weigh instead of mush.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The output&lt;/strong&gt;&lt;br&gt;
Here's a screenshot of a real Decision Brief from the new pipeline. The constraint scorecard at the top is the most visually distinctive thing — every option scored against every extracted constraint. Below it, the research section shows cited findings with evidence strength badges and clickable source URLs.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiit39zdon26mgpqr1vae.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiit39zdon26mgpqr1vae.png" alt=" " width="800" height="420"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F83481rx3vj3iw7oggg44.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F83481rx3vj3iw7oggg44.png" alt=" " width="800" height="578"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What's next&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The pipeline still has weak spots. Constraint extraction is fragile when user inputs are sparse, garbage in, garbage out. I'm working on a constraint review step where the user can edit the extracted framework before advocates run. Evidence strength calibration is also conservative the model defaults to "medium" for almost everything unless there's a clearly cited stat. I'm experimenting with explicit calibration examples in the prompt.&lt;/p&gt;

&lt;p&gt;If you want to play with the tool, it's at &lt;a href="https://arbiter-frontend-iota.vercel.app/" rel="noopener noreferrer"&gt;https://arbiter-frontend-iota.vercel.app/&lt;/a&gt;. Free tier gives you a few briefs per month, no credit card. Genuinely interested in feedback on where the pipeline breaks for your use case.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>architecture</category>
      <category>buildinpublic</category>
    </item>
  </channel>
</rss>
