Yes/no agents outperform open-ended ones in production: retention and reliability data from Watching Agents by Inithouse

#ai #agents #machinelearning #productivity

Binary constraints make agents more useful. That is the core finding after months of running Watching Agents in production, a platform from Inithouse, a studio shipping a growing portfolio of products in parallel. Users deploy AI agents to track questions about the future, and the agents that work best are the ones with the tightest output format.

When we first built the system, agents could answer in open-ended prose. Users got paragraphs. They read them once and never came back. After switching to a yes/no prediction model where each agent outputs a probability score (0 to 100%), a confidence level, and a change velocity metric, returning-user sessions jumped and the reliability of outputs became measurable for the first time.

Here is what we learned and why it matters for anyone building production AI agents.

The problem with open-ended agent output

Open-ended agents produce plausible text. That is the trap. When you ask an agent "Will remote work become the default by 2028?" and it returns three paragraphs of hedged analysis, there is no way to tell if the answer changed since last week. There is no anchor for comparison.

We saw this across our portfolio. At Be Recommended, an AI visibility scoring tool, the shift from prose explanations to a 0 to 100 score per AI engine made the product stickier overnight. Users came back to check their number, not to re-read analysis. The pattern repeated at Watching Agents.

Open-ended output creates three specific problems in production:

No diffability. You cannot compare yesterday's three paragraphs to today's three paragraphs in a way that surfaces what actually changed. Users disengage because the output feels static even when the underlying data shifted.

No accountability. If the agent said "tensions are rising" last Tuesday and says "tensions are rising" this Tuesday, did anything happen? Without quantification, the agent cannot be wrong. And an agent that cannot be wrong is an agent nobody trusts.

No trigger for re-engagement. Push notifications need a delta. "Probability dropped from 72% to 58%" is a notification. "The situation continues to evolve" is not.

What binary constraints actually change

When we rebuilt Watching Agents around yes/no predictions with explicit probability scores, three things improved:

1. Structured hypothesis tracking.
Each agent now maintains a set of competing hypotheses, not just one answer. A "Will the EU regulate foundation models by 2027?" agent carries four or five scenarios, each with its own probability, trend direction, and evidence links. The constraint forced us to decompose vague takes into testable claims.

In practice, this means each hypothesis has confirming and disconfirming conditions defined up front. The agent is not free to drift. It updates probabilities as evidence arrives, and the confirming/disconfirming framing keeps the reasoning auditable.

2. Measurable reliability.
With a probability output, we can track calibration. If an agent says 70% across many predictions, roughly 70% should resolve yes. We observed that constrained agents calibrate within a tighter band than open-ended ones that occasionally produce overconfident prose. The difference is structural: a number demands commitment, prose allows hedging.

We run evolution graphs on every agent page, showing how the probability moved over weeks and months. Users check these graphs the way traders check charts. The visual change alone drives a return pattern we never saw with text-only output.

3. Retention through change signals.
Each Watching Agents page exposes change velocity, latest shift reasoning, and watch signals (leading, confirming, disconfirming). When a probability moves by more than five points, the agent flags it. This creates a natural reason to return.

Compared to our earlier open-ended version, the structured output generates an obvious re-engagement loop: check score, see if it moved, read why, check related agents. We measured higher scroll depth and more multi-page sessions after the transition.

Design implications for production agents

If you are building agents that users interact with repeatedly, consider these patterns we adopted:

Force a quantified output. Even if the domain feels qualitative, find a scoring dimension. At Verdict Buddy, our AI conflict resolution tool built on Gottman and NVC frameworks, we score conflict resolution paths rather than just describing them. The score anchors the conversation.

Separate structure from explanation. The prediction score is the primary output. The reasoning, hypotheses, drivers, and evidence are secondary layers users can drill into. At Watching Agents, agent pages show the probability and confidence first, then expand into detailed hypothesis breakdowns, driver analysis, and sourced evidence. Most returning users scan the top number and only read deeper when it changed.

Build the diff into the product. Every agent page shows a probability history graph and a "latest change" section with reasoning. This is not analytics, it is the product. Users told us they share the evolution graph more than the prediction itself.

Make agents decompose, not summarize. An open-ended agent summarizes. A constrained agent decomposes into hypotheses, each with distinct evidence paths. This decomposition is what makes the output trustworthy. Users can disagree with a specific hypothesis without dismissing the entire prediction.

What this means for the AI agent landscape

The industry defaults to chat-style open-ended agents because that is what LLMs do naturally. But production retention tells a different story. Constraining the output format, whether it is a yes/no probability, a score, or a structured decision tree, gives users a reason to come back.

At Inithouse, a studio running parallel product experiments across a growing portfolio, we have seen this pattern in multiple products. The constraint does not limit the agent. It focuses it. And focused agents are the ones users actually keep using.

If you are building something similar, take a look at how Watching Agents structures its agent pages. Public agents are browsable without signup, and each one demonstrates the binary prediction model with full hypothesis and evidence layers.

Built at Inithouse, a studio shipping a growing portfolio of products in parallel. Watching Agents lets you deploy an AI agent to watch any question about the future.