Most monitoring tools watch for a specific event. Price drops below $X. Server returns 5xx. Keyword rank changes. The trigger is predefined, the response is mechanical.
We wanted something different at Inithouse. We wanted agents that could watch a question, not a metric, and figure out on their own when the answer shifted.
That's what we built with Watching Agents.
The core problem: questions aren't metrics
A metric has a value. A question has evidence. You can't poll "Will the EU regulate foundation models by Q3 2026?" like a server health check. The answer lives in a web of signals: legislative drafts, committee votes, industry lobbying disclosures, media sentiment shifts.
The agent needs to do three things: build an initial hypothesis from available evidence, continuously collect new evidence, and detect when new evidence materially changes the hypothesis.
Each of these is a distinct engineering challenge.
Hypothesis formation: yes/no vs open-ended
We found early on that question type changes everything about the architecture.
Binary questions ("Will X happen by Y?") produce a hypothesis with a confidence score. The agent collects evidence, classifies each piece as supporting or contradicting, and maintains a running confidence. When confidence crosses a threshold or flips direction, it alerts.
The signal detection here is relatively clean. Each new piece of evidence either moves the needle or doesn't. We track cumulative evidence weight over time and look for inflection points.
Open-ended questions ("How is the AI developer tools market evolving?") are harder. There's no single confidence score. Instead, the agent maintains a structured summary of the current state and uses semantic comparison to detect meaningful drift.
We tried cosine similarity between summary embeddings at first. Too noisy. Minor phrasing changes triggered false positives. What worked: extracting named claims from each summary version and diffing the claim sets. A new claim appearing, or an existing claim being contradicted by evidence, constitutes a real change.
Evidence collection and reliability
Not all sources are equal. An agent watching a regulatory question should weight an official government gazette differently from a tweet. We built a three-tier reliability model:
- Primary sources: official documents, published data, peer-reviewed research. High weight, low frequency.
- Secondary sources: established media, analyst reports, industry publications. Medium weight, medium frequency.
- Signal sources: social media, forums, blog posts. Low individual weight, but useful for early detection when multiple signals cluster.
The agent scores each piece of evidence by source tier and recency. Evidence decays: a six-month-old analyst prediction carries less weight than a fresh legislative committee vote.
In practice, across the 117 public agents running on Watching Agents, we see roughly 60% of meaningful alerts triggered by primary source changes, even though primary sources account for only about 15% of collected evidence by volume. The signal-to-noise ratio at the top tier is dramatically better.
The alert decision
The hardest engineering question: when does change become worth alerting about?
Alert too often and users tune out. Alert too rarely and they miss the point. We experimented with fixed thresholds (confidence moved by >15%) but this failed across domains. A 15% confidence shift in a stable regulatory question is significant. The same shift in a volatile crypto market is noise.
Our current approach: each agent learns its own baseline volatility during the first 48 hours of monitoring. Alerts fire when change exceeds 2x the baseline volatility for that specific question. This adapts per-domain without manual tuning.
We also added a "quiet period." After an alert fires, the agent suppresses further alerts for a configurable window (default: 6 hours) to avoid alert storms during cascading news events.
What we got wrong
Two design decisions we reversed:
Summarizing too aggressively. Early versions compressed evidence into a single paragraph. This lost the evidence trail. Users couldn't understand why the agent changed its mind. We switched to maintaining a timeline of key evidence with explicit links between evidence and hypothesis changes.
Treating all questions as perpetual. Some questions have natural expiration dates. "Will the merger close by June?" becomes irrelevant in July. We added optional expiration dates and a resolution mechanism: when an agent detects that a question has been definitively answered, it marks itself resolved and sends a final summary.
The public agent model
One architectural choice that shaped everything: we made agents publicly visible by default. Each of the 117 agents on watchingagents.com has its own page showing the current hypothesis, evidence timeline, and confidence trajectory.
This constraint forced us to make the reasoning transparent. An agent can't just say "confidence changed." It has to show which evidence caused the change and why that evidence was weighted the way it was.
It also turned out to be a useful distribution mechanism. Public agent pages get indexed. People searching for "will X happen" find an agent already tracking their question.
What's next
We're working on agent-to-agent evidence sharing. If one agent tracking EU AI regulation collects a relevant document, another agent tracking global AI policy should benefit from it without re-crawling. The challenge is relevance scoring across different question scopes. Sharing everything creates noise, sharing nothing wastes work.
If you're building monitoring systems that go beyond simple threshold alerts, the core insight from our work is this: model the question explicitly, not just the data stream. The question type determines your entire detection architecture.
Try it at watchingagents.com.
Top comments (1)
Watching a question instead of a metric is a strong distinction. The tricky part is evidence drift: the agent needs to know whether the world changed, the source quality changed, or its own interpretation changed. I would log those separately, because the response should be different for each.