EstatePass

Posted on Apr 3

Three Checks We Added After One AI Prompt Started Giving the Wrong Kind of Help

#ai #productmanagement #edtech #promptengineering

Three Checks We Added After One AI Prompt Started Giving the Wrong Kind of Help

When a product team says an AI feature is “helpful,” that sentence usually hides an unresolved question: helpful for whom, and helpful at which point in the workflow?

We ran into this the hard way while working across two very different content and assistance surfaces. One side served real estate exam-prep users who needed precise study guidance. The other served licensed agents who needed faster drafts, reusable workflow structure, and less blank-page time. The original AI layer was clean, shared, and efficient. It was also too generic.

The model could answer both kinds of requests. That was not the same thing as helping both groups well.

What finally improved the system was not another round of prompt polishing. It was adding three product checks that forced us to separate “sounds useful” from “creates the right next action.” This post breaks down those checks, why the original workflow failed, and why the fix mattered more than the model change.

Disclosure: these lessons come from product work tied to EstatePass, but the point of this write-up is the implementation pattern, not a product pitch.

Where the original setup broke

The original pattern was easy to justify.

Use one strong system prompt.
Route users by context.
Adjust examples and tone.
Let the model return a recommendation, explanation, or draft.

That looked efficient because the top-level request often looked similar across users.

Learners asked:

What should I review next?
Why am I still missing this topic?
How do I know if I am ready for the exam?

Agents asked:

What should I send next?
How do I rewrite this listing copy?
What should this workflow step look like?

At the request layer, all of these can be labeled “guidance.” But the risk profile behind them is different.

A learner uses the answer to decide what to study, what to repeat, and whether they are close to test-ready. An agent uses the answer more like a working draft or a structure to adapt. One side needs tighter judgment. The other needs faster leverage.

The product problem only became obvious once we looked at failure behavior instead of output tone.

The model was producing plenty of responses that sounded supportive and coherent. But some of those responses were too generic to change learner behavior, while similar output could still be useful for an agent who only needed a strong starting point.

That meant the same prompt quality score was masking two very different product outcomes.

Check 1: Can the answer survive contact with the user’s next decision?

This became the first filter.

A response should not count as good just because it reads well. It should count as good if the user can take the next step with less confusion than before.

For exam prep, that threshold is much higher than many teams expect.

Consider a response like this:

review your weak areas
keep practicing consistently
revisit missed questions

Nothing there is wrong. It is just too vague. A learner who is struggling with state-specific law, real estate math, or question-stem misreads still does not know what to do in the next 30 minutes.

So the first check became: if the user followed this answer exactly, would the next study block improve?

If the answer could not survive that test, we stopped calling it helpful.

Agent workflows passed this check much more easily. An imperfect but structured draft can still move the work forward. The user can revise tone, facts, or sequencing. The answer does not need to function as a diagnostic instrument.

That difference forced the first product split: not every “next step” answer should be evaluated against the same standard.

Check 2: Does the feedback point to a real failure mode, or only to a theme?

This was the second big change.

The first version of the system was too willing to speak in themes:

contracts need more review
timing needs improvement
confidence is still low
state material needs more attention

That language can sound analytical. But it often hides the absence of a true failure mode.

A real failure mode is narrower and more actionable. It says something like:

the learner is recognizing vocabulary in isolation but missing it inside long scenario questions
the learner understands the formula after seeing it worked out, but cannot set up the steps from scratch
the agent keeps getting usable drafts, but the workflow loses time because property details are not structured before generation begins

Once we forced the system to identify failure modes instead of broad themes, response quality improved for both audiences.

For learners, the benefit was obvious. The recommendation became specific enough to act on.

For agents, the improvement was different. The tool stopped pretending the bottleneck was “better writing” when the real issue was missing inputs, bad sequencing, or unstructured source material.

That mattered because AI systems often absorb blame for workflow problems they did not create. A model can only do so much if the handoff into the model is weak.

Check 3: Are we measuring the same kind of success across both audiences?

This check forced the product team to stop using one success story for two different jobs.

Before the split, the easy metric was user satisfaction at the response level:

Did the user continue?
Did they click again?
Did they say the answer was useful?

Those signals are not useless, but they are weak when the product supports different kinds of decisions.

For learners, a “useful” answer should eventually show up in performance improvement:

fewer repeated misses in the same concept group
clearer readiness decisions
better targeted review behavior
lower ambiguity about what to study next

For agents, the success markers are more operational:

faster time to first draft
less redundant rewriting
more reuse of proven workflow patterns
lower effort per listing, follow-up sequence, or content asset

The old analytics layer blurred those together. Once we separated them, the product looked less uniformly successful, but much more truthfully measurable.

That change was uncomfortable. It also made iteration possible.

What changed in practice

The most important shift was not a technical breakthrough. It was a product constraint.

We stopped asking the AI layer to be one universal helper and started forcing it to prove value against the user’s actual next move.

That led to three concrete product changes.

Learner-facing responses became more diagnostic

We pushed the system to answer a stricter sequence:

What went wrong?
Why did it go wrong?
What should happen in the next study block?
What would better performance look like next time?

That sequence is harder to generate, but much easier to trust.

Agent-facing responses became more operational

For agents, the system became more useful when it focused less on explanation and more on usable structure:

draft this from the available inputs
show what is missing before drafting
preserve reusable parts
shorten the time from notes to a usable asset

This reduced friction without overpromising precision the workflow did not actually need.

The orchestration layer became stricter than the prompt layer

This was the real lesson.

Teams often react to bad AI output by rewriting the prompt again. We did some of that too. But the higher leverage move was tightening the orchestration rules around the prompt.

The system needed to know:

which audience it was serving
what kind of mistake was unacceptable
what kind of evidence had to inform the response
what counted as success after the answer was delivered

Without those rules, the prompt stayed broad and the product stayed vague.

Why this matters outside one product

This is not specific to real estate. Any dual-audience product can fall into the same trap.

If two user groups ask similar questions but act on the answer under different stakes, they probably need different feedback loops. Shared infrastructure is fine. Shared editing tools can be fine. Shared context objects can even be fine.

What is dangerous to share blindly is the definition of a “good answer.”

That definition should change when:

one audience needs diagnostic clarity
one audience mainly needs speed and leverage
one audience can safely revise a weak output
one audience needs the system to reduce ambiguity before a high-stakes decision

Once we started designing around that distinction, the product became much easier to reason about.

The practical takeaway

If your AI feature serves more than one audience, add these three checks before you trust the output too much:

Can the answer survive the user’s immediate next decision?
Does it point to a real failure mode instead of a broad theme?
Are we measuring success in a way that matches this user’s actual job?

Those questions are simple, but they change what a product team notices. They move the conversation away from “the model sounds good” and toward “the workflow got sharper.”

That is the shift that mattered for us. Once we stopped treating every request for help as the same kind of help, both sides of the product improved.

And if I had to keep only one lesson from that process, it would be this: a shared prompt is cheap, but a shared trust model usually is not.

DEV Community

Three Checks We Added After One AI Prompt Started Giving the Wrong Kind of Help

Three Checks We Added After One AI Prompt Started Giving the Wrong Kind of Help

Where the original setup broke

Check 1: Can the answer survive contact with the user’s next decision?

Check 2: Does the feedback point to a real failure mode, or only to a theme?

Check 3: Are we measuring the same kind of success across both audiences?

What changed in practice

Learner-facing responses became more diagnostic

Agent-facing responses became more operational

The orchestration layer became stricter than the prompt layer

Why this matters outside one product

The practical takeaway

Top comments (0)