Shawn Murphy

Posted on Mar 12

How I Scope an LLM Feature Before Writing Any Code

#ai #llm #softwareengineering #productivity

Excerpt:

Before I build any LLM feature, I spend time narrowing the problem, defining failure modes, and deciding what “good” actually means. That scoping work usually matters more than the first version of the code.

One of the easiest mistakes in AI product work is starting with the implementation too early.

A team gets excited about a model, a use case sounds promising, and the first instinct is often:

“Let’s build a quick prototype and see what happens.”

Sometimes that works.

Most of the time, it creates confusion.

Over time, I’ve learned that the quality of an LLM feature is heavily shaped before any code is written. The scoping phase decides whether the feature will solve a real problem, whether it can be evaluated clearly, and whether it has a realistic path to production.

So before I build anything, I slow down and answer a few practical questions.

1. What exact user problem are we solving?

This is the first filter, and it is more important than the model choice.

A lot of weak AI features are not weak because the model is bad. They are weak because the problem definition is vague.

For example, these are too broad:

help users with documents
answer questions intelligently
automate customer support
make internal workflows smarter

Those sound useful, but they are not scoped enough to build well.

I try to turn them into something more specific:

generate a first draft reply for support tickets about billing issues
extract structured fields from uploaded intake forms
answer employee questions using a defined internal knowledge base
classify inbound requests into a fixed set of actions

That shift matters a lot.

The narrower the problem, the easier it is to define useful behavior, identify edge cases, and improve quality over time.

If I cannot describe the task clearly in one or two sentences, the scope is usually still too fuzzy.

2. Why does this need an LLM at all?

This question saves time.

Not every workflow problem needs a model. Some are better solved with rules, search, templates, or normal backend logic.

Before choosing an LLM approach, I ask:

Is the task language-heavy?
Does it involve ambiguity or messy inputs?
Would fixed rules become hard to maintain?
Is there enough value to justify model cost and complexity?
Can the output be verified or constrained?

Sometimes the answer is yes, and an LLM is the right tool.

Sometimes the answer is “partially,” which usually means the best solution is a hybrid system: standard software for the predictable parts, and model-based logic only where flexibility is actually needed.

That tends to produce more reliable products than trying to make the model do everything.

3. What does success actually look like?

This is where a lot of teams stay too abstract.

They say things like:

make it helpful
make it accurate
make it feel smart
improve the user experience

Those are directionally fine, but they are not enough to guide implementation.

I try to translate success into something more concrete:

draft quality is good enough that users only make light edits
extraction accuracy is above a usable threshold for the top document types
answers cite relevant internal sources
classification output maps cleanly to downstream actions
the feature reduces time spent on a task by a meaningful amount

When success is vague, evaluation becomes vague too.

And once evaluation is vague, the team starts arguing from opinions instead of evidence.

A good scoped feature has a definition of “useful” that multiple people can agree on.

4. What are the most likely failure modes?

This is one of the most important parts of scoping.

Before building the happy path, I want to understand how the feature will fail.

Common failure modes for LLM features include:

wrong but confident answers
incomplete extraction
low-quality formatting
ignoring instructions
using stale or irrelevant context
over-triggering automation
producing output that looks valid but is not trustworthy

I like to ask:

If this feature fails in production, what kind of failure will hurt the user most?

That question is often more useful than asking how to improve average-case performance.

For example:

In a support workflow, a bad draft may be acceptable if a human reviews it.
In a compliance-sensitive workflow, even a small hallucination may be unacceptable.
In document extraction, missing one field may be manageable, but assigning the wrong value may be much worse.

Understanding the failure shape affects architecture decisions early:
Do we need human review?
Do we need citations?
Do we need confidence thresholds?
Do we need schema validation?
Do we need a fallback?

Those choices should come from scope, not from cleanup after launch.

5. What context will the model need?

Many LLM features do not fail because of poor reasoning.

They fail because the system does not provide the right information.

So before coding, I think carefully about context:

Will the model rely only on the user’s input?
Does it need internal documentation?
Does it need historical examples?
Does it need structured product data?
Does it need permissions-aware retrieval?
How fresh does the information need to be?

This is usually the moment where the real architecture starts to appear.

A simple drafting feature may only need prompt structure and user input.

A knowledge feature may need retrieval and ranking.

An action-oriented feature may need tool access plus strict validation.

Scoping the context layer early helps avoid a common mistake:
building a nice prompt around weak or incomplete inputs.

6. What should be deterministic, and what should stay flexible?

One of the best ways to improve LLM features is to reduce how much you leave open-ended.

I try to separate the workflow into two parts:

Deterministic parts

permissions
routing
calculations
database writes
state transitions
validations

Flexible parts

summarization
classification with ambiguous inputs
drafting
extraction from messy text
natural language interpretation

This separation matters because it keeps the model focused on the parts where flexibility adds value.

The more deterministic logic you push into standard software, the easier the feature is to trust, debug, and maintain.

In my experience, good scoping often means deciding not just what the model should do, but also what it definitely should not do.

7. How will we evaluate the first version?

I never want evaluation to be an afterthought.

Before building, I try to identify a lightweight but useful way to assess quality.

That can include:

a small set of representative examples
side-by-side output review
human scoring with a simple rubric
pass/fail checks for structured outputs
task completion rate
edit distance from final accepted output
user acceptance or override behavior

The goal is not to build a perfect benchmark on day one.

The goal is to avoid launching a feature with no real feedback loop.

Even a simple evaluation setup creates discipline. It forces the team to define what matters and gives the feature a path for improvement beyond opinions and demos.

8. What is the smallest version worth shipping?

This question helps prevent overbuilding.

A lot of AI features become bloated before they ever reach users. Teams try to support too many use cases, too many workflows, and too many edge cases in version one.

I prefer to find the smallest version that is still genuinely useful.

That might be:

one document type instead of ten
one internal knowledge domain instead of the whole company wiki
draft suggestions only, without auto-send
classification only, without downstream automation
one user segment first, before expanding

Smaller scope creates faster learning.

And in AI work, learning quickly from real usage is usually more valuable than shipping an overly ambitious first release.

9. What needs a human in the loop?

I do not treat human review as a weakness. I treat it as a design tool.

Before writing code, I ask where humans should stay involved:

review every output?
review only low-confidence cases?
approve actions before execution?
correct extracted data?
flag bad answers for retraining or prompt updates?

This is especially important when the feature touches business operations, healthcare, internal knowledge, or customer communication.

A good human-in-the-loop step can dramatically reduce risk while still delivering most of the time savings the product needs.

Trying to remove humans too early often leads to fragile systems and lower trust.

10. Is this feature a demo, a workflow, or a product capability?

This is the final framing question I like to ask.

Because those three things are different.

A demo is built to impress.

A workflow tool is built to save time on a task.

A product capability is built to behave consistently over time inside a larger system.

If the goal is only to demonstrate possibility, the bar is lower.

If the goal is to support real work, the bar is much higher:
better context, better guardrails, clearer evaluation, better observability, and better UX around failure.

Knowing which one you are building changes what “done” means.

Final thoughts

When I scope an LLM feature well, the implementation usually becomes simpler.

Not because the work is easy, but because the uncertainty is lower.

I know:

what problem I’m solving
why an LLM is justified
what success looks like
what failure looks like
what context is required
what stays deterministic
how the first version will be evaluated
what the smallest useful release actually is

That is why I try not to jump into code too fast.

In AI product development, the first technical decision is often not about the stack, the framework, or even the model.

It is about whether the feature has been scoped clearly enough to deserve being built.

And in my experience, that step is where a lot of the real engineering judgment begins.

DEV Community