Brian Love

Posted on Mar 18 • Edited on May 18 • Originally published at brianflove.com

The Frontend Reward Loop for Agentic Software

#ai #webdev #architecture

I have been thinking about this after reading the rLLM work on post-training language agents.

The big idea in that work is right: if agents are going to improve, they need a loop.
Not just inference.
A loop of action, feedback, and improvement.

What I want to argue in this post is simple:

The frontend is the best place to collect the heuristics for that loop.
Most teams should push prompt augmentation much further before training an offline reward model.

tl;dr

Agent quality is now a feedback-loop problem, not only a model-size problem.
The frontend is the only place where intent, correction, and outcome are visible together.
Product heuristics can drive real gains without starting with offline reward-model training.
Prompt augmentation gets most teams very far on short and medium-horizon workflows.
Move to offline RL only when prompt gains flatten or long-horizon credit assignment becomes the bottleneck.

Fullstack Agentic Applications

Be sure to check out the Angular Agent Framework for building fullstack agentic applications.

From static intelligence to live behavior

Reasoning models can do very well on static tasks.
Agentic software lives in dynamic tasks.

That means we care less about one-shot correctness and more about behavioral quality over a sequence of steps:

Did the system choose a sensible next action?
Did the user need to undo or rewrite it?
Did the workflow actually complete?

This is why post-training agents matter.
And this is why product teams need to treat interaction data as first-class infrastructure.

Why the frontend is the best reward surface

The backend can tell us what was called.
The model logs can tell us what token came next.
Only the frontend can reliably tell us whether the user felt the action was useful.

The frontend gives us a dense stream of behavior signals:

Accept vs reject
Minor edit vs full rewrite
One-shot completion vs retry loops
Continue vs abandon
Agent flow vs human escalation

These are not perfect labels.
They are heuristics.
But they are high signal and available immediately.

In practice, this is the best early reward surface most teams have.

A practical heuristic set

If I were starting today, I would implement four buckets.

1. Correction heuristics

Edit distance between agent output and final submitted value
Undo/revert frequency after agent actions
Manual overwrite rate

2. Friction heuristics

Time-to-next-meaningful-action
Retry count for same intent
Dead-end path frequency

3. Outcome heuristics

Task completion rate
Reopen/regression rate
SLA-aware completion time

4. Trust heuristics

Delegation rate for repeat tasks
Human escalation rate
Opt-out after poor outcomes

No single metric is enough.
A composite score is usually better than betting on one proxy.

The part I want to challenge: do we need offline reward models first?

Usually, no.

I think most teams can get much further than expected with prompt augmentation before training a separate reward model offline.

When I say prompt augmentation, I mean the full runtime strategy:

Better task framing
Better context packing from recent user behavior
Better tool constraints and checks
Better fallback prompts for low-confidence cases
Better routing between prompt variants

This is still behavior optimization.
It is just inference-time optimization.

For many product workflows, that is enough to unlock most early gains.

How far prompt augmentation gets us

Prompt augmentation gets us very far on:

Bounded workflows
Clear completion criteria
Frequent user corrections
Fast evaluation loops

Examples:

Internal support triage
Structured drafting and form completion
Actionable summarization
Multi-step UI workflows with explicit completion states

Where it starts to fail:

Long trajectories with delayed outcomes
Heavy credit assignment ambiguity
High cost/latency from complex inference-time strategies
Persistent instability across prompt variants

That is when offline trajectory training and RL become worth the investment.

Recommended rollout

If you are building agentic product features now, this is the sequence I would use:

Instrument frontend interaction events with stable task/session IDs.
Define a small heuristic scorecard (correction, friction, outcome, trust).
Use that scorecard to drive prompt and routing updates weekly.
Build an eval flywheel with holdouts and regression checks.
Introduce offline reward-model training only after improvements plateau.

This keeps teams moving quickly while preserving a path to deeper optimization.

Guardrails

None of this works well without guardrails.

At minimum:

Privacy-safe event collection
Anti-gaming checks for proxy metrics
Correctness-weighted objectives (not just speed)
Human review samples to calibrate heuristic drift

If we do not add guardrails, we optimize the dashboard instead of the product.

Closing

The backend can train the policy.
But the frontend is where the reward signal is born.

If we want agents that improve from real human interaction, we need to treat frontend behavior as training infrastructure.

And we should be honest about sequencing:
Start with prompt augmentation.
Push it hard.
Then add offline RL when the data proves you need it.

Using Angular & LangChain together?

Check out the the Enterprise Streaming Resource for LangChain and Angular

References

rLLM: A Framework for Post-Training Language Agents
rLLM Docs: rLLM Project Documentation
Sutton and Silver: [Welcome to the Era of Experience](https://storage.googleapis.com/deepmind-media/Era-of-Experience%20/The%20Era%20of%20Experience%20Paper.pdf

DEV Community