I have been thinking about this after reading the rLLM work on post-training language agents.
The big idea in that work is right: if agents are going to improve, they need a loop.
Not just inference.
A loop of action, feedback, and improvement.
What I want to argue in this post is simple:
- The frontend is the best place to collect the heuristics for that loop.
- Most teams should push prompt augmentation much further before training an offline reward model.
tl;dr
- Agent quality is now a feedback-loop problem, not only a model-size problem.
- The frontend is the only place where intent, correction, and outcome are visible together.
- Product heuristics can drive real gains without starting with offline reward-model training.
- Prompt augmentation gets most teams very far on short and medium-horizon workflows.
- Move to offline RL only when prompt gains flatten or long-horizon credit assignment becomes the bottleneck.
From static intelligence to live behavior
Reasoning models can do very well on static tasks.
Agentic software lives in dynamic tasks.
That means we care less about one-shot correctness and more about behavioral quality over a sequence of steps:
- Did the system choose a sensible next action?
- Did the user need to undo or rewrite it?
- Did the workflow actually complete?
This is why post-training agents matter.
And this is why product teams need to treat interaction data as first-class infrastructure.
Why the frontend is the best reward surface
The backend can tell us what was called.
The model logs can tell us what token came next.
Only the frontend can reliably tell us whether the user felt the action was useful.
The frontend gives us a dense stream of behavior signals:
- Accept vs reject
- Minor edit vs full rewrite
- One-shot completion vs retry loops
- Continue vs abandon
- Agent flow vs human escalation
These are not perfect labels.
They are heuristics.
But they are high signal and available immediately.
In practice, this is the best early reward surface most teams have.
A practical heuristic set
If I were starting today, I would implement four buckets.
1. Correction heuristics
- Edit distance between agent output and final submitted value
- Undo/revert frequency after agent actions
- Manual overwrite rate
2. Friction heuristics
- Time-to-next-meaningful-action
- Retry count for same intent
- Dead-end path frequency
3. Outcome heuristics
- Task completion rate
- Reopen/regression rate
- SLA-aware completion time
4. Trust heuristics
- Delegation rate for repeat tasks
- Human escalation rate
- Opt-out after poor outcomes
No single metric is enough.
A composite score is usually better than betting on one proxy.
The part I want to challenge: do we need offline reward models first?
Usually, no.
I think most teams can get much further than expected with prompt augmentation before training a separate reward model offline.
When I say prompt augmentation, I mean the full runtime strategy:
- Better task framing
- Better context packing from recent user behavior
- Better tool constraints and checks
- Better fallback prompts for low-confidence cases
- Better routing between prompt variants
This is still behavior optimization.
It is just inference-time optimization.
For many product workflows, that is enough to unlock most early gains.
How far prompt augmentation gets us
Prompt augmentation gets us very far on:
- Bounded workflows
- Clear completion criteria
- Frequent user corrections
- Fast evaluation loops
Examples:
- Internal support triage
- Structured drafting and form completion
- Actionable summarization
- Multi-step UI workflows with explicit completion states
Where it starts to fail:
- Long trajectories with delayed outcomes
- Heavy credit assignment ambiguity
- High cost/latency from complex inference-time strategies
- Persistent instability across prompt variants
That is when offline trajectory training and RL become worth the investment.
Recommended rollout
If you are building agentic product features now, this is the sequence I would use:
- Instrument frontend interaction events with stable task/session IDs.
- Define a small heuristic scorecard (correction, friction, outcome, trust).
- Use that scorecard to drive prompt and routing updates weekly.
- Build an eval flywheel with holdouts and regression checks.
- Introduce offline reward-model training only after improvements plateau.
This keeps teams moving quickly while preserving a path to deeper optimization.
Guardrails
None of this works well without guardrails.
At minimum:
- Privacy-safe event collection
- Anti-gaming checks for proxy metrics
- Correctness-weighted objectives (not just speed)
- Human review samples to calibrate heuristic drift
If we do not add guardrails, we optimize the dashboard instead of the product.
Closing
The backend can train the policy.
But the frontend is where the reward signal is born.
If we want agents that improve from real human interaction, we need to treat frontend behavior as training infrastructure.
And we should be honest about sequencing:
Start with prompt augmentation.
Push it hard.
Then add offline RL when the data proves you need it.
References
- rLLM: A Framework for Post-Training Language Agents
- rLLM Docs: rLLM Project Documentation
- Sutton and Silver: Welcome to the Era of Experience
Top comments (0)