DEV Community

Nelson Lin
Nelson Lin

Posted on

Design offline metrics for Agentic AI trading

In machine learning evaluation frameworks, there are two types of metrics: online and offline.
Online metrics are real-world measurements that carry direct business value in a production environment. At sandx.ai, for example, the actual return generated by AI agents is an online metric.

Offline metrics are reproducible using fixed inputs and variables with ground truth. They are extremely useful for testing without having to wait for real-time outcomes.

When it comes to AI agent systems, designing an offline metrics is more much complex than typical ML system.

How to evaluate LLM for investment decision, I come up an idea.

At sandx.ai, when the agent continuously invests and manages a portfolio, I retained every message, historical reasoning, and final stock recommendation output. Specifically, the system will store a snapshot that includes:

1) The LLM model's internal state and context
2)The overall market situation
3)The agent's tool-calling results
Together, these form X, while the final stock recommendation forms Y.

Formally, for a given time tt:
Xt​ = snapshot of:
LLM configuration (model version, weights, prompt, temperature, available tools)
Market conditions (prices, volume, volatility, sentiment, macro data, order book, etc.)
Agent's tool-calling results (e.g., retrieved data, backtest execution, risk computation)
Yt​ = final stock recommendation (e.g., buy/sell/hold, target price, allocation percentage)
Thus, the system will produce a time-ordered sequence:
(X1,Y1),(X2,Y2),…,(Xt,Yt)

Based on these historical snapshots, we can create an offline metrics to evaluate which LLM performs best at predicting future market movements. This can be done by assessing how close a model's past prediction aligns with the actual market outcome in a similar current context.

If you're interested in the agent evaluation, particularly in AI trading, stay tune this offline agent evaluation framework and I will share the result.

My Linkedin: https://www.linkedin.com/in/nelson-l-842564164/

Top comments (1)

Collapse
 
harjjotsinghh profile image
Harjot Singh

The online-vs-offline split is the right backbone, and trading makes the offline-metrics problem brutal because the ground truth is path-dependent: the "correct" decision at t depends on a future you can't replay cleanly without leaking it. Two traps I'd watch: lookahead bias creeping in through your fixed inputs (any feature computed with post-decision data quietly inflates the offline score), and reward attribution, a good return can come from a bad-but-lucky decision, so scoring the outcome rewards noise. The fix that's worked for me is separating decision quality from outcome: score the agent's reasoning against what was knowable at decision time, then track realized return as a separate online signal. That's the same verify-the-process-not-just-the-output principle I build into Moonshift's eval layer. How are you handling the counterfactual, do you replay alternate actions, or only score the path the agent actually took?