Cheng Peng

Posted on Jun 3

Why Your LLM Agent Gives a Different P-Value Every Time (And What to Build Instead)

#datascience #llm #python #opensource

Hand the same paired before/after dataset (n = 25) to ChatGPT five times. Same prompt: "These are the same subjects measured before and after an intervention. Did their scores change significantly?"

Four of the five runs return p = 0.009 from a paired t-test.

The fifth run does a Shapiro–Wilk normality check on the differences first, decides they're non-normal, switches to a Wilcoxon signed-rank test, and reports p = 0.000018.

All five reach the same conclusion (significant). But notice what happened: only one run out of five thought to check an assumption you'd want it to check. The other four skipped it. The choice of method — and the test statistic, and the p-value — depended on whether the LLM happened to run an assumption check that time. On borderline data, this is the difference between reject and don't reject.

If you're using LLMs for exploratory data analysis on a weekend project, you might shrug. If you're using them for anything that gets cited, gets submitted to a regulator, or gets handed to a clinician, this is a problem. It's a known problem — Cui & Alexander (2026) documented exactly this kind of method-divergence empirically; AIRepr (Zeng et al., 2025) shows the same thing across reproducibility metrics. The current answer in the literature is to constrain the agent so its execution is replayable. But replayability fixes "did we run the same code." It doesn't fix "did we run the right analysis."

I've spent the last two months building a different fix. The more interesting half is the architecture. Let me walk through it.

The real problem isn't temperature

The first reflex is "set temperature=0." It's not enough.

temperature=0 doesn't make a tool-using agent deterministic across runs. Three reasons:

Inference isn't bitwise deterministic, even at temperature=0. Production LLM serving batches requests dynamically, and the attention kernels aren't batch-invariant — so the same input produces different output tokens depending on what other requests it gets batched with. Thinking Machines Lab and SGLang are still treating this as an active engineering problem in 2026.
Plausible methods have no principled tiebreaker. When a paired t-test and Wilcoxon signed-rank are both reasonable for a moderate-skew paired sample, there's no rule in the model's weights that says which to pick. It picks based on whichever rationale chain it happened to generate (as in the n=25 example above).
Whether an assumption check is even run is stochastic. The same dataset, asked the same question, sometimes triggers a Shapiro–Wilk check and sometimes doesn't. When the check is run, it routes to a non-parametric test; when it isn't, the model defaults to a paired t. The case above is exactly this: one in five runs decided to check, four didn't.

The deeper issue: LLM agents try to do two jobs at once. Choose which analysis to run, and run the analysis. The first is a judgment problem the LLM is reasonably good at. The second is a computation problem the LLM is bad at, because it's inherently stochastic and produces results you can't verify by inspection.

"Just write the code yourself"

Natural reaction: stop using the LLM for the computation. Write the scipy code yourself.

This is right — but it throws out the half that's actually useful. When a researcher says "compare the post-treatment scores between cohorts and tell me if the intervention worked," the value of the LLM is mapping that informal request to (a) the right columns in the dataframe, (b) the right method given assumptions, (c) the right multiple-comparison correction, (d) a plain-English summary at the end. That mapping is genuinely hard to encode as a fixed program. Throwing the whole LLM out is overcorrecting.

What you actually want: keep the LLM for the routing decision, but pin the computation to a fixed, validated implementation that cannot vary across runs.

LLM routes; engine computes

That's the architecture:

natural-language request
        │
        ▼
   LLM Supervisor ─────────► chooses ONE next action at a time
        │                    (a tool call, or a final answer)
        ▼
 Deterministic plugin ─────► runs a hardcoded statistical method,
        │                    cross-validated against scipy/statsmodels
        ▼
 Claims ledger + gate ─────► verifies that every reported number came
        │                    from an actual plugin run
        ▼
   Auditable report

This pattern — let the LLM choose tools, but pin the computation — isn't novel. Variants of it show up in domains as different as devops automation and financial reporting. What I think is specific to applying it to statistical inference is the anti-fabrication discipline below: a generic deterministic tool ecosystem still allows the LLM to paraphrase or round the numbers it received. The claims ledger pattern makes that structurally impossible.

I built this as StatGuard Agent. The supervisor LLM (currently gpt-4o) picks one of 27 hardcoded analysis plugins per step. The plugins do all numerical work; the LLM never emits a number. Given the same plugin and the same arguments, the output is byte-identical across runs — the variability that remains is in plugin selection, which is what the validation framework below targets.

The interesting design choice was not "LLM picks tools" — that's standard agent stuff now. The interesting choice was making sure the LLM never gets to emit a number.

The piece I'd argue should be standard: a claims ledger

Here's the failure mode I really wanted to prevent. Take the opening example: a paired t-test on the n = 25 dataset returns p = 0.009. Now the LLM produces a final summary for the user. The most likely failure isn't that the wrong test was chosen — we can catch that in routing tests. The most likely failure is that the LLM, in its summary, writes "p = 0.01", or "p < 0.01", or hallucinates a confidence interval that nobody computed. Over a multi-step analysis, what got computed and what got reported can drift apart silently.

The pattern that fixes this:

Every plugin run emits structured claims with stable IDs: claim_42 = {value: 0.009, kind: "p_value", method: "paired_t", n: 25, ...}.
The LLM, during its working session, sees only a list of claim IDs with their semantic tags ("there is a p-value claim with ID 42"). It does not see the literal numbers in its scratchpad.
When the LLM emits a final report, it must reference claims by ID: "The intervention shows {claim_42}, suggesting...".
A separate, deterministic render layer substitutes claim IDs with the verified text from the original plugin output: "...shows p = 0.009 (paired t-test, n = 25)...".

The result: the LLM cannot insert a number that wasn't computed. It cannot round. It cannot round-trip. It cannot paraphrase a statistic into something subtly different. It can only point at claims. A coverage gate also enforces that every required piece of evidence (for a group comparison: test statistic, p-value, effect size, assumption check) has been produced before a final answer is allowed.

I'd argue this pattern should be standard for any agent that produces structured numerical output, not just statistics ones. The principle: LLMs are pointers, not values. Numbers, dates, quotes from documents, monetary amounts — anything where "almost right" is wrong — should be produced by a deterministic tool, given a claim ID, and stitched into the final text by a renderer that the LLM cannot touch.

How do we actually know it works

Two layers of validation.

Layer 1 — plugin carpet benchmark. For every plugin, generate scenarios with fixed seeds and known ground truth, then check the plugin's output against an independent scipy/statsmodels computation of the same quantity. The current carpet is 362 cases, all passing. This validates the plugins as plugins, with the LLM out of the picture.

Layer 2 — end-to-end agent benchmark. Drive the full LLM-supervised pipeline on a representative 42-case subset of the same matrix. Each case is judged on four dimensions: (a) the LLM picked the right plugin (routing), (b) the agent reached a final answer (no-error), (c) the claims ledger is clean — every reported number traceable to a plugin run (honesty), (d) the final numerical output is within tolerance of the ground truth (accuracy). Current pass rate: 42/42 on all four.

Plus 764 deterministic unit/integration tests for everything else.

The most useful experience I had was during e2e validation. The first run had 36/38 routing pass — two cases failed because, on prompts framed for FDA submission or audit-grade contexts, the LLM didn't reach for the more rigorous bootstrap mode it should have. That kind of failure isn't a computation bug, it's a judgment bug — and it only surfaces in an e2e benchmark, not a plugin-layer one. I tightened the plugin's use_when specification with explicit triggers ("FDA", "audit-grade", "clinical", "third-party re-run"), re-ran, got 38/38. The pattern: e2e benchmarks find specification gaps; plugin benchmarks find code gaps.

One feature worth mentioning by name

The bootstrap_inference plugin produces confidence intervals for paired-difference statistics under percentile, basic, and BCa methods, all cross-validated against scipy.stats.bootstrap. It also has an opt-in Sequential Bootstrap mode (Peng 2025) for cases where the bootstrap CI itself needs to be more stable across RNG seeds — regulated submissions, audit reports. Every call emits a cross-seed CI endpoint-stability diagnostic so you can compare the two modes on your data.

What this isn't

Up front:

Pre-adoption. v0.2.0 just dropped. Real-world users are zero or one (you, possibly).
Scope is narrow and intentional. Standard univariate statistical inference and OLS. No mixed models, no factorial ANOVA yet, no survival analysis, no deep learning. The design philosophy is "reproducible analysis uses validated methods" — so the framework only covers methods I can validate against a reference implementation.
Routing is not perfect. The LLM still makes routing mistakes; the 42-case e2e benchmark is how we catch them and tighten the plugin specs. New plugins will need new e2e cases.
License: MIT. Just install and use.

What's next

Concrete things on the roadmap:

More plugins. Mixed-effects models (LMM / GLMM) for repeated-measures designs. Two-way / factorial ANOVA with interaction effects. Survival analysis (Cox PH, log-rank). Each new plugin gets its own carpet cases and e2e routing cases before merge.
Better routing on ambiguous prompts. When a user says "compare these groups" without specifying paired / independent / repeated, the LLM has to infer. The current routing logic is one-shot; I want to add a clarification loop where the agent asks one targeted question rather than guessing.
Jupyter cell magic. Most data scientists live in notebooks. A %%statguard compare cohort_A vs cohort_B cell magic returning a reproducible report in the next cell is more useful than the current Streamlit-only entry point.
Scale routing to more plugins without bloating the tool-selection context. With 27 plugins the tool-description payload is manageable. At 100 plugins it won't be — LLM context fills with metadata that's irrelevant to the current request. Likely path: a two-stage router that first picks a plugin family (comparison / regression / description / SQL), then picks the specific plugin within that family, halving the per-turn metadata payload.

If you build agents that produce structured numerical output and want to talk about the claims-ledger pattern, I'd love to hear from you. If you're a statistician with an opinion on what's missing from the plugin set, file an issue. If you're hiring for ML / data engineering / AI applications roles in the US, I'm currently looking — reach out if you're sourcing.

The repo:

Cheng-Peng0718 / StatGuard-Agent

An auditable statistical analysis framework pairing LLM orchestration with a deterministic, scipy-cross-validated statistics engine. The LLM routes; the engine computes and self-verifies.

StatGuard Agent

An auditable statistical analysis framework that pairs LLM orchestration with a deterministic, cross-validated statistics engine.

StatGuard Agent turns a natural-language analysis request into an end-to-end, reproducible statistical report. It is built on a deliberate separation of concerns:

The LLM orchestrates — it reads the request, inspects the data, and decides which analysis to run next.
The deterministic engine computes — every statistic is produced by hardcoded, plugin-based methods that are cross-validated against scipy / statsmodels, never by the LLM itself.

This division is the core design principle. A general-purpose LLM asked to "compare these groups" may silently pick the wrong test, skip an assumption check, or report a number it did not actually compute — and may do so differently every time it is run. A traditional tool like SPSS is reproducible but cannot interpret an open-ended request. StatGuard Agent aims for both: as adaptable as…

View on GitHub

Stars, issues, and adversarial test cases all welcome.

Top comments (4)

Tae Kim • Jun 3

Ran into this with a pipeline that extracted causal relations from trade news articles. Two runs, same article, same model - different entity spans, different confidence scores, occasionally contradictory relations. The fix that actually worked wasn't prompt engineering: I added a commit-time check where extracted evidence had to be a verbatim substring of the source. That caught the nondeterminism early - if the model was varying its output, the extracted spans would fail the check and the doc would be flagged for review rather than silently written to the graph. The statistical framing in this article is the right one; nondeterminism is a feature you have to design around, not a bug you can prompt away.

Cheng Peng • Jun 3

Yeah this matches pretty closely. The substring check and claims ledger are basically doing the same job from different sides. Neither lets the model output a fact that doesn't trace back to something we verified. Different anchors though: source text vs plugin output objects.

Curious if paraphrases ever broke the check. The model saying the right thing in slightly different words would fail a strict substring match. Was that a real issue in practice, or did it not really come up?

Tae Kim • Jun 4

Hit this in eval. I used fuzzy match for entity labels (Levenshtein < 3), strict substring for evidence text. Entity references get flex, evidence anchor stays verbatim. Caught legitimate paraphrases without letting fabrications through.

Tae Kim • Jun 21

Non-determinism bit me in a knowledge graph extraction pipeline where I needed reproducible relation IDs for deduplication. Setting temperature=0 helped but did not eliminate variance because sampling is not the only randomness source in a typical inference stack. The fix that worked was separating extraction from normalization: extraction can vary, normalization is deterministic code. Makes the non-determinism irrelevant to the final stored result, which is what actually matters for consistency downstream.