<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Cheng Peng</title>
    <description>The latest articles on DEV Community by Cheng Peng (@cheng-peng0718).</description>
    <link>https://dev.to/cheng-peng0718</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3965684%2Fd9f4892f-e41b-4c02-8198-0083b26608eb.jpg</url>
      <title>DEV Community: Cheng Peng</title>
      <link>https://dev.to/cheng-peng0718</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/cheng-peng0718"/>
    <language>en</language>
    <item>
      <title>Why Your LLM Agent Gives a Different P-Value Every Time (And What to Build Instead)</title>
      <dc:creator>Cheng Peng</dc:creator>
      <pubDate>Wed, 03 Jun 2026 06:34:28 +0000</pubDate>
      <link>https://dev.to/cheng-peng0718/why-your-llm-agent-gives-a-different-p-value-every-time-and-what-to-build-instead-5dc6</link>
      <guid>https://dev.to/cheng-peng0718/why-your-llm-agent-gives-a-different-p-value-every-time-and-what-to-build-instead-5dc6</guid>
      <description>&lt;p&gt;Hand the same paired before/after dataset (n = 25) to ChatGPT five times. Same prompt: &lt;em&gt;"These are the same subjects measured before and after an intervention. Did their scores change significantly?"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Four of the five runs return &lt;code&gt;p = 0.009&lt;/code&gt; from a paired t-test.&lt;/p&gt;

&lt;p&gt;The fifth run does a Shapiro–Wilk normality check on the differences first, decides they're non-normal, switches to a Wilcoxon signed-rank test, and reports &lt;code&gt;p = 0.000018&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzs8mc80s9ty9g2pa61bj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzs8mc80s9ty9g2pa61bj.png" alt=" " width="800" height="518"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvo0f6j7t7rwsnl2gtrlg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvo0f6j7t7rwsnl2gtrlg.png" alt=" " width="800" height="540"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;All five reach the same conclusion (significant). But notice what happened: only one run out of five thought to check an assumption you'd want it to check. The other four skipped it. The choice of &lt;em&gt;method&lt;/em&gt; — and the test statistic, and the p-value — depended on whether the LLM happened to run an assumption check that time. On borderline data, this is the difference between reject and don't reject.&lt;/p&gt;

&lt;p&gt;If you're using LLMs for exploratory data analysis on a weekend project, you might shrug. If you're using them for anything that gets cited, gets submitted to a regulator, or gets handed to a clinician, this is a problem. It's a known problem — &lt;a href="https://arxiv.org/abs/2602.14349" rel="noopener noreferrer"&gt;Cui &amp;amp; Alexander (2026)&lt;/a&gt; documented exactly this kind of method-divergence empirically; &lt;a href="https://arxiv.org/abs/2502.16395" rel="noopener noreferrer"&gt;AIRepr (Zeng et al., 2025)&lt;/a&gt; shows the same thing across reproducibility metrics. The current answer in the literature is to &lt;em&gt;constrain&lt;/em&gt; the agent so its execution is replayable. But replayability fixes "did we run the same code." It doesn't fix "did we run the &lt;em&gt;right&lt;/em&gt; analysis."&lt;/p&gt;

&lt;p&gt;I've spent the last two months building a different fix. The more interesting half is the architecture. Let me walk through it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The real problem isn't temperature
&lt;/h2&gt;

&lt;p&gt;The first reflex is "set &lt;code&gt;temperature=0&lt;/code&gt;." It's not enough.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;temperature=0&lt;/code&gt; doesn't make a tool-using agent deterministic across runs. Three reasons:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Inference isn't bitwise deterministic, even at temperature=0.&lt;/strong&gt; Production LLM serving batches requests dynamically, and the attention kernels aren't batch-invariant — so the same input produces different output tokens depending on what other requests it gets batched with. &lt;a href="https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/" rel="noopener noreferrer"&gt;Thinking Machines Lab&lt;/a&gt; and &lt;a href="https://www.lmsys.org/blog/2025-09-22-sglang-deterministic/" rel="noopener noreferrer"&gt;SGLang&lt;/a&gt; are still treating this as an active engineering problem in 2026.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Plausible methods have no principled tiebreaker.&lt;/strong&gt; When a paired t-test and Wilcoxon signed-rank are both reasonable for a moderate-skew paired sample, there's no rule in the model's weights that says which to pick. It picks based on whichever rationale chain it happened to generate (as in the n=25 example above).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Whether an assumption check is even run is stochastic.&lt;/strong&gt; The same dataset, asked the same question, sometimes triggers a Shapiro–Wilk check and sometimes doesn't. When the check is run, it routes to a non-parametric test; when it isn't, the model defaults to a paired t. The case above is exactly this: one in five runs decided to check, four didn't.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The deeper issue: LLM agents try to do two jobs at once. &lt;em&gt;Choose&lt;/em&gt; which analysis to run, and &lt;em&gt;run&lt;/em&gt; the analysis. The first is a judgment problem the LLM is reasonably good at. The second is a computation problem the LLM is bad at, because it's inherently stochastic and produces results you can't verify by inspection.&lt;/p&gt;

&lt;h2&gt;
  
  
  "Just write the code yourself"
&lt;/h2&gt;

&lt;p&gt;Natural reaction: stop using the LLM for the computation. Write the scipy code yourself.&lt;/p&gt;

&lt;p&gt;This is right — but it throws out the half that's actually useful. When a researcher says &lt;em&gt;"compare the post-treatment scores between cohorts and tell me if the intervention worked,"&lt;/em&gt; the value of the LLM is mapping that informal request to (a) the right columns in the dataframe, (b) the right method given assumptions, (c) the right multiple-comparison correction, (d) a plain-English summary at the end. That mapping is genuinely hard to encode as a fixed program. Throwing the whole LLM out is overcorrecting.&lt;/p&gt;

&lt;p&gt;What you actually want: keep the LLM for the routing decision, but pin the computation to a fixed, validated implementation that &lt;em&gt;cannot&lt;/em&gt; vary across runs.&lt;/p&gt;

&lt;h2&gt;
  
  
  LLM routes; engine computes
&lt;/h2&gt;

&lt;p&gt;That's the architecture:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;natural-language request
        │
        ▼
   LLM Supervisor ─────────► chooses ONE next action at a time
        │                    (a tool call, or a final answer)
        ▼
 Deterministic plugin ─────► runs a hardcoded statistical method,
        │                    cross-validated against scipy/statsmodels
        ▼
 Claims ledger + gate ─────► verifies that every reported number came
        │                    from an actual plugin run
        ▼
   Auditable report
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;This pattern — let the LLM choose tools, but pin the computation — isn't novel. Variants of it show up in domains as different as &lt;a href="https://dev.to/demianbrecht/stop-asking-llms-to-be-deterministic-e32"&gt;devops automation&lt;/a&gt; and &lt;a href="https://dev.to/nodefiend/trust-the-server-not-the-llm-a-deterministic-approach-to-llm-accuracy-20ag"&gt;financial reporting&lt;/a&gt;. What I think is specific to applying it to statistical inference is the &lt;strong&gt;anti-fabrication discipline&lt;/strong&gt; below: a generic deterministic tool ecosystem still allows the LLM to paraphrase or round the numbers it received. The claims ledger pattern makes that structurally impossible.&lt;/p&gt;

&lt;p&gt;I built this as &lt;a href="https://github.com/Cheng-Peng0718/StatGuard-Agent" rel="noopener noreferrer"&gt;StatGuard Agent&lt;/a&gt;. The supervisor LLM (currently &lt;code&gt;gpt-4o&lt;/code&gt;) picks one of 27 hardcoded analysis plugins per step. The plugins do &lt;em&gt;all&lt;/em&gt; numerical work; the LLM never emits a number. Given the same plugin and the same arguments, the output is byte-identical across runs — the variability that remains is in plugin selection, which is what the validation framework below targets.&lt;/p&gt;

&lt;p&gt;The interesting design choice was not "LLM picks tools" — that's standard agent stuff now. The interesting choice was making sure the LLM never gets to &lt;em&gt;emit a number&lt;/em&gt;.&lt;/p&gt;
&lt;h2&gt;
  
  
  The piece I'd argue should be standard: a claims ledger
&lt;/h2&gt;

&lt;p&gt;Here's the failure mode I really wanted to prevent. Take the opening example: a paired t-test on the n = 25 dataset returns &lt;code&gt;p = 0.009&lt;/code&gt;. Now the LLM produces a final summary for the user. The most likely failure isn't that the wrong test was chosen — we can catch that in routing tests. The most likely failure is that the LLM, in its summary, writes &lt;code&gt;"p = 0.01"&lt;/code&gt;, or &lt;code&gt;"p &amp;lt; 0.01"&lt;/code&gt;, or hallucinates a confidence interval that nobody computed. Over a multi-step analysis, &lt;em&gt;what got computed&lt;/em&gt; and &lt;em&gt;what got reported&lt;/em&gt; can drift apart silently.&lt;/p&gt;

&lt;p&gt;The pattern that fixes this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Every plugin run emits structured &lt;strong&gt;claims&lt;/strong&gt; with stable IDs: &lt;code&gt;claim_42 = {value: 0.009, kind: "p_value", method: "paired_t", n: 25, ...}&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;The LLM, during its working session, sees only a &lt;em&gt;list of claim IDs&lt;/em&gt; with their semantic tags ("there is a p-value claim with ID 42"). It does not see the literal numbers in its scratchpad.&lt;/li&gt;
&lt;li&gt;When the LLM emits a final report, it must reference claims by ID: &lt;code&gt;"The intervention shows {claim_42}, suggesting..."&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;A separate, deterministic &lt;strong&gt;render layer&lt;/strong&gt; substitutes claim IDs with the verified text from the original plugin output: &lt;code&gt;"...shows p = 0.009 (paired t-test, n = 25)..."&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The result: the LLM cannot insert a number that wasn't computed. It cannot round. It cannot round-trip. It cannot paraphrase a statistic into something subtly different. It can only point at claims. A coverage gate also enforces that every required piece of evidence (for a group comparison: test statistic, p-value, effect size, assumption check) has been produced before a final answer is allowed.&lt;/p&gt;

&lt;p&gt;I'd argue this pattern should be standard for any agent that produces structured numerical output, not just statistics ones. The principle: &lt;strong&gt;LLMs are pointers, not values.&lt;/strong&gt; Numbers, dates, quotes from documents, monetary amounts — anything where "almost right" is wrong — should be produced by a deterministic tool, given a claim ID, and stitched into the final text by a renderer that the LLM cannot touch.&lt;/p&gt;
&lt;h2&gt;
  
  
  How do we actually know it works
&lt;/h2&gt;

&lt;p&gt;Two layers of validation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer 1 — plugin carpet benchmark.&lt;/strong&gt; For every plugin, generate scenarios with fixed seeds and known ground truth, then check the plugin's output against an independent &lt;code&gt;scipy&lt;/code&gt;/&lt;code&gt;statsmodels&lt;/code&gt; computation of the same quantity. The current carpet is 362 cases, all passing. This validates the plugins as plugins, with the LLM out of the picture.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer 2 — end-to-end agent benchmark.&lt;/strong&gt; Drive the full LLM-supervised pipeline on a representative 42-case subset of the same matrix. Each case is judged on four dimensions: (a) the LLM picked the right plugin (routing), (b) the agent reached a final answer (no-error), (c) the claims ledger is clean — every reported number traceable to a plugin run (honesty), (d) the final numerical output is within tolerance of the ground truth (accuracy). Current pass rate: 42/42 on all four.&lt;/p&gt;

&lt;p&gt;Plus 764 deterministic unit/integration tests for everything else.&lt;/p&gt;

&lt;p&gt;The most useful experience I had was during e2e validation. The first run had 36/38 routing pass — two cases failed because, on prompts framed for FDA submission or audit-grade contexts, the LLM didn't reach for the more rigorous bootstrap mode it should have. That kind of failure isn't a computation bug, it's a &lt;em&gt;judgment&lt;/em&gt; bug — and it only surfaces in an e2e benchmark, not a plugin-layer one. I tightened the plugin's &lt;code&gt;use_when&lt;/code&gt; specification with explicit triggers ("FDA", "audit-grade", "clinical", "third-party re-run"), re-ran, got 38/38. The pattern: e2e benchmarks find specification gaps; plugin benchmarks find code gaps.&lt;/p&gt;
&lt;h2&gt;
  
  
  One feature worth mentioning by name
&lt;/h2&gt;

&lt;p&gt;The &lt;code&gt;bootstrap_inference&lt;/code&gt; plugin produces confidence intervals for paired-difference statistics under percentile, basic, and BCa methods, all cross-validated against &lt;code&gt;scipy.stats.bootstrap&lt;/code&gt;. It also has an opt-in &lt;strong&gt;Sequential Bootstrap&lt;/strong&gt; mode (&lt;a href="https://arxiv.org/abs/2511.18065" rel="noopener noreferrer"&gt;Peng 2025&lt;/a&gt;) for cases where the bootstrap CI itself needs to be more stable across RNG seeds — regulated submissions, audit reports. Every call emits a cross-seed CI endpoint-stability diagnostic so you can compare the two modes on your data.&lt;/p&gt;
&lt;h2&gt;
  
  
  What this isn't
&lt;/h2&gt;

&lt;p&gt;Up front:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Pre-adoption.&lt;/strong&gt; v0.2.0 just dropped. Real-world users are zero or one (you, possibly).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scope is narrow and intentional.&lt;/strong&gt; Standard univariate statistical inference and OLS. No mixed models, no factorial ANOVA yet, no survival analysis, no deep learning. The design philosophy is "reproducible analysis uses validated methods" — so the framework only covers methods I can validate against a reference implementation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Routing is not perfect.&lt;/strong&gt; The LLM still makes routing mistakes; the 42-case e2e benchmark is how we catch them and tighten the plugin specs. New plugins will need new e2e cases.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;License: MIT.&lt;/strong&gt; Just install and use.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  What's next
&lt;/h2&gt;

&lt;p&gt;Concrete things on the roadmap:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;More plugins.&lt;/strong&gt; Mixed-effects models (LMM / GLMM) for repeated-measures designs. Two-way / factorial ANOVA with interaction effects. Survival analysis (Cox PH, log-rank). Each new plugin gets its own carpet cases and e2e routing cases before merge.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Better routing on ambiguous prompts.&lt;/strong&gt; When a user says &lt;em&gt;"compare these groups"&lt;/em&gt; without specifying paired / independent / repeated, the LLM has to infer. The current routing logic is one-shot; I want to add a clarification loop where the agent asks one targeted question rather than guessing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Jupyter cell magic.&lt;/strong&gt; Most data scientists live in notebooks. A &lt;code&gt;%%statguard compare cohort_A vs cohort_B&lt;/code&gt; cell magic returning a reproducible report in the next cell is more useful than the current Streamlit-only entry point.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scale routing to more plugins without bloating the tool-selection context.&lt;/strong&gt; With 27 plugins the tool-description payload is manageable. At 100 plugins it won't be — LLM context fills with metadata that's irrelevant to the current request. Likely path: a two-stage router that first picks a plugin &lt;em&gt;family&lt;/em&gt; (comparison / regression / description / SQL), then picks the specific plugin within that family, halving the per-turn metadata payload.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you build agents that produce structured numerical output and want to talk about the claims-ledger pattern, I'd love to hear from you. If you're a statistician with an opinion on what's missing from the plugin set, file an issue. If you're hiring for ML / data engineering / AI applications roles in the US, I'm currently looking — reach out if you're sourcing.&lt;/p&gt;

&lt;p&gt;The repo:&lt;/p&gt;


&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://assets.dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/Cheng-Peng0718" rel="noopener noreferrer"&gt;
        Cheng-Peng0718
      &lt;/a&gt; / &lt;a href="https://github.com/Cheng-Peng0718/StatGuard-Agent" rel="noopener noreferrer"&gt;
        StatGuard-Agent
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      An auditable statistical analysis framework pairing LLM orchestration with a deterministic, scipy-cross-validated statistics engine. The LLM routes; the engine computes and self-verifies.
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;
&lt;p&gt;&lt;a href="https://doi.org/10.5281/zenodo.20519404" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/1c0f1e39774a95105d2054e9e2e083e76c7647a88c27e0b25db602cee9617735/68747470733a2f2f7a656e6f646f2e6f72672f62616467652f444f492f31302e353238312f7a656e6f646f2e32303531393430342e737667" alt="DOI"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;div class="markdown-heading"&gt;
&lt;h1 class="heading-element"&gt;StatGuard Agent&lt;/h1&gt;
&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;An auditable statistical analysis framework that pairs LLM orchestration with a deterministic, cross-validated statistics engine.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;StatGuard Agent turns a natural-language analysis request into an end-to-end, reproducible statistical report. It is built on a deliberate separation of concerns:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The LLM orchestrates&lt;/strong&gt; — it reads the request, inspects the data, and decides &lt;em&gt;which&lt;/em&gt; analysis to run next.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The deterministic engine computes&lt;/strong&gt; — every statistic is produced by hardcoded, plugin-based methods that are cross-validated against &lt;code&gt;scipy&lt;/code&gt; / &lt;code&gt;statsmodels&lt;/code&gt;, never by the LLM itself.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This division is the core design principle. A general-purpose LLM asked to "compare these groups" may silently pick the wrong test, skip an assumption check, or report a number it did not actually compute — and may do so &lt;em&gt;differently every time it is run&lt;/em&gt;. A traditional tool like SPSS is reproducible but cannot interpret an open-ended request. StatGuard Agent aims for both: &lt;strong&gt;as adaptable as&lt;/strong&gt;…&lt;/p&gt;
&lt;/div&gt;


&lt;/div&gt;
&lt;br&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/Cheng-Peng0718/StatGuard-Agent" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;br&gt;
&lt;/div&gt;
&lt;br&gt;


&lt;p&gt;Stars, issues, and adversarial test cases all welcome.&lt;/p&gt;

</description>
      <category>python</category>
      <category>llm</category>
      <category>datascience</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
