Johann Hagerer

Posted on Mar 25

LLM Non-Determinism: What Providers Guarantee, and How to Build Around It

#ai #llm #mlops

This post is based on sections 3--5 of "Understanding why deterministic output from LLMs is nearly impossible" by Shuveb Hussain at Unstract (Oct. 2025). The material has been rewritten and extended with practical code examples using Pydantic and Snowflake Cortex.

Motivation

When you build a pipeline that sends the same document through an LLM twice and expects the same structured output both times, you will eventually be surprised. Not because you've made a mistake, but because LLMs are fundamentally non-deterministic: the same prompt can produce different tokens across runs, even when you set temperature=0.

The root cause is that modern LLMs run on massively parallel GPU hardware, where floating-point arithmetic is not associative. The order in which thousands of parallel threads accumulate intermediate values is not guaranteed to be identical run-to-run, so the final token probability scores shift by tiny amounts. When two candidate tokens are close in probability, that tiny shift can flip which one gets selected. Because LLMs generate text auto-regressively --- each token conditions all subsequent tokens --- a single early flip can cascade into a structurally different response by the tenth token.

You cannot eliminate this. What you can do is design your system so that it is robust to it. The rest of this article shows you how, concretely, using Pydantic and Snowflake Cortex.

What Providers Actually Promise

No major LLM provider guarantees deterministic output. Snowflake Cortex, which hosts models like mistral-large2 and claude-3-5-sonnet via the complete function in snowflake-ml-python, is no exception. You can observe the problem directly:

from snowflake.cortex import complete
from snowflake.snowpark.context import get_active_session

session = get_active_session()

prompt = (
    "Extract the vendor name and total amount from this invoice: "
    "Acme Corp, Invoice #1042, Total: EUR 3,200.00. "
    "Respond in JSON."
)

run_1 = complete("mistral-large2", prompt, session=session)
run_2 = complete("mistrat-large2", prompt, session=session)

print(run_1)  # {"vendor": "Acme Corp", "total_amount": 3200.00}
print(run_2)  # {"vendor": "Acme Corp", "amount": 3200.0}  ← different field name

The values agree, but the field names differ. A downstream system parsing total_amount will fail silently on the second response.

The reasons providers haven't solved this are pragmatic. Deterministic GPU operations are substantially slower, routing across a distributed fleet makes identical hardware execution nearly impossible, and most applications tolerate small variations gracefully. The seed parameter, where available, only controls sampling randomness --- it does nothing about floating-point drift from parallel reductions, which is the dominant source of variation at temperature=0.

The practical takeaway: treat non-determinism as a fixed environmental property, like network latency. You don't eliminate it; you engineer around it.

Best Practices for Structured Extraction

Enforce Structure with `response_format` and Pydantic

The most effective tool available is Snowflake Cortex's structured output feature. You define your expected output as a Pydantic BaseModel, convert it to a JSON schema with .model_json_schema(), and pass it to complete via CompleteOptions. The model is then constrained to emit output conforming to that schema, and you validate the result back into your Pydantic object.

import json
from pydantic import BaseModel, Field
from snowflake.cortex import complete, CompleteOptions
from snowflake.snowpark.context import get_active_session

class InvoiceExtraction(BaseModel):
    vendor_name: str = Field(description="Full legal name of the vendor")
    invoice_number: str = Field(description="Invoice identifier, e.g. INV-1042")
    total_amount: float = Field(description="Total amount due, numeric only")
    currency: str = Field(description="ISO 4217 currency code, e.g. EUR")

session = get_active_session()

options = CompleteOptions(
    temperature=0,
    response_format={
        "type": "json",
        "schema": InvoiceExtraction.model_json_schema()
    }
)

prompt = "Extract structured data from: Acme Corp, Invoice #1042, Total: EUR 3,200.00"

raw = complete(
    model="mistral-large2",
    prompt=prompt,
    session=session,
    options=options,
)

result = InvoiceExtraction.model_validate_json(raw)

print(result.vendor_name)    # "Acme Corp"
print(result.total_amount)   # 3200.0
print(result.currency)       # "EUR"

The response_format eliminates the most common failure mode: field name variation. The model no longer chooses between total_amount, amount, and sum --- the schema decides. The subsequent model_validate_json call gives you a fully typed Python object and raises a ValidationError if anything is malformed.

Write Unambiguous Schemas

The schema is a contract. Vague field descriptions produce vague outputs. Be explicit about types, formats, and what to do when data is missing. Note that Snowflake Cortex has some schema constraints --- numeric range keywords like minimum/maximum are not supported, and property names may only contain letters, digits, hyphens, and underscores.

from pydantic import BaseModel, Field
from typing import Optional

class LineItem(BaseModel):
    description: str = Field(description="Product or service name, as written on the invoice")
    quantity: int = Field(description="Number of units. Convert 'dozen' to 12, 'pair' to 2.")
    unit_price: float = Field(description="Price per unit in the invoice currency, numeric only")

class InvoiceExtraction(BaseModel):
    vendor_name: str = Field(description="Full legal name of the vendor")
    invoice_date: str = Field(description="Invoice date in YYYY-MM-DD format. Convert all date formats.")
    total_amount: float = Field(description="Total amount due, numeric only, no currency symbol")
    currency: str = Field(description="ISO 4217 currency code, e.g. EUR, USD, GBP")
    line_items: list[LineItem] = Field(description="All line items listed on the invoice")
    purchase_order_number: Optional[str] = Field(
        default=None,
        description="PO reference number if present on the invoice, otherwise null"
    )

The Optional[str] with default=None on purchase_order_number is important. Without it, the model might hallucinate a PO number when none is present rather than omit the field --- a subtle but production-breaking behavior.

Anchor Behavior with Few-Shot Examples

Schema constraints define the structure; few-shot examples define the behavior within that structure. They are especially valuable for edge cases: relative dates, quantity words like "dozen", and optional fields that should be null. With a simple string prompt, the examples go directly into the prompt text.

from snowflake.cortex import complete, CompleteOptions
from snowflake.snowpark.context import get_active_session

session = get_active_session()

options = CompleteOptions(
    temperature=0,
    response_format={
        "type": "json",
        "schema": InvoiceExtraction.model_json_schema()
    }
)

def build_prompt(invoice_text: str) -> str:
    return f"""You extract structured invoice data. Respond in JSON.

Example 1:
Input: "From: Bolt GmbH | Ref: RE-2024-991 | Date: yesterday | 5x Widget A @ EUR 12.50 | Total: EUR 62.50"
Output: vendor_name="Bolt GmbH", invoice_date="2025-03-24", total_amount=62.50,
        currency="EUR", line_items=[{{description="Widget A", quantity=5, unit_price=12.50}}],
        purchase_order_number=null

Example 2:
Input: "DataPipe Inc | Invoice 5531 | PO: PO-8821 | 1 doz. API calls @ USD 0.01 | USD 0.12 total"
Output: vendor_name="DataPipe Inc", invoice_date=null, total_amount=0.12,
        currency="USD", line_items=[{{description="API calls", quantity=12, unit_price=0.01}}],
        purchase_order_number="PO-8821"

Now extract from:
{invoice_text}"""

raw = complete(
    model="mistral-large2",
    prompt=build_prompt(invoice_text),
    session=session,
    options=options,
)

result = InvoiceExtraction.model_validate_json(raw)

Without the "dozen → 12" example, the model might return quantity=1 with description="dozen API calls". With it, the conversion is consistent.

Measure Variance to Drive Improvement

Variance you don't measure is variance you can't improve. Run the same documents through your pipeline multiple times and compare the results. High variance on specific fields points directly at under-specified Field(description=...) strings or missing few-shot examples.

from collections import Counter

def measure_extraction_variance(
    invoice_text: str,
    n_runs: int = 5,
    session=None,
) -> dict:

    results = []
    for _ in range(n_runs):
        result = extract_with_retry(invoice_text, session=session)
        if result:
            results.append(result.model_dump())

    if not results:
        return {"error": "all runs failed"}

    variance_report = {}
    for field in results[0]:
        values = [str(r[field]) for r in results]
        top_value, top_count = Counter(values).most_common(1)[0]
        variance_report[field] = {
            "stable": len(set(values)) == 1,
            "unique_values": list(set(values)),
            "agreement_rate": top_count / n_runs,
        }

    return variance_report

# Example output:
# {
#   "vendor_name":  {"stable": True,  "agreement_rate": 1.0,  ...},
#   "invoice_date": {"stable": False, "agreement_rate": 0.6,  ...},  # ← needs attention
# }

A field with agreement_rate < 0.8 across five runs is a direct signal to improve its description or add a targeted few-shot example. This feedback loop is how schemas and prompts mature over time.

The Right Mental Model

The most useful shift you can make is to stop treating non-determinism as a bug waiting to be fixed and start treating it as a property of the environment --- one you design around, not against.

The analogy from distributed systems is apt. TCP/IP is built on top of unreliable packet delivery; the reliability lives in the protocol layer, not in the assumption of a perfect physical network. A reliable LLM pipeline puts its correctness guarantees in the surrounding system --- Pydantic validation, retry logic, normalization, business rules --- not in the assumption that the model will always produce identical output.

With Snowflake Cortex and snowflake-ml-python, this maps to a clear, implementable layering:

Raw document
      ↓
complete(prompt=..., options=CompleteOptions(response_format=...))   ← flexibility lives here
      ↓
model_validate_json() + retry on ValidationError                     ← structure enforced here
      ↓
NormalizedInvoice model_validator                                    ← canonical field names here
      ↓
Business logic / Snowflake table                                     ← determinism lives here

The complete call is allowed to be flexible --- that flexibility is precisely what lets it handle invoice formats you have never seen before. Every layer below it progressively tightens the guarantees, so that by the time data reaches a Snowflake table, it is in a predictable, validated shape regardless of what the model happened to call any given field on that particular run.

Non-determinism is not a problem to eliminate. It is the price of flexibility, and a reasonable one --- as long as you don't ask the model to be your schema enforcer, your validator, and your business-logic layer all at once. Those jobs belong to Pydantic.

Original article: Shuveb Hussain, "Understanding why deterministic output from LLMs is nearly impossible," Unstract Blog, October 8, 2025. https://unstract.com/blog/understanding-why-deterministic-output-from-llms-is-nearly-impossible/

DEV Community

LLM Non-Determinism: What Providers Guarantee, and How to Build Around It

Motivation

What Providers Actually Promise

Best Practices for Structured Extraction

Enforce Structure with `response_format` and Pydantic

Write Unambiguous Schemas

Anchor Behavior with Few-Shot Examples

Measure Variance to Drive Improvement

The Right Mental Model

Top comments (0)

Motivation

What Providers Actually Promise

Best Practices for Structured Extraction

Enforce Structure with response_format and Pydantic

Write Unambiguous Schemas

Anchor Behavior with Few-Shot Examples

Measure Variance to Drive Improvement

The Right Mental Model

Enforce Structure with `response_format` and Pydantic