DEV Community

Daniel Romitelli
Daniel Romitelli

Posted on • Originally published at craftedbydaniel.com

I Turned Temperature Up to Save My Extractions: The 3‑Node LangGraph That Trades Variance for Truth

I noticed it in the most annoying way: the email clearly contained a job title — not perfectly, not cleanly, but enough — and my extraction came back with nothing.

Not wrong. Not malformed. Just… absent.

That’s the failure mode that hurts in enterprise systems, because downstream code can’t tell the difference between “not present” and “present but partially occluded.” Silence looks like confidence.

This is Part 2 of my series, “How to Architect an Enterprise AI System (And Why the Engineer Still Matters)”.

  • In Part 0 (“Continuity Is a Feature”), I dealt with the brutal reality that enterprise workflows aren’t stateless: threads span days, people reply out of order, and the system has to remember what it already learned.
  • In Part 1 (“Adversarial Emails and the Front Door Problem”), I hardened the intake boundary so forwarded threads, disclaimers, and hostile payloads don’t poison the run.

Here in Part 2, I’m going to show the decision that surprised me most:

High variance + downstream validation beats low variance + brittle parsing.

Yes, that means my extraction calls run at temperature=1. And no, it’s not a vibe. It’s architecture.


The moment I stopped trusting temperature=0

Every “structured extraction” playbook eventually tells you some version of: set temperature to 0.

I did that. I shipped it. It looked clean in tests.

Then production taught me a different lesson: when the input is imperfect — and email is always imperfect — low variance decoding tends to pick the “safe” behavior: omit any field that isn’t slam-dunk supported.

That’s not caution. That’s data loss.

And the inputs that trigger it are exactly the distribution I live in:

  • forwarded threads with multiple voices
  • signature sludge (“Sent from my iPhone”, compliance blocks, calendars)
  • partially visible details (“Role: Sr. Eng…”, “Title: Dir. of …”)
  • weird formatting (bullets, tables, inline replies)

When you force low variance, the model often chooses to be conservative. The output looks tidy, but it quietly drops the weak signals you needed most.

So I made the opposite bet:

  1. Let the model take a swing at imperfect fields (temperature=1).
  2. Treat the extraction as candidates, not truth.
  3. Make validation earn the right to write downstream fields.

That’s a systems decision, not a model decision.


Key insight: for extraction, recall comes before precision

A lot of teams optimize extraction like it’s a database migration: strict, deterministic, no surprises.

But extraction from messy human text has a different primary constraint:

Can you preserve weak signals long enough to verify them?

If you run temperature=0 and the model omits a partially visible job title, you cannot validate it later — it never enters the system.

If you run temperature=1, you’ll get more attempts (including some garbage). But now you have something to:

  • sanitize
  • reject
  • corroborate during research

One analogy (and only one): it’s the difference between a cautious intern who only writes down what they’re 100% sure about, and a stenographer who writes down what they think they heard so you can fact-check it later.

I chose the stenographer. Then I built the fact-checker.


The 3‑node LangGraph: extract → research → validate

My email workflow is a three-node LangGraph:

  1. extract — pull structured fields from the email (high recall, permissive)
  2. research — enrich and cross-check (adds external context)
  3. validate — sanitize + reject bad fields (strict correctness boundary)

The non-obvious design choice: it’s sequential. No branching, no early exits, no “if confidence < x then …” graph gymnastics.

Why I keep it linear:

  • The failure mode I care about is partial truth, not total failure.
  • Conditional routing forces a hard split (“good” vs “bad”) too early.
  • In production debugging, I want every trace to have the same three chapters.

This graph behaves like a careful human: write down your first pass, go look things up, then decide what you actually believe.

Here’s a self-contained, runnable version of the pattern (async nodes, typed state, sequential edges). You can paste this into a file and run it.

from __future__ import annotations

import asyncio
from dataclasses import dataclass
from typing import Any, Dict, Optional, TypedDict

from langgraph.graph import StateGraph, START, END


class EmailState(TypedDict, total=False):
    # Input
    email_subject: str
    email_body: str

    # Outputs as the workflow progresses
    extracted: Dict[str, Any]
    researched: Dict[str, Any]
    validated: Dict[str, Any]


@dataclass
class EmailGraph:
    """Minimal extract → research → validate graph.

    In my production code this lives inside the email intake pipeline manager,
    where the nodes call real services (LLM + search + internal data sources).
    """

    async def extract_node(self, state: EmailState) -> EmailState:
        # Placeholder to keep the example runnable; the real LLM call is shown later.
        body = state.get("email_body", "")
        subject = state.get("email_subject", "")

        # A tiny toy heuristic so you can see data moving through.
        extracted = {
            "position_title": "Director of Engineering" if "Director" in body else None,
            "candidate_name": "Alex Morgan" if "Alex" in body else None,
            "source_subject": subject,
        }
        return {**state, "extracted": extracted}

    async def research_node(self, state: EmailState) -> EmailState:
        extracted = state.get("extracted", {})

        # Again: runnable placeholder. In production this is where I call tools
        # (company research, domain checks, internal CRM lookups, etc.).
        researched = {
            "position_title": extracted.get("position_title"),
            "title_seen_on_linkedin": bool(extracted.get("position_title")),
        }
        return {**state, "researched": researched}

    async def validate_node(self, state: EmailState) -> EmailState:
        extracted = state.get("extracted", {})
        researched = state.get("researched", {})

        # The real sanitization + title validation logic is implemented later.
        # This is just enough to show the flow.
        validated = dict(extracted)
        if researched.get("title_seen_on_linkedin") is False:
            validated["position_title"] = None

        return {**state, "validated": validated}

    def compile(self):
        workflow = StateGraph(EmailState)
        workflow.add_node("extract", self.extract_node)
        workflow.add_node("research", self.research_node)
        workflow.add_node("validate", self.validate_node)

        workflow.add_edge(START, "extract")
        workflow.add_edge("extract", "research")
        workflow.add_edge("research", "validate")
        workflow.add_edge("validate", END)

        return workflow.compile()


async def main():
    graph = EmailGraph().compile()

    result = await graph.ainvoke(
        {
            "email_subject": "Re: Intro",
            "email_body": "Hi — I’m Alex. Title: Director of Engineering. Let's talk.",
        }
    )

    print("Extracted:", result.get("extracted"))
    print("Researched:", result.get("researched"))
    print("Validated:", result.get("validated"))


if __name__ == "__main__":
    asyncio.run(main())
Enter fullscreen mode Exit fullscreen mode

And here’s the exact shape: three nodes, chained, no conditionals.

flowchart TD
  startNode[START] --> extractNode[extract]
  extractNode --> researchNode[research]
  researchNode --> validateNode[validate]
  validateNode --> finishNode[END]```



That “always the same three chapters” property matters more than people expect. When you’re triaging a production incident, consistency beats cleverness.

---

## The counterintuitive part: temperature=1 *in extraction*

My extraction step runs with **temperature=1**.

The point is not to be random. The point is to avoid the most damaging behavior I observed at temperature=0: **silent omission** of partially-supported fields.

Temperature=1 gives the model permission to output a best-effort candidate instead of going blank.

The tradeoff is obvious: you’ll see more variance, and some candidates will be wrong. That’s why I treat the extractor like an untrusted witness. The validate node is the court.

But if the extractor returns nothing, the court never gets the case.

---

## Show me the real call: a runnable LLM extractor with temperature=1

Here's a self-contained extractor function that calls an LLM with temperature=1, asks for strict JSON, parses the response safely, and returns a Python dict you can pass into the graph.

This is the exact structure I use in production: strict schema on the output, permissive decoding on the input.



```python
from __future__ import annotations

import json
from typing import Any, Dict, Optional

from openai import AsyncOpenAI


EXTRACTION_SCHEMA = {
    "type": "object",
    "additionalProperties": False,
    "properties": {
        "candidate_name": {"type": ["string", "null"]},
        "contact_email": {"type": ["string", "null"]},
        "phone": {"type": ["string", "null"]},
        "position_title": {"type": ["string", "null"]},
        "location": {"type": ["string", "null"]},
    },
    "required": ["candidate_name", "contact_email", "phone", "position_title", "location"],
}


def _safe_json_loads(text: str) -> Dict[str, Any]:
    """Parse JSON defensively. Raises ValueError on failure."""
    text = text.strip()
    if not text:
        raise ValueError("Empty response")
    data = json.loads(text)
    if not isinstance(data, dict):
        raise ValueError("Expected a JSON object")
    return data


async def extract_email_fields(
    client: AsyncOpenAI,
    *,
    model: str,
    subject: str,
    body: str,
) -> Dict[str, Any]:
    """LLM extraction with temperature=1.

    Returns a dict with keys from EXTRACTION_SCHEMA.
    """

    system = (
        "You extract recruiting intake fields from messy email text. "
        "Return ONLY JSON matching the provided schema. "
        "If a field is not present, use null. "
        "Do not invent facts. Best-effort partial extraction is allowed."
    )

    user = (
        f"EMAIL SUBJECT:\n{subject}\n\n"
        f"EMAIL BODY:\n{body}\n"
    )

    resp = await client.chat.completions.create(
        model=model,
        temperature=1,
        messages=[
            {"role": "system", "content": system},
            {"role": "user", "content": user},
        ],
        response_format={
            "type": "json_schema",
            "json_schema": {
                "name": "email_intake_extraction",
                "schema": EXTRACTION_SCHEMA,
                "strict": True,
            },
        },
    )

    content = resp.choices[0].message.content or ""
    return _safe_json_loads(content)
Enter fullscreen mode Exit fullscreen mode

A few design notes that matter in the real world:

  • Temperature=1 doesn’t mean “no structure.” I keep structure by forcing a strict JSON schema. Temperature affects which candidate the model proposes, not whether the output is parseable.
  • I explicitly instruct “use null” for missing fields so the model doesn’t get cute.
  • This function returns a dict that flows into sanitization and validation before it touches storage.

What actually changes at temperature=0 vs temperature=1

Here’s what I saw repeatedly on the same kind of email:

  • With temperature=0, the model often omits fields that are partially present.
  • With temperature=1, the model is more willing to output a plausible candidate.

The important point isn’t “temperature=1 is better.” The important point is that temperature=0 tends to fail silently under messy input.

And silent failures are poison in enterprise pipelines because they look like success.

So I optimize for:

  • high recall under messy input (extract)
  • additional context (research)
  • strict correctness boundary (validate)

Validation Layer 1: sanitization that turns fake nulls into real nulls

One of the most expensive bugs in AI pipelines is also one of the dumbest: models output strings that look like nulls.

You’ve seen them:

  • "Unknown"
  • "N/A"
  • "null"

If you store those, downstream systems treat them as real values. Then you get UI bugs (“Unknown” displayed as if it were user-provided), analytics bugs (NULL checks don’t work), and search bugs (“Unknown” becomes a term).

So I normalize aggressively.

Here’s the sanitization function I use. It’s simple on purpose: it’s correct, testable, and it makes the rest of the pipeline calmer.

from __future__ import annotations

import re
from typing import Any, Dict, Iterable, Optional


_NULL_TOKENS = {
    "unknown",
    "n/a",
    "na",
    "none",
    "null",
    "not provided",
    "not specified",
    "-",
}


_ws = re.compile(r"\s+")


def normalize_scalar(value: Any) -> Any:
    """Normalize common LLM extraction artifacts.

    - Trims strings and collapses whitespace.
    - Converts null-like tokens to None.
    - Leaves non-strings untouched.
    """
    if value is None:
        return None

    if isinstance(value, str):
        s = _ws.sub(" ", value.strip())
        if not s:
            return None
        if s.casefold() in _NULL_TOKENS:
            return None
        return s

    return value


def sanitize_extraction(extracted: Dict[str, Any], *, keys: Optional[Iterable[str]] = None) -> Dict[str, Any]:
    """Sanitize a dict of extracted values.

    If keys is provided, only sanitize those keys; otherwise sanitize all keys.
    """
    if keys is None:
        keys = extracted.keys()

    out = dict(extracted)
    for k in keys:
        if k in out:
            out[k] = normalize_scalar(out[k])
    return out
Enter fullscreen mode Exit fullscreen mode

This is the kind of code that feels too basic to mention until you’ve watched a “null-ish string” contaminate downstream reporting for weeks.

Yes, there’s a theoretical edge case where the legitimate value is literally “Unknown.” In my domain, that’s not actionable data anyway, so I’d rather collapse it to None than pretend it’s a real value.


Validation Layer 2: job title garbage detection (the hallucination bouncer)

When you turn temperature up, you buy recall. You also buy occasional nonsense.

A very common failure mode for titles is that the model outputs a sentence or a whole clause:

  • “He’s currently a senior manager leading platform initiatives…”
  • “Looking for a role as a director of engineering, preferably remote.”

That’s not a title. That’s a summary.

So I enforce a “bouncer” rule:

  • reject titles longer than 60 characters
  • reject titles containing sentence boundaries

Here’s the concrete implementation.

from __future__ import annotations

import re
from typing import Optional


_SENTENCE_BOUNDARY = re.compile(r"[\.!?]|\n|\r")


def validate_position_title(title: Optional[str]) -> Optional[str]:
    """Return a cleaned title if plausible; otherwise None."""
    if title is None:
        return None

    title = title.strip()
    if not title:
        return None

    # Hard cap: job titles are short.
    if len(title) > 60:
        return None

    # Sentence boundary detection: reject full sentences.
    if _SENTENCE_BOUNDARY.search(title):
        return None

    # A small extra guard that catches many "title: blah blah" artifacts.
    lowered = title.casefold()
    if lowered.startswith("title:"):
        title = title.split(":", 1)[1].strip()
        if not title or len(title) > 60 or _SENTENCE_BOUNDARY.search(title):
            return None

    return title
Enter fullscreen mode Exit fullscreen mode

That’s it. No embeddings, no classifier, no “AI safety layer.” Just blunt constraints that reflect the real domain boundary.

This is also why I’m comfortable with temperature=1: I’m not pretending the extractor is authoritative. I’m giving it room to propose candidates, then I enforce rules that keep the system honest.


Put it together: validate node that sanitizes + rejects bad titles

Now we connect the pieces: sanitize the extracted dict, validate the job title field, and produce a validated payload for downstream systems.

from __future__ import annotations

from typing import Any, Dict


def validate_extraction_payload(extracted: Dict[str, Any]) -> Dict[str, Any]:
    """Apply sanitization + field-level validation."""

    cleaned = sanitize_extraction(extracted)

    # Title validation
    cleaned["position_title"] = validate_position_title(cleaned.get("position_title"))

    return cleaned
Enter fullscreen mode Exit fullscreen mode

In my production graph, this runs after the research node so I can also reject fields that conflict with corroborated data (for example, a title that doesn’t match what I found via company research). The exact cross-checks vary by tenant and workflow, but the principle stays the same: validation is where strictness lives.


Tests and examples: accepted vs rejected titles

If you’re going to treat validation logic as part of your correctness boundary, it needs to be crisp and testable.

Here’s a runnable pytest-style test module you can drop into your repo.

from __future__ import annotations


def test_validate_position_title_accepts_normal_titles():
    assert validate_position_title("Director of Engineering") == "Director of Engineering"
    assert validate_position_title("VP, People") == "VP, People"
    assert validate_position_title("  Senior Software Engineer  ") == "Senior Software Engineer"


def test_validate_position_title_rejects_sentences_and_long_text():
    assert validate_position_title("I am looking for a Director of Engineering role.") is None
    assert validate_position_title("Director of Engineering! Remote preferred") is None
    assert validate_position_title("A" * 61) is None


def test_sanitize_extraction_null_tokens():
    payload = {
        "candidate_name": "Unknown",
        "contact_email": " N/A ",
        "position_title": "null",
        "location": "  New   York ",
    }
    cleaned = sanitize_extraction(payload)
    assert cleaned["candidate_name"] is None
    assert cleaned["contact_email"] is None
    assert cleaned["position_title"] is None
    assert cleaned["location"] == "New York"
Enter fullscreen mode Exit fullscreen mode

This is not optional engineering ceremony. It’s how you keep the “temperature=1” decision from turning into a permanent debugging tax.


Why I don’t use conditional routing in this graph

It’s tempting to add conditional edges:

  • “If extraction confidence is low, go to research.”
  • “If job title looks weird, rerun extraction.”
  • “If phone is missing, branch into a second pass.”

I’ve built those systems. They can work.

But for this workflow, a fixed 3-step path wins because it behaves like a ledger:

  • every run creates an extraction attempt
  • every run creates enrichment context
  • every run runs validation

That consistency matters when you’re debugging a real incident from logs:

  • You don’t need to remember which branch ran.
  • You don’t need to wonder why research didn’t happen.
  • You don’t end up with half the runs missing evidence.

The deeper reason is that early conditional routing usually relies on weak proxies (“confidence”) computed from the same messy text. If you’re wrong early, you route wrong. I’d rather keep the pipeline deterministic in shape, then be strict at the end.


The systems trade: variance is fine; brittleness is not

Here’s the honest cost of this approach:

  • temperature=1 increases candidate variance
  • validation code becomes part of the correctness boundary
  • you must invest in tests and observability for validation outcomes

But the benefit is the thing that actually matters in production: fewer silent failures.

A brittle extractor feels great in unit tests and collapses in the wild. A high-recall extractor feels noisy until you build the right shock absorbers — sanitizers, validators, corroboration, caching — and then it becomes dependable.

The most important reframing is this:

  • Extraction is not truth.
  • Validation is where truth enters the system.

Once you adopt that, temperature=1 stops being scary. It becomes a knob you can tune because you have a place to be strict.


How this fits into the rest of the architecture

Part 1 in this series was about adversarial input: stripping garbage before it hits the model. That’s the front door.

This post is the middle: once you call the model, you decide what you’re optimizing for.

I optimize for:

  • recall on messy input (so weak signals survive)
  • context via research (so candidates can be corroborated)
  • strict validation before persistence (so the database doesn’t become a rumor mill)

And this only stays sane because the workflow is sequential:

  • extract can be generous
  • research can add context
  • validate can be ruthless

That’s the division of labor that made my pipeline stop “going quiet” when reality got messy.


What I’d change next (and what I wouldn’t)

I wouldn’t go back to temperature=0 for this pipeline. The “clean output” is a mirage when the input is enterprise email.

If I were iterating further (and I am), I’d push the same principle into operations: don’t ask a single step to be both optimistic and correct. Let steps do their job, then verify.

In Part 3, I’m going to move from correctness to survivability: retries, caching, and what happens when a node fails mid-run — including how I keep the workflow debuggable when it hits the dead-letter queue.


🎧 Listen to the Enterprise AI Architecture audiobook
📖 Read the full 13-part series with an AI assistant

Top comments (0)