Four LLM Workflows That Actually Survive Production

#ai #llm #automation

Most teams waste time trying to ship a magical assistant before they have one boring workflow that makes money or saves hours. The production wins usually come from narrow tasks, hard guardrails, and obvious success metrics.

If you are responsible for getting an LLM feature past the demo stage, these are the patterns I have seen hold up when traffic, messy input, and annoyed users show up.

1. extraction beats conversation when you need reliability

A lot of business data is trapped in PDFs, emails, tickets, forms, and chat transcripts. LLMs are very good at turning ugly text into structured objects if you stop asking for prose and start asking for a schema.

The key is to make the model do one job: read, normalize, and return fields you can validate. Do not ask it to explain itself unless you need human review. In production, explanations create longer outputs, higher cost, and more room for format drift.

A prompt like this is already better than most first attempts:

system: |
  Extract support case details from raw text.
  Return valid JSON only.
  If a field is missing, use null.
user: |
  Schema:
  {{
    "customer_name": string | null,
    "issue_type": string | null,
    "priority": "low" | "medium" | "high" | null,
    "refund_requested": boolean | null
  }}

  Raw text:
  {{ticket_text}}

Then validate the response like you would validate any untrusted input:

from pydantic import BaseModel, ValidationError

class TicketFields(BaseModel):
    customer_name: str | None
    issue_type: str | None
    priority: str | None
    refund_requested: bool | None

raw = llm_extract(ticket_text)
try:
    fields = TicketFields.model_validate_json(raw)
except ValidationError:
    fields = None

This pattern works because the model handles fuzzy language, while your application still controls the contract. If validation fails, you retry with a narrower prompt or send the case to manual review. That is a real system, not a vibe.

2. draft generation works when a deterministic layer owns the facts

Teams get burned when they ask a model to generate customer emails, incident summaries, or release notes directly from memory. The fix is simple: split fact gathering from language generation.

Build a deterministic context object first. Pull the ticket fields, database values, latest order state, or incident timeline from trusted systems. Then ask the model to turn that context into copy for a human or a downstream tool.

def build_context(ticket, order, policy):
    return {
        "customer": ticket.customer_name,
        "issue": ticket.issue_type,
        "order_status": order.status,
        "eligible_refund": policy.allows_refund(order),
        "refund_amount": order.refund_amount,
    }

prompt = f"""
Write a support reply in plain English.
Use only these facts: {build_context(ticket, order, policy)}
Do not invent policy details.
Keep it under 120 words.
"""
reply = llm_generate(prompt)

Now the model is doing style and synthesis, which is where it shines. Your software still owns eligibility rules, prices, account status, and policy logic. This is the difference between a useful assistant and a liability.

3. LLM triage is strong when confidence controls the handoff

One of the best practical uses is first pass triage: classify tickets, route alerts, label feedback, or score leads. The mistake is forcing the model to make every decision. You want confidence thresholds and an escape hatch.

A clean pattern is to ask for both a label and a confidence score, then route based on score bands. High confidence gets automated handling. Medium confidence goes to a queue with the model suggestion attached. Low confidence falls back to your existing process.

That gives you an upgrade path. You can start conservative, inspect errors, and gradually automate more categories without betting the whole workflow on day one.

what actually trips people up

The model is rarely the main problem. The ugly failures usually come from everything around it.

First, prompt drift sneaks in through product changes. Someone adds a new field, another team renames a status, and nobody updates the prompt or schema. The feature still works on easy cases, so the breakage sits there quietly.

Second, teams skip adversarial inputs. They test on clean examples, not on OCR garbage, sarcastic customers, mixed languages, copied email chains, or logs pasted into a support box. Your eval set should look like your worst Tuesday, not your nicest demo.

Third, people do not budget for retries, rate limits, and timeout behavior. If the model call fails, what happens to the request? Do you drop the job, retry safely, or create duplicates? Production systems need idempotency keys and queue semantics long before they need a fancier prompt.

Fourth, nobody agrees on what good means. "Helpful" is not a metric. Pick something you can measure: exact field accuracy, handle time reduction, first response quality score, deflection rate, or human acceptance rate. If you cannot score it, you cannot improve it.

4. retrieval is useful, but only after you fix document hygiene

A lot of teams rush into retrieval augmented generation and blame the model when answers are weak. Usually the real problem is garbage source material. If your runbooks conflict, your docs are stale, and your naming is inconsistent, retrieval just delivers bad context faster.

Before you spend a week tuning chunk sizes, clean the corpus. Remove duplicates, add ownership, stamp update dates, and split giant pages into stable sections. Then keep retrieval narrow. Search within the right product area, customer tier, or service boundary before you send context to the model.

A small, clean corpus beats a giant messy one. Every time.

the stack that tends to age well

You do not need a giant platform to get results. A simple stack covers most teams:

queue for asynchronous jobs
typed validation for model output
prompt templates in version control
tracing for latency, token use, and failures
a review UI for low confidence cases
offline evals before you change prompts or models

That stack is boring on purpose. Boring is good when your feature touches customers or internal operations.

If you are picking one practical LLM project this week, start with extraction or triage on a workflow your team already understands. Instrument the baseline, automate only the high confidence slice, and review the misses every Friday. By the end of the month you will have a system that either saves real time or gives you clean evidence that it should not ship. Today, pick one queue with repetitive text input, define a schema or label set, and put the first hundred examples into an eval file.