DEV Community

Manjunath
Manjunath

Posted on

Building Production-Grade Human-in-the-Loop Workflow Automation with LangGraph

The Problem With Enterprise Approval Workflows.

Most enterprise approval workflows are not systems. They are sequences of emails.

A compliance review is filed. Someone forwards it to a reviewer. The reviewer replies. A manager is CC'd. Someone updates a spreadsheet. Three days later, the spreadsheet has a new column that no one agreed to add.

When something goes wrong - a decision is disputed, an auditor asks questions, a regulator wants a decision log - the answer is in someone's inbox. If the reviewer has left the company, the answer may not be recoverable at all.

The pattern breaks down further when workflows cross systems. A procurement approval might require a vendor check, a budget validation, a legal review, and a final sign-off. Each step is handled by a different team, in a different system, with no shared state. When step three fails, starting over means starting from step one.

The technical problem is the absence of persistent, structured workflow state. A workflow that lives in email has no state. It can't be paused and resumed. It can't be audited. It can't be recovered if a step fails.
This post covers how I built a platform to solve this using LangGraph, FastAPI, and SQLite - with a production path to Azure.
Why LangGraph
The core requirement was a workflow engine that could pause at a human decision point and resume from that exact position - surviving server restarts between the pause and the resume.
LangGraph's StateGraph is well-suited to this because it separates the workflow structure from the workflow state. The graph is a set of nodes (agent functions) and edges (routing logic). The state is a typed dictionary that flows through the graph. Checkpointing saves the state at each transition.

Two specific LangGraph primitives made this practical:

interrupt_before: The graph can be compiled with a list of node names that should trigger an interrupt before execution. When the graph reaches one of those nodes, it halts, persists the current state to the checkpointer, and returns control to the caller. The graph resumes when explicitly invoked again with the same thread ID.
AsyncSqliteSaver: A persistent checkpoint backend that writes graph state to SQLite. Unlike the default MemorySaver, which is process-local, AsyncSqliteSaver persists across server restarts. The same checkpoint file is readable by any process with the correct connection string.
These two primitives are the foundation of the human-in-the-loop pattern described in the next section.
The Checkpoint Pattern
The most common mistake in stateful workflow systems is assuming process memory is durable.
If the workflow is running inside a long-lived process, and that process restarts, the workflow state is gone. In practice, this means every server restart, every deployment, and every crash silently kills every in-flight workflow.
The fix is to write state to a persistent store at every transition, not just at the end.

from langgraph.checkpoint.aiosqlite import AsyncSqliteSaver
async with AsyncSqliteSaver.from_conn_string(CHECKPOINT_DB_URL) as checkpointer:
 graph = workflow_module.build_graph(checkpointer=checkpointer)
 result = await graph.ainvoke(input_state, config={"configurable": {"thread_id": workflow_id}})
Enter fullscreen mode Exit fullscreen mode

Every call to ainvoke with the same thread_id resumes from the last persisted checkpoint. If the server restarts between the risk scoring step and the human review step, the next invocation picks up from risk scoring output - not from the beginning.
In production, CHECKPOINT_DB_URL is a Postgres connection string. The application code does not change.
The Human Pause: Interrupt vs Polling
The conventional approach to human-in-the-loop is a polling loop: an agent writes a "pending review" flag to a database, and a background process polls until a human updates the flag.
This has two failure modes. First, the polling process itself is a point of failure - if it crashes, the workflow never resumes. Second, concurrent reviewers can both see "pending" and submit conflicting decisions before either decision is reflected.
The interrupt approach eliminates both.

graph = builder.compile(
 checkpointer=checkpointer,
 interrupt_before=["decision_agent"]
)
Enter fullscreen mode Exit fullscreen mode

When the graph reaches decision_agent, it halts. The caller receives control. The workflow state is in the checkpoint store. No polling. No flags. No background process.
Resume happens via a single API call:

# Human submits decision via POST /api/workflows/{id}/decide
await graph.aupdate_state(
 config={"configurable": {"thread_id": workflow_id}},
 values={"human_decision": decision, "decision_notes": notes}
)
result = await graph.ainvoke(None, config={"configurable": {"thread_id": workflow_id}})
Enter fullscreen mode Exit fullscreen mode

The graph loads the checkpoint, applies the updated state, and continues from decision_agent. The reviewer's decision, identity, and timestamp are written to the audit trail before the graph resumes.
 Immutable Audit Trails
An audit trail that can be modified after the fact is not an audit trail.
Every event in this platform is appended to a log. No update operations. No delete operations. The audit logger exposes a single method:

await audit_logger.log(
 workflow_id=workflow_id,
 stage="risk_scoring",
 actor="SYSTEM",
 event_type="RISK_SCORE_COMPUTED",
 data={"risk_score": 74, "reasoning_summary": "Three rule failures in financial controls section"}
)
Enter fullscreen mode Exit fullscreen mode

The data field is intentionally sanitized before logging. Document content - extracted text, raw field values, personal data - is never written to the audit trail. The log records what the system did (risk score computed, rule evaluated, human decision submitted) and the structured metadata that supports that record. Not the raw content that was processed.
This matters when the audit trail is itself subject to data retention requirements. A log that contains full document text is subject to the same retention and access controls as the document. A log that contains metadata is not.
Pluggable Workflow Registry
The architecture has a single orchestration engine and multiple workflow modules. Adding a new workflow requires one new folder in workflows/, implementing a standard interface:

class WorkflowModule:
 name: str
 description: str
def build_graph(self, checkpointer) -> StateGraph:
 …
def get_input_schema(self) -> dict:
 …
Enter fullscreen mode Exit fullscreen mode

The registry discovers and loads modules at startup. The API, the dashboard, and the audit trail require no changes when a new workflow is added.
The platform currently ships with two modules: compliance review and procurement. Both were added without modifying the orchestration engine. The third module - whatever it is - will be added the same way.
What This Enables
The compliance review workflow demonstrates the pattern at its most structured. Six automated stages produce a risk score and a rule evaluation before a human reviewer sees the workflow. The reviewer sees the complete automated analysis - not a summary, the full output - and submits a decision. The workflow generates a compliance certificate or a rejection report. The audit trail records every stage from document intake to certificate generation.
The same pattern applies to any workflow where:

  • Multiple sequential steps process the same input
  • A human decision is required at a defined checkpoint
  • The decision and its context must be traceable after the fact Vendor onboarding, contract review, budget approval, incident escalatio all of these map cleanly to the same architecture. The platform is local-first, with a documented path to Azure: SQLite to Postgres, local file storage to Blob Storage, API keys to Key Vault, uvicorn to Container Apps. One environment variable change per component.

Conclusion

The technical foundation for reliable enterprise workflow automation is not complicated. Persistent state, genuine human-in-the-loop interrupts, and an immutable audit log cover the majority of the requirements in regulated industries.
The difficulty is in the details: checkpoints that survive restarts, interrupt/resume that doesn't require polling, audit logs that capture decisions without capturing personal data.
The full platform, including architecture diagrams, state machine documentation, a working demo, and 56 passing tests, is at:
https://github.com/manjunath-hanmantgad/multi-agent-orchestration

Built with LangGraph, FastAPI, SQLite (Postgres-ready), and Tailwind CSS.

Top comments (0)