Building a chatbot is easy. Building an AI agent that can review a 50-page Master Services Agreement and suggest redlines without breaking the document formatting is a different problem entirely.
In this post, I'll walk through how I dwelled into such an interesting problem as a weekend project, and ended up designing a system that automates contract review for legal teams. The challenge wasn't calling an LLM API—it was everything else: maintaining document structure, handling mutable paragraphs, and generating valid Microsoft Word tracked changes programmatically.
The Problem Space
Contract review follows a predictable pattern. Legal team receives a counterparty's redlined contract, reviews each change against their organization's risk tolerance, and either accepts, rejects, or modifies each suggestion. This process takes hours for a single contract.
So when i set out to automate this process, i realized it involves multiple aspects which includes but not limited to:
- Analyze contracts against specific guidelines
- Generate specific text suggestions with rationale
- Apply changes as Word tracked changes—not as plain text replacements
- Survive document mutations—users edit contracts while analysis runs
Requirements #3 and #4 are where most tools struggles. They output suggestions in a chat interface. Users still have to manually copy-paste and format. That's not automation; it's a fancier Ctrl+F.
System Architecture
┌─────────────────────────────────────────────────────────────────┐
│ Architecture │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Web App │ │ Word Add-in │ │ Backend │ │
│ │ (Next.js) │ │ (Office.js) │ │ (FastAPI) │ │
│ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │
│ │ │ │ │
│ │ REST + SSE │ REST + SSE │ │
│ └────────────────────┴────────────────────┘ │
│ │ │
│ ┌──────────┴──────────┐ │
│ │ Analysis Engine │ │
│ │ ┌───────────────┐ │ │
│ │ │ DSPy + LLM │ │ │
│ │ │ (OpenAI / │ │ │
│ │ │ Mistral) │ │ │
│ │ └───────────────┘ │ │
│ └─────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
Three components:
- Web Dashboard: Next.js application for rule management, analytics, and administration
- Word Add-in: Microsoft Office plugin (React + Office.js) where users actually review contracts
- Backend API: FastAPI service handling analysis, LLM orchestration, and document processing
The interesting engineering lives in the Word Add-in (document manipulation) and Backend API (analysis pipeline).
Challenge #1: The Mutable Document Problem
Here's a scenario that breaks naive implementations:
- User uploads 50-page contract
- System analyzes paragraphs 1-50, stores suggestions keyed by paragraph index
- User deletes paragraph 12 while waiting
- System returns: "Paragraph 47 needs revision"
- Paragraph 47 is now paragraph 46. Suggestion applied to wrong location.
Solution: Paragraph Anchoring
I built a logic to assign persistent IDs to mark the paragraph during preprocessing.
The key is designed to reside with OOXML and persists even when users:
- Delete adjacent paragraphs
- Cut and paste sections
- Accept/reject other tracked changes
On the frontend, a Zustand store maintains bidirectional mappings:
interface ParagraphStore {
indexToPersistentIdMap: Map<number, string>; // index → UUID
persistentIdToIndexMap: Map<string, number>; // UUID → index
findAnchorByText(text: string): string | null; // fallback matching
}
When analysis results return, they reference UUIDs. The store resolves current paragraph indices at application time—not at analysis time.
Challenge #2: Generating Word Tracked Changes
This is the hard part.
Office.js provides no API for creating tracked changes. The paragraph.insertText() method just replaces text. To create actual redlines (strikethrough deletions, colored insertions), you must:
- Generate a diff between original and suggested text
- Convert that diff to OOXML elements
- Apply these elements to the actual content using Office.js
Regarding the difference generation, I have implemented the Token-Based Diffing.
Character-level diffs create garbage in Word. "The quick brown fox" → "A quick red fox" would show:
T̶h̶e̶A quick b̶r̶o̶w̶n̶red fox
Unusable. Token-level diffs are cleaner:
The → A quick brown → red fox
Another important aspect to consider is to preserve the original paragraph properties. Because, contracts have formatting: numbering, indentation, styles. Naive replacement destroys this.
--
Challenge #3: Long-Running Analysis
A 50-page contract with 30 playbook rules can take 2-3 minutes to analyze. HTTP requests shouldn't block that long.
Session-Based Async Processing
Client Server
│ │
│ POST │
│ ─────────────────────────────>│
│ │ Create session
│ { session_id: "abc123" } │ Start background task
│ <─────────────────────────────│
│ │
│ GET /sessions/abc123 │
│ ─────────────────────────────>│
│ { status: "processing", │
│ progress: 45% } │
│ <─────────────────────────────│
│ │
│ ... poll every 3 seconds ... │
│ │
│ GET /sessions/abc123 │
│ ─────────────────────────────>│
│ { status: "complete", │
│ results: [...] } │
│ <─────────────────────────────│
Cache Validation with Content Hashing
Users often analyze the same contract multiple times—different guidelines, or checking after minor edits. Re-analyzing unchanged content wastes time and API costs.
The hash comparison catches:
- Re-uploads of identical files
- "Analyze again" clicks without actual changes
- Multiple users analyzing the same template
Cache hit rate in production: ~40% for typical contract review workflows.
Challenge #4: Grounding and Hallucination Prevention
Legal documents require precision. An AI suggesting "Vendor liability is capped at $1M" when the contract says "$500K" is worse than no suggestion at all.
The best way to solve it is to use Structured Output with Explicit Citations
Every suggestion must reference the exact source text.
This catches suggestions in practice where the model paraphrases slightly instead of quoting exactly.
The Analysis Pipeline
Putting it together:
┌────────────────────────────────────────────────────────────────┐
│ Redline Analysis Pipeline │
├────────────────────────────────────────────────────────────────┤
│ │
│ 1. DOCUMENT INGESTION │
│ ┌─────────┐ ┌─────────────┐ ┌──────────────┐ │
│ │ DOCX │────>│ Extract │────>│ Paragraph │ │
│ │ File │ │ OOXML │ │ Anchoring │ │
│ └─────────┘ └─────────────┘ └──────────────┘ │
│ │
│ 2. CONTENT NORMALIZATION │
│ ┌─────────────┐ ┌─────────────────┐ │
│ │ OOXML with │────>│ Unified │ │
│ │ Tracked │ │ Markdown │ │
│ │ Changes │ │ (Original + │ │
│ │ │ │ Revised views) │ │
│ └─────────────┘ └─────────────────┘ │
│ │
│ 3. LLM ANALYSIS │
│ ┌─────────────┐ ┌─────────────┐ ┌──────────────┐ │
│ │ │────>│ DSPy │────>│ Structured │ │
│ │ Rules │ │ Signatures │ │ Suggestions │ │
│ └─────────────┘ └─────────────┘ └──────────────┘ │
│ │
│ 4. OUTPUT GENERATION │
│ ┌─────────────┐ ┌─────────────┐ ┌──────────────┐ │
│ │ Suggestions │────>│ Token Diff │────>│ OOXML │ │
│ │ + Rationale │ │ Algorithm │ │ │ │
│ └─────────────┘ └─────────────┘ └──────────────┘ │
│ │
└────────────────────────────────────────────────────────────────┘
The OOXML-to-Markdown conversion deserves special mention. Incoming contracts often already have tracked changes from counterparty negotiations. The converter:
- Parses elements
- Generates two synchronized views: Original (with deletions, without insertions) and Revised (with insertions, without deletions)
- Preserves paragraph IDs from content controls
This abstraction means the LLM analyzes clean markdown, not raw XML. The complexity stays in the conversion layer.
Results
The system processes a 20-page contract in approximately 30-45 seconds, depending on rules complexity. Key metrics:
- Cache hit rate: ~40% (saves re-analysis on unchanged content)
- Hallucination rate: <5% (caught by validation, not surfaced to users)
- Format preservation: 95% (paragraph properties maintained)
- Tracked change accuracy: Token-level precision
Lessons Learned
Office.js is powerful but underdocumented. The OOXML manipulation pattern isn't in any official guide. I reverse-engineered it by exporting documents and reading the XML.
Character-level diffs are wrong for documents. Always tokenize first. The general diff libraries doesn't know about words.
Async patterns matter more than you think. The session-based polling approach sounds simple, but handling edge cases (browser refresh, network drops, server restarts) required careful state management.
Ground everything. LLMs will confidently cite text that doesn't exist. Validation layers catch this, but only if you design the output schema to require explicit source references.
Content hashing is cheap insurance. The SHA-256 computation is negligible compared to LLM costs. Cache validation paid for itself in the first week.
Tech Stack Summary
| Layer | Technology | Why |
|---|---|---|
| Backend API | FastAPI (Python) | Async-native, great for long-running tasks |
| LLM Orchestration | DSPy | Structured outputs, provider-agnostic |
| LLM Providers | OpenAI, Mistral | Redundancy, cost optimization |
| Database | Supabase (PostgreSQL) | Real-time subscriptions, hosted |
| Web Frontend | Next.js | SSR for dashboard, API routes |
| Word Add-in | React + Office.js | Only option for Word integration |
| Document Processing | python-docx, custom OOXML | No library handles tracked changes |
Closing Thoughts
The interesting engineering in "AI for X" products is rarely the AI part. Calling an LLM API is straightforward. The challenge is everything around it: maintaining document fidelity, handling state across long-running operations, and building validation layers that catch model failures before users see them.
Legal redlining pushed me to solve problems I didn't anticipate—paragraph anchoring, OOXML manipulation, token-based diffing. Each solution came from understanding the domain deeply, not from finding a better prompt.
If you're building in this space, I'd be interested to hear about your approach.
Arun Venkataramanan is a Senior Software Engineer at Ottimate, where he works on architecting solutions for accounts payable automation. With a background spanning core banking systems (TCS), fintech platforms, and enterprise automation, he focuses on building solutions and tools to help users automate repetitive things from their day to day work.
Connect on LinkedIn
Top comments (0)