pyalwin

Posted on Dec 17, 2025

Beyond Simple Prompts: Architecting an AI Agent

#ai #tooling #productivity #agents

Building a chatbot is easy. Building an AI agent that can review a 50-page Master Services Agreement and suggest redlines without breaking the document formatting is a different problem entirely.

In this post, I'll walk through how I dwelled into such an interesting problem as a weekend project, and ended up designing a system that automates contract review for legal teams. The challenge wasn't calling an LLM API—it was everything else: maintaining document structure, handling mutable paragraphs, and generating valid Microsoft Word tracked changes programmatically.

The Problem Space

Contract review follows a predictable pattern. Legal team receives a counterparty's redlined contract, reviews each change against their organization's risk tolerance, and either accepts, rejects, or modifies each suggestion. This process takes hours for a single contract.

So when i set out to automate this process, i realized it involves multiple aspects which includes but not limited to:

Analyze contracts against specific guidelines
Generate specific text suggestions with rationale
Apply changes as Word tracked changes—not as plain text replacements
Survive document mutations—users edit contracts while analysis runs

Requirements #3 and #4 are where most tools struggles. They output suggestions in a chat interface. Users still have to manually copy-paste and format. That's not automation; it's a fancier Ctrl+F.

System Architecture

┌─────────────────────────────────────────────────────────────────┐
│                     Architecture                                │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│   ┌──────────────┐     ┌──────────────┐     ┌──────────────┐    │
│   │   Web App    │     │  Word Add-in │     │   Backend    │    │
│   │  (Next.js)   │     │ (Office.js)  │     │  (FastAPI)   │    │
│   └──────┬───────┘     └──────┬───────┘     └──────┬───────┘    │
│          │                    │                    │            │
│          │    REST + SSE      │    REST + SSE      │            │
│          └────────────────────┴────────────────────┘            │
│                               │                                 │
│                    ┌──────────┴──────────┐                      │
│                    │   Analysis Engine   │                      │
│                    │  ┌───────────────┐  │                      │
│                    │  │  DSPy + LLM   │  │                      │
│                    │  │  (OpenAI /    │  │                      │
│                    │  │   Mistral)    │  │                      │
│                    │  └───────────────┘  │                      │
│                    └─────────────────────┘                      │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Three components:

Web Dashboard: Next.js application for rule management, analytics, and administration
Word Add-in: Microsoft Office plugin (React + Office.js) where users actually review contracts
Backend API: FastAPI service handling analysis, LLM orchestration, and document processing

The interesting engineering lives in the Word Add-in (document manipulation) and Backend API (analysis pipeline).

Challenge #1: The Mutable Document Problem

Here's a scenario that breaks naive implementations:

User uploads 50-page contract
System analyzes paragraphs 1-50, stores suggestions keyed by paragraph index
User deletes paragraph 12 while waiting
System returns: "Paragraph 47 needs revision"
Paragraph 47 is now paragraph 46. Suggestion applied to wrong location.

Solution: Paragraph Anchoring

I built a logic to assign persistent IDs to mark the paragraph during preprocessing.

The key is designed to reside with OOXML and persists even when users:

Delete adjacent paragraphs
Cut and paste sections
Accept/reject other tracked changes

On the frontend, a Zustand store maintains bidirectional mappings:

interface ParagraphStore {
  indexToPersistentIdMap: Map<number, string>;  // index → UUID
  persistentIdToIndexMap: Map<string, number>;  // UUID → index

  findAnchorByText(text: string): string | null;  // fallback matching
}

When analysis results return, they reference UUIDs. The store resolves current paragraph indices at application time—not at analysis time.

Challenge #2: Generating Word Tracked Changes

This is the hard part.

Office.js provides no API for creating tracked changes. The paragraph.insertText() method just replaces text. To create actual redlines (strikethrough deletions, colored insertions), you must:

Generate a diff between original and suggested text
Convert that diff to OOXML elements
Apply these elements to the actual content using Office.js

Regarding the difference generation, I have implemented the Token-Based Diffing.

Character-level diffs create garbage in Word. "The quick brown fox" → "A quick red fox" would show:

T̶h̶e̶A quick b̶r̶o̶w̶n̶red fox

Unusable. Token-level diffs are cleaner:

The → A quick brown → red fox

Another important aspect to consider is to preserve the original paragraph properties. Because, contracts have formatting: numbering, indentation, styles. Naive replacement destroys this.

Challenge #3: Long-Running Analysis

A 50-page contract with 30 playbook rules can take 2-3 minutes to analyze. HTTP requests shouldn't block that long.

Session-Based Async Processing

Client                          Server
  │                               │
  │ POST         │
  │ ─────────────────────────────>│
  │                               │ Create session
  │ { session_id: "abc123" }      │ Start background task
  │ <─────────────────────────────│
  │                               │
  │ GET /sessions/abc123          │
  │ ─────────────────────────────>│
  │ { status: "processing",       │
  │   progress: 45% }             │
  │ <─────────────────────────────│
  │                               │
  │ ... poll every 3 seconds ...  │
  │                               │
  │ GET /sessions/abc123          │
  │ ─────────────────────────────>│
  │ { status: "complete",         │
  │   results: [...] }            │
  │ <─────────────────────────────│

Cache Validation with Content Hashing

Users often analyze the same contract multiple times—different guidelines, or checking after minor edits. Re-analyzing unchanged content wastes time and API costs.

The hash comparison catches:

Re-uploads of identical files
"Analyze again" clicks without actual changes
Multiple users analyzing the same template

Cache hit rate in production: ~40% for typical contract review workflows.

Challenge #4: Grounding and Hallucination Prevention

Legal documents require precision. An AI suggesting "Vendor liability is capped at $1M" when the contract says "$500K" is worse than no suggestion at all.

The best way to solve it is to use Structured Output with Explicit Citations

Every suggestion must reference the exact source text.

This catches suggestions in practice where the model paraphrases slightly instead of quoting exactly.

The Analysis Pipeline

Putting it together:

┌────────────────────────────────────────────────────────────────┐
│                     Redline Analysis Pipeline                  │
├────────────────────────────────────────────────────────────────┤
│                                                                │
│  1. DOCUMENT INGESTION                                         │
│     ┌─────────┐     ┌─────────────┐     ┌──────────────┐       │
│     │  DOCX   │────>│ Extract     │────>│ Paragraph    │       │
│     │  File   │     │ OOXML       │     │ Anchoring    │       │
│     └─────────┘     └─────────────┘     └──────────────┘       │
│                                                                │
│  2. CONTENT NORMALIZATION                                      │
│     ┌─────────────┐     ┌─────────────────┐                    │
│     │ OOXML with  │────>│ Unified         │                    │
│     │ Tracked     │     │ Markdown        │                    │
│     │ Changes     │     │ (Original +     │                    │
│     │             │     │  Revised views) │                    │
│     └─────────────┘     └─────────────────┘                    │
│                                                                │
│  3. LLM ANALYSIS                                               │
│     ┌─────────────┐     ┌─────────────┐     ┌──────────────┐   │
│     │             │────>│ DSPy        │────>│ Structured   │   │
│     │ Rules       │     │ Signatures  │     │ Suggestions  │   │
│     └─────────────┘     └─────────────┘     └──────────────┘   │
│                                                                │
│  4. OUTPUT GENERATION                                          │
│     ┌─────────────┐     ┌─────────────┐     ┌──────────────┐   │
│     │ Suggestions │────>│ Token Diff  │────>│ OOXML        │   │
│     │ + Rationale │     │ Algorithm   │     │              │   │
│     └─────────────┘     └─────────────┘     └──────────────┘   │
│                                                                │
└────────────────────────────────────────────────────────────────┘

The OOXML-to-Markdown conversion deserves special mention. Incoming contracts often already have tracked changes from counterparty negotiations. The converter:

Parses elements
Generates two synchronized views: Original (with deletions, without insertions) and Revised (with insertions, without deletions)
Preserves paragraph IDs from content controls

This abstraction means the LLM analyzes clean markdown, not raw XML. The complexity stays in the conversion layer.

Results

The system processes a 20-page contract in approximately 30-45 seconds, depending on rules complexity. Key metrics:

Cache hit rate: ~40% (saves re-analysis on unchanged content)
Hallucination rate: <5% (caught by validation, not surfaced to users)
Format preservation: 95% (paragraph properties maintained)
Tracked change accuracy: Token-level precision

Lessons Learned

Office.js is powerful but underdocumented. The OOXML manipulation pattern isn't in any official guide. I reverse-engineered it by exporting documents and reading the XML.
Character-level diffs are wrong for documents. Always tokenize first. The general diff libraries doesn't know about words.
Async patterns matter more than you think. The session-based polling approach sounds simple, but handling edge cases (browser refresh, network drops, server restarts) required careful state management.
Ground everything. LLMs will confidently cite text that doesn't exist. Validation layers catch this, but only if you design the output schema to require explicit source references.
Content hashing is cheap insurance. The SHA-256 computation is negligible compared to LLM costs. Cache validation paid for itself in the first week.

Tech Stack Summary

Layer	Technology	Why
Backend API	FastAPI (Python)	Async-native, great for long-running tasks
LLM Orchestration	DSPy	Structured outputs, provider-agnostic
LLM Providers	OpenAI, Mistral	Redundancy, cost optimization
Database	Supabase (PostgreSQL)	Real-time subscriptions, hosted
Web Frontend	Next.js	SSR for dashboard, API routes
Word Add-in	React + Office.js	Only option for Word integration
Document Processing	python-docx, custom OOXML	No library handles tracked changes

Closing Thoughts

The interesting engineering in "AI for X" products is rarely the AI part. Calling an LLM API is straightforward. The challenge is everything around it: maintaining document fidelity, handling state across long-running operations, and building validation layers that catch model failures before users see them.

Legal redlining pushed me to solve problems I didn't anticipate—paragraph anchoring, OOXML manipulation, token-based diffing. Each solution came from understanding the domain deeply, not from finding a better prompt.

If you're building in this space, I'd be interested to hear about your approach.

Arun Venkataramanan is a Senior Software Engineer at Ottimate, where he works on architecting solutions for accounts payable automation. With a background spanning core banking systems (TCS), fintech platforms, and enterprise automation, he focuses on building solutions and tools to help users automate repetitive things from their day to day work.

Connect on LinkedIn

DEV Community