Muhammad Bin Murtza

Posted on May 8

ClauseGuard — Technical Walkthrough

#python #programming #ai #opensource

How I built an AI contract risk analyzer with Qwen on AMD
**

1. Project concept

ClauseGuard analyzes legal contracts and flags risky clauses. Instead of reading through 20 pages of legalese, you upload a PDF or DOCX and get a structured risk report with plain English explanations, safer alternatives, and negotiation scripts.
The core idea: chain multiple specialized AI agents together, each handling one part of the analysis, rather than asking a single model to do everything at once.

2. Architecture decisions

Why a 5-agent pipeline

Early experiments with a single prompt (extract clauses AND classify AND score risks) produced inconsistent results. The model would skip steps or produce shallow analysis. Splitting into 5 dedicated agents gave each one a narrow, focused task:
Extractor → Classifier → Risk Scorer → Translator → Reporter
Each agent receives the output of the previous one. If any agent fails, the pipeline degrades gracefully rather than crashing entirely.

Why Streamlit

FastAPI with React would have given more control but Streamlit let me build the full UI in hours instead of days. Session state handles data sharing between pages naturally. The dark theme customization with custom CSS gave it a polished look without frontend overhead.

Why Qwen 2.5 1.5B on AMD

DeepSeek API was the original backend but switching to a self-hosted open source model was the goal. Qwen 2.5 1.5B is small enough to run with low latency but capable enough for structured legal analysis when given focused prompts. AMD MI300X with vLLM provided the OpenAI-compatible API layer, so the same Python SDK worked without changes.

3. Data models

Everything flows through Pydantic models. This enforced type safety across all 5 agents and made parsing reliable. Twelve clause types are defined as a strict enum: NDA, IP_ASSIGNMENT, NON_COMPETE, ARBITRATION, AUTO_RENEWAL, LIABILITY_CAP, TERMINATION, DATA_SHARING, GOVERNING_LAW, PAYMENT, INDEMNIFICATION, and OTHER.
Each Clause has an ID, raw text, plain English translation, classified type, section heading, and position. RiskFindings carry severity (CRITICAL through INFO), a specific risk reason citing the actual clause language, a recommended action, a safer rewritten version, a negotiation message, and impact scenarios. The FinalReport ties everything together with summary statistics, a machine-readable markdown report, and a CSV export.

4. File parsing layer

Three file types needed support: PDF, DOCX, TXT. For PDF I used PyMuPDF as primary with pdfplumber as fallback. Some scanned PDFs have no extractable text, so pdfplumber catches those edge cases. For DOCX, python-docx extracts paragraphs while preserving order. TXT files use chardet for automatic encoding detection (UTF-8, Latin-1, Windows-1252). Files over 10 MB are rejected before processing.

5. Model service layer

The key insight: instead of each agent creating its own API client, a single shared service handles all communication with the model. A lazy singleton AsyncOpenAI client is shared across all 6 agents. The call_model function wraps every request with configurable timeout, one automatic retry on JSON parse failure, markdown fence stripping, and JSON validation.
Temperature is set to 0.0 for deterministic outputs. On JSON parse failure, the retry appends a stronger instruction to the user prompt. Timeouts prevent hanging if the model endpoint is slow. This single shared function eliminated about 150 lines of duplicated retry, timeout, and JSON cleaning code that previously existed in all 6 agent files.

6. Prompt engineering

The hardest part was getting a 1.5B parameter model to produce consistent structured JSON. Several techniques made this work reliably:

Explicit schemas with examples

Every prompt includes a concrete JSON schema and at least one input/output pair. The model follows patterns better than abstract instructions.

Short focused prompts

Long system prompts confuse smaller models. I cut each prompt to under 30 lines. If a rule could be demonstrated by example, I removed the text description.

Severity rubrics

Instead of asking the model to judge risk abstractly, I gave it a decision tree. CRITICAL triggers for IP covering personal work, unlimited liability, or mandatory arbitration waiving jury rights. HIGH triggers for non-competes over 1 year, auto-renewal without opt-out, or one-sided indemnification. MEDIUM for standard non-competes, net-60 payment terms, or out-of-state governing law. LOW for standard confidentiality and governing law. INFO for definitions, severability, and entire agreement clauses. This mechanical approach produced far more consistent results.

Negotiation generation

For CRITICAL and HIGH clauses, the Translator agent also generates a safer clause rewrite plus a ready-to-send negotiation email. The email follows a template: opening concern, specific suggestion, fairness rationale, closing.

7. Pipeline orchestration

The orchestrator runs all 5 agents in sequence with error isolation. Each agent call is wrapped in asyncio.wait_for with a 120-second timeout. If the Extractor fails, the pipeline still produces a report marked as partial. If the Risk Scorer fails, a fallback assigns MEDIUM severity to all clauses with a "Needs Human Review" label so the user can still see the clauses instead of getting a misleading "no issues found" message.
A live event callback system emits status updates (running, completed, failed) for each agent. The UI renders these as a real-time pipeline status panel showing which agent is currently working, which completed, and which failed.

8. Frontend design

Streamlit native components handle most of the UI. The dark theme is set via config.toml. Custom CSS handles the polish: gradient buttons, animated agent status indicators, severity color coding, responsive expanders. Each severity level has a consistent color scheme used everywhere: badges, borders, backgrounds, chart bars.
The multi-page structure splits the single screen into focused views: Home for upload and overview, Clauses for individual clause cards with severity filters, Negotiation for side-by-side current vs safer comparisons with email templates, Chat for the AI assistant with full contract context, and Downloads for markdown, CSV, and safer contract exports. Session state is initialized once with defaults for report, error, analyzing status, agent statuses, chat history, and cache keys. All pages read from the same state.

9. Error handling strategy

The biggest problem with the first version: when the model API was unreachable, the pipeline silently produced an empty report and showed "No issues found" with a green checkmark. Completely misleading. Three fixes addressed this:
Pre-flight connectivity check. Before analysis starts, a 10-second call to client.models.list() tests the endpoint. If unreachable, the user sees a specific error (Connection refused or Cannot resolve host) with instructions.
Zero-clause detection. If the pipeline produces a FinalReport with 0 clauses, the UI shows a red error banner identifying which agent failed, instead of the empty report.
Fallback scoring. When the Risk Scorer agent fails, a fallback function builds ScoredClause objects with MEDIUM severity and "Needs Human Review" labels. Users still see their clauses with a clear warning that automated scoring was incomplete.

10. Testing

13 tests cover the critical paths. Extractor tests verify JSON parsing handles both dict and list response formats, strips markdown fences, validates minimum clause counts, and processes real sample contracts. Risk scorer tests verify severity assignment logic (IP assignment should be CRITICAL, governing law should be LOW), ensure every clause gets a non-empty risk reason, and confirm all 5 severity levels are parseable.
Pipeline integration tests mock the model service to verify the full 5-agent chain produces valid FinalReport objects, detects at least one CRITICAL or HIGH finding in known risky contracts, generates non-empty markdown, and handles extractor failure gracefully. All mocks use unittest.mock.patch on the agent module call_model import paths.

11. Project structure

The codebase is organized into clear layers: app.py is the Streamlit home page with upload and overview. ui.py contains all shared rendering functions and constants. The pages directory holds the 4 multi-page Streamlit files. The agents directory contains the 5 pipeline agents plus the orchestrator and copilot. services holds the shared model service layer. config has settings and all 5 system prompts. models defines Pydantic schemas. tools handles file parsing, clause utilities, and report formatting. tests contains 13 pytest tests covering extraction, risk scoring, and full pipeline integration.

12. What I would do differently

Better prompt iteration tooling. I wrote prompts by hand and tested by running the full pipeline. A side-by-side comparison tool that shows the same contract analyzed with different prompts would have saved hours.
Streaming responses. Currently the pipeline blocks until each agent completes. Streaming token by token would give users real-time visibility into what the model is thinking.
Caching layer. The same clause analyzed twice produces the same result. A clause-hash to response cache would eliminate redundant API calls when users re-upload similar contracts.
PDF table support. Multi-column layouts and tables in PDFs get garbled during text extraction. A layout-preserving parser would handle employment contracts and SaaS agreements with tabular fee schedules better.
Model fallback chain. Currently the app uses one model endpoint. A chain that tries Qwen first, falls back to DeepSeek API, and finally uses local regex-based extraction would make the demo work reliably on any deployment platform.

DEV Community