This is a submission for the Built with Google Gemini: Writing Challenge
The turning point in this project was not infrastructure. It was deciding whether a language model should sit directly in a production decision path without guardrails.
I built a document intelligence system for contract review. It ingests PDFs, extracts clause-level risk signals, and surfaces them for small businesses that cannot afford outside counsel for routine vendor agreements. The constraint was reliability. If a liability cap was hallucinated or an auto-renewal was missed, the output would not just be inaccurate. It would be harmful.
The architecture has three layers.
A PDF extraction pipeline segments contracts into clauses using heading detection and semantic boundary shifts instead of page splits.
A Gemini layer classifies each clause against a fixed risk taxonomy and generates a plain-language summary with a risk tier.
A React frontend groups clauses by risk tier rather than document order, aligning with how users actually review exposure.
Each clause moves through classify, summarize, score, and validate stages. Prompts are minimal and schema-constrained. Using gemini-1.5-pro with structured JSON output, every response is validated against a Zod schema before entering application state. Failures are flagged for manual review. Nothing degrades silently.
Gemini’s strength here is compression. Dense legal language becomes concise summaries that reflect practical implications. Iterating on the taxonomy exposed overlapping categories early, allowing me to reduce twelve risk types to six in days instead of weeks.
The friction appeared in nested clauses. Parent clauses with modifying sub-clauses produced summaries that were technically correct but incomplete. A liability cap without its exception clause is misleading. The solution was architectural: recursively unnest clauses, inject parent references into context, and reconstruct structure in code. That reduced summarization drift significantly.
Cost was another constraint. A forty-page contract could trigger over thirty model calls. An embedding similarity filter now pre-classifies low-risk template clauses, reducing calls by roughly thirty-five percent without hurting recall.
The real shift was conceptual. I stopped treating Gemini as an assistant and started treating it as a probabilistic component with defined failure modes. Structured outputs, validation boundaries, explicit failure states, and clear separation between language understanding and business logic now define the system.
Next, I am building multi-document comparison to detect deviations across similar vendor agreements. Before release, I am running recall evaluation against a manually annotated dataset. The system works, but without measurable precision and recall under edge cases, it does not belong in decision-critical workflows.
This project changed how I integrate language models. They are not conversational layers on top of applications. They are bounded components inside systems designed for validation, uncertainty, and measurable reliability.
Top comments (0)