How We Built an AI Agent Pipeline for a Healthcare Client Using CrewAI

#ai #python #startup #machinelearning

AI agents are changing how enterprises automate complex workflows. In this article, we break down how we built a production-grade AI pipeline using CrewAI.

When a mid-sized healthcare company approached us to automate their clinical document processing, they had a problem that traditional RPA could not solve. Their workflow involved reading unstructured PDFs, extracting patient data, cross-referencing insurance codes, and generating compliance reports — all tasks requiring contextual reasoning, not just pattern matching.
This is the story of how we designed, built, and deployed a multi-agent AI pipeline using CrewAI that now processes over 2,000 clinical documents per day with 97.3% accuracy — and what we learned along the way.
The Problem: Why Traditional Automation Failed
The client's existing workflow was manual. A team of 12 operators would receive scanned clinical documents, read through each one, extract relevant data points, validate against insurance databases, and produce standardised reports. The average processing time was 22 minutes per document. They had tried an RPA solution previously, but it broke constantly because the documents were unstructured — different hospitals used different formats, different terminologies, and different layouts.
What they needed was not a rule-based system. They needed AI agents that could reason about context, make judgement calls, and handle edge cases autonomously.
Why CrewAI? The Agent Framework Decision
We evaluated three frameworks before committing: LangChain Agents, AutoGen, and CrewAI. Each has distinct strengths.
LangChain gave us maximum flexibility but required significant boilerplate to orchestrate multi-agent workflows. AutoGen excelled at conversational agent patterns but was overkill for our use case — we did not need agents debating each other; we needed a structured pipeline. CrewAI hit the sweet spot: it provides a clean abstraction for defining agent roles, goals, and task dependencies with built-in support for sequential and hierarchical crew execution.
The deciding factor was CrewAI's task delegation model. We could define a crew where Agent A (Document Reader) feeds structured output to Agent B (Data Validator), which then passes to Agent C (Report Generator) — all with retry logic and error handling built in.
Architecture: The 4-Agent Pipeline
Here is the high-level architecture we deployed:

Ingestion Agent — Receives documents via API, performs OCR on scanned PDFs using Tesseract, and converts everything to clean text. This agent also classifies the document type (lab report, discharge summary, insurance claim) to route it correctly.
Extraction Agent — Uses GPT-4 with a carefully crafted prompt to extract structured data fields: patient demographics, diagnosis codes (ICD-10), procedure codes (CPT), dates, and provider information. We use few-shot examples tailored to each document type.
Validation Agent — Cross-references extracted data against an insurance code database and internal business rules. Flags inconsistencies (e.g., a diagnosis code that doesn't match the procedure code) and either auto-corrects obvious errors or escalates to a human reviewer.
Report Agent — Generates the final compliance report in the client's required format, including audit trails of every decision the AI made. This transparency layer was critical for HIPAA compliance. The 7 Production Lessons We Learned
Prompt Engineering is 60% of the Work We spent more time refining prompts than writing infrastructure code. The difference between 85% and 97% extraction accuracy came down to prompt structure — specifically, using structured output schemas (JSON mode) and providing 8-12 few-shot examples per document type rather than relying on zero-shot extraction.
Agent Memory is Not Optional Early versions of the pipeline treated each document independently. But in practice, documents from the same patient arrive in batches. When we added a shared memory layer (using Redis as a short-term context store), the Validation Agent could cross-reference previous documents from the same patient, catching errors that would have been impossible to detect in isolation.
Human-in-the-Loop is a Feature, Not a Fallback We designed the Validation Agent with a confidence threshold. When confidence drops below 85%, the document is routed to a human reviewer via a simple web dashboard. In the first month, about 15% of documents required human review. By month three, after we fine-tuned prompts based on reviewer feedback, that dropped to 4%.
Structured Logging Saved Us Repeatedly Every agent logs its input, output, reasoning chain, and token usage. When accuracy dipped for a specific document type, we could trace the exact point of failure. This observability was non-negotiable for a HIPAA-regulated environment.
Cost Management: Smaller Models for Simpler Tasks Not every agent needs GPT-4. The Ingestion Agent runs on GPT-3.5 Turbo (document classification is relatively simple). The Extraction Agent uses GPT-4 (accuracy matters most here). The Report Agent uses GPT-3.5 Turbo with a template system. This tiered approach reduced our API costs by roughly 40% compared to using GPT-4 across the board.
Retry Logic with Exponential Backoff API rate limits and occasional timeouts are a reality when processing 2,000+ documents daily. CrewAI's built-in retry mechanism helped, but we added custom exponential backoff with jitter to handle burst loads during morning peak hours (when most documents arrive).
Deploy with Guardrails, Not Just Monitoring Monitoring tells you something went wrong after the fact. Guardrails prevent it. We implemented input validation (reject documents under 100 characters — likely corrupt), output schema validation (reject responses that don't match expected JSON structure), and toxicity checks (ensure no hallucinated patient data leaks into reports). Results: 6 Months In • Processing time: 22 minutes per document → 47 seconds (28x faster) • Accuracy: 97.3% automated extraction (up from 91% with the previous RPA attempt) • Human review rate: Down from 100% to 4% of documents • Cost savings: The client reallocated 9 of 12 operators to higher-value tasks • Uptime: 99.7% over the first 6 months When to Use Agentic AI vs Traditional Approaches Not every problem needs AI agents. Use agentic AI when your workflow involves unstructured data that requires contextual reasoning, multi-step decision-making where each step depends on the previous one, variability in inputs that would break rigid rule-based systems, and a need for continuous improvement through feedback loops. If your data is structured and your rules are deterministic, traditional RPA or even a well-written Python script will serve you better and cost less. Conclusion Building AI agent pipelines for production is fundamentally different from building demos. The gap is in reliability engineering: structured logging, confidence thresholds, human escalation paths, cost management, and regulatory compliance. The frameworks like CrewAI give you the orchestration layer, but the real engineering work is in making it robust enough that a healthcare company trusts it with patient data. At Inventiple, we specialise in building these kinds of production-grade AI systems for enterprises — from architecture design through to deployment and ongoing optimisation. If you are exploring agentic AI for your business, feel free to reach out. --- About the Author: Written by the engineering team at Inventiple, an enterprise AI development company building agentic AI systems, MCP servers, and cloud-native applications for global clients.