Built with Gradio MCP, Gemini, Claude, Modal, Blaxel + LangGraph, and ElevenLabs
Recently I explored multi-model AI orchestration - wiring together Gradio MCP, Modal for serverless compute, DSPy for structured extraction, Blaxel with LangGraph for agent hosting, and ElevenLabs for voice AI. I wanted to go deeper than tutorials - actually build something, break it, and document what I learned.
I explored multi-model orchestration by building VendorGuard AI - an agent that processes invoices, analyzes pricing, and negotiates with vendors via voice call.
What I Built
Invoice processing is tedious. Extract data from a PDF, compare prices to historical records, send emails asking for better rates. I wanted to see if I could wire up multiple AI services to handle this end-to-end.
The system:
- Takes an invoice (PDF or image)
- Extracts all the data automatically
- Compares prices against historical records
- Generates a negotiation strategy
- Has a voice AI agent call the vendor
- Sends a follow-up email summarizing what was agreed
The interesting part wasn't any single piece - it was how they all connected.
The Architecture (And Why Each Piece)
Here's what the system looks like:
Upload Invoice
↓
[Gradio Frontend + MCP Server]
↓
[Modal: OCR with Gemini Vision]
↓
[Modal: Structured Extraction with DSPy]
↓
[Convex: Store Vendors, Invoices, Price History]
↓
[Blaxel + LangGraph: Negotiation Strategy with Claude]
↓
[ElevenLabs: Voice Negotiation Call]
↓
[Claude: Follow-up Email]
I didn't start with this architecture. It evolved as I hit walls and found solutions. Let me walk through the key decisions.
Modal
To serve custom or open source AI models with sub-second cold starts and selected GPUs
I define a function, add a decorator, and it runs in the cloud with GPU access easily. Deploy with modal deploy tools.py. No Dockerfile. No infrastructure config. Pay per second of actual compute.
DSPy: The Framework That Changed How I Think About LLMs
I started with the usual approach - prompt templates with "please return JSON in this format" instructions. You know how that goes. Half the time the model returns something slightly wrong, you add more instructions, it works for a bit, then breaks again.
DSPy flips this completely. Instead of writing prompts, you define signatures - what goes in, what comes out:
class InvoiceExtractionSignature(dspy.Signature):
"""Extract structured invoice data from OCR text."""
ocr_text: str = dspy.InputField()
invoice_data: InvoiceExtraction = dspy.OutputField()
That InvoiceExtraction is a Pydantic model with 40+ fields. DSPy handles generating the right prompt, parsing the output, and ensuring it matches the schema.
No more "please format as JSON". No more parsing errors. Just define what you want and get it.
What I learned: DSPy vs LangChain isn't about one being "better". They solve different problems. DSPy is for structured extraction - when you need reliable typed outputs. LangChain is for chains of operations. I used DSPy for the extraction step and it was rock solid.
Blaxel + LangGraph: Deploying Agents Without the Fuss
I needed to host the negotiation strategy agent somewhere. Options:
- Roll my own FastAPI + AWS/GCP - too much setup
- LangServe - tied to LangChain
- Blaxel - supports LangGraph out of the box, simple deploy
I decided to try out Blaxel. It has native LangGraph support, so I could define my agent as a graph and deploy it directly:
# agent.py
async def agent(input_data):
client = anthropic.Anthropic()
response = client.messages.create(...)
yield response.content[0].text
# main.py
@app.post("/")
async def run_agent(request: dict):
return StreamingResponse(agent(request.get("input")))
Deploy: bl deploy. Get a scalable endpoint.
What I learned: Blaxel is like AWS / GCP (Infra) for AI agents. Native LangGraph support meant I could use graph-based agent patterns without wrestling with deployment infrastructure. Also framework-agnostic if you're not all-in on LangChain.
Gradio MCP: The Feature That Surprised Me Most
Gradio 6.0 shipped with MCP (Model Context Protocol) server support. I almost missed this feature.
Add one flag to your app:
demo.launch(mcp_server=True)
Now every function in your Gradio app becomes a tool that any MCP-compatible AI can call. The function's docstring becomes the tool description. Types are inferred.
I built four MCP tools:
-
mcp_get_vendor_data- vendor contact info for follow-ups -
mcp_get_vendor_price_history- historical pricing for negotiation leverage -
mcp_get_invoice_details- complete invoice with line items -
mcp_analyze_invoice_prices- compares current vs historical prices
The Blaxel + LangGraph agent uses them internally, but they're also exposed via SSE endpoint (/gradio_api/mcp/sse) for external clients like Claude Desktop.
What I learned: This is genuinely magical. No OpenAPI spec writing, no tool schema definitions. Just Python functions with docstrings. MCP is going to be big.
Multi-Model Orchestration: Play to Strengths
One insight that clicked during this build: different models are good at different things.
| Task | Model | Why |
|-----------------------|------------------|---------------------------|
| Invoice OCR | Gemini 2.0 Flash | Best vision, fast |
| Structured Extraction | Gemini + DSPy | Good at following schemas |
| Price Analysis | Gemini + DSPy | Compares 6-month history, |
| | | calculates % changes |
| Negotiation Strategy | Claude Sonnet 4 | Nuanced reasoning |
| Follow-up Emails. | Claude Sonnet 4 | Professional tone |
Instead of forcing one model to do everything, I let each handle what it's best at. The orchestration layer (Gradio + MCP tools) ties them together.
Context Engineering: Beyond Prompt Engineering
One pattern that emerged from this build: context engineering - systematically constructing AI contexts from multiple data sources at runtime.
This goes beyond writing good prompts. It's about assembling the right information from different places so the AI can do its job.
Multi-source aggregation:
When the voice agent starts a negotiation, it needs context from four sources:
- Business database - company profile
- Vendor database - contact info, relationship history
- Invoice records - line items, totals, payment terms
- Price history - 6 months of historical data per item
All of this gets aggregated and injected into the voice agent's system prompt via dynamic variables.
Computed insights:
Raw data isn't enough. The system transforms it into negotiation intelligence:
- Compares current prices against historical best prices
- Calculates percentage markups (e.g., "+17.6% above best price")
- Prioritizes items with highest negotiation potential
Example insight generated: "8-inch Shear: +17.6% above best price (best was RM 5.27, now RM 6.20)"
Adaptive strategy:
The negotiation strategy adapts based on computed context:
- Items above best price → specific line-item targets
- Prices stable → focus on payment terms, volume discounts
- No historical data → generic best-practice tactics
What I learned: Context engineering is underrated. The same model performs dramatically differently depending on what context you give it. Investing in context assembly paid off more than tweaking prompts.
The Rough Edges
Not everything was smooth:
ElevenLabs transcript events - The API returns different event structures depending on... something? Sometimes source, sometimes role. Sometimes message, sometimes text. Had to write defensive parsing code.
Gradio MCP is new - The feature shipped recently and docs are sparse. If your function docstrings aren't precise, the tools become hard for AI agents to use correctly. I spent time rewriting docstrings to get reliable tool calls.
DSPy learning curve - Coming from prompt templates, the signature-based approach took adjustment. Documentation has gaps. Worth it once it clicks, but expect some ramp-up time.
What I'd Do Differently
If I rebuilt this:
Start with DSPy earlier - I wasted time on prompt engineering that DSPy would have solved immediately.
Plan the MCP tools upfront - I added them late. If I'd designed around MCP from the start, the architecture would be cleaner.
Invest more in voice analytics - ElevenLabs Conversational AI is impressive, but I barely scratched the surface. The transcripts could feed into cost analysis to identify negotiation patterns, post-call QA to improve agent responses, and better navigation for reviewing specific moments in calls. There's a lot more value to extract from the voice data.
The Takeaway
The pattern I'm most excited about: specialized services + orchestration.
Modal for compute. Blaxel + LangGraph for agents. ElevenLabs for voice. Each does one thing well. MCP ties them together.
This feels cleaner than a monolithic "do everything" agent. And it's easier to debug - when something breaks, you know which piece to look at.
Try It
Demo Video:
Top comments (0)