Ooi Yee Fei

Posted on Dec 4, 2025

How I Built a Multi-Model AI Agent That Negotiates With Vendors

#agents #showdev #ai #automation

Built with Gradio MCP, Gemini, Claude, Modal, Blaxel + LangGraph, and ElevenLabs

Recently I explored multi-model AI orchestration - wiring together Gradio MCP, Modal for serverless compute, DSPy for structured extraction, Blaxel with LangGraph for agent hosting, and ElevenLabs for voice AI. I wanted to go deeper than tutorials - actually build something, break it, and document what I learned.

I explored multi-model orchestration by building VendorGuard AI - an agent that processes invoices, analyzes pricing, and negotiates with vendors via voice call.

What I Built

Invoice processing is tedious. Extract data from a PDF, compare prices to historical records, send emails asking for better rates. I wanted to see if I could wire up multiple AI services to handle this end-to-end.

The system:

Takes an invoice (PDF or image)
Extracts all the data automatically
Compares prices against historical records
Generates a negotiation strategy
Has a voice AI agent call the vendor
Sends a follow-up email summarizing what was agreed

The interesting part wasn't any single piece - it was how they all connected.

The Architecture (And Why Each Piece)

Here's what the system looks like:

Upload Invoice
      ↓
[Gradio Frontend + MCP Server]
      ↓
[Modal: OCR with Gemini Vision]
      ↓
[Modal: Structured Extraction with DSPy]
      ↓
[Convex: Store Vendors, Invoices, Price History]
      ↓
[Blaxel + LangGraph: Negotiation Strategy with Claude]
      ↓
[ElevenLabs: Voice Negotiation Call]
      ↓
[Claude: Follow-up Email]

I didn't start with this architecture. It evolved as I hit walls and found solutions. Let me walk through the key decisions.

Modal

To serve custom or open source AI models with sub-second cold starts and selected GPUs
I define a function, add a decorator, and it runs in the cloud with GPU access easily. Deploy with modal deploy tools.py. No Dockerfile. No infrastructure config. Pay per second of actual compute.

DSPy: The Framework That Changed How I Think About LLMs

I started with the usual approach - prompt templates with "please return JSON in this format" instructions. You know how that goes. Half the time the model returns something slightly wrong, you add more instructions, it works for a bit, then breaks again.

DSPy flips this completely. Instead of writing prompts, you define signatures - what goes in, what comes out:

class InvoiceExtractionSignature(dspy.Signature):
    """Extract structured invoice data from OCR text."""

    ocr_text: str = dspy.InputField()
    invoice_data: InvoiceExtraction = dspy.OutputField()

That InvoiceExtraction is a Pydantic model with 40+ fields. DSPy handles generating the right prompt, parsing the output, and ensuring it matches the schema.

No more "please format as JSON". No more parsing errors. Just define what you want and get it.

What I learned: DSPy vs LangChain isn't about one being "better". They solve different problems. DSPy is for structured extraction - when you need reliable typed outputs. LangChain is for chains of operations. I used DSPy for the extraction step and it was rock solid.

Blaxel + LangGraph: Deploying Agents Without the Fuss

I needed to host the negotiation strategy agent somewhere. Options:

Roll my own FastAPI + AWS/GCP - too much setup
LangServe - tied to LangChain
Blaxel - supports LangGraph out of the box, simple deploy

I decided to try out Blaxel. It has native LangGraph support, so I could define my agent as a graph and deploy it directly:

# agent.py
async def agent(input_data):
    client = anthropic.Anthropic()
    response = client.messages.create(...)
    yield response.content[0].text

# main.py
@app.post("/")
async def run_agent(request: dict):
    return StreamingResponse(agent(request.get("input")))

Deploy: bl deploy. Get a scalable endpoint.

What I learned: Blaxel is like AWS / GCP (Infra) for AI agents. Native LangGraph support meant I could use graph-based agent patterns without wrestling with deployment infrastructure. Also framework-agnostic if you're not all-in on LangChain.

Gradio MCP: The Feature That Surprised Me Most

Gradio 6.0 shipped with MCP (Model Context Protocol) server support. I almost missed this feature.

Add one flag to your app:

demo.launch(mcp_server=True)

Now every function in your Gradio app becomes a tool that any MCP-compatible AI can call. The function's docstring becomes the tool description. Types are inferred.

I built four MCP tools:

mcp_get_vendor_data - vendor contact info for follow-ups
mcp_get_vendor_price_history - historical pricing for negotiation leverage
mcp_get_invoice_details - complete invoice with line items
mcp_analyze_invoice_prices - compares current vs historical prices

The Blaxel + LangGraph agent uses them internally, but they're also exposed via SSE endpoint (/gradio_api/mcp/sse) for external clients like Claude Desktop.

What I learned: This is genuinely magical. No OpenAPI spec writing, no tool schema definitions. Just Python functions with docstrings. MCP is going to be big.

Multi-Model Orchestration: Play to Strengths

One insight that clicked during this build: different models are good at different things.

|          Task         |       Model      |            Why            |
|-----------------------|------------------|---------------------------|
| Invoice OCR           | Gemini 2.0 Flash | Best vision, fast         |
| Structured Extraction | Gemini + DSPy    | Good at following schemas |
| Price Analysis        | Gemini + DSPy    | Compares 6-month history, |
|                       |                  | calculates % changes      |
| Negotiation Strategy  | Claude Sonnet 4  | Nuanced reasoning         |
| Follow-up Emails.     | Claude Sonnet 4  | Professional tone         |

Instead of forcing one model to do everything, I let each handle what it's best at. The orchestration layer (Gradio + MCP tools) ties them together.

Context Engineering: Beyond Prompt Engineering

One pattern that emerged from this build: context engineering - systematically constructing AI contexts from multiple data sources at runtime.

This goes beyond writing good prompts. It's about assembling the right information from different places so the AI can do its job.

Multi-source aggregation:

When the voice agent starts a negotiation, it needs context from four sources:

Business database - company profile
Vendor database - contact info, relationship history
Invoice records - line items, totals, payment terms
Price history - 6 months of historical data per item

All of this gets aggregated and injected into the voice agent's system prompt via dynamic variables.

Computed insights:

Raw data isn't enough. The system transforms it into negotiation intelligence:

Compares current prices against historical best prices
Calculates percentage markups (e.g., "+17.6% above best price")
Prioritizes items with highest negotiation potential

Example insight generated: "8-inch Shear: +17.6% above best price (best was RM 5.27, now RM 6.20)"

Adaptive strategy:

The negotiation strategy adapts based on computed context:

Items above best price → specific line-item targets
Prices stable → focus on payment terms, volume discounts
No historical data → generic best-practice tactics

What I learned: Context engineering is underrated. The same model performs dramatically differently depending on what context you give it. Investing in context assembly paid off more than tweaking prompts.

The Rough Edges

Not everything was smooth:

ElevenLabs transcript events - The API returns different event structures depending on... something? Sometimes source, sometimes role. Sometimes message, sometimes text. Had to write defensive parsing code.

Gradio MCP is new - The feature shipped recently and docs are sparse. If your function docstrings aren't precise, the tools become hard for AI agents to use correctly. I spent time rewriting docstrings to get reliable tool calls.

DSPy learning curve - Coming from prompt templates, the signature-based approach took adjustment. Documentation has gaps. Worth it once it clicks, but expect some ramp-up time.

What I'd Do Differently

If I rebuilt this:

Start with DSPy earlier - I wasted time on prompt engineering that DSPy would have solved immediately.
Plan the MCP tools upfront - I added them late. If I'd designed around MCP from the start, the architecture would be cleaner.
Invest more in voice analytics - ElevenLabs Conversational AI is impressive, but I barely scratched the surface. The transcripts could feed into cost analysis to identify negotiation patterns, post-call QA to improve agent responses, and better navigation for reviewing specific moments in calls. There's a lot more value to extract from the voice data.