Build a Multi-Modal AI Agent with GPU-Bridge (LLMs + Image + Audio)
Multi-modal AI agents that can see, hear, speak, and reason are one of the most exciting developments in AI. In this tutorial, we'll build one from scratch using GPU-Bridge.
By the end, you'll have a Python agent that:
- Analyzes an image using LLaVA-34B (visual Q&A)
- Transcribes audio using Whisper Large v3
- Generates a response using Llama 3.1 70B
- Converts the response to speech using XTTS v2 voice cloning
All powered by real GPUs via the GPU-Bridge API.
Prerequisites
pip install requests x402-client # x402-client optional
Get an API key at gpubridge.xyz.
The Complete Agent
import requests, base64, json
from pathlib import Path
API_KEY = "your_gpu_bridge_api_key"
BASE_URL = "https://api.gpubridge.xyz/v1"
headers = {"Authorization": f"Bearer {API_KEY}", "Content-Type": "application/json"}
def gpu_run(service: str, input_data: dict) -> dict:
resp = requests.post(f"{BASE_URL}/run", headers=headers,
json={"service": service, "input": input_data})
resp.raise_for_status()
return resp.json()
def analyze_image(image_path: str) -> str:
"""Use LLaVA 34B on RTX 4090 for visual Q&A."""
with open(image_path, "rb") as f:
img_b64 = base64.b64encode(f.read()).decode()
result = gpu_run("llava-4090", {
"image": img_b64,
"question": "Describe this image in detail",
"max_tokens": 500
})
return result["answer"]
def transcribe_audio(audio_path: str) -> str:
"""Use Whisper Large v3 on L4 GPU for transcription."""
with open(audio_path, "rb") as f:
audio_b64 = base64.b64encode(f.read()).decode()
result = gpu_run("whisper-l4", {"audio": audio_b64, "language": "auto"})
return result["text"]
def generate_response(image_desc: str, transcript: str) -> str:
"""Use Llama 3.1 70B on RTX 4090 to synthesize a response."""
result = gpu_run("llm-4090", {
"messages": [
{"role": "system", "content": "You are a helpful multi-modal AI assistant."},
{"role": "user", "content": f"Image: {image_desc}\n\nAudio: {transcript}\n\nProvide a helpful response."}
],
"max_tokens": 600
})
return result["choices"][0]["message"]["content"]
def text_to_speech(text: str, output_path: str = "response.wav",
voice_sample: str = None) -> str:
"""Use XTTS v2 on L4 GPU for voice synthesis (with optional voice cloning)."""
input_data = {"text": text, "language": "en"}
if voice_sample and Path(voice_sample).exists():
with open(voice_sample, "rb") as f:
input_data["voice_sample"] = base64.b64encode(f.read()).decode()
result = gpu_run("tts-l4", input_data)
audio_bytes = base64.b64decode(result["audio"])
with open(output_path, "wb") as f:
f.write(audio_bytes)
return output_path
# Run the complete pipeline
def run_agent(image_path: str, audio_path: str, voice_sample: str = None):
print("📸 Step 1: Analyzing image with LLaVA-34B...")
image_desc = analyze_image(image_path)
print("🎤 Step 2: Transcribing audio with Whisper Large v3...")
transcript = transcribe_audio(audio_path)
print("🤖 Step 3: Generating response with Llama 3.1 70B...")
response = generate_response(image_desc, transcript)
print("🗣️ Step 4: Converting to speech with XTTS v2...")
audio_out = text_to_speech(response, voice_sample=voice_sample)
print(f"✅ Done! Response saved to: {audio_out}")
return response
if __name__ == "__main__":
result = run_agent("input_image.jpg", "input_audio.mp3")
Using x402 for Autonomous Payments
Want your agent to run without any human setup? Use the x402 protocol:
from x402.client import PaymentClient
# Replace headers-based client with x402
x402_client = PaymentClient(
private_key="0xYOUR_BASE_L2_PRIVATE_KEY",
chain="base",
max_payment="0.10" # Safety limit per request
)
def gpu_run_x402(service: str, input_data: dict) -> dict:
"""x402-powered gpu_run — no API key needed."""
response = x402_client.request(
"POST", f"{BASE_URL}/run",
json={"service": service, "input": input_data}
)
return response.json()
Just swap gpu_run → gpu_run_x402. Your agent now pays for each GPU call autonomously with USDC on Base L2 (< $0.01 gas, ~2s settlement).
Cost Analysis
| Step | Service | ~Cost |
|---|---|---|
| Image analysis | llava-4090 |
$0.02 |
| Audio transcription (1 min) | whisper-l4 |
$0.005 |
| LLM response | llm-4090 |
$0.01 |
| TTS (100 words) | tts-l4 |
$0.005 |
| Total per run | ~$0.04 |
25 complete pipeline runs for $1.
Also Available: MCP Server for Claude
GPU-Bridge also has an MCP server that gives Claude direct access to all 26 services:
{
"mcpServers": {
"gpu-bridge": {
"command": "npx",
"args": ["-y", "@gpu-bridge/mcp-server"],
"env": { "GPUBRIDGE_API_KEY": "your_key" }
}
}
}
Links
- 🔑 Get API key: gpubridge.xyz
- 📖 Docs: gpubridge.xyz/docs
- 🐙 MCP Server: github.com/gpu-bridge/mcp-server
Have questions? Drop them in the comments!
Top comments (0)