Build a Multi-Modal AI Agent with GPU-Bridge (LLMs + Image + Audio)

#ai #machinelearning #tutorial #python

Build a Multi-Modal AI Agent with GPU-Bridge (LLMs + Image + Audio)

Multi-modal AI agents that can see, hear, speak, and reason are one of the most exciting developments in AI. In this tutorial, we'll build one from scratch using GPU-Bridge.

By the end, you'll have a Python agent that:

Analyzes an image using LLaVA-34B (visual Q&A)
Transcribes audio using Whisper Large v3
Generates a response using Llama 3.1 70B
Converts the response to speech using XTTS v2 voice cloning

All powered by real GPUs via the GPU-Bridge API.

Prerequisites

pip install requests x402-client  # x402-client optional

Get an API key at gpubridge.io.

The Complete Agent

import requests, base64, json
from pathlib import Path

API_KEY = "your_gpu_bridge_api_key"
BASE_URL = "https://api.gpubridge.io/v1"
headers = {"Authorization": f"Bearer {API_KEY}", "Content-Type": "application/json"}

def gpu_run(service: str, input_data: dict) -> dict:
    resp = requests.post(f"{BASE_URL}/run", headers=headers,
                         json={"service": service, "input": input_data})
    resp.raise_for_status()
    return resp.json()

def analyze_image(image_path: str) -> str:
    """Use LLaVA 34B on RTX 4090 for visual Q&A."""
    with open(image_path, "rb") as f:
        img_b64 = base64.b64encode(f.read()).decode()
    result = gpu_run("llava-4090", {
        "image": img_b64,
        "question": "Describe this image in detail",
        "max_tokens": 500
    })
    return result["answer"]

def transcribe_audio(audio_path: str) -> str:
    """Use Whisper Large v3 on L4 GPU for transcription."""
    with open(audio_path, "rb") as f:
        audio_b64 = base64.b64encode(f.read()).decode()
    result = gpu_run("whisper-l4", {"audio": audio_b64, "language": "auto"})
    return result["text"]

def generate_response(image_desc: str, transcript: str) -> str:
    """Use Llama 3.1 70B on RTX 4090 to synthesize a response."""
    result = gpu_run("llm-4090", {
        "messages": [
            {"role": "system", "content": "You are a helpful multi-modal AI assistant."},
            {"role": "user", "content": f"Image: {image_desc}\n\nAudio: {transcript}\n\nProvide a helpful response."}
        ],
        "max_tokens": 600
    })
    return result["choices"][0]["message"]["content"]

def text_to_speech(text: str, output_path: str = "response.wav",
                   voice_sample: str = None) -> str:
    """Use XTTS v2 on L4 GPU for voice synthesis (with optional voice cloning)."""
    input_data = {"text": text, "language": "en"}
    if voice_sample and Path(voice_sample).exists():
        with open(voice_sample, "rb") as f:
            input_data["voice_sample"] = base64.b64encode(f.read()).decode()

    result = gpu_run("tts-l4", input_data)
    audio_bytes = base64.b64decode(result["audio"])
    with open(output_path, "wb") as f:
        f.write(audio_bytes)
    return output_path

# Run the complete pipeline
def run_agent(image_path: str, audio_path: str, voice_sample: str = None):
    print("📸 Step 1: Analyzing image with LLaVA-34B...")
    image_desc = analyze_image(image_path)

    print("🎤 Step 2: Transcribing audio with Whisper Large v3...")
    transcript = transcribe_audio(audio_path)

    print("🤖 Step 3: Generating response with Llama 3.1 70B...")
    response = generate_response(image_desc, transcript)

    print("🗣️  Step 4: Converting to speech with XTTS v2...")
    audio_out = text_to_speech(response, voice_sample=voice_sample)

    print(f"✅ Done! Response saved to: {audio_out}")
    return response

if __name__ == "__main__":
    result = run_agent("input_image.jpg", "input_audio.mp3")

Using x402 for Autonomous Payments

Want your agent to run without any human setup? Use the x402 protocol:

from x402.client import PaymentClient

# Replace headers-based client with x402
x402_client = PaymentClient(
    private_key="0xYOUR_BASE_L2_PRIVATE_KEY",
    chain="base",
    max_payment="0.10"  # Safety limit per request
)

def gpu_run_x402(service: str, input_data: dict) -> dict:
    """x402-powered gpu_run — no API key needed."""
    response = x402_client.request(
        "POST", f"{BASE_URL}/run",
        json={"service": service, "input": input_data}
    )
    return response.json()

Just swap gpu_run → gpu_run_x402. Your agent now pays for each GPU call autonomously with USDC on Base L2 (< $0.01 gas, ~2s settlement).

Cost Analysis

Step	Service	~Cost
Image analysis	`llava-4090`	$0.02
Audio transcription (1 min)	`whisper-l4`	$0.005
LLM response	`llm-4090`	$0.01
TTS (100 words)	`tts-l4`	$0.005
Total per run		~$0.04

25 complete pipeline runs for $1.

Also Available: MCP Server for Claude

GPU-Bridge also has an MCP server that gives Claude direct access to all 26 services:

{
  "mcpServers": {
    "gpu-bridge": {
      "command": "npx",
      "args": ["-y", "@gpu-bridge/mcp-server"],
      "env": { "GPUBRIDGE_API_KEY": "your_key" }
    }
  }
}