GPU-Bridge

Posted on Mar 14 • Edited on Mar 16

How to Build an Autonomous Agent That Pays for Its Own GPU Inference

#ai #agents #langchain #python

How to Build an Autonomous Agent That Pays for Its Own GPU Inference

Autonomous agents are getting smarter. They can browse the web, write code, plan tasks, and call external APIs. But there's a gap that almost nobody talks about: agents can't pay for their own compute.

Run a LangChain agent on a free API tier and it works fine — until it hits rate limits, needs a GPU-heavy model, or has to run inference on your private data. Then what? You either pre-fund an account (centralised, manual) or the agent breaks. Neither is great if you're building something truly autonomous.

This tutorial shows a different approach: an agent that pays per inference call using USDC on Base L2, with no pre-registration and no credit card. It works because of two things:

x402 — a draft HTTP extension for machine-native micropayments
GPU-Bridge — a GPU inference API that accepts x402 payments natively

Let's build it.

1. The Problem: Agents That Need GPU But Can't Hold Credit Cards

Modern AI agents need GPU compute for:

Running open-source LLMs (Llama, Qwen, Mistral) without OpenAI vendor lock-in
Speech-to-text, image generation, embeddings, vision models
Private inference (your data never leaves your control to a third party)
Cost control at sub-cent granularity

The traditional solution: sign up, add a credit card, get an API key, hardcode it. This works for humans. It's terrible for autonomous agents because:

Manual setup — someone has to register each agent identity
Shared credentials — multiple agents share one API key, no per-agent accounting
Over-provisioning — you pre-fund an account; unused credits are wasted
No autonomy — the agent can't spin up its own payment capability at runtime

What we want is an agent that can say "I need GPU inference, here's $0.05 in USDC, give me the result" — with no prior registration.

2. What is x402?

x402 is a revival of the HTTP 402 Payment Required status code that was reserved in HTTP/1.1 but never standardised.

The flow is simple:

Client → POST /run  →  Server returns 402 + payment details
Client pays on-chain (USDC, Base L2)
Client → POST /run + X-Payment header  →  Server returns 200 + result

The payment details come back as a JSON object in the 402 response body:

Which wallet to pay
How much (in USDC)
Which network (Base L2)
A nonce to prevent replays

The client signs a payment authorisation (EIP-712) and attaches it as an X-Payment header on the retry. No registration. No API key needed. Just a wallet with some USDC.

For agents that use API keys (simpler setup), GPU-Bridge also supports prepaid credits with an X-API-Key header. Both work with the same endpoint.

3. GPU-Bridge API Overview

GPU-Bridge exposes 30 GPU services through a single unified API. One endpoint to rule them all: https://api.gpubridge.io/run.

The full catalog is at GET https://api.gpubridge.io/catalog. Here's a taste:

Service	Key	Typical Cost
LLM Inference (Llama 3.3 70B, Qwen3 32B...)	`llm-4090`	~$0.05
Speech-to-Text (Whisper)	`whisper-l4`	~$0.05
Image Generation (SDXL)	`image-4090`	~$0.06
Text Embeddings	`embedding-l4`	~$0.01
Vision-Language Model	`llava-4090`	~$0.05
Document Reranking	`rerank`	~$0.001
Image Captioning	`caption`	~$0.01
NSFW Detection	`nsfw-detect`	~$0.005
Background Removal	`rembg-l4`	~$0.01
PDF/Document Parser	`pdf-parse`	~$0.05

Request format is always:

{
  "service": "llm-4090",
  "input": {
    "prompt": "Your prompt here",
    "model": "llama-3.1-8b-instant"
  }
}

Auth is either:

X-API-Key: gpub_xxx (prepaid credits, easier to start with)
X-Payment: <EIP-712 signed payment> (x402, permissionless)

4. Code Example: Python Agent with API Key Auth

Let's start simple — a Python script that calls GPU-Bridge with an API key and gets LLM inference:

import requests
import json

GPUBRIDGE_API_KEY = "gpub_RqQpoj6xF0znuFGe1TJ2u6B-CRgHcj1P"  # replace with yours
GPUBRIDGE_URL = "https://api.gpubridge.io/run"

def gpu_inference(prompt: str, model: str = "llama-3.1-8b-instant") -> str:
    """Call GPU-Bridge LLM inference."""
    response = requests.post(
        GPUBRIDGE_URL,
        headers={
            "X-API-Key": GPUBRIDGE_API_KEY,
            "Content-Type": "application/json",
        },
        json={
            "service": "llm-4090",
            "input": {
                "prompt": prompt,
                "model": model,
                "max_tokens": 512,
                "temperature": 0.7,
            }
        }
    )
    response.raise_for_status()
    data = response.json()
    # Response follows OpenAI-compatible format
    return data["choices"][0]["message"]["content"]


# Simple usage
result = gpu_inference("Explain quantum entanglement in 2 sentences.")
print(result)

Now let's make it x402-capable — the agent pays with USDC directly, no API key needed:

import requests
import json
from eth_account import Account
from eth_account.messages import encode_typed_data

# Your agent's wallet (generated or loaded from env)
AGENT_PRIVATE_KEY = "0x..."  # agent's wallet private key
GPUBRIDGE_URL = "https://api.gpubridge.io/run"


def sign_x402_payment(payment_details: dict, private_key: str) -> str:
    """Sign an x402 payment authorisation (EIP-712)."""
    account = Account.from_key(private_key)

    # EIP-712 typed data structure
    typed_data = {
        "types": {
            "EIP712Domain": [
                {"name": "name", "type": "string"},
                {"name": "version", "type": "string"},
                {"name": "chainId", "type": "uint256"},
            ],
            "Payment": [
                {"name": "from", "type": "address"},
                {"name": "to", "type": "address"},
                {"name": "value", "type": "uint256"},
                {"name": "validAfter", "type": "uint256"},
                {"name": "validBefore", "type": "uint256"},
                {"name": "nonce", "type": "bytes32"},
            ]
        },
        "domain": payment_details["domain"],
        "primaryType": "Payment",
        "message": {
            "from": account.address,
            "to": payment_details["recipient"],
            "value": int(float(payment_details["amount"]) * 1_000_000),  # USDC has 6 decimals
            "validAfter": 0,
            "validBefore": payment_details["expires"],
            "nonce": bytes.fromhex(payment_details["nonce"].replace("0x", "")),
        }
    }

    signed = account.sign_typed_data(typed_data)
    return signed.signature.hex()


def gpu_inference_x402(prompt: str, model: str = "llama-3.1-8b-instant") -> str:
    """Call GPU-Bridge with x402 micropayment. No API key needed."""
    payload = {
        "service": "llm-4090",
        "input": {
            "prompt": prompt,
            "model": model,
            "max_tokens": 512,
        }
    }

    # Step 1: Probe for payment requirements
    probe = requests.post(
        GPUBRIDGE_URL,
        headers={"Content-Type": "application/json"},
        json=payload
    )

    if probe.status_code == 200:
        # Paid already or free — shouldn't happen normally
        return probe.json()["choices"][0]["message"]["content"]

    if probe.status_code != 402:
        probe.raise_for_status()

    # Step 2: Parse payment requirements
    payment_info = probe.json()
    print(f"Payment required: {payment_info['amount']} USDC on Base")

    # Step 3: Sign payment authorisation
    signature = sign_x402_payment(payment_info, AGENT_PRIVATE_KEY)

    account = Account.from_key(AGENT_PRIVATE_KEY)
    x_payment = json.dumps({
        "scheme": "exact",
        "network": "base:8453",
        "payload": {
            "authorization": {
                "from": account.address,
                "to": payment_info["recipient"],
                "value": str(int(float(payment_info["amount"]) * 1_000_000)),
                "validAfter": "0",
                "validBefore": str(payment_info["expires"]),
                "nonce": payment_info["nonce"],
            },
            "signature": signature,
        }
    })

    # Step 4: Retry with payment header
    result = requests.post(
        GPUBRIDGE_URL,
        headers={
            "Content-Type": "application/json",
            "X-Payment": x_payment,
        },
        json=payload
    )
    result.raise_for_status()
    return result.json()["choices"][0]["message"]["content"]

Note: For x402 payments, your agent wallet needs USDC on Base L2. You can bridge from mainnet at bridge.base.org or buy directly on Coinbase. $5 worth of USDC will cover thousands of inference calls.

5. Code Example: LangChain Tool Wrapping GPU-Bridge

Here's where it gets interesting. You can wrap GPU-Bridge as a set of LangChain tools, making the entire 30-service catalog available to any LangChain agent:

from langchain.tools import tool
from langchain.agents import AgentExecutor, create_openai_tools_agent
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
import requests
import base64

GPUBRIDGE_API_KEY = "gpub_RqQpoj6xF0znuFGe1TJ2u6B-CRgHcj1P"
GPUBRIDGE_URL = "https://api.gpubridge.io/run"
HEADERS = {"X-API-Key": GPUBRIDGE_API_KEY, "Content-Type": "application/json"}


@tool
def gpu_llm(prompt: str, model: str = "llama-3.3-70b-versatile") -> str:
    """
    Run LLM inference on a GPU cluster using open-source models.
    Supports: llama-3.3-70b-versatile, llama-3.1-8b-instant, qwen3-32b, llama-4-scout-17b-16e-instruct.
    Use this for tasks requiring strong reasoning or long context.
    Cost: ~$0.05 per call.
    """
    resp = requests.post(
        GPUBRIDGE_URL,
        headers=HEADERS,
        json={"service": "llm-4090", "input": {"prompt": prompt, "model": model, "max_tokens": 1024}}
    )
    resp.raise_for_status()
    return resp.json()["choices"][0]["message"]["content"]


@tool
def gpu_embed(text: str) -> list[float]:
    """
    Generate a high-quality text embedding vector.
    Use this for semantic search, RAG retrieval, or similarity comparison.
    Returns a list of floats (embedding vector).
    Cost: ~$0.01 per call.
    """
    resp = requests.post(
        GPUBRIDGE_URL,
        headers=HEADERS,
        json={"service": "embedding-l4", "input": {"text": text}}
    )
    resp.raise_for_status()
    return resp.json()["data"][0]["embedding"]


@tool
def gpu_transcribe(audio_url: str) -> str:
    """
    Transcribe audio from a URL to text using Whisper Large v3.
    Supports mp3, wav, m4a, ogg formats.
    Cost: ~$0.05 per call.
    """
    resp = requests.post(
        GPUBRIDGE_URL,
        headers=HEADERS,
        json={"service": "whisper-l4", "input": {"audio_url": audio_url}}
    )
    resp.raise_for_status()
    return resp.json()["text"]


@tool
def gpu_caption(image_url: str) -> str:
    """
    Generate a descriptive caption for an image.
    Use when you need to understand the content of an image.
    Cost: ~$0.01 per call.
    """
    resp = requests.post(
        GPUBRIDGE_URL,
        headers=HEADERS,
        json={"service": "caption", "input": {"image_url": image_url}}
    )
    resp.raise_for_status()
    return resp.json()["caption"]


@tool
def gpu_rerank(query: str, documents: list[str]) -> list[dict]:
    """
    Rerank a list of documents by relevance to a query.
    Returns documents sorted by relevance score (highest first).
    Ideal for RAG pipelines to improve retrieval quality.
    Cost: ~$0.001 per call.
    """
    resp = requests.post(
        GPUBRIDGE_URL,
        headers=HEADERS,
        json={"service": "rerank", "input": {"query": query, "documents": documents}}
    )
    resp.raise_for_status()
    return resp.json()["results"]


# Build an agent with all GPU tools available
tools = [gpu_llm, gpu_embed, gpu_transcribe, gpu_caption, gpu_rerank]

# You can use any LLM as the orchestrator — even a local one
orchestrator = ChatOpenAI(model="gpt-4o-mini", temperature=0)

prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful AI assistant with access to GPU-powered tools. "
               "Use the GPU tools for tasks that require heavy computation: "
               "open-source LLMs, audio transcription, image understanding, embeddings, and document reranking."),
    ("human", "{input}"),
    MessagesPlaceholder(variable_name="agent_scratchpad"),
])

agent = create_openai_tools_agent(orchestrator, tools, prompt)
agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True)

# Example: multi-modal reasoning pipeline
result = agent_executor.invoke({
    "input": (
        "I have a podcast episode at https://example.com/episode.mp3. "
        "Transcribe it, then summarise the key points using the Llama 70B model. "
        "Finally, generate an embedding of the summary for storage."
    )
})

print(result["output"])

This pattern works equally well with CrewAI, AutoGen, and LlamaIndex — just wrap the requests.post calls in the appropriate tool format for your framework.

CrewAI example (quick snippet)

from crewai_tools import BaseTool

class GPUBridgeLLMTool(BaseTool):
    name: str = "GPU LLM Inference"
    description: str = "Run open-source LLMs (Llama, Qwen) via GPU-Bridge. Cost: ~$0.05/call."

    def _run(self, prompt: str) -> str:
        resp = requests.post(
            "https://api.gpubridge.io/run",
            headers={"X-API-Key": GPUBRIDGE_API_KEY},
            json={"service": "llm-4090", "input": {"prompt": prompt, "model": "llama-3.3-70b-versatile"}}
        )
        return resp.json()["choices"][0]["message"]["content"]

AutoGen example (quick snippet)

import autogen

def gpu_bridge_inference(prompt: str, model: str = "llama-3.1-8b-instant") -> str:
    resp = requests.post(
        "https://api.gpubridge.io/run",
        headers={"X-API-Key": GPUBRIDGE_API_KEY},
        json={"service": "llm-4090", "input": {"prompt": prompt, "model": model}}
    )
    return resp.json()["choices"][0]["message"]["content"]

# Register as a function for AutoGen agents
assistant = autogen.AssistantAgent(
    "gpu_assistant",
    function_map={"gpu_inference": gpu_bridge_inference}
)

6. Cost Comparison: GPU-Bridge vs OpenAI vs Together AI

Let's be concrete about what things actually cost. All prices as of March 2026:

LLM Inference (per 1M tokens)

Provider	Model	Input	Output
OpenAI	GPT-4o	$2.50	$10.00
OpenAI	GPT-4o mini	$0.15	$0.60
Together AI	Llama 3.3 70B	$0.88	$0.88
GPU-Bridge	Llama 3.3 70B	~$0.40*	~$0.40*
GPU-Bridge	Llama 3.1 8B	~$0.10*	~$0.10*

*GPU-Bridge charges by wall-clock GPU time (~$0.0024/sec on RTX 4090). Estimate based on typical throughput.

Embeddings (per 1M tokens)

Provider	Model	Cost
OpenAI	text-embedding-3-small	$0.02
Together AI	M2-BERT-80M	$0.008
GPU-Bridge	GPU embedding	~$0.01*

Audio Transcription (per hour)

Provider	Model	Cost
OpenAI	Whisper	$0.36/hr
AssemblyAI	Best	$0.37/hr
GPU-Bridge	Whisper Large v3 Turbo	~$0.10/hr*

*Estimate: ~$0.05/call, typical audio ~30 min per call.

The real advantage: per-call pricing with no commitment

GPU-Bridge charges per GPU-second with no minimum spend and no rate limits tied to plan tiers. For an autonomous agent running occasional bursts of inference:

OpenAI/Together: You pay for a tier; unused capacity is wasted
GPU-Bridge: You pay exactly for what you use, per call, down to fractions of a cent

For agents doing RAG pipelines (embed → retrieve → rerank → generate), the cost breakdown on GPU-Bridge looks like:

embed query:     $0.01
rerank 10 docs:  $0.001
generate answer: $0.05
─────────────────────
total:           ~$0.061 per RAG query

Compare to an all-OpenAI stack: ~$0.15–$0.30 per equivalent pipeline.

7. Putting It Together: A Complete RAG Agent

Here's a minimal but complete RAG agent that uses GPU-Bridge for every step:

"""
rag_agent.py — A simple RAG pipeline using GPU-Bridge exclusively.
pip install requests numpy
"""

import requests
import numpy as np
import json

API_KEY = "gpub_RqQpoj6xF0znuFGe1TJ2u6B-CRgHcj1P"
BASE_URL = "https://api.gpubridge.io/run"
HEADERS = {"X-API-Key": API_KEY, "Content-Type": "application/json"}


def embed(text: str) -> list[float]:
    r = requests.post(BASE_URL, headers=HEADERS,
                      json={"service": "embedding-l4", "input": {"text": text}})
    r.raise_for_status()
    return r.json()["data"][0]["embedding"]


def rerank(query: str, docs: list[str]) -> list[str]:
    r = requests.post(BASE_URL, headers=HEADERS,
                      json={"service": "rerank", "input": {"query": query, "documents": docs}})
    r.raise_for_status()
    results = r.json()["results"]
    return [docs[item["index"]] for item in sorted(results, key=lambda x: x["score"], reverse=True)]


def generate(prompt: str) -> str:
    r = requests.post(BASE_URL, headers=HEADERS,
                      json={"service": "llm-4090", "input": {
                          "prompt": prompt,
                          "model": "llama-3.3-70b-versatile",
                          "max_tokens": 512
                      }})
    r.raise_for_status()
    return r.json()["choices"][0]["message"]["content"]


def cosine_similarity(a: list[float], b: list[float]) -> float:
    a, b = np.array(a), np.array(b)
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))


class SimpleRAG:
    def __init__(self):
        self.documents = []
        self.embeddings = []

    def add_document(self, text: str):
        print(f"  Embedding: {text[:60]}...")
        self.documents.append(text)
        self.embeddings.append(embed(text))

    def query(self, question: str, top_k: int = 3) -> str:
        print(f"\nQuery: {question}")

        # Step 1: Embed the query
        query_emb = embed(question)

        # Step 2: Retrieve top-k by cosine similarity
        scores = [cosine_similarity(query_emb, doc_emb) for doc_emb in self.embeddings]
        top_indices = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:top_k]
        top_docs = [self.documents[i] for i in top_indices]

        # Step 3: Rerank for precision
        reranked_docs = rerank(question, top_docs)
        context = "\n\n".join(reranked_docs[:2])

        # Step 4: Generate answer
        prompt = f"""Answer the question based on the context below.

Context:
{context}

Question: {question}

Answer:"""

        return generate(prompt)


# Demo
rag = SimpleRAG()

# Index some documents
docs = [
    "GPU-Bridge is a GPU-as-a-Service API that supports x402 micropayments with USDC on Base L2.",
    "x402 is an HTTP-native payment protocol that uses the 402 status code to trigger crypto micropayments.",
    "LangChain is a Python framework for building LLM-powered applications and autonomous agents.",
    "Base L2 is an Ethereum Layer 2 network built by Coinbase, offering fast and cheap USDC transactions.",
    "USDC is a stablecoin pegged 1:1 to the US dollar, issued by Circle and widely used in DeFi.",
]

print("Indexing documents...")
for doc in docs:
    rag.add_document(doc)

# Query
answer = rag.query("How can an AI agent pay for GPU inference without a credit card?")
print(f"\nAnswer: {answer}")

Run it:

pip install requests numpy
python rag_agent.py

Expected output:

Indexing documents...
  Embedding: GPU-Bridge is a GPU-as-a-Service API that supports x402...
  Embedding: x402 is an HTTP-native payment protocol that uses the 402...
  ...

Query: How can an AI agent pay for GPU inference without a credit card?

Answer: An AI agent can pay for GPU inference without a credit card by using 
GPU-Bridge's x402 payment system. x402 is an HTTP-native payment protocol that 
uses USDC stablecoin on Base L2 — a fast, low-cost Ethereum Layer 2 network. 
The agent holds a crypto wallet with USDC, and when it calls the GPU-Bridge API, 
it receives a 402 response with payment details, signs a payment authorisation, 
and retries the request. No registration or credit card required.

Conclusion

The combination of x402 + GPU-Bridge unlocks a new pattern for AI agents: self-funded GPU compute. Instead of pre-configuring API keys and billing accounts for every agent deployment, you can:

Give your agent a wallet with some USDC
Point it at https://api.gpubridge.io/run
Let it pay for exactly what it uses, per call

This is particularly powerful for:

Multi-agent systems where each agent needs its own budget
Serverless agent deployments where you don't control the runtime environment
Autonomous agents that need to self-provision compute without human intervention
Cost-sensitive pipelines where you want sub-cent granularity

Get Started

Browse the catalog: GET https://api.gpubridge.io/catalog
Get API credits: gpubridge.io (or use x402 with a Base wallet)
Try the API:

   curl -X POST https://api.gpubridge.io/run \
     -H "X-API-Key: your_key_here" \
     -H "Content-Type: application/json" \
     -d '{"service": "llm-4090", "input": {"prompt": "Hello!", "model": "llama-3.1-8b-instant"}}'

The future of autonomous agents isn't agents that ask humans for API keys. It's agents that pay their own way.

Found this useful? Follow @gpubridge for more tutorials on building autonomous AI agents. We're building the infrastructure layer for agentic AI — come say hi.

DEV Community

How to Build an Autonomous Agent That Pays for Its Own GPU Inference

How to Build an Autonomous Agent That Pays for Its Own GPU Inference

1. The Problem: Agents That Need GPU But Can't Hold Credit Cards

2. What is x402?

3. GPU-Bridge API Overview

4. Code Example: Python Agent with API Key Auth

5. Code Example: LangChain Tool Wrapping GPU-Bridge

CrewAI example (quick snippet)

AutoGen example (quick snippet)

6. Cost Comparison: GPU-Bridge vs OpenAI vs Together AI

LLM Inference (per 1M tokens)

Embeddings (per 1M tokens)

Audio Transcription (per hour)

The real advantage: per-call pricing with no commitment

7. Putting It Together: A Complete RAG Agent

Conclusion

Get Started

Top comments (0)