DEV Community

GPU-Bridge
GPU-Bridge

Posted on

Build a Fully Autonomous RAG Agent That Pays for Its Own Compute (x402 + GPU-Bridge)

Build a Fully Autonomous RAG Agent That Pays for Its Own Compute (x402 + GPU-Bridge)

What if your AI agent could read a PDF, understand it, and answer questions — without you ever touching a credit card?

That's not a thought experiment. It's a 70-line Python script. Let's build it.


The Problem With RAG Pipelines Today

A standard production RAG pipeline needs:

  1. PDF parsing — AWS Textract, or Adobe PDF Services, or Unstructured.io
  2. Embeddings — OpenAI, Cohere, or Voyage AI
  3. Reranking — Cohere Rerank, or Jina AI
  4. LLM inference — OpenAI, Anthropic, or Together.ai

That's 4 billing accounts, 4 API keys, 4 rate limit policies, and 4 dashboards to monitor. Every service has its own signup flow, its own credit card requirement, its own auth scheme.

Worse: if you're building autonomous AI agents, none of this works. An agent can't create a Stripe account. It can't fill out a CAPTCHA. It can't receive a verification email. The entire payment infrastructure of the modern API economy is designed for humans, not machines.


The Solution: GPU-Bridge + x402

GPU-Bridge is a unified GPU inference API that covers the full AI stack — document parsing, embeddings, reranking, LLM inference, image generation, audio, and more — through a single endpoint: https://api.gpubridge.xyz/run.

Two payment modes:

  • Stripe credits — for developers: register with email, prepay credits, get an API key. Standard workflow.
  • x402 — for agents: no account, no API key, no signup. The server returns HTTP 402 Payment Required, your agent pays with USDC on Base L2, retries, gets the result.

This tutorial covers both. We'll build the pipeline with an API key first (easier to test), then show the x402 path that makes it truly autonomous.


Architecture

┌─────────────────────────────────────────────────────────────┐
│                     RAG Agent Pipeline                       │
│                                                              │
│  PDF URL                                                     │
│    │                                                         │
│    ▼                                                         │
│  [pdf-parse]  ──── $0.050 ──→  raw text + chunks            │
│    │                                                         │
│    ▼                                                         │
│  [embedding-l4] ── $0.010 ──→  vector embeddings            │
│    │                                                         │
│    ▼                                                         │
│  cosine similarity (local, free)                             │
│    │                                                         │
│    ▼                                                         │
│  [rerank]  ────── $0.001 ──→  top-3 most relevant chunks    │
│    │                                                         │
│    ▼                                                         │
│  [llm-4090] ───── $0.003 ──→  final answer                  │
│                                                              │
│  Total: ~$0.064 per query  │  1 endpoint  │  1 auth         │
└─────────────────────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

All four steps go to POST https://api.gpubridge.xyz/run. One URL. One auth header.


Prerequisites

pip install requests numpy
Enter fullscreen mode Exit fullscreen mode

For the x402 path, you'll also need:

pip install web3
Enter fullscreen mode Exit fullscreen mode

Get an API key (or $10 in free credits) at api.gpubridge.xyz/account/register.


The Full Pipeline (API Key Mode)

import requests
import json
import time
import numpy as np

API_KEY = "gpub_your_key_here"
BASE_URL = "https://api.gpubridge.xyz/run"

HEADERS = {
    "Authorization": f"Bearer {API_KEY}",
    "Content-Type": "application/json"
}

def run_service(service: str, input_data: dict, poll_interval: float = 1.0) -> dict:
    """
    Submit a job and poll until complete.
    Sync services (rerank, llm-4090) return HTTP 200 directly.
    Async services return a job_id; poll /status/{job_id}.
    """
    payload = {"service": service, "input": input_data}
    resp = requests.post(BASE_URL, headers=HEADERS, json=payload)

    if resp.status_code == 200:
        return resp.json()

    if resp.status_code == 202:
        job_id = resp.json()["job_id"]
        while True:
            status_resp = requests.get(
                f"https://api.gpubridge.xyz/status/{job_id}",
                headers=HEADERS
            )
            data = status_resp.json()
            if data["status"] == "completed":
                return data["output"]
            elif data["status"] == "failed":
                raise RuntimeError(f"Job failed: {data.get('error')}")
            time.sleep(poll_interval)

    resp.raise_for_status()


def cosine_similarity(a: list, b: list) -> float:
    a, b = np.array(a), np.array(b)
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))


def rag_query(pdf_url: str, query: str) -> str:
    """
    Full RAG pipeline: parse → embed → rerank → generate.
    """

    # Step 1: Parse the PDF
    print("📄 Parsing PDF...")
    parse_result = run_service("pdf-parse", {
        "file_url": pdf_url,
        "mode": "fast"
    })

    # Extract text chunks (GPU-Bridge returns structured output)
    full_text = parse_result.get("text", "")
    # Chunk by paragraphs (rough split; use a proper chunker in production)
    chunks = [c.strip() for c in full_text.split("\n\n") if len(c.strip()) > 100]
    print(f"{len(chunks)} chunks extracted")

    # Step 2: Embed query + all chunks
    print("🔢 Generating embeddings...")
    all_texts = [query] + chunks
    embed_result = run_service("embedding-l4", {"text": all_texts})
    embeddings = embed_result["embeddings"]

    query_embedding = embeddings[0]
    chunk_embeddings = embeddings[1:]

    # Step 3: Compute similarity, take top-10 candidates
    similarities = [
        cosine_similarity(query_embedding, ce) for ce in chunk_embeddings
    ]
    top_indices = sorted(range(len(similarities)), key=lambda i: similarities[i], reverse=True)[:10]
    candidate_chunks = [chunks[i] for i in top_indices]

    # Step 4: Rerank — cross-encoder picks the top-3 that actually matter
    print("🎯 Reranking candidates...")
    rerank_result = run_service("rerank", {
        "query": query,
        "documents": candidate_chunks,
        "top_n": 3
    })

    # rerank returns results sorted by relevance score
    top_chunks = [r["document"]["text"] for r in rerank_result["results"]]
    context = "\n\n---\n\n".join(top_chunks)

    # Step 5: LLM inference
    print("🤖 Generating answer...")
    prompt = f"""You are a helpful assistant. Answer the question based ONLY on the provided context.

Context:
{context}

Question: {query}

Answer:"""

    llm_result = run_service("llm-4090", {
        "prompt": prompt,
        "max_tokens": 512
    })

    return llm_result.get("text", llm_result.get("output", ""))


if __name__ == "__main__":
    # Example: query a public research paper
    pdf_url = "https://arxiv.org/pdf/1706.03762"  # Attention Is All You Need
    query = "What is the main contribution of the transformer architecture?"

    answer = rag_query(pdf_url, query)
    print(f"\n{'='*60}")
    print(f"Q: {query}")
    print(f"\nA: {answer}")
Enter fullscreen mode Exit fullscreen mode

Run it:

python rag_agent.py
Enter fullscreen mode Exit fullscreen mode

Expected output:

📄 Parsing PDF...
  → 47 chunks extracted
🔢 Generating embeddings...
🎯 Reranking candidates...
🤖 Generating answer...

============================================================
Q: What is the main contribution of the transformer architecture?

A: The Transformer introduces a novel architecture based entirely 
on attention mechanisms, dispensing with recurrence and convolutions. 
The key innovation is multi-head self-attention, which allows the model 
to attend to information from different representation subspaces...
Enter fullscreen mode Exit fullscreen mode

x402: Making the Agent Fully Autonomous

Now the interesting part. The x402 protocol lets an agent pay for each request on-chain — no account, no API key, no human in the loop.

Here's how the payment flow works:

Agent                          GPU-Bridge                    Base L2
  │                                │                            │
  │──── POST /run ────────────────>│                            │
  │                                │                            │
  │<─── 402 Payment Required ──────│                            │
  │     {amount: "0.050",          │                            │
  │      token: "USDC",            │                            │
  │      address: "0xB0Fd...6381"} │                            │
  │                                │                            │
  │──── send 0.050 USDC ──────────────────────────────────────>  │                                │                            │
  │<─── txHash: "0xabc..." ───────────────────────────────────│
  │                                │                            │
  │──── POST /run ─────────────────│                            │
  │     X-Payment: base64({        │                            │
  │       txHash: "0xabc...",      │                            │
  │       from: "0xAgent..."       │                            │
  │     })                         │                            │
  │                                │──── verify tx ────────────>  │                                │<─── confirmed ─────────────│
  │<─── 200 OK + result ───────────│                            │
Enter fullscreen mode Exit fullscreen mode

The agent sends USDC to 0xB0FdC6030B9f30652e8B221B8090d443Dd3C6381 on Base L2, then retries with the transaction hash in the X-Payment header.

import json
import base64
from web3 import Web3

# Base L2 setup
w3 = Web3(Web3.HTTPProvider("https://mainnet.base.org"))

# ERC-20 minimal ABI for USDC transfers
USDC_ADDRESS = "0x833589fCD6eDb6E08f4c7C32D4f71b54bdA02913"  # USDC on Base
USDC_ABI = [
    {
        "name": "transfer",
        "type": "function",
        "inputs": [
            {"name": "to", "type": "address"},
            {"name": "value", "type": "uint256"}
        ],
        "outputs": [{"name": "", "type": "bool"}]
    }
]

AGENT_PRIVATE_KEY = "0x..."  # Agent's funded wallet
AGENT_ADDRESS = w3.eth.account.from_key(AGENT_PRIVATE_KEY).address
PAYMENT_ADDRESS = "0xB0FdC6030B9f30652e8B221B8090d443Dd3C6381"

usdc = w3.eth.contract(address=Web3.to_checksum_address(USDC_ADDRESS), abi=USDC_ABI)


def pay_and_call(service: str, input_data: dict) -> dict:
    """Call GPU-Bridge with x402 payment. No API key required."""

    payload = {"service": service, "input": input_data}

    # First attempt — expect 402
    resp = requests.post(
        BASE_URL,
        headers={"Content-Type": "application/json"},
        json=payload
    )

    if resp.status_code != 402:
        return resp.json()

    payment_info = resp.json()
    amount_usdc = float(payment_info["amount"])
    amount_raw = int(amount_usdc * 1_000_000)  # USDC has 6 decimals

    print(f"  💰 Paying ${amount_usdc:.4f} USDC on Base...")

    # Build and send USDC transfer
    tx = usdc.functions.transfer(
        Web3.to_checksum_address(PAYMENT_ADDRESS),
        amount_raw
    ).build_transaction({
        "from": AGENT_ADDRESS,
        "nonce": w3.eth.get_transaction_count(AGENT_ADDRESS),
        "gas": 100000,
        "maxFeePerGas": w3.eth.gas_price * 2,
        "maxPriorityFeePerGas": w3.to_wei("0.001", "gwei"),
        "chainId": 8453  # Base mainnet
    })

    signed = w3.eth.account.sign_transaction(tx, AGENT_PRIVATE_KEY)
    tx_hash = w3.eth.send_raw_transaction(signed.rawTransaction)
    receipt = w3.eth.wait_for_transaction_receipt(tx_hash, timeout=30)

    if receipt.status != 1:
        raise RuntimeError("Payment transaction failed")

    # Retry with payment proof
    payment_header = base64.b64encode(json.dumps({
        "txHash": tx_hash.hex(),
        "from": AGENT_ADDRESS
    }).encode()).decode()

    retry_resp = requests.post(
        BASE_URL,
        headers={
            "Content-Type": "application/json",
            "X-Payment": payment_header
        },
        json=payload
    )

    return retry_resp.json()
Enter fullscreen mode Exit fullscreen mode

Replace run_service() with pay_and_call() and the pipeline becomes fully autonomous. The agent funds itself, pays per request, no humans required.


Cost Breakdown

Step Service Cost Notes
Parse PDF pdf-parse $0.050 Per document; PDF/DOCX/PPTX/images
Embed query + chunks embedding-l4 $0.010 Per request, any batch size
Rerank top-10 rerank $0.001 Jina AI cross-encoder, 89 languages
LLM answer llm-4090 $0.003 Llama 3.3 70B, 512 tokens output
Total $0.064 Per end-to-end PDF query

For a knowledge base with 100 PDFs pre-indexed (parse + embed once, query many times), the per-query cost drops to $0.004 (rerank + LLM only).


Comparison: GPU-Bridge vs. Separate Providers

Capability GPU-Bridge Alternative GPU-Bridge Price Alt Price Savings
PDF Parsing pdf-parse AWS Textract ~$0.001/page $1.50/1000 pages 1500x cheaper
Embeddings embedding-l4 OpenAI ada-002 $0.010/req $0.020/1M tokens Comparable
Reranking rerank Cohere Rerank $0.001/req $2.00/1K requests 2000x cheaper
LLM Inference llm-4090 OpenAI GPT-4o ~$0.003-0.06 $0.015-0.060/1K tok Competitive
Autonomous payments x402 native ❌ Not available Included N/A Only option
Accounts required 0 (x402) 4+
API keys required 0 (x402) 4+

The reranking gap is the most dramatic. Cohere charges $2.00 per 1,000 rerank requests. GPU-Bridge charges $0.001. At scale, this isn't a rounding error — it's the difference between a $20/month and $20,000/month bill.


Why x402 Matters for Agents

Current agentic frameworks assume there's a human holding the credit card. LangChain, AutoGen, CrewAI — they all externalize billing to "whatever API keys you configure." This works for demos. It breaks at scale.

x402 (originally an HTTP status code for "Payment Required", now being revived as a micropayment protocol) flips the model:

  • The agent has a wallet. It holds USDC on Base L2.
  • Services declare their price. 402 + amount in the response.
  • The agent pays atomically. USDC transfer on-chain, tx hash in the retry.
  • No human involvement. No billing portal, no invoice, no credit limit.

This is how agent-to-agent economies work. One agent orchestrates, spawns sub-agents, pays for their compute out of its own wallet. The entire pipeline is on-chain and auditable.

GPU-Bridge is the first AI inference provider to implement x402 natively across its full service catalog. That means pdf-parse, embeddings, reranking, LLM inference, image generation, TTS — all payable by an autonomous agent with nothing but USDC and a wallet address.


What's Available on GPU-Bridge

Beyond the RAG stack in this tutorial:

  • Image generation — SDXL, Flux (RTX 4090)
  • Audio/TTS — Kokoro, Bark, Whisper transcription
  • Video — AnimateDiff, frame interpolation
  • Vision — LLaVA 1.6, image captioning
  • Utilities — background removal, upscaling, music generation

All through the same POST /run endpoint. All payable with API key or x402.


Get Started

Developer path (API key):

  1. Register at api.gpubridge.xyz/account/register
  2. Add $10 in credits (minimum)
  3. Run the script above

Agent path (x402):

  1. Fund a wallet with USDC on Base L2
  2. Point the agent at https://api.gpubridge.xyz/run
  3. Implement the payment retry loop above
  4. No account needed

Links:

The full pipeline code is in this article. Copy it, replace the API key or wallet, and you have a working autonomous RAG agent for $0.064 per PDF query.


GPU-Bridge · One endpoint for the full AI stack · x402 native · No account required

Top comments (0)