Build a Fully Autonomous RAG Agent That Pays for Its Own Compute (x402 + GPU-Bridge)
What if your AI agent could read a PDF, understand it, and answer questions — without you ever touching a credit card?
That's not a thought experiment. It's a 70-line Python script. Let's build it.
The Problem With RAG Pipelines Today
A standard production RAG pipeline needs:
- PDF parsing — AWS Textract, or Adobe PDF Services, or Unstructured.io
- Embeddings — OpenAI, Cohere, or Voyage AI
- Reranking — Cohere Rerank, or Jina AI
- LLM inference — OpenAI, Anthropic, or Together.ai
That's 4 billing accounts, 4 API keys, 4 rate limit policies, and 4 dashboards to monitor. Every service has its own signup flow, its own credit card requirement, its own auth scheme.
Worse: if you're building autonomous AI agents, none of this works. An agent can't create a Stripe account. It can't fill out a CAPTCHA. It can't receive a verification email. The entire payment infrastructure of the modern API economy is designed for humans, not machines.
The Solution: GPU-Bridge + x402
GPU-Bridge is a unified GPU inference API that covers the full AI stack — document parsing, embeddings, reranking, LLM inference, image generation, audio, and more — through a single endpoint: https://api.gpubridge.xyz/run.
Two payment modes:
- Stripe credits — for developers: register with email, prepay credits, get an API key. Standard workflow.
-
x402 — for agents: no account, no API key, no signup. The server returns
HTTP 402 Payment Required, your agent pays with USDC on Base L2, retries, gets the result.
This tutorial covers both. We'll build the pipeline with an API key first (easier to test), then show the x402 path that makes it truly autonomous.
Architecture
┌─────────────────────────────────────────────────────────────┐
│ RAG Agent Pipeline │
│ │
│ PDF URL │
│ │ │
│ ▼ │
│ [pdf-parse] ──── $0.050 ──→ raw text + chunks │
│ │ │
│ ▼ │
│ [embedding-l4] ── $0.010 ──→ vector embeddings │
│ │ │
│ ▼ │
│ cosine similarity (local, free) │
│ │ │
│ ▼ │
│ [rerank] ────── $0.001 ──→ top-3 most relevant chunks │
│ │ │
│ ▼ │
│ [llm-4090] ───── $0.003 ──→ final answer │
│ │
│ Total: ~$0.064 per query │ 1 endpoint │ 1 auth │
└─────────────────────────────────────────────────────────────┘
All four steps go to POST https://api.gpubridge.xyz/run. One URL. One auth header.
Prerequisites
pip install requests numpy
For the x402 path, you'll also need:
pip install web3
Get an API key (or $10 in free credits) at api.gpubridge.xyz/account/register.
The Full Pipeline (API Key Mode)
import requests
import json
import time
import numpy as np
API_KEY = "gpub_your_key_here"
BASE_URL = "https://api.gpubridge.xyz/run"
HEADERS = {
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json"
}
def run_service(service: str, input_data: dict, poll_interval: float = 1.0) -> dict:
"""
Submit a job and poll until complete.
Sync services (rerank, llm-4090) return HTTP 200 directly.
Async services return a job_id; poll /status/{job_id}.
"""
payload = {"service": service, "input": input_data}
resp = requests.post(BASE_URL, headers=HEADERS, json=payload)
if resp.status_code == 200:
return resp.json()
if resp.status_code == 202:
job_id = resp.json()["job_id"]
while True:
status_resp = requests.get(
f"https://api.gpubridge.xyz/status/{job_id}",
headers=HEADERS
)
data = status_resp.json()
if data["status"] == "completed":
return data["output"]
elif data["status"] == "failed":
raise RuntimeError(f"Job failed: {data.get('error')}")
time.sleep(poll_interval)
resp.raise_for_status()
def cosine_similarity(a: list, b: list) -> float:
a, b = np.array(a), np.array(b)
return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))
def rag_query(pdf_url: str, query: str) -> str:
"""
Full RAG pipeline: parse → embed → rerank → generate.
"""
# Step 1: Parse the PDF
print("📄 Parsing PDF...")
parse_result = run_service("pdf-parse", {
"file_url": pdf_url,
"mode": "fast"
})
# Extract text chunks (GPU-Bridge returns structured output)
full_text = parse_result.get("text", "")
# Chunk by paragraphs (rough split; use a proper chunker in production)
chunks = [c.strip() for c in full_text.split("\n\n") if len(c.strip()) > 100]
print(f" → {len(chunks)} chunks extracted")
# Step 2: Embed query + all chunks
print("🔢 Generating embeddings...")
all_texts = [query] + chunks
embed_result = run_service("embedding-l4", {"text": all_texts})
embeddings = embed_result["embeddings"]
query_embedding = embeddings[0]
chunk_embeddings = embeddings[1:]
# Step 3: Compute similarity, take top-10 candidates
similarities = [
cosine_similarity(query_embedding, ce) for ce in chunk_embeddings
]
top_indices = sorted(range(len(similarities)), key=lambda i: similarities[i], reverse=True)[:10]
candidate_chunks = [chunks[i] for i in top_indices]
# Step 4: Rerank — cross-encoder picks the top-3 that actually matter
print("🎯 Reranking candidates...")
rerank_result = run_service("rerank", {
"query": query,
"documents": candidate_chunks,
"top_n": 3
})
# rerank returns results sorted by relevance score
top_chunks = [r["document"]["text"] for r in rerank_result["results"]]
context = "\n\n---\n\n".join(top_chunks)
# Step 5: LLM inference
print("🤖 Generating answer...")
prompt = f"""You are a helpful assistant. Answer the question based ONLY on the provided context.
Context:
{context}
Question: {query}
Answer:"""
llm_result = run_service("llm-4090", {
"prompt": prompt,
"max_tokens": 512
})
return llm_result.get("text", llm_result.get("output", ""))
if __name__ == "__main__":
# Example: query a public research paper
pdf_url = "https://arxiv.org/pdf/1706.03762" # Attention Is All You Need
query = "What is the main contribution of the transformer architecture?"
answer = rag_query(pdf_url, query)
print(f"\n{'='*60}")
print(f"Q: {query}")
print(f"\nA: {answer}")
Run it:
python rag_agent.py
Expected output:
📄 Parsing PDF...
→ 47 chunks extracted
🔢 Generating embeddings...
🎯 Reranking candidates...
🤖 Generating answer...
============================================================
Q: What is the main contribution of the transformer architecture?
A: The Transformer introduces a novel architecture based entirely
on attention mechanisms, dispensing with recurrence and convolutions.
The key innovation is multi-head self-attention, which allows the model
to attend to information from different representation subspaces...
x402: Making the Agent Fully Autonomous
Now the interesting part. The x402 protocol lets an agent pay for each request on-chain — no account, no API key, no human in the loop.
Here's how the payment flow works:
Agent GPU-Bridge Base L2
│ │ │
│──── POST /run ────────────────>│ │
│ │ │
│<─── 402 Payment Required ──────│ │
│ {amount: "0.050", │ │
│ token: "USDC", │ │
│ address: "0xB0Fd...6381"} │ │
│ │ │
│──── send 0.050 USDC ──────────────────────────────────────>│
│ │ │
│<─── txHash: "0xabc..." ───────────────────────────────────│
│ │ │
│──── POST /run ─────────────────│ │
│ X-Payment: base64({ │ │
│ txHash: "0xabc...", │ │
│ from: "0xAgent..." │ │
│ }) │ │
│ │──── verify tx ────────────>│
│ │<─── confirmed ─────────────│
│<─── 200 OK + result ───────────│ │
The agent sends USDC to 0xB0FdC6030B9f30652e8B221B8090d443Dd3C6381 on Base L2, then retries with the transaction hash in the X-Payment header.
import json
import base64
from web3 import Web3
# Base L2 setup
w3 = Web3(Web3.HTTPProvider("https://mainnet.base.org"))
# ERC-20 minimal ABI for USDC transfers
USDC_ADDRESS = "0x833589fCD6eDb6E08f4c7C32D4f71b54bdA02913" # USDC on Base
USDC_ABI = [
{
"name": "transfer",
"type": "function",
"inputs": [
{"name": "to", "type": "address"},
{"name": "value", "type": "uint256"}
],
"outputs": [{"name": "", "type": "bool"}]
}
]
AGENT_PRIVATE_KEY = "0x..." # Agent's funded wallet
AGENT_ADDRESS = w3.eth.account.from_key(AGENT_PRIVATE_KEY).address
PAYMENT_ADDRESS = "0xB0FdC6030B9f30652e8B221B8090d443Dd3C6381"
usdc = w3.eth.contract(address=Web3.to_checksum_address(USDC_ADDRESS), abi=USDC_ABI)
def pay_and_call(service: str, input_data: dict) -> dict:
"""Call GPU-Bridge with x402 payment. No API key required."""
payload = {"service": service, "input": input_data}
# First attempt — expect 402
resp = requests.post(
BASE_URL,
headers={"Content-Type": "application/json"},
json=payload
)
if resp.status_code != 402:
return resp.json()
payment_info = resp.json()
amount_usdc = float(payment_info["amount"])
amount_raw = int(amount_usdc * 1_000_000) # USDC has 6 decimals
print(f" 💰 Paying ${amount_usdc:.4f} USDC on Base...")
# Build and send USDC transfer
tx = usdc.functions.transfer(
Web3.to_checksum_address(PAYMENT_ADDRESS),
amount_raw
).build_transaction({
"from": AGENT_ADDRESS,
"nonce": w3.eth.get_transaction_count(AGENT_ADDRESS),
"gas": 100000,
"maxFeePerGas": w3.eth.gas_price * 2,
"maxPriorityFeePerGas": w3.to_wei("0.001", "gwei"),
"chainId": 8453 # Base mainnet
})
signed = w3.eth.account.sign_transaction(tx, AGENT_PRIVATE_KEY)
tx_hash = w3.eth.send_raw_transaction(signed.rawTransaction)
receipt = w3.eth.wait_for_transaction_receipt(tx_hash, timeout=30)
if receipt.status != 1:
raise RuntimeError("Payment transaction failed")
# Retry with payment proof
payment_header = base64.b64encode(json.dumps({
"txHash": tx_hash.hex(),
"from": AGENT_ADDRESS
}).encode()).decode()
retry_resp = requests.post(
BASE_URL,
headers={
"Content-Type": "application/json",
"X-Payment": payment_header
},
json=payload
)
return retry_resp.json()
Replace run_service() with pay_and_call() and the pipeline becomes fully autonomous. The agent funds itself, pays per request, no humans required.
Cost Breakdown
| Step | Service | Cost | Notes |
|---|---|---|---|
| Parse PDF | pdf-parse |
$0.050 | Per document; PDF/DOCX/PPTX/images |
| Embed query + chunks | embedding-l4 |
$0.010 | Per request, any batch size |
| Rerank top-10 | rerank |
$0.001 | Jina AI cross-encoder, 89 languages |
| LLM answer | llm-4090 |
$0.003 | Llama 3.3 70B, 512 tokens output |
| Total | $0.064 | Per end-to-end PDF query |
For a knowledge base with 100 PDFs pre-indexed (parse + embed once, query many times), the per-query cost drops to $0.004 (rerank + LLM only).
Comparison: GPU-Bridge vs. Separate Providers
| Capability | GPU-Bridge | Alternative | GPU-Bridge Price | Alt Price | Savings |
|---|---|---|---|---|---|
| PDF Parsing | pdf-parse |
AWS Textract | ~$0.001/page | $1.50/1000 pages | 1500x cheaper |
| Embeddings | embedding-l4 |
OpenAI ada-002 | $0.010/req | $0.020/1M tokens | Comparable |
| Reranking | rerank |
Cohere Rerank | $0.001/req | $2.00/1K requests | 2000x cheaper |
| LLM Inference | llm-4090 |
OpenAI GPT-4o | ~$0.003-0.06 | $0.015-0.060/1K tok | Competitive |
| Autonomous payments | x402 native | ❌ Not available | Included | N/A | Only option |
| Accounts required | 0 (x402) | 4+ | — | — | ✓ |
| API keys required | 0 (x402) | 4+ | — | — | ✓ |
The reranking gap is the most dramatic. Cohere charges $2.00 per 1,000 rerank requests. GPU-Bridge charges $0.001. At scale, this isn't a rounding error — it's the difference between a $20/month and $20,000/month bill.
Why x402 Matters for Agents
Current agentic frameworks assume there's a human holding the credit card. LangChain, AutoGen, CrewAI — they all externalize billing to "whatever API keys you configure." This works for demos. It breaks at scale.
x402 (originally an HTTP status code for "Payment Required", now being revived as a micropayment protocol) flips the model:
- The agent has a wallet. It holds USDC on Base L2.
-
Services declare their price.
402 + amountin the response. - The agent pays atomically. USDC transfer on-chain, tx hash in the retry.
- No human involvement. No billing portal, no invoice, no credit limit.
This is how agent-to-agent economies work. One agent orchestrates, spawns sub-agents, pays for their compute out of its own wallet. The entire pipeline is on-chain and auditable.
GPU-Bridge is the first AI inference provider to implement x402 natively across its full service catalog. That means pdf-parse, embeddings, reranking, LLM inference, image generation, TTS — all payable by an autonomous agent with nothing but USDC and a wallet address.
What's Available on GPU-Bridge
Beyond the RAG stack in this tutorial:
- Image generation — SDXL, Flux (RTX 4090)
- Audio/TTS — Kokoro, Bark, Whisper transcription
- Video — AnimateDiff, frame interpolation
- Vision — LLaVA 1.6, image captioning
- Utilities — background removal, upscaling, music generation
All through the same POST /run endpoint. All payable with API key or x402.
Get Started
Developer path (API key):
- Register at api.gpubridge.xyz/account/register
- Add $10 in credits (minimum)
- Run the script above
Agent path (x402):
- Fund a wallet with USDC on Base L2
- Point the agent at
https://api.gpubridge.xyz/run - Implement the payment retry loop above
- No account needed
Links:
- API: api.gpubridge.xyz
- Docs: gpubridge.xyz
- X/Twitter: @gpubridge
The full pipeline code is in this article. Copy it, replace the API key or wallet, and you have a working autonomous RAG agent for $0.064 per PDF query.
GPU-Bridge · One endpoint for the full AI stack · x402 native · No account required
Top comments (0)