GPU-Bridge + LlamaIndex: Embeddings and Reranking in One Line

#rag #python #llamaindex #ai

GPU-Bridge + LlamaIndex: Embeddings and Reranking in One Line

Most RAG pipelines manage 3-4 separate billing accounts for embeddings, reranking, LLM, and document parsing. GPU-Bridge collapses all of that into one endpoint.

We just shipped two LlamaIndex integrations:

llama-index-embeddings-gpubridge — high-throughput text embeddings
llama-index-postprocessor-gpubridge-rerank — semantic reranking

Here's how to use them.

Install

pip install llama-index-embeddings-gpubridge
pip install llama-index-postprocessor-gpubridge-rerank

Get an API key at gpubridge.io (free to start).

Embeddings

from llama_index.embeddings.gpubridge import GPUBridgeEmbedding
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

# Set your embed model
embed_model = GPUBridgeEmbedding(api_key="gpub_...")

# Build an index
documents = SimpleDirectoryReader("./docs").load_data()
index = VectorStoreIndex.from_documents(
    documents,
    embed_model=embed_model,
)

# Query
query_engine = index.as_query_engine(embed_model=embed_model)
response = query_engine.query("What is GPU-Bridge?")
print(response)

Cost: ~$0.00002/call. For 100k document chunks, that's $2. Compare to OpenAI at $0.0001/1k tokens — GPU-Bridge embeddings are 5x cheaper.

Reranking

After vector search, reranking significantly improves result quality by reordering candidates by semantic relevance:

from llama_index.embeddings.gpubridge import GPUBridgeEmbedding
from llama_index.postprocessor.gpubridge_rerank import GPUBridgeRerank
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.query_engine import RetrieverQueryEngine

embed_model = GPUBridgeEmbedding(api_key="gpub_...")
reranker = GPUBridgeRerank(api_key="gpub_...", top_n=3)

documents = SimpleDirectoryReader("./docs").load_data()
index = VectorStoreIndex.from_documents(documents, embed_model=embed_model)
retriever = index.as_retriever(similarity_top_k=10)  # retrieve 10, rerank to 3

query_engine = RetrieverQueryEngine(
    retriever=retriever,
    node_postprocessors=[reranker],
)

response = query_engine.query("Your question here")
print(response)

Reranking typically improves retrieval quality by 15-30% on standard benchmarks by catching semantically relevant results that vector similarity missed.

Full RAG Pipeline (Embed + Rerank + LLM)

from llama_index.embeddings.gpubridge import GPUBridgeEmbedding
from llama_index.postprocessor.gpubridge_rerank import GPUBridgeRerank
from llama_index.llms.openai import OpenAI  # or any LLM
from llama_index.core import VectorStoreIndex, Settings, SimpleDirectoryReader
from llama_index.core.query_engine import RetrieverQueryEngine

# Configure GPU-Bridge for embed + rerank
Settings.embed_model = GPUBridgeEmbedding(api_key="gpub_...")
reranker = GPUBridgeRerank(api_key="gpub_...", top_n=5)

# Load and index
docs = SimpleDirectoryReader("./docs").load_data()
index = VectorStoreIndex.from_documents(docs)

# Query with reranking
engine = RetrieverQueryEngine(
    retriever=index.as_retriever(similarity_top_k=15),
    node_postprocessors=[reranker],
)

print(engine.query("Explain the architecture"))

x402 Autonomous Payments

For agents that need to pay for their own compute without a pre-registered API key:

# No API key needed — agent pays autonomously via x402
embed_model = GPUBridgeEmbedding()  # will use x402 if wallet configured

The agent receives an HTTP 402 response, pays with USDC on Base L2, and the embedding executes. No human in the approval loop.

Available Services

GPU-Bridge exposes 30 services via POST https://api.gpubridge.io/run:

Service	Use case	Price
`embedding-l4`	Text embeddings	~$0.00002/call
`rerank`	Semantic reranking	per call
`llm-4090`	LLM inference (Llama 70B)	~$0.06/call
`pdf-parse`	PDF extraction	per call
`whisper-l4`	Audio transcription	per minute