Your OpenAI bill is $800 this month. But which feature is consuming the most tokens? Which users hit GPT-4 vs GPT-3.5? Are duplicate queries burning money? You'd need to instrument every LLM call with custom logging. Or you could add one line of code and get all of this through Helicone.
What Helicone Actually Does
Helicone is an open-source LLM observability proxy. You change your OpenAI base URL from api.openai.com to oai.helicone.ai, and Helicone transparently logs every request — tokens, latency, cost, user, model, prompt, response — while forwarding to OpenAI normally.
No SDK changes. No wrapper functions. One URL change gives you: request logging, cost analytics, user tracking, response caching, rate limiting, prompt versioning, and custom property tagging.
Helicone supports OpenAI, Anthropic, Azure OpenAI, Google Gemini, and any OpenAI-compatible API. Self-hosted (free, open-source Apache 2.0) or Helicone Cloud (free tier: 100K requests/month).
Quick Start
The simplest integration — just change the base URL:
import openai
client = openai.OpenAI(
api_key="sk-your-key",
base_url="https://oai.helicone.ai/v1",
default_headers={
"Helicone-Auth": "Bearer your-helicone-key"
}
)
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": "Hello!"}]
)
# Request is logged in Helicone dashboard automatically
With user tracking and custom properties:
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": query}],
extra_headers={
"Helicone-User-Id": user.id,
"Helicone-Property-Feature": "code-review",
"Helicone-Property-Tier": user.plan
}
)
Node.js / curl — same pattern:
curl https://oai.helicone.ai/v1/chat/completions \
-H "Authorization: Bearer sk-your-openai-key" \
-H "Helicone-Auth: Bearer your-helicone-key" \
-H "Helicone-User-Id: user-123" \
-H "Content-Type: application/json" \
-d '{"model": "gpt-4", "messages": [{"role": "user", "content": "Hello"}]}'
3 Practical Use Cases
1. Response Caching (Save Money)
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": "What is Python?"}],
extra_headers={
"Helicone-Cache-Enabled": "true",
"Cache-Control": "max-age=3600" # Cache for 1 hour
}
)
# Second identical request → served from cache, $0 cost, <10ms latency
For FAQ-style queries, caching can cut costs by 30-50%.
2. Rate Limiting Per User
response = client.chat.completions.create(
model="gpt-4",
messages=messages,
extra_headers={
"Helicone-RateLimit-Policy": "100;w=3600;u=user-123",
# 100 requests per hour per user
}
)
Prevent abuse without building rate limiting infrastructure.
3. Prompt Versioning and Experiments
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "system", "content": system_prompt_v2}],
extra_headers={
"Helicone-Prompt-Id": "summarizer",
"Helicone-Prompt-Version": "2.0"
}
)
Compare v1 vs v2 in the dashboard — latency, cost, user satisfaction scores.
Why This Matters
Helicone is the lowest-friction way to add observability to any LLM application. One header change. No code refactoring. Instant visibility into costs, performance, and usage patterns. For any team spending more than $100/mo on LLM APIs, Helicone pays for itself through caching alone.
Need custom data extraction or web scraping solutions? I build production-grade scrapers and data pipelines. Check out my Apify actors or email me at spinov001@gmail.com for custom projects.
Follow me for more free API discoveries every week!
Top comments (0)