OpenAI's Responses API is replacing Chat Completions, and we just shipped full support in Bifrost v1.3.0. After implementing both APIs, I can tell you - the new one is better designed.
maximhq
/
bifrost
Fastest LLM gateway (50x faster than LiteLLM) with adaptive load balancer, cluster mode, guardrails, 1000+ models support & <100 µs overhead at 5k RPS.
Bifrost
The fastest way to build AI applications that never go down
Bifrost is a high-performance AI gateway that unifies access to 15+ providers (OpenAI, Anthropic, AWS Bedrock, Google Vertex, and more) through a single OpenAI-compatible API. Deploy in seconds with zero configuration and get automatic failover, load balancing, semantic caching, and enterprise-grade features.
Quick Start
Go from zero to production-ready AI gateway in under a minute.
Step 1: Start Bifrost Gateway
# Install and run locally
npx -y @maximhq/bifrost
# Or use Docker
docker run -p 8080:8080 maximhq/bifrost
Step 2: Configure via Web UI
# Open the built-in web interface
open http://localhost:8080
Step 3: Make your first API call
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "openai/gpt-4o-mini",
"messages": [{"role": "user", "content": "Hello, Bifrost!"}]
}'
That's it! Your AI gateway is running with a web interface for visual configuration, real-time monitoring…
What Changed (And Why It Matters)
Chat Completions was OpenAI's first LLM API. It shows. The messages array gets unwieldy, tool calls are messy, and conversation state management is all client-side. Responses API fixes this.
Input simplified. Instead of an array of message objects, you pass input - either a string or structured data. For simple prompts, this is way cleaner. For tool results, you use function_call_output objects instead of jamming them into the messages array.
Conversation chaining built-in. The previous_response_id parameter chains requests automatically. No more manually tracking message history. The API handles state.
Tool calls redesigned. In Chat Completions, tool calls live in message.tool_calls and results go back as tool-role messages. Responses puts tool calls in output with type function_call, and results use function_call_output. This is more explicit and easier to parse.
Structured output first-class. JSON mode was always a hack in Chat Completions. Responses has native json_schema support with validation.
For gateway implementers, these changes are significant. We're not just proxying requests - we're translating between API formats and provider protocols.
The Implementation Challenge
Bifrost already supports Chat Completions for 15+ providers. Adding Responses meant:
- Parse a new input format
- Translate to provider-specific protocols
- Handle streaming with different chunk structures
- Update semantic caching to work with both APIs
- Extend observability to track Responses-specific fields
The hardest part? Tool calls. Every provider handles tools differently. Chat Completions hid some of this behind the messages array. Responses makes tool execution explicit, which is better for users but means gateways need smarter translation logic.
How We Built It
Separate accumulator for streaming. Chat Completions streaming uses deltas in choices[].delta. Responses uses a different structure. We built a dedicated accumulator that:
- Buffers chunks in memory using
sync.Poolfor efficiency - Reconstructs tool calls from partial deltas
- Tracks inter-token latency for performance metrics
- Calculates final token counts and costs
- Returns complete, structured responses
Request normalization. When a Responses request hits Bifrost, we translate it to the target provider's format:
- OpenAI: native Responses endpoint
- Anthropic: convert to Messages API
- Google: convert to generateContent format
- AWS Bedrock: convert to Converse API
- Azure OpenAI: use Responses with Azure auth headers
This happens transparently. Applications use Responses API regardless of which provider processes the request.
Tool call tracking. The previous_response_id parameter chains multi-turn conversations. We added state tracking so observability plugins can correlate requests. When you look at traces, you see the complete conversation flow, not isolated API calls.
Semantic cache updates. Our semantic cache uses embeddings to match similar requests. Responses has a different input format than Chat Completions, so we updated the cache to:
- Generate embeddings from Responses
inputfield - Normalize both formats to the same embedding space
- Support cross-API cache hits
If you cache a Chat Completions request and later make a semantically similar Responses request, you get a cache hit.
Anthropic's Thinking Parameter
We added something OpenAI doesn't have - Anthropic's thinking parameter. Claude models with extended thinking can show their reasoning process. This is different from OpenAI's reasoning mode.
When you use Bifrost with Anthropic models through Responses API, you can pass the thinking object:
{
"model": "claude-sonnet-4",
"input": "Analyze this dataset",
"thinking": {
"type": "enabled",
"budget_tokens": 5000
}
}
Bifrost passes this through to Anthropic and strips it for providers that don't support it. Your code stays provider-agnostic.
What We Learned
String vs structured input is trickier than it looks. The input field can be a string or an array of objects. We initially only handled strings and broke on tool result inputs. Fixed by type-checking and branching logic.
Streaming chunks need careful handling. We were including extra metadata fields that shouldn't appear in streaming responses. Some client SDKs strictly validate chunk format and rejected our responses. Fixed by filtering to spec-compliant fields only.
Tool result aggregation is subtle. When multiple tool calls execute in parallel and return results, Anthropic expects them in a specific format. We were dropping some results in edge cases. Fixed by tracking call IDs explicitly.
Timeout errors need context. Responses API timeouts manifest as context.Canceled, context.DeadlineExceeded, or fasthttp.ErrTimeout. Generic error messages confused users. We added specific timeout detection and clear guidance to increase timeout settings.
Performance
We benchmarked Responses vs Chat Completions to make sure we didn't add latency:
- Request parsing: 0.3ms (same as Chat Completions)
- Streaming accumulation: 0.05ms per chunk
- Response reconstruction: 0.8ms
- Memory per stream: 1.5MB (vs 1.8MB for Chat Completions)
The new format is actually slightly more efficient. Simpler structure means less parsing work.
Using Both APIs
Bifrost supports Chat Completions and Responses simultaneously. Same observability, same caching, same governance. Pick the API that fits your use case.
Responses is better for:
- New applications (it's the future)
- Tool-heavy workflows (cleaner tool call handling)
- Conversation chaining (built-in state management)
- Structured output (native JSON schema)
Chat Completions still works for:
- Existing applications (no migration needed)
- Simple prompts (familiar format)
- Legacy codebases (don't break what works)
The semantic cache works across both. Cache a Chat Completions request, get cache hits on similar Responses requests.
Migration Is Gradual
You don't need to migrate everything at once. Run both APIs side by side. Test Responses on new features while keeping existing code on Chat Completions.
When you're ready to switch, the changes are straightforward:
Endpoint: /v1/chat/completions → /v1/responses
Request format:
// Chat Completions
{
"model": "gpt-4o",
"messages": [{"role": "user", "content": "Hello"}]
}
// Responses
{
"model": "gpt-4o",
"input": "Hello"
}
Response parsing:
# Chat Completions
text = response.choices[0].message.content
# Responses
text = response.output_text
Tool results:
# Chat Completions
messages.append({
"role": "tool",
"tool_call_id": call_id,
"content": result
})
# Responses
{
"input": [{
"type": "function_call_output",
"call_id": call_id,
"output": result
}],
"previous_response_id": response.id
}
Bifrost handles the rest - provider translation, streaming, caching, observability.
What's Coming
OpenAI keeps adding features to Responses API. Web search and citations are in beta. We'll add support as they stabilize.
We're also exploring better tool orchestration. The previous_response_id chaining is powerful but requires state management. We want to make multi-turn tool calls easier to implement.
For now, Responses API support is live in Bifrost v1.3.0. Both streaming and non-streaming work. All observability plugins capture Responses data. Semantic caching works across APIs.
Check the docs for examples. If you're starting a new project, use Responses. If you're on Chat Completions, you're good - we support both.
Extra links:
Website: https://www.getmaxim.ai/bifrost
Blog: https://www.getmaxim.ai/bifrost/blog

Top comments (0)