Debby McKinney

Posted on Jan 21

OpenAI Responses API in an LLM Gateway: What Changed and Why It Matters

#programming #ai #openai #chatgpt

OpenAI's Responses API represents a fundamental redesign of how applications interact with language models. The API addresses pain points from Chat Completions while introducing features that make building production LLM applications cleaner. Bifrost v1.3.0 added full support for both streaming and non-streaming Responses, with transparent provider translation and semantic caching.

Understanding the API Shift

Chat Completions served as OpenAI's primary API since GPT-3. Over time, limitations emerged:

Message array complexity. Conversations are represented as arrays of message objects with roles (system, user, assistant, tool). As conversations grow and tool calls accumulate, managing this array becomes cumbersome. Applications must track state manually.

Tool call mechanics. When a model calls a tool, the application receives tool_calls in the message. To provide results, the application appends a message with role tool to the array and makes another request. This works but feels indirect.

Implicit conversation flow. The entire conversation must be sent with every request. There's no server-side state or conversation chaining.

Responses API changes this architecture:

Simplified input. The input parameter accepts either a string (for simple prompts) or structured objects (for tool results and complex inputs). Single-turn conversations no longer need message arrays.

Explicit tool handling. Tool calls appear in output with type function_call. Results are submitted using function_call_output objects, making the flow more explicit.

Conversation chaining. The previous_response_id parameter chains requests server-side. Applications don't need to resend full conversation history.

Structured output. Native json_schema support enables schema-constrained responses with validation.

How Gateways Handle Both APIs

LLM gateways like Bifrost sit between applications and model providers, handling routing, caching, observability, and protocol translation. Supporting both Chat Completions and Responses APIs requires careful implementation.

maximhq / bifrost

Fastest LLM gateway (50x faster than LiteLLM) with adaptive load balancer, cluster mode, guardrails, 1000+ models support & <100 µs overhead at 5k RPS.

Bifrost

The fastest way to build AI applications that never go down

Bifrost is a high-performance AI gateway that unifies access to 15+ providers (OpenAI, Anthropic, AWS Bedrock, Google Vertex, and more) through a single OpenAI-compatible API. Deploy in seconds with zero configuration and get automatic failover, load balancing, semantic caching, and enterprise-grade features.

Quick Start

Go from zero to production-ready AI gateway in under a minute.

Step 1: Start Bifrost Gateway

# Install and run locally
npx -y @maximhq/bifrost

# Or use Docker
docker run -p 8080:8080 maximhq/bifrost

Step 2: Configure via Web UI

# Open the built-in web interface
open http://localhost:8080

Step 3: Make your first API call

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-4o-mini",
    "messages": [{"role": "user", "content": "Hello, Bifrost!"}]
  }'

That's it! Your AI gateway is running with a web interface for visual configuration, real-time monitoring…

View on GitHub

Request Normalization

When a Responses request arrives, the gateway must translate it to provider-specific formats:

OpenAI models use the Responses endpoint natively. No translation needed.

Anthropic Claude uses the Messages API, which resembles Chat Completions more than Responses. The gateway converts Responses input to Anthropic's message format and handles tool calls accordingly.

Google Gemini uses the generateContent format with a different structure entirely. The gateway maps Responses fields to Gemini's expected format.

AWS Bedrock uses the Converse API across multiple model providers. Translation logic varies by underlying model.

Azure OpenAI supports Responses but requires Azure-specific authentication headers and endpoint patterns.

This translation happens transparently. Applications use Responses API regardless of which provider processes the request.

Streaming Implementation

Streaming responses deliver tokens as they're generated, improving perceived latency. Chat Completions and Responses use different chunk structures, requiring separate streaming logic.

Chat Completions chunks include choices[].delta with partial content. Responses chunks have a different structure for tool calls, reasoning, and text deltas. Gateways must:

Buffer incoming chunks per request
Accumulate deltas correctly
Track tool calls across chunks
Reconstruct the complete response
Calculate token counts and latency metrics
Clean up buffers when streaming completes

Bifrost implements this with dedicated accumulators for each API, using memory pooling to minimize allocations.

Semantic Caching Across APIs

Semantic caching uses vector embeddings to identify similar requests, serving cached responses for semantically equivalent prompts. This reduces latency and costs dramatically for repetitive workloads.

The challenge: Chat Completions uses messages arrays while Responses uses input. Cache implementations must:

Generate embeddings from both formats
Normalize to a common embedding space
Enable cache hits across APIs

When configured correctly, a cached Chat Completions request can serve a Responses request with similar semantic meaning, and vice versa.

Responses API Features

Tool Execution Flow

Tool calls in Responses API follow an explicit pattern:

Application sends initial request with tool definitions
Model responds with tool calls in output
Application executes tools and collects results
Application sends results as function_call_output with previous_response_id
Model incorporates results and generates final response

This differs from Chat Completions where tools results are appended to the messages array. The Responses pattern makes tool execution state more explicit.

Structured Output

The json_schema parameter constrains model responses to match a JSON schema. This enables reliable parsing of model outputs into typed data structures.

Example usage:

{
  "model": "gpt-4o",
  "input": "Extract person data from this text",
  "text": {
    "format": {
      "type": "json_schema",
      "json_schema": {
        "name": "person",
        "schema": {
          "type": "object",
          "properties": {
            "name": {"type": "string"},
            "age": {"type": "integer"}
          },
          "required": ["name", "age"]
        }
      }
    }
  }
}

The model's response conforms to the schema or returns an error.

Anthropic Thinking Parameter

While OpenAI's Responses API doesn't include reasoning transparency features, Anthropic's Claude models support a thinking parameter for extended thinking. Bifrost's Responses implementation includes this extension:

{
  "model": "claude-sonnet-4",
  "input": "Solve this problem",
  "thinking": {
    "type": "enabled",
    "budget_tokens": 5000
  }
}

The gateway passes this parameter to Anthropic and strips it for providers that don't support it, maintaining provider-agnostic code.

Performance Implications

Responses API's simpler structure offers minor performance advantages:

Request parsing. Fewer nested objects mean faster parsing. Benchmarks show similar performance to Chat Completions (0.3ms average).

Streaming accumulation. Simpler chunk structure reduces per-delta processing time (0.05ms per chunk).

Memory usage. Responses streaming uses slightly less memory than Chat Completions (1.5MB vs 1.8MB per active stream).

Response reconstruction. Building the final response from accumulated chunks takes 0.8ms on average.

These differences are marginal but compound at scale. For high-throughput gateways processing thousands of requests per second, efficiency matters.

Migration Considerations

When to Use Responses API

New applications should start with Responses. OpenAI positioned it as the future standard, and Chat Completions will eventually deprecate.

Tool-heavy workflows benefit from cleaner tool call handling. The explicit function_call_output pattern is easier to implement correctly than Chat Completions' message array manipulation.

Conversation applications using server-side state benefit from previous_response_id chaining, reducing the data sent with each request.

Structured output needs are better served by native json_schema support than JSON mode workarounds.

When to Keep Chat Completions

Existing applications don't need immediate migration. Both APIs work simultaneously through gateways like Bifrost.

Legacy codebases with significant Chat Completions integration may not justify migration costs until OpenAI announces deprecation timelines.

Framework dependencies using Chat Completions may not support Responses yet. Check library compatibility before migrating.

Gradual Migration Strategy

Gateways enabling both APIs allow gradual migration:

Keep existing code on Chat Completions
Use Responses for new features
Test Responses thoroughly in staging
Migrate critical paths when confident
Monitor performance and error rates

Semantic caching works across both APIs, so cache hit rates won't drop during migration.

Implementation in Bifrost

Bifrost v1.3.0 implements Responses API with:

Full provider support. Transparent translation to OpenAI, Anthropic, Google, AWS, and Azure formats.

Streaming and non-streaming. Both modes work with complete observability.

Semantic caching. Cross-API cache hits reduce duplicate processing.

Observability integration. All logging, tracing, and metrics plugins capture Responses data.

Anthropic extensions. Native support for Claude's thinking parameter.

The implementation maintains Bifrost's low-latency characteristics - 11µs overhead at 5,000 RPS remains unchanged.

Looking Forward

OpenAI continues expanding Responses API features. Web search and citation capabilities are in beta. Additional structured output formats and reasoning transparency may arrive.

For gateway implementers, the key challenge is maintaining compatibility across provider ecosystems while supporting emerging features. As the API evolves, translation logic must adapt to new parameters and response types.

For application developers, Responses API offers a cleaner foundation for building LLM-powered applications. The explicit tool handling, conversation chaining, and structured output support address real pain points from Chat Completions.

Teams using gateways like Bifrost can adopt Responses incrementally, testing new code paths while keeping production systems stable. The semantic cache works across both APIs, preventing performance regressions during migration.

Learn more: https://docs.getbifrost.ai/api-reference/responses/create-response

DEV Community