Danilo Poccia for AWS

Posted on Dec 16, 2025

Exploring the OpenAI-Compatible APIs in Amazon Bedrock: A CLI Journey Through Project Mantle

#aws #ai #python #openai

After Amazon Bedrock introduced OpenAI-compatible application programming interfaces (APIs) through Project Mantle, I decided to explore firsthand what this meant in practice. There's nothing like actually calling endpoints and seeing responses to build real intuition. I needed a way to quickly experiment with both the Responses API and Chat Completions API, compare their behaviors, and understand when to use each one.

That's why I put together bedrock-mantle, a command-line interface (CLI) that took shape as I tested these new endpoints. It's designed as an exploration tool—something you can fire up when you want to understand how stateful conversations work, test background processing for long-running tasks, or simply verify that your existing OpenAI software development kit (SDK) code will work with minimal changes.

In this post, I'll walk through what makes these APIs different, show you how to use the CLI for hands-on exploration, and share some insights about when each API makes sense.

What Is Project Mantle?

Before diving into the APIs, it's worth understanding what's under the hood. Project Mantle is a distributed inference engine for large-scale model serving on Amazon Bedrock. It's designed to simplify onboarding new models while providing performant serverless inference with sophisticated quality of service controls.

For developers, the practical benefit is twofold. First, Project Mantle provides out-of-the-box compatibility with OpenAI API specifications—existing code using the OpenAI SDK works with Bedrock models by changing the base URL and API key. Second, it introduces new capabilities like stateful conversation management and asynchronous inference that go beyond simple compatibility.

The Responses API currently supports the OpenAI GPT OSS 20B and 120B models, with support for additional models coming. The Chat Completions API already works with all Bedrock models powered by Project Mantle.

Getting Started

The CLI is packaged as a Python tool and uses uv for dependency management. Installation takes one command:

uv tool install .

Configuration requires two environment variables pointing to the Mantle endpoint and your API key:

export OPENAI_BASE_URL=https://bedrock-mantle.us-east-1.api.aws/v1
export OPENAI_API_KEY=your-amazon-bedrock-api-key

You can get your API key from the Amazon Bedrock console. With that configured, you're ready to explore.

Two APIs, Two Approaches to Conversation State

The heart of Project Mantle is the choice between two APIs: the Responses API and the Chat Completions API. They solve the same fundamental problem—getting responses from language models—but they handle conversation state very differently.

All the code examples below assume you've set the OPENAI_BASE_URL and OPENAI_API_KEY environment variables as described in the Getting Started section. The OpenAI SDK reads these automatically.

The Responses API maintains conversation state server-side. When you send a message, the server remembers the context automatically using a previous_response_id. You don't need to send the full conversation history with each request because the server tracks it for you. This simplifies client code, reduces bandwidth (especially for long conversations), and makes tool use integration for agentic workflows more straightforward.

Here's a basic request using the Responses API, taken from the official AWS documentation:

from openai import OpenAI

client = OpenAI()

response = client.responses.create(
    model="openai.gpt-oss-120b",
    input=[
        {"role": "user", "content": "Hello! How can you help me today?"}
    ]
)

print(response)

For multi-turn conversations, chain responses using previous_response_id. Each response object includes an id field that you pass to the next request:

# First turn
response = client.responses.create(
    model="openai.gpt-oss-120b",
    input=[{"role": "user", "content": "What is the capital of France?"}]
)
print(response.id)  # e.g., "resp_abc123..."

# Second turn: pass the previous response id
response = client.responses.create(
    model="openai.gpt-oss-120b",
    input=[{"role": "user", "content": "What river runs through it?"}],
    previous_response_id=response.id
)

The server handles the history—you just chain the IDs.

The Chat Completions API follows the traditional stateless pattern. You manage conversation history client-side and send the full context with each request. The server processes the request and returns a response without retaining any state between calls. This API also supports reasoning effort configuration, giving you control over how much computational effort the model applies to generate responses.

Here's a basic chat completion request:

from openai import OpenAI

client = OpenAI()

completion = client.chat.completions.create(
    model="openai.gpt-oss-120b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Hello!"}
    ]
)

print(completion.choices[0].message)

With Chat Completions, you build and maintain the messages array yourself. Each request includes the complete conversation history, giving you full control over what context the model sees.

Here's what a typical session looks like with the Responses API:

$ bedrock-mantle chat --model openai.gpt-oss-120b

Starting chat session
  Model: openai.gpt-oss-120b
  API: Responses API
  Streaming: enabled
  Background: disabled

Type /quit or /q to exit, /clear to reset conversation
------------------------------------------------------------

You: What is the capital of France?
Assistant: The capital of France is Paris.

You: What river runs through it?
Assistant: The Seine River runs through Paris, flowing through the heart of the city.

You: /quit
Goodbye!

Notice how the second question ("What river runs through it?") works without explicitly mentioning Paris. Under the hood, with the Responses API, the server maintains the conversation context, so "it" resolves correctly. With the Chat Completions API, you'd need to include the previous exchange in your request to get the same behavior.

To switch to the Chat Completions API, add the --completions flag:

bedrock-mantle chat --model openai.gpt-oss-120b --completions

Background Processing for Long-Running Tasks

Some tasks take time. Complex reasoning, extensive analysis, or multi-step processes might run for minutes rather than seconds. Keeping an HTTP connection open for that duration introduces reliability concerns—network timeouts, connection drops, and client resource consumption all become issues.

The Responses API addresses this with asynchronous inference through background processing. When you enable background mode, requests are queued and processed asynchronously:

bedrock-mantle chat --model openai.gpt-oss-120b --background

The CLI submits your message and then polls for completion, showing progress as it waits. This pattern is useful for long-running inference workloads where you'd rather wait confidently than wonder whether your connection will survive.

Behind the scenes, the CLI handles the polling loop:

# Background mode: poll for completion
response = client.responses.create(
    model=model,
    input=user_input,
    previous_response_id=previous_response_id,
    background=True
)

while response.status == "in_progress":
    time.sleep(1)
    response = client.responses.retrieve(response.id)

This approach translates well to production architectures where you might submit a request, receive a job ID, and check back later.

Choosing Between APIs

The choice between APIs depends on your requirements. I've found it helpful to think about three dimensions.

State management is the most obvious differentiator. If you want the server to track conversation context automatically, use the Responses API. If you need full control over what context gets sent (perhaps for privacy reasons, or because you're doing custom context management), use the Chat Completions API.

Data retention matters for compliance-sensitive applications. The Responses API stores data for approximately 30 days to support its stateful features. The Chat Completions API follows a zero data retention model—no conversation data is stored between requests.

Model support varies between the APIs. The Responses API currently works with OpenAI GPT OSS models (20B and 120B parameters), while the Chat Completions API supports all Bedrock models powered by Project Mantle. You can check available models with:

bedrock-mantle list-models

Practical Exploration Patterns

The CLI includes several options that make exploration more productive. Disabling streaming shows you the complete response structure rather than incremental chunks:

bedrock-mantle chat --model openai.gpt-oss-120b --no-stream

Streaming is useful when you want to display responses as they arrive. Here's how streaming works with the Responses API:

from openai import OpenAI

client = OpenAI()

stream = client.responses.create(
    model="openai.gpt-oss-120b",
    input=[{"role": "user", "content": "Tell me a story"}],
    stream=True
)

for event in stream:
    print(event)

And with the Chat Completions API:

from openai import OpenAI

client = OpenAI()

stream = client.chat.completions.create(
    model="openai.gpt-oss-120b",
    messages=[{"role": "user", "content": "Tell me a story"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="")

Notice the difference in how you process the stream. The Chat Completions API returns chunks with a delta containing the incremental content, while the Responses API returns typed events.

Custom system prompts help you test how different personas or instructions affect behavior:

bedrock-mantle chat --model openai.gpt-oss-120b --system "You are a helpful assistant who explains concepts simply"

During a session, the /status command shows your current configuration, and /clear resets the conversation state—useful when you want to start fresh without restarting the CLI.

What I Learned Building This

Building the CLI taught me some practical lessons about working with these APIs.

First, the stateful nature of the Responses API changes how you think about error handling. If a request fails mid-conversation, you need to decide whether to retry with the same previous_response_id or reset the conversation. The server's state might be consistent even if your client didn't receive the response.

Second, background processing introduces its own considerations. How often should you poll? How long should you wait before giving up? The CLI uses simple fixed-interval polling, but production code might implement exponential backoff to be more efficient.

Third, streaming and non-streaming responses have different structures. If you're building tooling that works with both modes, you need to handle the response parsing accordingly.

The Migration Path

If you're currently using the OpenAI APIs and considering a move to Bedrock, migrating is much easier. The endpoint format, request structure, and response format follow the OpenAI specification. In many cases, changing two environment variables is enough to switch between providers.

That said, testing matters. The CLI gives you a low-friction way to verify behavior and eventual code changes. Run your typical prompts through both the Responses API and Chat Completions API, observe the responses, and build confidence in how the migration could affect your application.

Try It Yourself

The CLI is available on GitHub under the MIT license. Clone it, configure your credentials, and start exploring the tool and its code. The info command shows you the current configuration, and list-models tells you what's available in your region.

Whether you're building applications that currently use the OpenAI APIs and you're curious about what migration to Bedrock would look like, or you're already using Bedrock and want to understand these new capabilities, the CLI provides a playground to build intuition before committing to architectural decisions.

I'm curious to hear what patterns you discover as you explore. Are there specific use cases where the stateful Responses API simplifies your architecture significantly? Or do you find the control of the stateless Chat Completions API more valuable for your needs? Let me know in the comments what you're building with these capabilities.

The complete code is available at github.com/danilop/bedrock-mantle. Contributions and feedback welcome.

DEV Community