Azeem Siddiqui

Posted on Jun 22

Putting an OpenAI-Compatible Gateway in Front of AWS Bedrock

#devops #ai #aws #bedrock

The request that kicked this off was simple enough: the developers wanted an AI coding assistant in their editors. The constraints were where it got interesting. No AWS credentials on laptops. No third-party SaaS that meant a data-handling review. And whoever set it up had to be able to answer "who spent what" when the bill showed up.

AWS Bedrock solved the backend question. The models are good, inference stays inside our account, and there's nothing to sign off on legally. The trouble is everything sitting in front of Bedrock.

Here's the mismatch. Bedrock has its own API and authenticates with AWS SigV4. The tools developers actually reach for — Continue.dev, Cursor, aider, a big chunk of the LangChain ecosystem — all speak the OpenAI API instead: POST /v1/chat/completions, an Authorization: Bearer header, done. So if you want your team using their preferred tools against Bedrock, something has to bridge those two worlds.

The naive bridge is to mint an IAM user per developer so their editor can sign requests. Don't do this. You'll have AWS credentials scattered across a dozen machines, no per-person attribution, and no way to throttle or cut off one person without going and editing IAM policies. I went a different way: a small gateway that speaks OpenAI on the front, Bedrock on the back, and owns auth, rate limiting, audit logging, and cost tracking in between.

I've been running a version of this in production for a team of about a dozen developers. What follows is a clean, self-contained build of it — you can have it answering requests in an afternoon.

What we're building

Developer tools (Continue.dev, Cursor, aider, curl)
        │   OpenAI format + Bearer key (sk-...)
        ▼
┌──────────────────────────────────────┐
│              Gateway                  │
│   auth → rate limit → translate →     │
│   call Bedrock → translate back →     │
│   audit + cost log                    │
└──────────────────────────────────────┘
        │   SigV4, one IAM role
        ▼
        AWS Bedrock (Converse API)

One IAM role lives on the gateway and nowhere else. Developers hold a sk-... key that I issue and can kill from one place. Every request passes through a single service, which is exactly where you want auth, throttling, and accounting to sit. That last point is the whole argument for doing this, so I'll keep coming back to it.

Before any code: should you even build this? LiteLLM has a proxy that covers a lot of the same ground, and AWS publishes a bedrock-access-gateway sample. Look at both. I went custom because I wanted auth tied to our own notion of identity, audit logs in a shape our existing tooling already understood, and per-developer cost numbers rather than a per-key blob — and because the translation layer turned out to be about 80 lines. If your needs are generic, use the off-the-shelf option and save yourself the maintenance. If you want to own the four things in the title, it's a weekend, and you'll know every line.

Prerequisites

Node.js 18+ and an AWS account with Bedrock model access turned on. (You enable it per-model in the Bedrock console. For most models it's effectively instant.)
An IAM role or user the gateway can assume, with bedrock:InvokeModel and bedrock:InvokeModelWithResponseStream on the model ARNs you plan to serve.
Comfort with Express. Nothing fancy.

I'm using SQLite here so the whole thing runs with zero external services. In production I back it with a managed database instead, but the schema doesn't change — you swap the driver and move on.

Step 1: Project setup

mkdir bedrock-gateway && cd bedrock-gateway
npm init -y
npm install express @aws-sdk/client-bedrock-runtime express-rate-limit better-sqlite3

Add "type": "module" to package.json so the ES module imports work.

Give the gateway its AWS credentials however you normally would — an instance role if it's on EC2 or ECS, environment variables if you're running it locally to test. The one rule: these credentials live on the gateway and only on the gateway.

export AWS_REGION=us-east-1

Step 2: The data layer (keys + usage)

Two tables. One maps an issued key to a developer. One records every call so you can audit it and bill it back.

db.js:

import Database from "better-sqlite3";

export const db = new Database("gateway.db");

db.exec(`
  CREATE TABLE IF NOT EXISTS api_keys (
    id          TEXT PRIMARY KEY,
    name        TEXT NOT NULL,
    key_hash    TEXT NOT NULL UNIQUE,
    active      INTEGER NOT NULL DEFAULT 1,
    created_at  TEXT NOT NULL DEFAULT (datetime('now'))
  );

  CREATE TABLE IF NOT EXISTS usage (
    id            INTEGER PRIMARY KEY AUTOINCREMENT,
    developer_id  TEXT NOT NULL,
    ts            TEXT NOT NULL DEFAULT (datetime('now')),
    model         TEXT NOT NULL,
    input_tokens  INTEGER NOT NULL,
    output_tokens INTEGER NOT NULL,
    cost_usd      REAL NOT NULL,
    latency_ms    INTEGER NOT NULL,
    streamed      INTEGER NOT NULL,
    status        INTEGER NOT NULL
  );
`);

The keys table stores a hash of the key, never the key itself. You see the raw value once, at creation, and it's unrecoverable after that — same deal as a password. If gateway.db ever leaks, no usable key leaks with it.

Step 3: Issuing keys

A throwaway CLI so you can hand a developer a key without building a UI first:

create-key.js:

import crypto from "node:crypto";
import { db } from "./db.js";

const name = process.argv[2];
if (!name) {
  console.error("usage: node create-key.js \"developer name\"");
  process.exit(1);
}

const raw = "sk-" + crypto.randomBytes(24).toString("hex");
const id = crypto.randomUUID();
const keyHash = crypto.createHash("sha256").update(raw).digest("hex");

db.prepare(
  "INSERT INTO api_keys (id, name, key_hash) VALUES (?, ?, ?)"
).run(id, name, keyHash);

console.log(`Created key for ${name}`);
console.log(`Key (shown once, store it now): ${raw}`);

node create-key.js "Jordan"
# Key (shown once, store it now): sk-1f3c...e9a2

The sk- prefix is cosmetic, but it earns its keep. Some OpenAI-compatible clients sanity-check the shape of the key, and it tells the developer this drops into the same field where an OpenAI key would go. Small thing, fewer support pings.

Step 4: Auth middleware

Pull the bearer token, hash it, look it up. Attach the developer to the request so everything downstream — the rate limiter, the audit log — knows who's calling.

auth.js:

import crypto from "node:crypto";
import { db } from "./db.js";

const lookup = db.prepare(
  "SELECT id, name FROM api_keys WHERE key_hash = ? AND active = 1"
);

export function authenticate(req, res, next) {
  const header = req.get("authorization") || "";
  const match = header.match(/^Bearer\s+(.+)$/i);
  if (!match) {
    return res.status(401).json({
      error: { message: "Missing bearer token", type: "invalid_request_error" },
    });
  }

  const keyHash = crypto.createHash("sha256").update(match[1]).digest("hex");
  const dev = lookup.get(keyHash);
  if (!dev) {
    return res.status(401).json({
      error: { message: "Invalid API key", type: "invalid_request_error" },
    });
  }

  req.developer = dev;
  next();
}

I'm matching OpenAI's error envelope — { error: { message, type } } — on purpose. Client tools parse that shape. Return a bare string and some of them surface a vague generic failure instead of your message, and then you're debugging "the gateway is down" tickets that were really just expired keys. Ask me how I know.

Step 5: The translation layer

This is the part that does the actual work. OpenAI's chat format and Bedrock's Converse format are close, with three differences that'll trip you up if you don't watch for them:

Converse roles are only user and assistant. The system prompt goes in its own top-level system field, not in the messages array.
Converse message content is an array of blocks — [{ text: "..." }] — not a plain string.
Inference params live under inferenceConfig, and the keys are renamed (maxTokens, not max_tokens).

bedrock.js:

import {
  BedrockRuntimeClient,
  ConverseCommand,
  ConverseStreamCommand,
} from "@aws-sdk/client-bedrock-runtime";

const client = new BedrockRuntimeClient({
  region: process.env.AWS_REGION || "us-east-1",
});

// Friendly names the clients send -> real Bedrock model IDs.
// Check current IDs with: aws bedrock list-foundation-models
// IDs and regional availability move around. If you need a model in a
// region that doesn't host it directly, look at cross-region inference
// profiles (the "us." / "eu." prefixed IDs).
export const MODELS = {
  "claude-3-5-sonnet": "anthropic.claude-3-5-sonnet-20240620-v1:0",
  "claude-3-5-haiku": "anthropic.claude-3-5-haiku-20241022-v1:0",
};

// OpenAI content can be a string or an array of parts. Normalize to text.
function extractText(content) {
  if (typeof content === "string") return content;
  if (Array.isArray(content)) {
    return content
      .filter((p) => p.type === "text")
      .map((p) => p.text)
      .join("");
  }
  return "";
}

// OpenAI request body -> Converse arguments
export function toBedrock(body) {
  const modelId = MODELS[body.model];
  if (!modelId) throw new ModelNotFound(body.model);

  const system = [];
  const messages = [];

  for (const m of body.messages) {
    if (m.role === "system") {
      system.push({ text: extractText(m.content) });
    } else {
      messages.push({
        role: m.role, // "user" | "assistant"
        content: [{ text: extractText(m.content) }],
      });
    }
  }

  const inferenceConfig = {};
  if (body.max_tokens != null) inferenceConfig.maxTokens = body.max_tokens;
  if (body.temperature != null) inferenceConfig.temperature = body.temperature;
  if (body.top_p != null) inferenceConfig.topP = body.top_p;
  if (body.stop != null) {
    inferenceConfig.stopSequences = Array.isArray(body.stop) ? body.stop : [body.stop];
  }

  return {
    modelId,
    ...(system.length ? { system } : {}),
    messages,
    ...(Object.keys(inferenceConfig).length ? { inferenceConfig } : {}),
  };
}

export class ModelNotFound extends Error {}

const FINISH = {
  end_turn: "stop",
  stop_sequence: "stop",
  max_tokens: "length",
  tool_use: "tool_calls",
  content_filtered: "content_filter",
};

export function finishReason(stopReason) {
  return FINISH[stopReason] || "stop";
}

export async function converse(args) {
  return client.send(new ConverseCommand(args));
}

export async function converseStream(args) {
  return client.send(new ConverseStreamCommand(args));
}

Step 6: Cost calculation

Bedrock bills per token, at a rate that depends on the model. Keep a price table and compute cost from the usage Bedrock hands back. Prices change, so I keep this in one file and treat it as config, not as numbers buried in the request path.

pricing.js:

// USD per 1M tokens. Confirm against current Bedrock pricing before trusting it.
const PRICES = {
  "claude-3-5-sonnet": { input: 3.0, output: 15.0 },
  "claude-3-5-haiku":  { input: 0.8, output: 4.0 },
};

export function costUsd(model, inputTokens, outputTokens) {
  const p = PRICES[model];
  if (!p) return 0;
  return (inputTokens / 1e6) * p.input + (outputTokens / 1e6) * p.output;
}

Step 7: Wiring it together

Now the server pulls it all in: auth, a per-developer rate limiter, the two routes, and the audit-and-cost write that fires on every request.

server.js:

import express from "express";
import crypto from "node:crypto";
import rateLimit from "express-rate-limit";
import { db } from "./db.js";
import { authenticate } from "./auth.js";
import {
  toBedrock,
  converse,
  converseStream,
  finishReason,
  MODELS,
  ModelNotFound,
} from "./bedrock.js";
import { costUsd } from "./pricing.js";

const app = express();
app.use(express.json({ limit: "2mb" }));

const recordUsage = db.prepare(`
  INSERT INTO usage
    (developer_id, model, input_tokens, output_tokens, cost_usd, latency_ms, streamed, status)
  VALUES (?, ?, ?, ?, ?, ?, ?, ?)
`);

// One audit + cost write per request. Lives in a finally block so it runs
// even when a streaming client hangs up mid-response.
function audit(req, { model, usage, latencyMs, streamed, status }) {
  const input = usage?.inputTokens ?? 0;
  const output = usage?.outputTokens ?? 0;
  const cost = costUsd(model, input, output);

  recordUsage.run(
    req.developer.id, model, input, output, cost, latencyMs, streamed ? 1 : 0, status
  );

  // Structured line for log shipping. Metadata only — no prompt or
  // completion text. Logging content is a privacy/compliance call you
  // make on purpose, not a default you back into.
  console.log(JSON.stringify({
    evt: "llm_request",
    developer: req.developer.name,
    developer_id: req.developer.id,
    model, input_tokens: input, output_tokens: output,
    cost_usd: Number(cost.toFixed(6)), latency_ms: latencyMs,
    streamed, status,
  }));
}

app.use("/v1", authenticate);

// Per-developer rate limit. The keyGenerator is the whole trick — it makes
// the limit per-person instead of per-IP. Run more than one instance and
// you'll want a shared store (rate-limit-redis), or each instance counts
// its own window and the real limit is N times what you set.
app.use("/v1", rateLimit({
  windowMs: 60_000,
  limit: 60,
  keyGenerator: (req) => req.developer?.id || req.ip,
  standardHeaders: true,
  legacyHeaders: false,
  handler: (req, res) => res.status(429).json({
    error: { message: "Rate limit exceeded", type: "rate_limit_error" },
  }),
}));

app.get("/v1/models", (req, res) => {
  res.json({
    object: "list",
    data: Object.keys(MODELS).map((id) => ({
      id, object: "model", owned_by: "bedrock-gateway",
    })),
  });
});

app.post("/v1/chat/completions", async (req, res) => {
  const started = Date.now();
  const model = req.body.model;
  const streamed = req.body.stream === true;
  let usage = null;
  let status = 200;

  try {
    const args = toBedrock(req.body);

    if (!streamed) {
      const out = await converse(args);
      usage = out.usage;
      res.json(toCompletion(model, out));
    } else {
      usage = await streamCompletion(res, model, args);
    }
  } catch (err) {
    status = err instanceof ModelNotFound ? 400 : 502;
    const payload = {
      error: {
        message: err instanceof ModelNotFound
          ? `Unknown model: ${model}`
          : "Upstream model error",
        type: err instanceof ModelNotFound ? "invalid_request_error" : "api_error",
      },
    };
    if (res.headersSent) res.end();
    else res.status(status).json(payload);
  } finally {
    audit(req, { model, usage, latencyMs: Date.now() - started, streamed, status });
  }
});

function toCompletion(model, out) {
  return {
    id: "chatcmpl-" + crypto.randomBytes(12).toString("hex"),
    object: "chat.completion",
    created: Math.floor(Date.now() / 1000),
    model,
    choices: [{
      index: 0,
      message: {
        role: "assistant",
        content: out.output.message.content.map((b) => b.text || "").join(""),
      },
      finish_reason: finishReason(out.stopReason),
    }],
    usage: {
      prompt_tokens: out.usage.inputTokens,
      completion_tokens: out.usage.outputTokens,
      total_tokens: out.usage.totalTokens,
    },
  };
}

app.listen(3000, () => console.log("Gateway listening on :3000"));

Step 8: Streaming

Streaming is where gateways tend to get sloppy, and it's also where two of the four concerns get tested at once. Bedrock's ConverseStream gives you an async iterable of events; you translate each text delta into an OpenAI SSE chunk and flush it right away. The catch that got me the first time: token counts only show up in the final metadata event. If you care about cost tracking — and the entire point here is that you do — you have to let the stream finish, or at least catch that event, even when the client has already walked away.

Add this to server.js:

async function streamCompletion(res, model, args) {
  res.writeHead(200, {
    "Content-Type": "text/event-stream",
    "Cache-Control": "no-cache",
    Connection: "keep-alive",
  });

  const id = "chatcmpl-" + crypto.randomBytes(12).toString("hex");
  const created = Math.floor(Date.now() / 1000);
  const send = (delta, finish = null) => {
    const chunk = {
      id, object: "chat.completion.chunk", created, model,
      choices: [{ index: 0, delta, finish_reason: finish }],
    };
    res.write(`data: ${JSON.stringify(chunk)}\n\n`);
  };

  let usage = null;
  const out = await converseStream(args);

  send({ role: "assistant" }); // first chunk carries the role, no content

  for await (const event of out.stream) {
    if (event.contentBlockDelta?.delta?.text) {
      send({ content: event.contentBlockDelta.delta.text });
    }
    if (event.messageStop) {
      send({}, finishReason(event.messageStop.stopReason));
    }
    if (event.metadata?.usage) {
      usage = event.metadata.usage; // this is the part you can't skip
    }
  }

  res.write("data: [DONE]\n\n");
  res.end();
  return usage;
}

Two details that matter. The first chunk announces { role: "assistant" } with no content — clients expect the role up front, once. And the terminating data: [DONE] line is part of the OpenAI contract; tools wait for it to know the stream closed cleanly, and they'll hang if it never comes.

The reason the usage capture lives where it does: that function returns usage, and the route's finally block writes it. So a developer who cancels a long generation halfway through still gets billed for the tokens Bedrock actually produced. That's the difference between cost numbers that tie out at the end of the month and cost numbers that quietly run low and make you look like you don't know what your platform costs.

Testing it

Plain curl, no streaming:

curl http://localhost:3000/v1/chat/completions \
  -H "Authorization: Bearer sk-YOUR-KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude-3-5-sonnet",
    "messages": [{"role": "user", "content": "Say hello in five words."}]
  }'

Streaming — note the -N so curl doesn't buffer:

curl -N http://localhost:3000/v1/chat/completions \
  -H "Authorization: Bearer sk-YOUR-KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"claude-3-5-haiku","stream":true,
       "messages":[{"role":"user","content":"Count to five."}]}'

And the test that actually matters — pointing a real editor at it. In Continue.dev, the gateway is just an "openai" provider with a custom base URL:

{
  "models": [
    {
      "title": "Bedrock Gateway",
      "provider": "openai",
      "model": "claude-3-5-sonnet",
      "apiBase": "http://your-gateway-host:3000/v1",
      "apiKey": "sk-YOUR-KEY"
    }
  ]
}

The developer never has to know Bedrock is back there. They get chat and autocomplete in their editor. You get a key you can revoke, a limit you can tune, and a usage row for every single call.

Checking the books

Cost tracking is only worth anything if you can read it back. The reason every usage row carries a developer id is so questions like "who's driving the bill this month" are one query and not an afternoon:

SELECT
  k.name,
  COUNT(*)                  AS requests,
  SUM(u.input_tokens)       AS input_tokens,
  SUM(u.output_tokens)      AS output_tokens,
  ROUND(SUM(u.cost_usd), 2) AS cost_usd
FROM usage u
JOIN api_keys k ON k.id = u.developer_id
WHERE u.ts >= datetime('now', '-30 days')
GROUP BY u.developer_id
ORDER BY cost_usd DESC;

I wire this into a dashboard and a monthly summary, but even the raw query does the job — it catches a runaway script the day it starts instead of the day the invoice lands.

What I'd add before calling it done

The build above is honest about what it is: a clean core, not a hardened production deployment. A few things I'd want in place before I trusted it with a real team:

Shared rate-limit state. The in-memory limiter resets per process. Two instances behind a load balancer means a developer's real limit is double what you configured. Move the window into Redis with rate-limit-redis.
A real database. SQLite is great for the tutorial and fine for one small instance. For anything multi-instance, point db.js at managed Postgres or MySQL. The schema is identical; it's a driver swap.
Content logging is a decision. This gateway logs metadata only. The moment you start logging prompts and completions, you've created a store of possibly-sensitive text with its own retention, access, and compliance questions hanging off it. Choose that deliberately, write down why, and tell your developers either way.
Budget guardrails. Usage rows let you see spend. A hard stop is a separate thing — add a per-developer monthly cap checked in the auth path, and set an account-level AWS budget alert as a backstop. When the app's number and AWS's number disagree, something in your accounting is wrong, and you want to find out from a graph, not a finance meeting.
Key lifecycle. You've already got active = 0 for revocation. Add rotation and an expiry column so contractor keys age out on their own instead of living forever.
Health and timeouts. A /healthz route for your load balancer, and a request timeout so one stuck upstream call doesn't pin a connection open indefinitely.

Wrapping up

The whole thing is around 200 lines, and it converts "every developer holds AWS credentials and spend is a mystery" into "every developer holds a key I issued, every request is attributed to a person, and I can throttle or cut off anyone from one place." The translation layer — OpenAI in, Converse out — is small and it stays put. The four behaviors in the title aren't bolt-ons; they're the reason the service exists. Auth is one middleware. Rate limiting is one more. Audit and cost are a single write per request. You get all of it precisely because every call funnels through one service you control.

If your requirements are generic, save yourself the upkeep and try LiteLLM or the AWS sample first. But if you want to own these four things end to end, it's a weekend of work — and the first time something goes sideways in production, knowing exactly why each piece is there pays for itself.

DEV Community