DEV Community: Gabriel Koo

Why I Split-Tunnel My VPN for AI Services — and Let Cloudflare's Application Library Pick the Domains

Gabriel Koo — Fri, 19 Jun 2026 10:00:00 +0000

I maintain a small set of open-source GitHub repos that hold split-tunnel VPN configs for reaching AI services — Tailscale app connectors, WireGuard, and OpenVPN. The single hardest part of maintaining them isn't the VPN config. It's answering one deceptively boring question:

Which domains does this AI platform actually use?

This post is about a shortcut I lean on — Cloudflare's Application Library, queried over its REST API — and a broader point I keep coming back to: even in an era where AI can scaffold most of your code and config for you, networking fundamentals are still worth learning. Not because the AI can't write the config, but because the difference between a clean setup and one that gets your account flagged comes down to a judgment call — what you route and what you deliberately don't — and that judgment is exactly the part an AI can't make for you without understanding your intent.

(One caveat before we start: this is for routing traffic you're already entitled to — not for evading restrictions that legitimately apply to you. Full legitimate-use note at the end.)

The problem: a "domain list" is never just one domain

When you want to route only one app through a VPN, you need the set of hostnames (or IPs) that app actually talks to. That sounds trivial until you open DevTools on a modern web app and watch it fan out to twenty different domains: the main API, an auth provider, a CDN for static assets, telemetry, a fraud-detection pixel, a feature-flag service.

Route too little and the app half-works (login spins forever, streaming responses stall). Route too much — say, the bare *.cloudflarestorage.com wildcard — and you scoop up huge amounts of unrelated traffic, which defeats the entire point of a split tunnel and can even get your exit node rate-limited.

So you need the narrowest accurate set of hostnames. Where do you get it?

The shortcut: Cloudflare's Application Library

Cloudflare's Zero Trust product ships an Application Library — a catalog of well-known SaaS apps, each annotated with the hostnames Cloudflare has observed it using. It's meant for admins building Access policies, but it doubles as a fantastic domain-discovery tool. Someone at Cloudflare already did the tedious traffic-watching for ChatGPT, Claude, and friends.

You can browse it in the dashboard under Zero Trust → Team & Resources → Application Library. But I don't want to click through a UI every time a provider quietly adds a domain — I want it in code. So I query the REST API.

Here's the helper I use to pull ChatGPT's hostnames. Pure stdlib, no SDK:

import json, urllib.parse, urllib.request

CF_API = "https://api.cloudflare.com/client/v4"


def cf_hostnames(token, account, app_name, search):
    url = "%s/accounts/%s/resource-library/applications?%s" % (
        CF_API, account, urllib.parse.urlencode({"search": search, "limit": 25}))
    req = urllib.request.Request(url, headers={"Authorization": "Bearer " + token})
    with urllib.request.urlopen(req, timeout=30) as r:
        d = json.load(r)
    for a in (d.get("result") or []):
        if a.get("name") == app_name:
            return sorted(set(a.get("hostnames") or []))
    raise SystemExit("CF: application %r not found (search=%r)" % (app_name, search))

A few things worth calling out:

The endpoint is accounts/{account_id}/resource-library/applications. It takes a search query and returns matching catalog entries, each with a hostnames array. I match on the exact name (e.g. "ChatGPT", "Claude") because a search can return several near-matches.
The token only needs read access to the resource library. Scope it minimally — there's no reason this token should be able to change anything.
It's deterministic and CI-friendly. No browser automation, no scraping. That matters for the next step.
You don't need a paid plan. This lives in Cloudflare's Zero Trust product, and Zero Trust has a free tier (up to 50 seats) that's plenty for personal use. The Application Library and its REST API are available on that free plan — so the entire domain-discovery pipeline costs you nothing but an API token.

What the API actually returns

Here's the real output for Claude (name == "Claude") as of this writing:

{
  "service": "claude",
  "application": "Claude",
  "hostnames": [
    "a-api.anthropic.com",
    "a-cdn.anthropic.com",
    "anthropic.com",
    "claude.ai",
    "claude.com"
  ]
}

Five hostnames, and notice they cover the things that actually matter: the web app (claude.ai, claude.com), the API (a-api.anthropic.com), the static asset CDN (a-cdn.anthropic.com), and the marketing/auth origin (anthropic.com). That's the narrow accurate set I was after — nothing extraneous to prune.

And the same call for ChatGPT (App ID 1199):

{
  "service": "openai",
  "application": "ChatGPT",
  "hostnames": [
    "api.openai.com",
    "auth.openai.com",
    "auth0.openai.com",
    "cdn.oaistatic.com",
    "chat.openai.com",
    "chatgpt.com",
    "oaistatic.com",
    "oaiusercontent.com",
    "openai.com"
  ]
}

Nine hostnames, and the list is more revealing than it looks. Miss the Auth0-backed auth0.openai.com (still in the login path at time of writing) and sign-in silently hangs; miss oaiusercontent.com and uploads/generated files break. Try assembling that from DevTools by hand and you'll miss the auth host you only hit on first login — then spend an afternoon debugging a broken login flow.

Cloudflare already did the traffic-watching.

From hostnames to a routable config

Hostnames are perfect for Tailscale app connectors, which match on the domain name directly — Tailscale handles the DNS-to-IP mapping for you, so the config keeps working even when the provider rotates IPs.

For WireGuard and OpenVPN — which you'd reach for when you need router-level enforcement, or you're on a client where Tailscale isn't an option — you can't route by hostname; those tunnels route by IP. So the pipeline becomes:

Pull hostnames from the Cloudflare Application Library (the function above).
Resolve each to A + AAAA records over DNS-over-HTTPS, so the result doesn't depend on whatever resolver the CI runner happens to use.
Aggregate the addresses to CIDR blocks, then collapse overlapping ranges. I default to /24 (v4) and /48 (v6), but understand this is a heuristic, not precision: expanding one resolved IP to a /24 also routes the other 255 addresses on that edge node, some of which belong to unrelated tenants. It's a deliberate trade — narrower (/32, /28) is more precise but churns more often; wider (/20, a supernet) means fewer entries but more collateral traffic. Pick the prefix that matches how much drift and collateral you can tolerate.
Union with a static "floor" for providers that publish a stable, authoritative prefix — for Anthropic that's 160.79.104.0/21 (their own AS399230), so a single resolved /24 never accidentally blackholes the rest of the block.
Rewrite the config files between marker comments, and let a scheduled GitHub Action open the change.

The whole thing is one stdlib Python script (scripts/update_ips.py) that runs on a cron schedule. The configs stay fresh without me touching them.

And here's where the two providers diverge in a way that proves the whole point. Run the resolve-and-aggregate step and ChatGPT collapses to a pile of Cloudflare ranges:

# IPv4 (all shared Cloudflare anycast)
104.18.32.0/23
104.18.37.0/24
104.18.41.0/24
162.159.140.0/24
172.64.146.0/24
172.64.150.0/24
172.64.154.0/23
172.65.90.0/24
172.66.0.0/24

# IPv6 — 9 more Cloudflare /48s

Every one of those is shared Cloudflare anycast — 104.18.x, 172.64.x, 172.66.x are Cloudflare's, not OpenAI's. Route them and you're routing a slice of Cloudflare's entire customer base, and the specific /24s will drift as Cloudflare reshuffles. That's why ChatGPT-by-IP needs the scheduled refresh, and why it's genuinely better as a Tailscale hostname connector.

Claude collapses to something completely different:

34.36.57.0/24    ← Google Cloud LB (shared)
160.79.104.0/21  ← Anthropic's OWN block (AS399230)
2607:6bc0::/32   ← Anthropic's own IPv6

The 160.79.104.0/21 is Anthropic's own registered allocation — a whois/RDAP lookup shows the block (160.79.104.0–160.79.111.255) registered directly to Anthropic, PBC, announced via AS399230. So I can route the whole /21 and trust who it belongs to, even though it's 2,048 addresses — wider than Claude strictly needs today. That's the deliberate stability-over-precision trade for a block whose owner won't suddenly hand it to someone else. The 34.36.57.0/24 is a Google Cloud front-end I let the script re-resolve each run rather than hardcode, because it is shared infrastructure that drifts. Same pipeline, two completely different risk profiles — and you only know which is which if you understand who owns the address space.

Why this is a networking lesson, not a VPN lesson

Every step above is a networking decision, and getting them wrong has real consequences:

Hostname routing vs. IP routing is a fundamental tradeoff. DNS/SNI-based routing (Tailscale) survives IP churn and cleanly isolates one app on a shared CDN. IP-CIDR routing (WireGuard/OpenVPN) is simpler and dependency-free but brittle — it breaks when addresses drift, and it can't separate two services that share an anycast front end. Knowing which tool fits which provider saves hours.
Shared anycast CDNs are a trap. chatgpt.com sits behind Cloudflare's shared anycast — the same /24s serve thousands of unrelated sites. Route the whole block and you've quietly tunneled a chunk of the internet. Route too narrowly and it breaks tomorrow. You have to understand that the IP doesn't belong to OpenAI to make a sane call.

Knowing who owns an IP block matters. 160.79.104.0/21 is Anthropic's own allocation (AS399230). That's a stable, safe thing to route wholesale. A 34.36.x Google Cloud load-balancer front-end in front of the same service is not — it's shared infrastructure. A quick whois or a look at the AS tells you which is which.
Split tunneling is precision, and precision is how you stay out of trouble. This circles back to the disclaimer. The reason I route the narrowest set of hostnames isn't tidiness — it's that the less unrelated traffic I push through an exit node, the less I look like abuse, and the less I disrupt the security-sensitive services (banking, corporate SSO) that legitimately don't want to see me arriving from a VPN IP. Full-tunneling everything is both lazier and riskier.

None of this requires Cisco's CCNA certification. But it does require knowing what a routing table is, what a CIDR block means, what an autonomous system is, and why DNS resolution is a separate concern from packet routing. That foundation turns "I copied a VPN config off the internet and hope it works" into "I know exactly what traffic goes where, and why."

Takeaways

Cloudflare's Application Library is an underrated domain-discovery tool — queryable over a simple REST endpoint, no scraping required.
Pick your routing primitive to match the provider: hostname-based for shared CDNs and IP-churning services, IP-CIDR for providers on their own stable address space.
Scope narrowly on purpose. It's better for performance, it keeps your other services unaffected, and it keeps you on the right side of a provider's abuse heuristics.
And stay legitimate. Use this for what you're already entitled to use — careful routing, continuity while traveling — not for getting around restrictions that apply to you.

If you want the runnable code, the three repos are linked at the top. The Cloudflare helper and the full resolve-and-aggregate pipeline live in scripts/update_ips.py.

One last thing: the legitimate-use note

I parked this for the end so it didn't bog down the technical walkthrough, but it matters. This whole approach is for one specific, legitimate situation: your account and home region are permitted to use the service, and you want precise control over your own traffic. Two concrete cases:

You want to route carefully — or deliberately NOT route — to avoid tripping a provider's fraud/abuse checks. Many platforms run security engines (WAFs, fraud-detection, bot management) that get nervous when a logged-in session suddenly appears from a datacenter IP or a region that doesn't match your account history. This cuts both ways: the same is true of other services you use — banks especially — which actively distrust traffic arriving from a VPN exit IP. Precision routing is as much about keeping the wrong traffic off the tunnel as putting the right traffic on it.
You normally reside in a region that's allowed to use the service, but you're temporarily somewhere it isn't reachable — a short work trip, a conference, a layover. You're a legitimate user of an unrestricted region who wants continuity of access you're already entitled to.

What this is not for: evading a restriction that legitimately applies to you. If your account or country of residence isn't permitted to use a service, a VPN config doesn't change that, and nothing here is an invitation to break a provider's Terms of Service or your local law. Read the ToS, respect it, and when in doubt don't. I keep my configs scoped as narrowly as possible precisely because the goal is to not look like abuse traffic.

Sources

Stop Putting API Keys in Your MCP Config: Per-User OAuth with Amazon Cognito + AWS Lambda

Gabriel Koo — Sun, 07 Jun 2026 15:00:00 +0000

The runnable companion to my AgentCon HK 2026 talk, "Empower Team-Wide Vibe Coding with LLM Gateway and Security-First MCPs." The talk argued per-user OAuth is what turns a shared god-token into safe, auditable agent access. This is the wiring — explained at the architecture level. The full template.yaml, Lambda, and scripts live in the public repo linked at the end.

The gap nobody fills: your identity in front of a shared-key API

By 2026 most big SaaS shipped an official MCP server, and many even support OAuth. So "my tool has no MCP" and "my tool has no OAuth" are both fading problems — chasing vendors that lack one is a losing game; the list shrinks every week.

Tavily — the search API I use here — is actually a good citizen: its remote MCP supports both a shared key in the URL (https://mcp.tavily.com/mcp/?tavilyApiKey=<KEY>) and an OAuth flow. So why wrap it at all? Because even a vendor's OAuth answers a different question than the one a security team is asking:

Shared keys are the default, and cost is why. Plenty of these tools price per seat. So the cheapest integration — the one teams actually ship — is one account's API key wired into a single shared MCP for the whole group. The instant that key is shared, per-user identity is gone: every call is the same caller, and "who ran this?" has no answer. Vendor OAuth only helps if everyone pays for their own seat, which is exactly the bill teams are dodging.
Claimed ≠ enforced. Even when a tool does attribute per user, it often rides on a value the client supplies — Tavily's X-Human-Id header is exactly this. The client asserts it, so nothing stops a caller from sending someone else's; the attribution is a courtesy, not a control. Trustworthy attribution has to come from a token your own gateway cryptographically verified, not a string the client typed.

And here's the part that doesn't shrink at all: the upstreams that matter most to you will never ship OAuth. Your internal billing API, the legacy claims system nobody wants to touch, the compliance-mandated third-party tool with a single team API key. For every one of those, the only auth is a shared key — shared by nature. And it's not only internal systems: plenty of public vendors are the same shape — Brave Search's own official MCP server, for instance, authenticates with a single BRAVE_API_KEY and no OAuth at all. One key, the whole team behind it.

That's the real, durable gap: no front door that authenticates callers as your identity, scopes them, and audits them — before handing off to a shared-key upstream. That's what this post builds. Tavily is just a free, public stand-in you can actually run; mentally swap it for your own shared-key API.

💡 The one idea: put a real OAuth 2.0 / OIDC authorization server bound to your identity (Amazon Cognito) — running the PKCE and client-credentials flows the MCP spec expects — in front of a shared-key upstream, in a single Lambda. Each caller gets their own scoped, cryptographically-verified, audited identity. The shared key never leaves the server.

Want the 30-second version first? Walk through the interactive demo — click through the whole OAuth flow, the scoped tool call, and the 403 when a caller reaches past its scope.

The architecture

Two pictures tell the whole story. The shared-key path — one key for everyone, whether it sits in a URL or a config file:

With the wrapper, each caller arrives as themselves — verified against your own identity provider — and the key is locked away server-side:

Three managed AWS pieces do the work, each with one clear job:

Amazon Cognito — the bouncer who issues the wristbands. It's the OAuth 2.0 / OIDC authorization server (with PKCE — the piece OAuth 2.1 and the MCP spec lean on). A caller proves who they are to Amazon Cognito, which hands back a short-lived signed token stamped with a scope (here, tavily-mcp/search). It handles both kinds of caller from the same pool: a backend agent authenticates machine-to-machine, a human logs in through a hosted login page with PKCE. Crucially, Amazon Cognito is your identity layer — swapping the demo's username/password for real corporate SSO (SAML, OIDC, or social logins like Google, Microsoft, Apple) is a configuration change, not a rebuild. The tokens, scopes, and everything downstream stay identical.

API Gateway HTTP API + its JWT authorizer — the wristband scanner at the door. Every request to the MCP endpoint — a plain JSON-RPC call over HTTP POST (the Streamable HTTP transport; this wrapper uses request/response POSTs only, not the transport's optional SSE streaming channel) — hits API Gateway first. Its built-in JWT authorizer — a native feature of the HTTP API (v2) flavour; the older REST API would need a Cognito or custom Lambda authorizer instead — checks the token's signature, issuer, and expiry at the edge, before a single line of your code runs. No token, expired token, forged token: rejected with a 401 right there. This is pure authentication: are you who the token says you are? Nothing reaches your logic until that passes. (One footgun worth flagging: Cognito access tokens carry no aud claim, so the authorizer's audience must be set to the Cognito app client ID — a mismatch here is the most common source of silent 401s in this stack.)

Lambda — the room you're actually allowed into. Once the token is valid, the Lambda does the authorization: it re-reads the scope to confirm this identity may call this specific tool, writes an audit line tying the action to that caller, then — and only then — reaches into Secrets Manager for the shared upstream key and calls Tavily. The key lives only inside this function's execution role; it is never sent back to the client.

The split is the whole point: API Gateway answers "is this a real, valid token?" and Lambda answers "is this identity allowed to do this, and let's record that they did." Authentication at the edge, authorization in your code, the secret sealed behind both.

Why `type: http`, not `type: stdio`

There's a reason this wrapper is a remote MCP server and not a local one — and it's the same reason the talk put OAuth-fronted remote servers at the centre of the architecture. An MCP client config gives you three transport choices, and the choice quietly decides your entire security story:

stdio — the client spawns a local process (npx some-mcp, a Python script) and talks to it over stdin/stdout. The catch: that process needs the upstream credential on the developer's machine. So the key lands in mcp.json, in an env block, in shell history, in a dotfile that gets synced to who-knows-where. Every laptop is now a copy of the shared key. This is the exact shape of the BYOAI / shared-key problem — one credential, sprayed everywhere, impossible to attribute or revoke cleanly.
sse — the older remote transport. Remote is the right instinct, but plain SSE was only ever a transport; it never carried an auth story of its own, so in practice people bolted a static bearer token onto it and landed right back at "one shared secret."
http (Streamable HTTP) — a remote endpoint the client reaches over plain HTTPS, and the transport the 2025 MCP spec standardised on. Crucially, the spec defines the MCP server as an OAuth resource server — so authentication is a first-class part of the transport, not an afterthought. The client config holds no secret at all; it just points at a URL and lets OAuth do the rest. (With an OAuth-native client the browser login is automatic; clients that don't yet drive the flow themselves lean on a small local helper to do it — more on that below.)

That last line is the whole pitch. Compare the two configs a developer actually writes:

// stdio — the secret lives on every laptop
{ "tavily": { "type": "stdio", "command": "npx",
    "args": ["tavily-mcp"],
    "env": { "TAVILY_API_KEY": "tvly-SHARED-KEY-everyone-has-this" } } }

// http — no secret, OAuth per user
{ "tavily": { "type": "http", "url": "https://…/mcp" } }

The http version has nothing to leak. The credential never reaches the client; identity is established per-user through Cognito; and the server — not the laptop — is the only thing that ever touches the upstream key. Whenever an upstream can be reached as a remote OAuth-fronted http server, it should be. This wrapper exists precisely to turn a stdio-shaped shared-key tool into an http-shaped one.

One endpoint, two kinds of caller

A backend agent and a human engineer authenticate through completely different OAuth flows — client_credentials for the machine, authorization_code + PKCE for the person. But they end up carrying the same scope and hitting the same endpoint. The server treats them identically; the only difference is what lands in the audit log — a client ID for the machine, an email for the human.

The last mile most "MCP + OAuth" posts skip

One piece of the human flow is easy to gloss over, and it's exactly where most "remote MCP + OAuth" walkthroughs quietly wave their hands: a static MCP client config (the claude_desktop_config.json kind) can't pop a browser for a Cognito login on its own. The fix is a thin local helper — registered as the client's stdio command — that checks for a cached token, triggers the PKCE browser login when it's missing or expired, stashes the result, and forwards each request with the Authorization: Bearer … header. This is exactly what tools like mcp-remote exist to do — and the rough edges around that login flow are a recurring source of confusion when people first wire up a remote MCP server. It's a small shim, but without it the "human logs in with PKCE" line is doing a lot of unspoken work. The repo includes a runnable PKCE script you can wire in as that helper.

The honest limitation

The Lambda is the final enforcement point — but the security of the whole scheme rests on the entire chain: the JWT authorizer validating token integrity at the edge, and the Lambda enforcing scope behind it. The upstream still sees one shared key; it has no idea which human is behind a call. All the per-user identity, scoping, and audit live in your layer. So the scheme reduces to a few disciplines:

Lock the secret to only the Lambda's execution role. Anyone who can read it gets full upstream access.
Make sure every route to the function goes through the JWT authorizer. A second unprotected trigger would bypass the whole thing.
Machine callers blur the human behind them. When a backend agent authenticates with client_credentials, the audit log names the machine client, not the person who prompted it. For a human-triggered agent you've narrowed "who" to one service identity, not one person. Closing that last gap means propagating the user's own token down to the agent (OAuth 2.0 token exchange, RFC 8693) instead of falling back to a shared machine credential — worth knowing before you lean on M2M audit lines as proof of who did something.
Mind the 30-second ceiling. API Gateway HTTP APIs cap each integration at 30 seconds. A search call returns in well under a second, so it's a non-issue here — but if you wrap a slower upstream (a heavy query, a multi-step scrape, a long-running agent tool), a call that overruns hits a 504 at the gateway. That's the point where you'd reach for the transport's SSE streaming channel, or an async submit-then-poll pattern, instead of a single blocking POST.

Get those right and you've genuinely converted a shared god-token into per-user delegation. Get them wrong and you've just added a proxy in front of the same shared key. This is also why the audit log matters more than it looks: it's the only place "who ran what" is recoverable at all — Tavily's logs will forever show one key.

It works

Deployed to a real AWS account, the end-to-end path is exactly what you'd hope. A caller fetches an Amazon Cognito token, calls the MCP tool, and a real Tavily answer comes back through the wrapper — the client never touches the upstream key:

Answer: The Model Context Protocol (MCP) is a standardized framework
enabling AI models to access external data sources and tools securely…

1. What is the Model Context Protocol (MCP)?
   https://www.databricks.com/blog/what-is-model-context-protocol

And the boundary holds: a request with no token, or a forged one, gets a 401 at the edge — rejected by API Gateway before the Lambda ever runs. Meanwhile every successful call writes an audit line naming the caller — client 59psk… or demo@example.com — something the upstream's shared-key logs physically cannot produce. (Commands and full output are in the repo.)

Why this generalizes

Swap Tavily for a legacy internal API, a SaaS whose own OAuth logs into its identity instead of yours, or a machine credential you don't want sprayed across developer laptops — the Lambda is the only thing that changes, and the payoff is the same. A leaked Amazon Cognito access token is short-lived and expires on its own, and you cut a compromised identity off at the source by revoking its refresh token (and disabling the user/app client) so it can't mint new ones — versus a leaked shared key, which means rotate-and-redeploy for the whole team. (Worth knowing: with the edge JWT authorizer, an already-issued access token stays valid until it expires; for instant kill-switch revocation you'd add a server-side check in the Lambda.) It ties straight back to the Golden Rule from the talk — if a human can't do it in the UI, the agent can't do it via MCP — because the agent inherits exactly the caller's scope, nothing more. That's what makes team-wide AI-assisted coding safe to roll out: developers plug a powerful shared-key tool straight into their own IDE, and security doesn't have to say no, because there's no shared credential on the laptop to leak — just the developer's own scoped, revocable session.

What does all this cost? Almost nothing.

Here's the part that makes this an easy yes: the entire control plane is serverless and pay-per-use, so for a team's real workload it rounds to a rounding error. Amazon Cognito is free up to 50,000 monthly active users. Lambda's free tier covers a million requests a month. An idle HTTP API costs nothing and is about $1 per million requests after that.

The one real decision is where the shared upstream key lives. An encrypted Lambda environment variable is free but readable by anyone with lambda:GetFunctionConfiguration and baked into the deployment — fine for a demo, not for production. SSM Parameter Store SecureString is also free and gives the same at-rest protection for a static key (KMS-encrypted, IAM-scoped reads) — the sweet spot for most single-key wrappers. AWS Secrets Manager costs ~$0.40/month but adds the operational layer you may actually need: automatic rotation, resource policies, cross-account sharing. I used Secrets Manager in the reference because it makes the lockdown story explicit, but Parameter Store swaps in with a few lines. Either way the boundary is identical: IAM grants read to only the Lambda's execution role, and the key never leaves the server.

You also get measurement nearly for free: because every call carries a real identity, your Lambda can log structured per-user, per-tool fields, and CloudWatch Logs Insights (or an EMF metric) turns them into per-identity rate limits, anomaly alerts, or usage you bill back to a team. The shared-key world gives you one undifferentiated blob of traffic and none of that.

Full source — template.yaml, the Lambda, both flow scripts, and a one-command teardown — is on GitHub. Clone it, point it at your own shared-key API, and you've got per-user OAuth in an afternoon.

Resurface Claude Code Usage Across Your Team with CloudWatch OTEL (No Lambda)

Gabriel Koo — Sat, 18 Apr 2026 12:59:36 +0000

I've been building AI tooling infrastructure to empower a team of 50+ software engineers to do vibe coding safely. We went from 3 engineers using AI full-time to 50+ in 6 months — including non-engineers. (I co-presented on this journey at AgentCon Hong Kong 2026.)

One thing we learned: you can't improve what you can't measure. Once you give a team AI coding tools, you want visibility into how they're being used — not to evaluate individual engineers, but to understand adoption patterns. Which tools are people reaching for? How large are the prompts? What tool calls are being made? Making these metrics visible to everyone helps the team learn from each other and helps champions pull others forward.

This post is about the plumbing: how to get that telemetry data from coding agents into CloudWatch with minimal infrastructure.

"But we already have an LLM gateway." If your team routes AI traffic through a gateway like LiteLLM or AWS Bedrock, you already have token-level usage data. But if your engineers are on coding plans — Claude Team/Max, OpenCode Go, GitHub Copilot seats, ChatGPT Codex — the LLM calls bypass your gateway entirely. You lose visibility into the interesting stuff: how many tool calls per session, prompt sizes, which tools are being invoked, who's active at what times. That's where OTEL telemetry fills the gap.

AI coding tools are shipping with built-in OpenTelemetry support. Claude Code, Claude CoWork, GitHub Copilot, Gemini CLI, and Cursor (via hooks) all export metrics, traces, and log events over OTLP/HTTP — token counts, tool durations, model latency, the works. Kiro has an open feature request for native OTEL support too.

There's one catch: CloudWatch's OTLP endpoints require SigV4 signing. These tools' OTEL SDKs can't do that. Neither can most OTEL SDKs without an AWS-specific exporter or a collector sidecar.

The usual fix is a Lambda function that receives OTLP, signs it, and forwards it. That means cold starts, packaging, and another thing to maintain.

Here's a simpler way: API Gateway REST API with AWS Service Integration. APIGW signs the request with SigV4 using an execution role. No Lambda. No collector. No code.

Timeline: CloudWatch has supported OTLP ingestion for traces and logs for some time (availability varies by region — check the OTLP endpoints doc). Native OTLP metrics support launched April 2, 2026 in public preview, completing all three pillars of observability via OTLP.

Architecture

AI Coding Tool (OTEL SDK)
  ↓ OTLP/HTTP + x-api-key
API Gateway REST API
  ├→ POST /v1/metrics  → AWS Integration → monitoring (SigV4) → CloudWatch Metrics
  ├→ POST /v1/traces   → AWS Integration → xray (SigV4)      → X-Ray / CloudWatch Logs
  └→ POST /v1/logs     → AWS Integration → logs (SigV4)       → CloudWatch Logs

The client sends standard OTLP/HTTP requests with an API key. APIGW validates the key, assumes an IAM role, signs the request with SigV4, and forwards it to the CloudWatch OTLP endpoint. That's it.

Why This Works

API Gateway REST API has an integration type called AWS Service Integration. It can call any AWS service API and sign the request with SigV4 using an execution role. The CloudWatch OTLP endpoints are standard AWS service endpoints:

Signal	Endpoint	Service
Metrics	`monitoring.{region}.amazonaws.com/v1/metrics`	`monitoring`
Traces	`xray.{region}.amazonaws.com/v1/traces`	`xray`
Logs	`logs.{region}.amazonaws.com/v1/logs`	`logs`

APIGW's integration URI format maps directly:

arn:aws:apigateway:{region}:monitoring:path/v1/metrics
arn:aws:apigateway:{region}:xray:path/v1/traces
arn:aws:apigateway:{region}:logs:path/v1/logs

Setup

The full infrastructure is defined in a CloudFormation template (link at the bottom). Here's what it creates:

IAM Execution Role

APIGW needs an IAM role to sign requests to CloudWatch. The policy is scoped to only the actions and resources needed for OTLP ingestion:

Parameters:
  OtlpLogGroupName:
    Type: String
    Default: "otlp-logs"
    Description: CloudWatch Logs log group for OTLP log ingestion

Resources:
  OtlpExecutionRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Statement:
          - Effect: Allow
            Principal:
              Service: apigateway.amazonaws.com
            Action: sts:AssumeRole
      Policies:
        - PolicyName: otlp-metrics
          PolicyDocument:
            Statement:
              - Effect: Allow
                Action:
                  - cloudwatch:PutMetricData
                Resource: "*"
        - PolicyName: otlp-traces
          PolicyDocument:
            Statement:
              - Effect: Allow
                Action:
                  - xray:PutTraceSegments
                  - xray:PutTelemetryRecords
                Resource: "*"
        - PolicyName: otlp-logs
          PolicyDocument:
            Statement:
              - Effect: Allow
                Action:
                  - logs:PutLogEvents
                  - logs:CreateLogStream
                  - logs:DescribeLogStreams
                Resource:
                  - !Sub "arn:aws:logs:${AWS::Region}:${AWS::AccountId}:log-group:${OtlpLogGroupName}:*"

Note: cloudwatch:PutMetricData doesn't support resource-level ARNs. The cloudwatch:namespace condition key exists but does not apply to the OTLP ingestion path — metrics are accepted regardless of namespace. X-Ray PutTraceSegments also doesn't support resource-level restrictions. Logs permissions are scoped to a specific log group via the OtlpLogGroupName parameter.

API Gateway with AWS Service Integration

Each OTLP signal gets its own resource with an AWS integration:

MetricsMethod:
  Type: AWS::ApiGateway::Method
  Properties:
    HttpMethod: POST
    AuthorizationType: NONE
    ApiKeyRequired: true
    Integration:
      Type: AWS
      IntegrationHttpMethod: POST
      Uri: !Sub "arn:aws:apigateway:${AWS::Region}:monitoring:path/v1/metrics"
      Credentials: !GetAtt OtlpExecutionRole.Arn
      PassthroughBehavior: WHEN_NO_MATCH
      ContentHandling: CONVERT_TO_TEXT

Same pattern for /v1/traces (service: xray) and /v1/logs (service: logs).

API Key Authentication

Protect the endpoint with an API key so only your tools can send telemetry:

ApiKey:
  Type: AWS::ApiGateway::ApiKey
  Properties:
    Enabled: true

UsagePlan:
  Type: AWS::ApiGateway::UsagePlan
  Properties:
    ApiStages:
      - ApiId: !Ref Api
        Stage: !Ref Stage

Configure Your Tools

The proxy works with any tool that supports standard OTEL environment variables. Here's how to configure each:

Disclaimer: I personally use Claude Code routed through a custom LLM gateway (not a coding plan), since some coding plans aren't available in the region I live in. The configurations below are based on each tool's official documentation — your mileage may vary.

Claude Code

Official monitoring docs

export CLAUDE_CODE_ENABLE_TELEMETRY=1
export OTEL_METRICS_EXPORTER=otlp
export OTEL_LOGS_EXPORTER=otlp
export OTEL_EXPORTER_OTLP_PROTOCOL=http/json
export OTEL_EXPORTER_OTLP_ENDPOINT=https://xxx.execute-api.us-west-2.amazonaws.com/prod
export OTEL_EXPORTER_OTLP_HEADERS=x-api-key=your-api-key
export OTEL_SERVICE_NAME=claude-code

For short-lived tasks, lower the export interval so data flushes before the process exits:

export OTEL_METRIC_EXPORT_INTERVAL=1000
export OTEL_LOGS_EXPORT_INTERVAL=1000

For traces (beta):

export CLAUDE_CODE_ENHANCED_TELEMETRY_BETA=1
export OTEL_TRACES_EXPORTER=otlp
export OTEL_TRACES_EXPORT_INTERVAL=1000

Enforcing OTEL across your team: Claude Code supports managed settings via managed-settings.json, deployable through MDM (Jamf, Intune, etc.). This lets you enforce OTEL configuration org-wide — engineers don't need to set environment variables manually, and they can't opt out.

Claude CoWork (Team & Enterprise)

CoWork monitoring docs — configure via Admin Settings → Cowork → Monitoring:

OTLP endpoint: your APIGW URL
OTLP protocol: http/json
OTLP headers: x-api-key=your-api-key

CoWork streams user prompts, tool/MCP invocations, file access, human approval decisions, and API request details. It shares the same OTel event schema as Claude Code via the Claude Agent SDK — you can distinguish them by terminal.type (cowork vs cli).

GitHub Copilot CLI

Copilot CLI OTel reference — available since Copilot CLI 1.0.4:

export COPILOT_OTEL_ENDPOINT=https://xxx.execute-api.us-west-2.amazonaws.com/prod
export COPILOT_OTEL_HEADERS=x-api-key=your-api-key

Gemini CLI

Gemini CLI telemetry docs

export GEMINI_CLI_OTEL_EXPORT_ENDPOINT=https://xxx.execute-api.us-west-2.amazonaws.com/prod

Cursor (via Hooks)

Cursor doesn't have native OTEL export yet, but the community cursor-otel-hook project captures agent activity via Cursor's hook system and exports traces to any OTLP endpoint. Configure via otel_config.json:

{
  "OTEL_EXPORTER_OTLP_ENDPOINT": "https://xxx.execute-api.us-west-2.amazonaws.com/prod/v1/traces",
  "OTEL_EXPORTER_OTLP_PROTOCOL": "http/json",
  "OTEL_EXPORTER_OTLP_HEADERS": { "x-api-key": "your-api-key" }
}

What You Get

CloudWatch receives standard OTLP data. For Claude Code specifically:

Metrics: claude_code.token.usage (by token.type: input/output/cache_read/cache_creation), claude_code.cost.usage (USD), claude_code.session.count, claude_code.lines_of_code.count
Traces (beta): Spans linking each user prompt → API requests → tool executions
Log events: claude_code.user_prompt, claude_code.tool_decision, claude_code.tool_result, claude_code.api_request — all tagged with session.id and service.name=claude-code

Here's what real Claude Code log events look like after flowing through the proxy into CloudWatch Logs. This is actual data from an E2E test — a single prompt that triggered a Bash tool call:

claude_code.user_prompt — emitted when the user sends a prompt:

{
  "resource": {
    "attributes": {
      "host.arch": "arm64",
      "os.type": "linux",
      "service.name": "claude-code",
      "service.version": "2.1.114",
      "os.version": "6.17.0-1010-aws"
    }
  },
  "scope": {
    "name": "com.anthropic.claude_code.events",
    "version": "2.1.114"
  },
  "body": "claude_code.user_prompt",
  "attributes": {
    "event.sequence": 0,
    "user.id": "1c257d04...",
    "prompt_length": "40",
    "terminal.type": "non-interactive",
    "event.name": "user_prompt",
    "event.timestamp": "2026-04-18T11:23:13.187Z",
    "prompt": "<REDACTED>",
    "session.id": "846ab649-8bba-471e-8ec5-8756116d0840",
    "prompt.id": "88475ee2-59c2-4137-9201-5540c6a6cad1"
  }
}

claude_code.tool_result — emitted after each tool execution:

{
  "body": "claude_code.tool_result",
  "attributes": {
    "tool_name": "Bash",
    "tool_result_size_bytes": "899",
    "tool_input": "{\"command\":\"ls\",\"description\":\"List files in current directory\"}",
    "duration_ms": "95",
    "success": "true",
    "session.id": "846ab649-8bba-471e-8ec5-8756116d0840",
    "prompt.id": "88475ee2-59c2-4137-9201-5540c6a6cad1"
  }
}

claude_code.api_request — emitted after each API call with token counts and cost:

{
  "body": "claude_code.api_request",
  "attributes": {
    "model": "claude-sonnet-4-5-20250929",
    "input_tokens": "142",
    "output_tokens": "61",
    "cache_read_tokens": "0",
    "cache_creation_tokens": "25848",
    "cost_usd": "0.098271",
    "duration_ms": "4950",
    "speed": "normal",
    "session.id": "846ab649-8bba-471e-8ec5-8756116d0840",
    "prompt.id": "88475ee2-59c2-4137-9201-5540c6a6cad1"
  }
}

All events share the same prompt.id, linking them into a single interaction. The event.sequence field orders events within a prompt. Every record carries service.name=claude-code in resource attributes, so isolating Claude Code telemetry in a mixed pipeline is trivial — just filter on that in CloudWatch Logs Insights:

fields @timestamp, body, attributes.model, attributes.cost_usd, attributes.duration_ms
| filter resource.attributes.`service.name` = 'claude-code'
| filter body = 'claude_code.api_request'
| sort @timestamp desc

Region Availability

CloudWatch OTLP endpoints are available in most regions but not all. The OTLP metrics preview launched in 5 regions:

Signal	Regions	Docs
Metrics (preview)	us-east-1, us-west-2, ap-southeast-1, ap-southeast-2, eu-west-1	Announcement
Traces	Most commercial regions	OTLP Endpoints
Logs	Most commercial regions	OTLP Endpoints

Tested and confirmed:

Region	Metrics	Traces	Logs
us-east-1	✅	✅	✅
us-west-2	✅	✅	✅
ap-southeast-1	✅	✅	✅
ap-east-1 (Hong Kong)	❌	❌	❌

If your primary region doesn't support it, deploy the proxy in a supported region. The APIGW endpoint is accessible from anywhere.

For the full list of CloudWatch service endpoints by region, see the AWS General Reference.

Gotchas

XRay traces require manual setup. The CloudFormation template creates the proxy endpoints, but X-Ray traces need two additional steps that aren't in the template:

Set CloudWatch Logs as the trace segment destination:

aws xray update-trace-segment-destination --destination CloudWatchLogs

Create a CloudWatch Logs resource policy allowing X-Ray to write to the aws/spans log group:

aws logs put-resource-policy \
  --policy-name XRayAccessPolicy \
  --policy-document '{"Version":"2012-10-17","Statement":[{"Effect":"Allow","Principal":{"Service":"xray.amazonaws.com"},"Action":["logs:PutLogEvents","logs:CreateLogGroup","logs:CreateLogStream"],"Resource":"*"}]}'

Without these, traces will return AccessDeniedException.

CloudWatch Logs supports bearer token auth. The /v1/logs endpoint supports bearer token authentication without SigV4 — but only for logs. Metrics and traces still require SigV4, which is why the APIGW proxy is needed for a unified endpoint.

Use http/json, not http/protobuf. CloudWatch accepts both formats, but API Gateway's CONVERT_TO_TEXT content handling can corrupt binary protobuf payloads in transit. Set OTEL_EXPORTER_OTLP_PROTOCOL=http/json to avoid this. JSON is also easier to debug in APIGW execution logs. Most coding tools default to protobuf, so you'll need to override this explicitly.

API Gateway payload limit. REST API has a 10MB payload limit. OTLP batches from coding tools are well under this, but keep it in mind if you're aggregating from multiple sources. CloudWatch's own limits are 1MB for metrics and logs, 5MB for traces (full limits).

REST API, not HTTP API. Only REST API supports the AWS integration type needed for SigV4 service proxying. HTTP API does not.

Cost

This is about as cheap as it gets for a telemetry pipeline:

Component	Cost
API Gateway	~$3.50 / million requests
CloudWatch Metrics	Standard CW pricing (free during OTel metrics preview)
CloudWatch Logs	Standard CW pricing
Lambda	$0 (there is none)

No idle cost. No provisioned capacity. Pure pay-per-request.

For comparison: a Lambda-based OTLP forwarder would add ~$0.20/million invocations plus compute time, but gives you retry logic and transformation capabilities. At typical coding agent volumes (a few hundred requests/day per developer), the cost difference is negligible — the real win is operational simplicity.

When NOT to Use This

This proxy is optimized for simplicity. It's the right choice for low-to-moderate telemetry volumes from coding agents and developer tools. But it has tradeoffs:

Approach	Complexity	Cost	Retries	Multi-destination	Transformation
This proxy (APIGW)	Minimal	~$3.50/M req	❌	❌	❌
OTel Collector	Medium	Compute cost	✅	✅	✅
Lambda forwarder	Medium	~$0.20/M + compute	✅	✅	✅
ADOT SDK (in-app)	Low	Free (SigV4 native)	✅	❌	❌
SaaS (Datadog, etc.)	Low	$$$	✅	N/A	✅

Consider an OTel Collector or Lambda forwarder instead if you need:

High throughput — thousands of requests/second from many sources
Retry and buffering — this proxy is fire-and-forget; if CloudWatch returns an error, the data is lost. OTEL SDKs have built-in retry, but only for transient failures
Multi-destination routing — fan out to CloudWatch + Datadog + S3 simultaneously
Payload transformation — filter, enrich, or redact telemetry before ingestion
Compliance requirements — audit trails, guaranteed delivery, or data residency controls

For most coding agent monitoring use cases (a team of 5-50 developers), this proxy handles the volume comfortably.

Security Considerations

The proxy uses API key authentication — simple but not the strongest option. Here's how to harden it:

Attach AWS WAF to the REST API. Add rate limiting, IP allowlisting, or geo-blocking to prevent abuse. A single WAF WebACL with a rate-based rule (e.g., 1000 req/5min per IP) costs ~$6/month and stops most abuse patterns.

Rotate API keys. APIGW supports multiple API keys per usage plan. Create a new key, distribute it, then disable the old one — zero downtime rotation.

Consider IAM auth for internal use. If your tools run inside AWS (EC2, ECS, Lambda), switch AuthorizationType from NONE to AWS_IAM and drop the API key entirely. The caller signs requests with SigV4 using their IAM role — no shared secrets. This doesn't work for external tools like Claude Code on developer laptops, but it's ideal for CI/CD pipelines or server-side agents.

Egress control. If you're running coding agents in a controlled environment, restrict outbound traffic to only your APIGW endpoint. This prevents telemetry from leaking to unauthorized collectors.

Beyond Coding Agents

This proxy works with any OTEL SDK that supports OTLP/HTTP. If your tool can set OTEL_EXPORTER_OTLP_ENDPOINT and OTEL_EXPORTER_OTLP_HEADERS, it can ship telemetry to CloudWatch through this proxy.

Potential use cases:

AI coding agents (Claude Code, CoWork, Copilot, Cursor, Gemini CLI) — track token usage, costs, and tool calls across your org
Internal tools — ship metrics without embedding AWS credentials in client apps
CI/CD pipelines — export build/test telemetry to CloudWatch
On-premises services — send OTLP from outside AWS without running ADOT Collector

For apps running inside AWS with IAM roles available, consider the ADOT SDK for collector-less telemetry with native SigV4 signing — no proxy needed.

Source Code & One-Click Deploy

The CloudFormation template and full documentation are on GitHub:

👉 gabrielkoo/otlp-cloudwatch-proxy

One-click deploy to supported regions:

Region	Deploy
US East (N. Virginia)
US West (Oregon)
Asia Pacific (Singapore)
Asia Pacific (Sydney)
Europe (Ireland)

Built and validated on a Saturday morning with Claude Code + OpenClaw. Zero Lambda functions were harmed in the making of this article.

Bedrock for AI Coding Tools: Mantle vs Gateway vs LiteLLM — A Decision Guide for AWS Credit Burners

Gabriel Koo — Sun, 22 Mar 2026 08:59:54 +0000

You have AWS credits. You want to use them on AI coding tools — OpenCode, Codex CLI, Claude Code, whatever. Amazon Bedrock has the models. But how do you actually connect them?

There are three approaches, and picking the wrong one wastes time. Here's the decision guide I wish I had.

All data in this post is as of March 2026. Model counts and API support may change — check amazonbedrockmodels.github.io for the latest.

TL;DR

Just want it to work? Mantle + OpenCode. Five minutes, zero infra.
Need Claude models via OpenAI API? bedrock-access-gateway on Lambda.
Need Claude Code specifically? LiteLLM. It's the only path.
Codex CLI? Broken with all three. Wait for LiteLLM to fix a tool translation bug.

The three paths

One thing all three have in common: your API keys and code context stay within your AWS account or your own infrastructure. Third-party AI gateways exist (Bifrost, Portkey, etc.), but they require routing your Bedrock API keys and code context through someone else's servers. Self-hosted or AWS-native — that's the baseline.

1. Bedrock Mantle — no self-hosted infra required

Mantle is AWS's native OpenAI-compatible endpoint. No Lambda, no container, no proxy — just set your base URL and API key:

export OPENAI_BASE_URL="https://bedrock-mantle.us-east-1.api.aws/v1"
export OPENAI_API_KEY="your-bedrock-api-key"

What's on Mantle: 38 open-weight models — DeepSeek, Mistral, Qwen, GLM, NVIDIA Nemotron, MiniMax, Moonshot Kimi, Google Gemma, OpenAI gpt-oss, and Writer Palmyra.

What's NOT on Mantle: Anthropic Claude, Amazon Nova, Meta Llama, AI21, Cohere — the proprietary/first-party models are absent.

API coverage: Mantle exposes Chat Completions (/v1/chat/completions) and Responses API (/v1/responses). No Anthropic Messages API (/v1/messages).

The Responses API is limited — only 4 models support it: openai.gpt-oss-120b-1:0, openai.gpt-oss-20b-1:0, openai.gpt-oss-120b, and openai.gpt-oss-20b. Every other model is Chat Completions only. I verified this by scraping all 102 model card pages in the AWS documentation.

Cost: Standard Bedrock on-demand pricing. No gateway markup, no infra costs.

Best for: OpenCode or any tool that speaks OpenAI Chat Completions.

2. bedrock-access-gateway — self-hosted, all models

bedrock-access-gateway (or my fork, bedrock-access-gateway-function-url) gives you an OpenAI-compatible proxy backed by all Bedrock models — including Claude, Nova, and Llama.

Deploy it as a Lambda Function URL or on ECS, and you get:

export OPENAI_BASE_URL="https://your-lambda-url.lambda-url.us-west-2.on.aws/api/v1"
export OPENAI_API_KEY="your-gateway-api-key"

The tradeoff: you maintain infrastructure. But you get access to every Bedrock model through a single OpenAI-compatible endpoint.

Cost: Bedrock on-demand pricing + Lambda/ECS compute costs (minimal for Lambda Function URLs — you pay per invocation).

Best for: When you need Claude or Nova through OpenAI-compatible tools, or want full control over routing, caching, and logging.

3. LiteLLM — the universal translator

LiteLLM is the Swiss Army knife. It translates between API schemas — OpenAI, Anthropic, Bedrock native, and more. It's the only option that gives you Anthropic Messages API (/v1/messages) compatibility with Bedrock models.

This matters because Claude Code uses the Anthropic API schema, not OpenAI's. If you want to run Claude Code against Bedrock, LiteLLM is your best (and arguably only) option. I tested this end-to-end: Claude Code CLI → LiteLLM → Bedrock Converse API — it works, including streaming responses.

Is it perfect? No. Setup is more complex (Python process or Docker container, optional PostgreSQL for analytics), and you're adding another layer of abstraction. But it's the most flexible gateway available.

Cost: Bedrock on-demand pricing + your compute costs for hosting LiteLLM. No per-call markup from LiteLLM itself (open source).

Best for: Claude Code, or when you need both OpenAI and Anthropic API compatibility from a single proxy.

Tool compatibility matrix

Tool	API Schema	Mantle	bedrock-access-gateway	LiteLLM
OpenCode	OpenAI Chat	✅	✅	✅
Codex CLI	OpenAI Responses	❌ Auth issues	❌ No Responses API	⚠️ Tool bug
Claude Code	Anthropic Messages	❌ No support	❌ Wrong schema	✅

Note: Anthropic-native tools like Kiro CLI also work through LiteLLM's Anthropic Messages API translation.

A note on Codex CLI

Codex CLI requires the Responses API (/v1/responses), which limits your options:

Mantle: Only the 4 OpenAI gpt-oss models support Responses API. Even with those, I hit 401 auth errors (Bearer token not passed correctly through Codex's HTTPS transport) and tool type rejections (web_search type not supported — only function and mcp).
bedrock-access-gateway: No Responses API at all — /v1/responses returns 404. The gateway only implements Chat Completions.
LiteLLM: Supports Responses API (v1.66.3+) and has an official Codex CLI tutorial. However, as of v1.82.5, there's a tool translation bug: Codex CLI sends built-in tool types that LiteLLM converts to Bedrock Converse format with empty toolSpec.name fields, causing Bedrock validation errors. The Responses API itself works fine when tested with standard function tools. This should be fixable on the LiteLLM side.

Quick setup: OpenCode + Mantle

If you just want to burn AWS credits on a coding CLI today, here's the fastest path. OpenCode (v1.2.27+) works with Mantle out of the box:

{
  "$schema": "https://opencode.ai/config.json",
  "provider": {
    "bedrock-mantle": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "Bedrock Mantle",
      "options": {
        "baseURL": "https://bedrock-mantle.us-east-1.api.aws/v1",
        "apiKey": "{env:BEDROCK_API_KEY}"
      },
      "models": {
        "openai.gpt-oss-120b": { "name": "GPT OSS 120B" },
        "zai.glm-5": { "name": "GLM 5 (744B/40B MoE)" },
        "qwen.qwen3-coder-480b-a35b-instruct": { "name": "Qwen3 Coder 480B" },
        "deepseek.v3.2": { "name": "DeepSeek V3.2" },
        "mistral.mistral-large-3-675b-instruct": { "name": "Mistral Large 3" }
      }
    }
  },
  "model": "bedrock-mantle/openai.gpt-oss-120b"
}

Save to ~/.config/opencode/opencode.json, set BEDROCK_API_KEY, and you're coding.

The bottom line

	Mantle	bedrock-access-gateway	LiteLLM
Infra to maintain	None	Lambda/ECS	Container/process
Models available	38 (open-weight)	All Bedrock	All Bedrock
OpenAI Chat API	✅	✅	✅
OpenAI Responses API	⚠️ gpt-oss only	❌	✅
Anthropic Messages API	❌	❌	✅
OpenCode	✅	✅	✅
Codex CLI	❌	❌	⚠️ Tool bug
Claude Code	❌	❌	✅
Extra cost	None	~$0 (Lambda)	Your compute
Setup time	5 min	30 min	1 hr

Track available Mantle models

I maintain amazonbedrockmodels.github.io — a catalog of every Bedrock model with API support badges and endpoint support (Mantle vs Runtime), scraped from the AWS documentation.

Burning AWS credits on something interesting? I'd love to hear what tools and models you're using — drop a comment.
comment.*

From 3-Minute Cold Starts to ~20 Seconds: Whisper on AWS Lambda + EFS for OpenClaw

Gabriel Koo — Fri, 13 Mar 2026 01:27:21 +0000

Part 3 of my series on building a low-cost personal AI stack on AWS.
Part 1 — Squeezing my $1k/month API bill to $20/month with AWS Credits
Part 2 — Drop-in Perplexity Sonar replacement with AWS Bedrock Nova Grounding

TL;DR

I built a self-hosted speech-to-text API on AWS Lambda using faster-whisper. After trying Amazon Transcribe, SageMaker Serverless, and Lambda with a bundled model, I landed on a Lambda + EFS + S3 architecture that achieves ~20-30 second cold starts (once the model is cached on EFS) for ~$0.21/month in storage costs. Once warm, specifying the language drops response time to ~10s.

Open source: gabrielkoo/aws-lambda-whisper-adaptor

The Problem

I wanted to automatically transcribe Telegram voice messages. The requirements were simple:

Accuracy: Good enough for Cantonese
Cost: Pay-per-use, scales to zero when idle
Latency: Cold start under 60 seconds

There's a fourth constraint that's easy to overlook outside Hong Kong: most managed STT APIs simply aren't available here. OpenAI's Whisper API falls under their notorious China's regional restriction. Google's Gemini models are available and actually competitive on both accuracy and price — Gemini 3 Flash achieves 3.1% WER at ~$1.92/1000 minutes (Artificial Analysis STT leaderboard), cheaper than OpenAI's Whisper API and competitive with Lambda at low volume. The real reason I went with Lambda: AWS Credits from the Community Builder program (same theme as the rest of this series) make it effectively free.

Simple enough. Except it took four attempts to get there.

What I Tried (and Why It Didn't Work)

Option 1: Amazon Transcribe

The obvious first choice — fully managed, pay-per-use, native AWS integration.

Why I rejected it before even trying:

Amazon Transcribe supports zh-CN and zh-TW, but not yue (Cantonese). Whisper large-v3-turbo handles Cantonese significantly better, and accuracy matters more than convenience here.

Option 2: SageMaker Serverless Inference

SageMaker Serverless scales to zero and handles model serving — sounds perfect.

What happened:

I deployed a SageMaker Serverless endpoint with faster-whisper. The first invocation after idle:

Container provisioning: ~30s
Model loading: ~45-60s
Total cold start: 60-90 seconds

For a voice message that's 5-10 seconds long, waiting 90 seconds is a terrible experience.

The 6GB memory wall:

SageMaker Serverless maxes out at 6144 MB (6 GB) RAM. Here's why that's a problem for Whisper:

whisper-large-v3-turbo (INT8): ~780MB model + ~2GB Python/runtime overhead ≈ 2.8GB minimum
whisper-large-v3 (FP16): ~3GB model alone — barely fits, zero headroom for audio processing
Any concurrent requests? You're OOM.

Lambda goes up to 10,240 MB. That headroom matters.

Cost comparison:

SageMaker Serverless bills per GB-second of inference time. For sporadic voice message transcription (~10s per request, a few times a day), Lambda's per-invocation pricing is significantly cheaper. My Lambda setup costs ~$0.21/month in storage — the compute is essentially free at this volume.

I deleted the endpoint after testing.

Option 2b: Bedrock Marketplace

AWS Bedrock Marketplace does list Whisper Large V3 Turbo — but it deploys on a dedicated endpoint instance. Auto-scaling is available (including scale-to-zero), but that creates a different problem:

Keep minimum 1 instance: always paying for idle time, even at 3am
Scale to zero: cold starts when traffic resumes — SageMaker cold starts are measured in minutes, not seconds
Not token/usage-based pricing either way

For a Telegram bot that gets a few voice messages a day, you're either burning money on idle instances or waiting minutes for the first message to transcribe. Lambda's ~20-30s cold start looks great by comparison.

Option 3: Lambda with Bundled Model

Next idea: bundle the model directly into the Docker image. No external dependencies, simple architecture.

What happened:

# Download model during build
# Note: using openai/whisper-large-v3-turbo converted to int8 via sync-model workflow
RUN python -c "from faster_whisper import WhisperModel; WhisperModel('openai/whisper-large-v3-turbo')"

Docker image size: ~10GB
ECR push time: 5+ minutes
Lambda cold start: 2 minutes 51 seconds

The cold start is dominated by Lambda pulling the 10GB image from ECR. AWS Lambda caches images, but any cold start after the cache expires hits this wall.

Why it didn't work:

3-minute cold start is unusable for interactive transcription
Every code change requires rebuilding and pushing a 10GB image
ECR storage: ~$1/month just for the image

Option 4: Lambda + S3 (No EFS)

What if Lambda downloads the model from S3 on cold start, storing it in /tmp?

The problem:

Lambda's /tmp is ephemeral. Every cold start re-downloads the model from S3:

S3 download for 1.6GB FP16 model: 30-60 seconds
S3 download for 780MB INT8 model: 15-30 seconds

This is better than the bundled model approach, but there's a bigger issue: no caching between Lambda instances. If you have 3 concurrent invocations, all 3 download the model independently. You're paying for S3 transfer on every cold start.

What about Lambda SnapStart or Durable Functions?

AWS added two relevant features since this was written:

SnapStart for Python (Nov 2024): snapshots the initialized execution environment — sounds perfect for caching a loaded model. The catch: SnapStart doesn't support container images. This adaptor is container-based, so it's off the table.
Lambda Durable Functions (re:Invent 2025): enables multi-step workflows with automatic checkpointing, pause/resume for up to one year, and failure recovery. This is workflow orchestration (think Azure Durable Functions) — useful for multi-step AI pipelines, but not for persisting a 780MB model binary between cold starts.

EFS remains the right solution for model caching.

What Actually Worked: Lambda + EFS + S3

The solution: use EFS as a persistent model cache, bootstrapped from S3. I've used EFS for persistent Streamlit state on ECS before — same pattern, different compute layer.

Request → Lambda Function URL
               ↓
          Lambda (VPC)
               ↓ first cold start only: S3 → EFS
              EFS (model cached here permanently)

How it works:

First cold start (once per model): Lambda checks for a marker file on EFS. If missing, downloads model from S3 to EFS (~55s for INT8). Writes marker file. Then loads model into RAM.
Subsequent cold starts (new container, model already on EFS): Marker file exists → load model from EFS into RAM (~20-30s for INT8).
Warm invocations (same container reused): Model already in memory → transcription-only time (~10-22s depending on audio length and whether language is specified).

HF_MODEL_REPO = os.environ.get('HF_MODEL_REPO', 'openai/whisper-large-v3-turbo')
MODEL_SLUG = HF_MODEL_REPO.replace('/', '--')
EFS_MODEL_DIR = f'/mnt/whisper-models/{MODEL_SLUG}'
MODEL_MARKER = f'/mnt/whisper-models/.ready-{MODEL_SLUG}'

def bootstrap_model():
    if os.path.exists(MODEL_MARKER):
        return WhisperModel(EFS_MODEL_DIR, device='cpu', compute_type='int8')

    # First run: sync model from S3 to EFS
    s3 = boto3.client('s3')
    prefix = f'models/{MODEL_SLUG}/'
    os.makedirs(EFS_MODEL_DIR, exist_ok=True)

    paginator = s3.get_paginator('list_objects_v2')
    for page in paginator.paginate(Bucket=os.environ['MODEL_S3_BUCKET'], Prefix=prefix):
        for obj in page.get('Contents', []):
            key = obj['Key']
            local_path = os.path.join(EFS_MODEL_DIR, key[len(prefix):])
            os.makedirs(os.path.dirname(local_path), exist_ok=True)
            s3.download_file(os.environ['MODEL_S3_BUCKET'], key, local_path)

    open(MODEL_MARKER, 'w').close()  # Mark as ready
    return WhisperModel(EFS_MODEL_DIR, device='cpu', compute_type='int8')

MODEL = bootstrap_model()  # Runs at Lambda init time, cached for warm invocations

Why EFS works:

EFS persists across Lambda instances — model is downloaded once, reused forever
EFS is mounted at /mnt/whisper-models — Lambda reads it like a local filesystem
S3 VPC Gateway Endpoint is free — no NAT Gateway needed (saves ~$32/month)
Zero internet egress — Lambda → S3 via VPC Gateway Endpoint, Lambda → EFS within VPC. The Lambda function never reaches the internet. This is a meaningful security benefit when using third-party models from HuggingFace — model weights never leave the AWS network once synced to S3.
EFS storage: ~$0.19/month for the 780MB INT8 model

🔒 Security note: The Lambda runs in a VPC with no internet access — no NAT Gateway, no public subnet. It can only reach EFS (VPC-internal) and S3 (via the free VPC Gateway Endpoint). This means even if you're using a third-party HuggingFace model, the model weights and your audio data never leave the AWS network. No data exfiltration risk, no outbound calls to unknown endpoints.

INT8 vs FP16: The Model Size Trade-off

The openai/whisper-large-v3-turbo model on HuggingFace needs conversion to CTranslate2 format. The sync-model workflow handles this, converting to INT8 and fixing the num_mel_bins config. Alternatively, use Zoont/faster-whisper-large-v3-turbo-int8-ct2 — a pre-converted CTranslate2 INT8 model that works out of the box with quantization=none:

Model	Size (EFS)	First Bootstrap	EFS Cold Start	Warm (2.5s audio)	Memory
`Zoont/faster-whisper-large-v3-turbo-int8-ct2`	~780MB	~55s	~22s ✅	~10s ✅	~2.8GB
`openai/whisper-large-v3-turbo` (INT8, via sync-model)	~780MB	~55s	~22s	~10s	~2.8GB
`openai/whisper-large-v3-turbo` (FP16)	~1.5GB	~126s	~40s	~15s	~4GB
`Systran/faster-whisper-large-v3` (FP16, loaded as int8)	~1.6GB	~54s	~30s	~13s	6GB

Recommended: Zoont/faster-whisper-large-v3-turbo-int8-ct2 — no conversion step needed, identical performance to the openai model converted to INT8. Use quantization=none in the sync-model workflow since it's already in CTranslate2 format.

Cost Breakdown

Resource	Monthly Cost
EFS storage (780MB INT8)	~$0.19
S3 storage (780MB)	~$0.02
Lambda compute	~$0.00167/warm invocation*
S3 VPC Gateway Endpoint	Free
NAT Gateway	Not needed ($0)
Total (storage only)	~$0.21/month

10GB × 10s = 100 GB-seconds per warm invocation. The Lambda free tier covers **400,000 GB-seconds/month* — roughly 4,000 warm invocations. For a personal bot, compute cost is effectively $0. Storage dominates.

Compare to SageMaker Serverless: minimum ~$5-10/month for similar workloads, plus the 60-90s cold start penalty.

Why not Provisioned Concurrency? PC keeps Lambda permanently warm (no cold starts), but costs ~$0.0000097222/GB-second. For a 10GB function running 24/7: ~$252/month. Even a minimal 4GB setup runs ~$100/month — roughly 500x more than the $0.21 storage approach. For a personal bot with a few voice messages a day, the occasional ~60s cold start is a fine trade-off.

vs. OpenAI Whisper API

OpenAI's Whisper API costs $0.006/minute. Here's how it compares for a bot averaging 15s voice messages:

Volume	OpenAI Whisper API	Self-hosted Lambda
50 msgs/month	$0.08	$0.21 (storage only)
140 msgs/month	$0.21	$0.21 ← break-even
500 msgs/month	$0.75	$0.21 (storage only)
1,000 msgs/month	$1.50	$0.21 (storage only)
4,000 msgs/month	$6.00	$0.21 (storage only)

Lambda compute is free within the free tier (~4,000 warm invocations/month). Beyond that, it's $0.00167/invocation — but that's a high volume for a personal bot.

Break-even: ~140 messages/month. Above that, Lambda wins on cost.

But cost isn't the only reason to self-host:

Geographic availability: OpenAI's API is not available in Hong Kong — HK falls under China's regional restriction. Azure OpenAI does offer Whisper, but only whisper-1 (large-v2 based) — large-v3 and large-v3-turbo are not available. If you're in HK (or other restricted regions), this approach isn't just cheaper — it's the only option for v3-quality transcription.
Cantonese accuracy: language=yue with Whisper large-v3-turbo is noticeably better than the managed API for Cantonese
Privacy: audio never leaves your infrastructure
No rate limits: Lambda scales independently

Architecture

Telegram voice message
        ↓
   OpenClaw (gateway)
        ↓
Lambda Function URL (auth via token)
        ↓
Lambda (VPC, 10GB RAM, 900s timeout)
        ↓
EFS /mnt/whisper-models/{model-slug}
        ↓
faster-whisper (CTranslate2, INT8)
        ↓
    Transcript

Lambda configuration:

Memory: 10,240 MB — actual usage is ~2.2GB (INT8 model), but Lambda allocates CPU proportional to memory. 10GB gives ~6 vCPUs vs ~2.3 vCPUs at 4GB, cutting warm transcription from ~16s to ~10s. You're paying for CPU, not RAM.
Timeout: 900s (handles long audio files)
VPC: Default VPC (no NAT Gateway)
EFS: Mounted at /mnt/whisper-models

Memory vs. cost trade-off (tested, 3 runs each):

Config	Cold Start	Warm (2.5s audio)	GB-seconds/invocation
4,096 MB	~30s	~21s	84 (~$0.00140)
6,144 MB	~25s	~16s	96 (~$0.00160)
8,192 MB	~24s	~18s	144 (~$0.00240)
10,240 MB	~22s	~15s	150 (~$0.00250)

Cold start is ~20-30s across all configs — it's EFS I/O bound, not CPU bound, so more memory doesn't help much here. Warm inference time does scale with memory (more vCPUs = faster CTranslate2 decoding). Interestingly, 4GB is the cheapest per invocation — the warm time savings at higher memory don't offset the extra GB-seconds. Within the free tier, cost differences are negligible regardless.

API Compatibility

The adaptor exposes two endpoints so it works as a drop-in replacement for existing integrations:

OpenAI compatible (/v1/audio/transcriptions):

curl -X POST https://<function-url>/v1/audio/transcriptions \
  -H "Authorization: Token <secret>" \
  -F "file=@audio.ogg" \
  -F "language=yue"

{"text": "transcript here"}

Deepgram compatible (/v1/listen):

curl -X POST https://<function-url>/v1/listen?language=yue \
  -H "Authorization: Token <secret>" \
  -H "Content-Type: audio/ogg" \
  --data-binary @audio.ogg

Model Management API

Once you've synced multiple models to EFS, there's no SSH access to see what's there or clean up. I added two non-standard endpoints:

List models on EFS:

curl https://<function-url>/v1/models -H "Authorization: Token <secret>"

{
  "object": "list",
  "data": [
    {"id": "openai/whisper-large-v3-turbo", "object": "model", "owned_by": "openai"},
    {"id": "Systran/faster-distil-whisper-large-v3", "object": "model", "owned_by": "Systran"}
  ]
}

Delete a model from EFS (the currently loaded model returns 409):

curl -X DELETE https://<function-url>/v1/models/Systran/faster-distil-whisper-large-v3 \
  -H "Authorization: Token <secret>"

{"id": "Systran/faster-distil-whisper-large-v3", "object": "model", "deleted": true}

Slashes in model IDs work naturally — rawPath preserves the full path, so DELETE /v1/models/openai/whisper-large-v3-turbo correctly maps to model ID openai/whisper-large-v3-turbo.

Performance Tip: Always Specify Language

When no language is specified, Whisper runs language detection on the first audio chunk — adding noticeable overhead. For a 2.5s voice message:

Request	Response Time
No language (auto-detect)	~22s
`language=yue` (Cantonese)	~10s

That's a 2x speedup just from passing a language hint. Two ways to do it:

Option A — per-request query param (recommended, keeps Lambda language-agnostic):

# Deepgram endpoint
curl -X POST https://<function-url>/v1/listen?language=yue \
  -H "Authorization: Token <secret>" \
  -H "Content-Type: audio/ogg" \
  --data-binary @audio.ogg

# OpenAI endpoint
curl -X POST https://<function-url>/v1/audio/transcriptions \
  -H "Authorization: Token <secret>" \
  -F "file=@audio.ogg" \
  -F "language=yue"

Option B — Lambda env var (simpler if you only ever transcribe one language):

WHISPER_LANGUAGE=yue

I use Option A — the language is set in my OpenClaw config (language: "yue" in the audio model), which passes it as ?language=yue to the Lambda on every request.

Real-time Factor

Once warm, the Lambda transcribes faster than real-time for typical voice messages:

Audio Duration	Warm Response Time	Real-time Factor
2.5s	~10s	4x
33s	~23s	0.68x ✅ faster than real-time

The 2.5s result looks slow (4x), but Whisper processes audio in 30-second chunks — the overhead is fixed regardless of audio length. For longer messages, the real-time factor drops well below 1x.

Open Source

The project is open source at gabrielkoo/aws-lambda-whisper-adaptor.

Key features:

Any faster-whisper model via HF_MODEL_REPO env var
GitHub Actions workflow to sync models from HuggingFace → S3 (quantization=int8 for HF-format models, quantization=none for pre-converted CTranslate2 models)
GET /v1/models — list all models currently on EFS
DELETE /v1/models/{owner}/{model} — remove a model from EFS on demand
Pre-built Docker image: ghcr.io/gabrielkoo/aws-lambda-whisper-adaptor:latest
Configurable language detection via WHISPER_LANGUAGE env var or per-request parameter

Pre-warming

For OpenClaw voice prompts: the ~20-30s cold start is often negligible in practice — if you're asking the agent to run a multi-step job, it'll take a few minutes anyway. Pre-warming only matters if you need the very first transcription to be fast.

Cold starts happen when Lambda hasn't been invoked recently. For predictable usage patterns (e.g. a morning standup bot), pre-warm the Lambda before you need it:

#!/bin/bash
# prewarm.sh — trigger Lambda init before expected usage
curl -s -o /dev/null \
  -X POST "$WHISPER_LAMBDA_URL/v1/listen?language=yue" \
  -H "Authorization: Token $WHISPER_API_SECRET" \
  -H "Content-Type: audio/ogg" \
  --data-binary @sample.ogg
echo "Lambda pre-warmed"

Schedule with cron: 0 8 * * * /path/to/prewarm.sh (runs at 8am daily).

Alternatively, use an EventBridge rule to ping the Lambda every few minutes — though at that frequency, Provisioned Concurrency starts making more sense cost-wise.

Conclusion

The Lambda + EFS + S3 architecture achieves:

~20-30s cold start (INT8 model on EFS); first-ever bootstrap from S3 takes ~55s (one-time only)
~10s warm invocations with language=yue
~$0.21/month storage cost
Zero idle cost (scales to zero)
Deepgram and OpenAI compatible APIs

The key insight: EFS is the missing piece. It provides persistent, fast storage that Lambda can access without a NAT Gateway (using the free S3 VPC Gateway Endpoint for bootstrapping).

I couldn't find any existing write-up of Whisper on Lambda using EFS for persistent model caching — most approaches either bundle the model in Docker (3-minute cold starts) or re-download from S3 on every cold start (no caching between instances). If you've seen this done before, I'd love to know.

Two things worth knowing before you deploy:

Use Zoont/faster-whisper-large-v3-turbo-int8-ct2 with quantization=none in the sync-model workflow — it's pre-converted to CTranslate2 INT8 and works out of the box (the openai/whisper-large-v3-turbo model requires conversion and can hit num_mel_bins config issues)
Always pass a language parameter if you know it — cuts response time roughly in half

If you're building voice transcription on AWS and want Whisper-quality accuracy without the SageMaker complexity, give it a try.

Using EFS as a persistent model cache follows the same pattern I used earlier for scaling a stateful Streamlit chatbot with ECS + EFS — if you're building other stateful workloads on AWS, that one's worth a look too.

I Squeezed My $1k Monthly OpenClaw API Bill with ~$20/Month in AWS Credits — Here's the Exact Setup

Gabriel Koo — Sat, 21 Feb 2026 14:44:45 +0000

Update (May 1, 2026): AWS Finally started cracking down this technique.

New signups will no longer available starting May 15, 2026. New Q Developer Free Tier account creation (via Builder ID in IDE plugins) and new Q Developer subscription creation (via the AWS Console) will be blocked.
Model changes: starting May 29, 2026, Opus 4.6 will no longer be available on Q Developer Pro. Opus 4.5 and all other existing models remain available. The latest coding models, including Opus 4.7, are available exclusively on Kiro.
Existing customers retain access. If you access Q Developer via a Q Developer Pro subscription or Kiro subscription, you will continue to have access to the Q Developer IDE plugins until April 30, 2027. The May 15, 2026 change affects new signups for Q Developer accounts and subscriptions.

https://aws.amazon.com/blogs/devops/amazon-q-developer-end-of-support-announcement/

Part 1 of my series on building a low-cost personal AI stack on AWS.
Part 2 — Drop-in Perplexity Sonar replacement with AWS Bedrock Nova Grounding
Part 3 — From 3-Minute Cold Starts to ~20 Seconds: Whisper on AWS Lambda + EFS

I've got OpenClaw(MoltBot(ClawdBot)) running locally on a Raspberry Pi — where computation power is scarce, and it's gone unresponsive on me more than a few times. But even on constrained hardware, every chat turn, every memory search, every web lookup is hitting paid APIs. The bill is small at first, then it isn't. I had been using qwen3-coder-480b for a week or two, and the daily cost skyrocketed to as much as $50.

Assumption: OpenClaw is running on hardware you already own or pay for separately — a Raspberry Pi, home server, or existing cloud instance. The compute cost of the host itself isn't counted in the ~$20/month figure here.

If you've picked up AWS Credits from events, the AWS Community Builder program ($500/year), or AWS Activate — or if your company prefers to keep spend within AWS rather than onboarding yet another SaaS API provider — there's a way to run the whole OpenClaw stack on credits.

This is how I did it.

Disclaimer: The crux of this hack relies heavily on Amazon Q Developer Pro's undocumented while generously high usage ceiling while it lasts. If it's eventually deprecated, we will still need to switch to Kiro plans with overage pricings - still covered by AWS Credits with lower cost/token ratio.

Who This Is For

Two very different reasons to care about this setup.

If you have AWS Credits to burn:
Credits from re:Invent, AWS Community Builder, AWS Activate, or customer programs come with expiry dates. Running your AI assistant stack on them is one of the most practical ways to put idle credits to work — at ~$20/month, $100 in credits covers 5 months of the full stack. If you're sitting on a few hundred dollars with an end-of-year deadline, this is a productive use before they lapse.

If you're in a company with procurement or compliance requirements:
Every new SaaS vendor is a TPRM exercise. OpenAI for embeddings, Perplexity for web search, Anthropic for Claude — each one is a separate vendor assessment, a separate DPA, and a separate conversation with your security team. For FSI and regulated industries, that's not just overhead — it can be a blocker.

AWS is likely already in your vendor register. Consolidating on Bedrock means single billing, fewer third-party relationships to manage, and data residency you control. For anything touching customer data in banking, insurance, or healthcare, that's the difference between a quick internal approval and a 3-month procurement cycle.

Prerequisites

AWS account with Bedrock access enabled in us-east-1 (or another US region)
AWS credentials — a Bedrock API key is the simplest option if your account supports it. Otherwise, a long-term IAM access key/secret key pair works fine and is easier to manage than SSO. IAM Identity Center is only required for the Q Developer Pro layer.
Python 3.10+ — used by kiro-gateway, LiteLLM, and the Nova grounding proxy
Amazon Q Developer Pro subscription ($19/user/month, credit-eligible) — required for Layer 1 (kiro-gateway). Kiro Pro, Pro+, or Power plans also work but are credit-based with overage charges — Q Developer Pro is the better deal.

What Actually Costs Money in OpenClaw?

Before reaching for solutions, it helps to know exactly where the spend goes. OpenClaw has five distinct cost centers:

1. Main model (LLM)
Every chat turn, every agent action, every tool call — all routed through your primary LLM. This is the biggest variable cost. On a busy day it adds up fast.

2. Memory search (embeddings)
OpenClaw's memory_search tool converts your memory files into vector embeddings and queries them semantically. Every search = an embedding API call. Low cost per call, but it runs constantly in the background.

3. Web search
The web_search tool hits Perplexity or Brave APIs. Perplexity charges per query on paid plans; Brave gives you $5/month free then charges beyond that.

4. Browser automation
The browser tool spins up a Chromium instance for web scraping, form filling, and screenshots. Running a full browser on a low-compute machine (Raspberry Pi, t4g.small) is heavy — and cloud browser options cost per session.

5. Speech-to-text (STT)
Voice messages transcribed via your STT provider. OpenAI Whisper API charges per minute of audio — self-hosting on Lambda eliminates this entirely.

That's it. Five layers. The goal: drive variable cost to zero.

My Config: All 5 Layers on AWS Credits

Here's the full picture before we go deep:

Layer	Solution	Credit
Main model	kiro-gateway → Amazon Q Developer Pro	@Jwadow
Memory search	Native Bedrock embeddings via PR #20191	@gabrielkoo
Web search	bedrock-web-search-proxy — Nova Grounding as Perplexity drop-in	@gabrielkoo
Browser	agent-browser + AgentCore provider	@pahudnet
Speech-to-text	`aws-lambda-whisper-adaptor` — Whisper on Lambda + EFS	@gabrielkoo

Three of these I built myself. Two were built by other community members. All five are open source.

Layer 1: Main Model + Image Analysis — Kiro CLI — Covered by AWS Credits

Amazon Q Developer Pro: flat-rate access to Claude

The key difference between Amazon Q Developer Pro and Kiro Pro is the billing model. Kiro Pro is credit-based — 1,000 credits/month, pay more if you exceed them. Amazon Q Developer Pro is a flat monthly subscription: $19/user/month, no per-token billing, no surprise overages.

Plan	Cost	Usage
Kiro Free	$0/mo	50 credits/month
Kiro Pro	$20/mo	1,000 credits + $0.04/credit overage
Kiro Pro+	$40/mo	2,000 credits + $0.04/credit overage
Kiro Power	$200/mo	10,000 credits + $0.04/credit overage
Amazon Q Developer Pro (legacy)	$19/user/mo	Flat-rate, not credit-capped

Note: Amazon Q Developer Pro is now a legacy plan in the Kiro ecosystem. AWS has stopped allowing new Builder ID subscriptions to Q Developer Pro — new users can only subscribe through Kiro plans. The undocumented usage limits on Q Pro are likely part of why AWS made this transition. If you're already on Q Developer Pro, you retain access and it remains the better deal for OpenClaw.

Your Q Developer Pro subscription grants access to kiro-cli. The documented quota is 10,000 inference calls/month — for a personal AI assistant, that's more than enough.

Real-world cost check: In 4 days of active OpenClaw usage after switching to kiro-gateway, I consumed ~40M input tokens and ~865K output tokens with Claude Sonnet. OpenClaw loads memory files, system prompts, and tool results into every turn — the context window fills up fast. At standard Bedrock pricing ($3/1M input, $15/1M output), that's ~$135 for 4 days, or roughly $1,000/month. Q Developer Pro covers all of it for $19/month flat.

In practice, I've been running Kiro CLI with OpenClaw daily and haven't hit any rate limits in active use. Note: the /usage command isn't available under the Q Developer Pro plan — monitor your usage via the AWS console instead. That said, after running OpenClaw with kiro-gateway for several days, I checked the Q Developer usage metrics in the AWS console and the figures hadn't moved at all. It's unclear whether Kiro CLI usage is counted against the same quota as Q Developer's agentic requests, or tracked separately. The Amazon Q Developer pricing page only states "Included (with limits)" for the Pro tier — no specifics on what those limits are or how Kiro CLI calls are metered.

Note: Q Developer Pro requires AWS IAM Identity Center (SSO) — you can't use it with a free Builder ID. If you're already set up with Identity Center (common in enterprise teams and AWS Community Builders with corporate accounts), you're good to go.

Important: Standard AWS Credits don't cover per-token Claude usage via Anthropic's marketplace agreement. But the Q Developer Pro subscription fee itself is credit-eligible — making the whole stack fundable with AWS credits. Kiro's flat-rate subscription is currently the only practical way to run Claude in OpenClaw without per-token billing.

New AWS accounts: Even if you'd prefer to pay per-token via direct Bedrock API, new accounts often come with ultra-low default rate limits that can't reliably serve OpenClaw — even when you're willing to pay. The flat-rate Q Developer Pro route sidesteps this entirely.

kiro-gateway: the bridge

kiro-gateway — built by @Jwadow — wraps Kiro CLI and exposes OpenAI-compatible and Anthropic-compatible API endpoints. OpenClaw talks to it like any other provider.

git clone https://github.com/jwadow/kiro-gateway
cd kiro-gateway
pip install -r requirements.txt
cp .env.example .env

Edit .env:

PROXY_API_KEY="your-secret-key"
KIRO_CREDS_FILE="~/.aws/sso/cache/kiro-auth-token.json"

Run kiro-cli login once to authenticate — this populates KIRO_CREDS_FILE automatically. (kiro-cli is only needed for this initial login; kiro-gateway reads the token it generates. Re-run if your token expires.) Then:

python main.py --port 9000

Heads up: kiro-gateway's hardcoded fallback model list may lag behind new Claude releases. If a model isn't showing up at /v1/models, add it manually to FALLBACK_MODELS in kiro/config.py.

Available models via Q Developer Pro:

Model	Best for
`claude-sonnet-4.6`	General tasks, coding, writing
`claude-haiku-4.5`	Fast, lightweight responses
`claude-opus-4.6`	Complex reasoning, long context

OpenClaw config:

{
  "models": {
    "providers": {
      "kiro": {
        "baseUrl": "http://localhost:9000",
        "apiKey": "your-secret-key",
        "api": "anthropic-messages"
      }
    }
  },
  "agents": {
    "defaults": {
      "model": {
        "primary": "kiro/claude-sonnet-4.6"
      },
      "imageModel": {
        "primary": "kiro/claude-sonnet-4.6"
      }
    }
  }
}

Bonus: kiro-gateway works with any tool that supports OpenAI or Anthropic APIs — not just OpenClaw. To use it with Claude Code: ANTHROPIC_BASE_URL=http://localhost:9000 and ANTHROPIC_API_KEY=your-secret-key.

Layer 2: Memory Search — Bedrock Embeddings — Covered by AWS Credits

OpenClaw's memory_search needs an embedding model. Amazon Nova Multimodal Embeddings costs ~$0.00014 per 1K tokens — fractions of a cent per query, and covered by AWS Credits.

OpenClaw's native Bedrock provider doesn't wire up embeddings cleanly yet — PR #24892 - (I made a novice mistake with PR #20191) is pending merge. Until then, you'll need a local OpenAI-compatible proxy in front of Bedrock. Two options:

Option A: LiteLLM

# litellm_config.yaml
model_list:
  - model_name: nova-2-multimodal-embeddings-v1.0
    litellm_params:
      model: bedrock/amazon.nova-2-multimodal-embeddings-v1:0
      aws_region_name: us-east-1

litellm_settings:
  drop_params: true
  master_key: "local-only"

pip install 'litellm[proxy]'
litellm --config litellm_config.yaml --port 4000

"memorySearch": {
  "enabled": true,
  "provider": "openai",
  "remote": { "baseUrl": "http://localhost:4000", "apiKey": "local-only" },
  "model": "nova-2-multimodal-embeddings-v1.0"
}

Option B: bedrock-access-gateway-function-url (serverless, no fixed cost)

My own fork of the original bedrock-access-gateway — deployed as a Lambda Function URL instead of ALB+Fargate, so there's no $16+/month fixed cost. Full writeup: Use Amazon Bedrock Models with OpenAI SDKs with a Serverless Proxy Endpoint.

Note: My PR #222 for Nova 2 embedding support against the original bedrock-access-gateway project has been merged — so my fork pulls from this upstream automatically via prepare_source.sh.

git clone --depth=1 https://github.com/gabrielkoo/bedrock-access-gateway-function-url
cd bedrock-access-gateway-function-url
./prepare_source.sh
sam build
sam deploy --guided

Grab the FunctionUrl output after deploy, then:

"memorySearch": {
  "enabled": true,
  "provider": "openai",
  "remote": { "baseUrl": "https://<your-function-url>.lambda-url.us-east-1.on.aws", "apiKey": "your-api-key" },
  "model": "amazon.nova-2-multimodal-embeddings-v1:0"
}

Region note: amazon.nova-2-multimodal-embeddings-v1:0 availability varies — check the Bedrock model availability page. Make sure your IAM credentials have bedrock:InvokeModel in your target region.

Once PR #24892 merges, no proxy needed — the config simplifies to:

"memorySearch": {
  "enabled": true,
  "provider": "bedrock",
  "model": "amazon.nova-2-multimodal-embeddings-v1:0",
  "region": "us-east-1"
}

Layer 3: Web Search — Nova Grounding Proxy — Covered by AWS Credits

I built bedrock-web-search-proxy — a FastAPI wrapper that makes Bedrock Nova Grounding look like the Perplexity Sonar API. No Perplexity or Brave API key needed. Runs entirely on AWS Credits.

Full writeup: Drop-in Perplexity Sonar Replacement with AWS Bedrock Nova Grounding.

Option A: Run locally

git clone https://github.com/gabrielkoo/bedrock-web-search-proxy
cd bedrock-web-search-proxy
pip install fastapi uvicorn boto3
uvicorn main:app --port 7000

Option B: Lambda Function URL (zero idle cost)

See the deployment guide in the repo — SAM-based, arm64, python3.13. Once deployed, you get a persistent HTTPS endpoint with no local process to manage.

OpenClaw config:

{
  "tools": {
    "web": {
      "search": {
        "provider": "perplexity",
        "perplexity": {
          "apiKey": "your-proxy-key",
          "baseUrl": "http://localhost:7000/v1",
          "model": "sonar-pro"
        }
      }
    }
  }
}

All US Nova CRIS (Cross-Region Inference Services) profiles support web grounding (us.amazon.nova-premier-v1:0, us.amazon.nova-pro-v1:0, etc.). Native model IDs without the us. prefix do NOT work — must use CRIS profiles. Web grounding is US regions only (us-east-1, us-east-2, us-west-2).

Layer 4: Cloud Browser — Bedrock AgentCore — Covered by AWS Credits

agent-browser by Vercel Labs, with the AgentCore provider contributed by Pahud Hsieh (@pahudnet) — PR #397.

The browser runs in AWS — no local Chromium needed. Particularly useful on low-compute instances (Pi, t4g.small) where running a local browser would be too heavy. Covered by AWS Credits.

Node.js and pnpm required. Since PR #397 isn't merged yet, check out the branch directly:

git clone https://github.com/vercel-labs/agent-browser
cd agent-browser
git fetch origin pull/397/head:agentcore
git checkout agentcore
pnpm install && pnpm build

Then use it:

agent-browser -p agentcore open https://example.com
agent-browser close

Your AWS identity needs these IAM permissions:

bedrock-agentcore:StartBrowserSession
bedrock-agentcore:ConnectBrowserAutomationStream
bedrock-agentcore:StopBrowserSession

On a desktop machine with enough RAM, local CDP (OpenClaw's built-in browser) is free and works fine. AgentCore is the play for headless/low-compute setups.

Layer 5: Speech-to-Text — Whisper on Lambda — Covered by AWS Credits

aws-lambda-whisper-adaptor — self-hosted faster-whisper on AWS Lambda, with Deepgram-compatible and OpenAI-compatible transcription endpoints. EFS-backed model storage, pay-per-use, scales to zero.

Full setup guide: From 3-Minute Cold Starts to ~20 Seconds: Whisper on AWS Lambda + EFS.

Quick Start

Use the pre-built image — no build step needed:

docker pull ghcr.io/gabrielkoo/aws-lambda-whisper-adaptor:latest

Use this image URI when creating your Lambda function. The repo includes a SAM template for the full VPC + EFS setup.

Lambda runs in VPC for EFS access — no NAT Gateway needed (free S3 VPC Gateway Endpoint). Cold start is ~20–30s on first invocation after a model download; subsequent calls are fast.

The Cost Math

Without this setup, Claude Sonnet alone runs ~$1,000/month at standard Bedrock pricing — based on real token usage from my own sessions. OpenClaw's large context window (memory files, system prompts, tool results loaded every turn) means the token bill compounds fast.

The full stack with this setup runs at ~$20/month:

$19/mo — Amazon Q Developer Pro (flat-rate, covers all LLM calls)
≤$1/mo — Bedrock embeddings for memory search (Nova 2 at $0.00014/1K tokens)

Web search, browser automation, and speech-to-text are covered by AWS Credits — no separate line item.

With $100 in AWS Credits, you cover roughly 5 months of the full stack. Both the Q Developer Pro subscription and Bedrock embeddings are credit-eligible — if you're an AWS Community Builder, that $500/year allocation more than covers it.

Where AWS Credits Come From

AWS event participant/speaker — re:Invent, Summit, local user groups
AWS Community Builder — $500/year for active builders (builder.aws.com). The application opens a few rounds per year — I'm one of the builders in the program.
AWS Customer Council — participation typically includes credits
AWS Activate (startups) — up to $100K
AWS Educate / Academy — educators and students

Check your balance: console.aws.amazon.com/billing/home#/credits

Closing

Five layers. Two built by community members, three I built myself. All open source, all running on AWS Credits.

To be clear: kiro-gateway is the most crucial piece here. @Jwadow built the bridge that makes Claude accessible without per-token billing — I built the embedding proxy, web search proxy, and Whisper gateway to fill the remaining gaps. Web search and cloud browser (Layers 3 and 4) are purely AWS Credits — no subscription, per-token billing well covered by AWS Credits.

If you're already an AWS Community Builder or have credits sitting in your account, there's no reason to be paying per-token for a personal AI assistant. Wire it up once, and the stack runs itself.

Put those credits to work.

Drop-in Perplexity Sonar Replacement with AWS Bedrock Nova Grounding

Gabriel Koo — Fri, 20 Feb 2026 16:26:37 +0000

Part 2 of my series on building a low-cost personal AI stack on AWS.
Part 1 — I Squeezed My $1k Monthly OpenClaw API Bill with ~$20/Month in AWS Credits
Part 3 — From 3-Minute Cold Starts to ~20 Seconds: Whisper on AWS Lambda + EFS

If you're running an AI assistant or agent framework that uses Perplexity's Sonar API for web search, you're paying per query — or burning through your monthly credit allocation faster than you'd like.

I'm on Perplexity Pro, which comes with $5/month in API credits. Sounds fine until you hit mid-month and realize OpenClaw has quietly burned through all of it. I wanted something uncapped that didn't add another bill. If you're an AWS user with any credits sitting around — that $25 from a workshop, an event promo, or re:Invent swag — there's a better option: route those queries through Amazon Bedrock's Nova Premier grounding instead.

I built bedrock-web-search-proxy, a FastAPI proxy that makes Bedrock Nova Premier look exactly like the Perplexity Sonar API. Change one URL, keep everything else the same.

What is Nova Grounding?

Amazon Nova Premier supports a nova_grounding system tool that lets the model search the web in real-time and return answers with citations — similar to Perplexity Sonar. The difference: it runs on Bedrock, so it counts against your AWS credits rather than a separate Perplexity subscription.

Why Not Just Use Brave Search's Free Tier?

Brave does have an AI Answers API that returns synthesized answers with citations — similar to Perplexity. Two catches though:

Credit card required — even the $5/month free tier needs a card on file as an anti-fraud measure
Undocumented model — Brave doesn't clearly disclose which LLM powers the answers, so you're trusting a black box

With Nova grounding, you know exactly what's running (Nova Premier on Bedrock), and it counts against AWS credits you likely already have. No new billing relationship, no mystery model.

Apps That Use Perplexity API

The wrapper is a drop-in for any app that supports Perplexity as a provider:

OpenClaw — tools.web.search.perplexity.baseUrl config
Open WebUI — web search integration
LibreChat — via Perplexity MCP server
Cursor — Perplexity MCP for web research
Continue.dev — Sonar models for codebase context
AnythingLLM — Perplexity as cloud LLM provider
LiteLLM — web search interception

Proof It's Actually Grounded (Not Hallucinated)

Here's a direct API call asking for the current Bitcoin price:

curl -s http://localhost:7000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nova-premier-web-grounding",
    "messages": [{"role": "user", "content": "What is the Bitcoin price right now?"}],
    "max_tokens": 200
  }'

Response:

{
  "choices": [{
    "message": {
      "content": "The current price of Bitcoin (BTC) is $67,254.57 USD, reflecting a 0.54% increase in the last 24 hours. Last updated February 20, 2026 at 04:34 UTC."
    }
  }],
  "citations": [
    "https://www.latestly.com/technology/bitcoin-price-today-february-20-2026-btc-price-at-usd-67243-up-compared-to-yesterdays-usd-66941-mark-7321498.html",
    "https://www.binance.com/en/price/bitcoin"
  ]
}

The citation URL contains today's date in the slug. Not hallucinated — Nova Premier actually fetched and synthesized live web content.

Setup

1. Install and run (one line, no cloning needed)

uvx --from git+https://github.com/gabrielkoo/bedrock-web-search-proxy bedrock-web-search-proxy

Or run directly from the raw script:

uv run https://raw.githubusercontent.com/gabrielkoo/bedrock-web-search-proxy/main/main.py

Both require uv and AWS credentials with bedrock:InvokeModel on us.amazon.nova-premier-v1:0. Region defaults to us-east-1 — override with AWS_DEFAULT_REGION if needed.

2. Configure your app

For OpenClaw, update ~/.openclaw/openclaw.json:

{
  "tools": {
    "web": {
      "search": {
        "provider": "perplexity",
        "perplexity": {
          "baseUrl": "http://localhost:7000/v1",
          "apiKey": "nova-grounding",
          "model": "nova-premier-web-grounding"
        }
      }
    }
  }
}

⚠️ The apiKey must not be a real pplx- key — OpenClaw detects that prefix and overrides baseUrl back to Perplexity's servers.

For other apps, just point the Perplexity base URL to http://your-host:7000/v1 and use any model name — the wrapper routes everything to Nova Premier.

Model Aliases

All standard Perplexity model names are accepted and routed to Nova Premier (the only Nova model that currently supports the grounding tool):

Request model	Bedrock model
`nova-premier-web-grounding`	`us.amazon.nova-premier-v1:0`
`sonar-pro`, `sonar-pro-online`	`us.amazon.nova-premier-v1:0`
`sonar`, `sonar-mini`, `sonar-turbo`	`us.amazon.nova-premier-v1:0`

Cost

Nova Premier usage counts against your AWS credits — so if you have any sitting around (a $25 promo from an AWS event, workshop, or re:Invent swag bag), this is effectively free. Check your Billing console — you might have more than you think.

AWS Community Builders: covered by $500/year credits
Others with AWS credits: same deal — credits apply
No credits: check the Bedrock pricing page for current Nova Premier rates

Caveats

Streaming doesn't return citations[] — Nova limitation. Non-streaming works fine, and OpenClaw's web_search tool uses non-streaming.
MAX_CONCURRENT semaphore defaults to 5 — tune via env var if needed.
Region: Nova Premier grounding requires us-east-1.

Wrapping Up

If you're already on AWS Bedrock for your LLM workloads, there's no reason to pay Perplexity separately for web-grounded search. The wrapper is ~350 lines of Python, has 44 tests, and is OpenAI SDK-compatible — so it works with anything that speaks the Perplexity or OpenAI chat completions API.

Repo: github.com/gabrielkoo/bedrock-web-search-proxy

AWS Silently Releases Kimi K2.5 and GLM 4.7 Models to Bedrock

Gabriel Koo — Sun, 08 Feb 2026 15:31:31 +0000

[UPDATE 10 Feb 2026] - It seems it was part of AWS’s rollout plan for rolling out open weight models for Kiro and Kiro CLI! Open weight models are here: more choice, more speed, less cost

But interestingly among the models covered in this article, only DeepSeek v3.2, Minimax 2.1 and Qwen Coder next were covered. Moonshot K2.5 and GLM-4.7 were missing.
————
I was refreshing my Bedrock model catalog script our of random curiosity when a few unfamiliar model IDs showed up in us-east-1. No AWS blog post. No tweet thread. Just a few new entries in the API response.

If you've been waiting for a Claude-adjacent model you could swap in seamlessly via AWS credits — this is it.

But there’s a drawback for early adopters - Do read till the end to learn about the flaw!

Kimi K2.5 (by Moonshot AI), GLM 4.7 (by Zhipu AI), and several other new models like DeepSeek 3.2 and Qwen3 Coder Next are now live on Bedrock, all with full support for the Converse API, tool calling, and — in Kimi K2.5's case — native image understanding.

Quick note: These models aren't listed in the AWS Bedrock models-supported documentation, yet they're fully functional via the Converse API. That's the whole "silent release" thing — available in production, just not reflected in the canonical docs yet. Worth bookmarking your region's actual model list from the Bedrock console instead of relying solely on the written guides.

The models: what just landed

Kimi K2.5 (Moonshot AI blog) is the eye-catcher here:

Tool calling (function calling): ✓ Fully supported via Bedrock Converse API
Image understanding: ✓ Native image inputs (base64 or URL)
Code generation: In my testing, it held its own against Claude 4.5 Sonnet on typical coding prompts — it handled a multi-file refactor of a FastAPI router cleanly on the first try
Bedrock Model ID: moonshotai.kimi-k2.5
Availability: us-east-1, us-west-2 (and expanding)
Use case fit: Drop-in replacement for Claude if you're already on AWS credits

GLM 4.7 (Zhipu AI blog) fills a quieter but useful role:

Tool calling: ✓ Supported, though less aggressively tested in my flows
Code generation: Strong; competitive with Deepseek for certain workloads
Bedrock Model ID: zai.glm-4.7(-flash)
Availability: us-east-1, us-west-2
Use case fit: Solid all-arounder; good for prompts that don't strictly require image handling

The real unlock: Both are live on converse API, which means they work seamlessly with Bedrock's function-calling infrastructure.

When to pick which

Need	Pick	Why
Image understanding + tool calling	Kimi K2.5	Only Bedrock open-weight flagship model with both
Text-only tasks, cost-conscious	GLM 4.7	Solid all-arounder, no vision overhead
Maximum reliability & ecosystem	Claude	Battle-tested, widest documentation

Why this matters for vibe coding

"Vibe coding" — the practice of rapidly iterating on code with LLM assistance, swapping models mid-session, and optimizing for flow over perfection — lives or dies on how frictionless your model-switching is.

If you're sitting on expiring AWS credits (I've got ~$700 by July 2026), the bottleneck isn't usually "which model is smartest?" — it's "how fast can I swap without rewriting everything?"

Kimi K2.5 solves a real pain point: until now, if you wanted image understanding + tool calling + AWS-native billing, you were stuck with Claude. And the only way you could have done so is via purchasing Kiro CLI subscription ($20/$40/$200 per month) - Since Anthropic Claude models are not covered by the typical AWS Credits.

I initially considered subscribing to the $200 Kiro Power plan, but then I am not confident that I could utilize the plan fully every month, but I'm worried that sticking to the $20/$40 plans will result to paying for Kiro credit overages (which is double of the average plan price per credit). Therefore any PAYG option would perfectly fit my usage pattern.

So now you have an option that:

Bills directly to your AWS account — no vendor intermediary, no separate API key, just your existing credits burning down
Runs on the same Bedrock Converse API
Calls tools reliably
Natively understands images

For experimentation loops (refactors, code generation, visual analysis), that's a genuinely useful escape hatch.

The lightweight setup: local LiteLLM gateway

You don't need a complex setup. My entire gateway is:

A Python venv with litellm installed
A single YAML config file (shown in the next section)
A systemd unit to keep it running on port 4000

No containers, no Kubernetes. One command to install, one service file to manage. Once running, any client on that machine calls http://localhost:4000/chat/completions with the standard OpenAI format, and LiteLLM translates it to Bedrock Converse API automatically.

Performance note: In my testing, the LiteLLM translation layer adds negligible latency (~20–50ms overhead). Streaming responses from Kimi K2.5 feel comparable to calling Claude directly — first tokens arrive within 1–2 seconds for typical prompts.

Bonus: Claude Code & OpenCode integration

Here's the slightly cheeky part: you can point Claude Code or OpenCode at your local LiteLLM gateway and route requests through to Kimi K2.5 or GLM 4.7 on Bedrock — all while staying on your AWS credits.

LiteLLM supports the Anthropic /v1/messages API endpoint, so it's a two-liner to set up:

export ANTHROPIC_BASE_URL=http://localhost:4000
export ANTHROPIC_AUTH_TOKEN=sk-your-litellm-key
export ANTHROPIC_MODEL=kimi-k2.5
export DISABLE_PROMPT_CACHING=true

The DISABLE_PROMPT_CACHING=true is essential here (Special thanks my colleague [at]Marty for troubleshooting and fixing that) - since by default Claude Code tries to apply prompt caching for speed, but not every model on Amazon Bedrock supports that:

Then launch Claude Code or OpenCode as usual. LiteLLM intercepts the Anthropic-format requests and translates them to Bedrock Converse calls. It's not officially blessed by Anthropic, but it works cleanly for local experimentation — and your AWS credits take the hit instead of your Anthropic billing.

Config: explicit about capabilities

Here's how I route Kimi K2.5 and GLM 4.7:

model_list:
  - model_name: kimi-k2.5
    litellm_params:
      model: bedrock/converse/moonshotai.kimi-k2.5
      aws_region_name: us-east-1
      allowed_openai_params: ['reasoning_effort', 'tools', 'tool_choice']
    model_info:
      mode: completion

  - model_name: glm-4.7
    litellm_params:
      model: bedrock/converse/zai.glm-4.7
      aws_region_name: us-east-1
      allowed_openai_params: ['reasoning_effort', 'tools', 'tool_choice']    
    model_info:
      mode: completion

litellm_settings:
  modify_params: true
  log_responses: true

Key patterns here:

Friendly names (kimi-k2.5, glm-4.7) instead of long model IDs
Explicit capability flags (supports_function_calling, supports_vision)
modify_params: true for Bedrock edge-case smoothing
Single region (us-east-1) since both models are there

The capability flags matter. They let your orchestration layer (or agent framework) gracefully degrade if a model can't do tools or images. No more "half-attempt to call a function and fail mysteriously."

Testing: quick verification

I tested both models with the Bedrock Converse API in us-east-1. Here's what actually happened:

Kimi K2.5: I threw a "get current weather in Tokyo" tool spec at it — standard JSON Schema function definition, nothing fancy. It correctly structured the function call on the first attempt, including proper argument types in the response. For code generation, I asked it to refactor a Python CLI script into async; the output was clean and ran without edits. Image support is declared in the model schema but I haven't validated it hands-on yet — that's next on my list.

GLM 4.7: Solid on text queries and code generation. Tool calling works, though it was slightly less eager to invoke tools unprompted compared to Kimi — it sometimes answered directly when I expected a function call. No image support, as expected; Zhipu hasn't added vision capabilities to GLM 4.7.

Why the quiet release?

These aren't show-stopping announcements. They're bread-and-butter additions to Bedrock's model portfolio. AWS likely brought them in as part of an ongoing expansion to reduce vendor lock-in on "you have to use Claude for everything." That's healthy — more options, better pricing pressure, cleaner credit utilization.

This is a pattern, not an anomaly. Anthropic Claude, AI21 Jamba, and several Mistral variants all appeared on Bedrock before official blog posts or documentation updates. If you're only checking AWS launch announcements, you're always behind.

📌 That's exactly why I built amazonbedrockmodels.github.io — a living catalog of what's actually available on Bedrock, in which regions, and what each model can do. Bookmark it. It updates faster than the docs.

The model-swapping checklist

If you want to swap models without rewriting your code:

Use a gateway (LiteLLM, LLMProxy, or similar) to normalize requests
Pin the Bedrock route explicitly (bedrock/converse/modelid) in your config
Mark capability per model (tool calling, vision, etc.) — don't assume
Test the tool spec — even "supported" models sometimes have quirky implementations
Keep a catalog so you don't rediscover the same model twice

Kimi K2.5 fits this playbook cleanly. It's a genuine Claude replacement for Bedrock users, not a "wait and see if it works" experiment.

Next steps

If you're on AWS credits: Spin up a local LiteLLM instance and try both models
If you find more quietly-available models: Open a PR against the catalog or message me
If you're in an AWS org: Check your Bedrock region — availability is still expanding

The AWS credits will expire whether you use them or not. Might as well pick models that fit your workflow instead of forcing your workflow around Kiro CLI only.

P.S. During my testing, indeed I observed occasional longer LLM inference times - but I guess AWS is working on providing more compute capability on these newer beta models.

References

[1] Moonshot AI — Kimi K2.5 Announcement
[2] Zhipu AI — GLM 4.7 Announcement
[3] AWS Bedrock — Converse API Reference
[4] LiteLLM — AWS Bedrock Provider Documentation
[5] Unofficial - Amazon Bedrock Model Catalog I Created

Ultra Low Bedrock LLM Rate Limits for New AWS Accounts? Time to Wake Up Your Inactive AWS Accounts!

Gabriel Koo — Tue, 25 Nov 2025 17:00:02 +0000

Are You Struggling With Amazon Bedrock’s Ultra-Low Quotas on New AWS Accounts? 🤯

Are you hitting painfully low rate limits when running LLMs on Amazon Bedrock from a newly created AWS account? You’re definitely not alone — many developers are discovering that new accounts often start with extremely restrictive quotas, sometimes as low as 2 requests per minute.

Official guidance usually suggests contacting an account manager to escalate your limits, but for startups, hobby projects, or personal experimentation, that path is far from simple.

New Account? Big Ambitions, Tiny Quota 🤏🚧

Since 2024 or maybe 2025, AWS quietly adjusted Bedrock’s default model access for newly created accounts - together with other lower account defaults such as a maximum concurreny of 10 for AWS Lambda. Even when using global endpoints, many fresh accounts get just a few requests per minute (e.g., 2 rpm for Claude 4.5 Sonnet). This severely slows down prototyping or early-stage AI development.

Meanwhile, older AWS accounts — even ones that never touched Bedrock before — often start with dramatically higher limits, approaching 200+ rpm for the exact same models.

This creates a real operational advantage for teams with access to aged accounts.

The Elder Account Advantage 🕰️✨

AWS appears to apply significantly stricter defaults to newer accounts while preserving far more permissive limits for older ones. This aligns with AWS’s long-standing pattern of maintaining stable experiences for long-time customers.

In practice, a dormant but years-old AWS account can immediately receive much higher Bedrock limits purely due to its age.

Below is a striking example: 3 rpm for new accounts vs 250 rpm for older accounts on Claude 4.5 Opus.

Why AWS Is Unlikely to Reduce Older Accounts’ Limits 🔒🏢

Reducing quotas for older accounts would create major risk and break expectations for long-standing customers — especially enterprises.

Many organizations have stable, long-lived workloads.
Retroactively lowering quotas could break pipelines and violate performance assumptions.
AWS historically avoids backward-incompatible changes unless absolutely necessary.

Because of this, it’s unlikely AWS will apply newer, stricter defaults to older accounts. Teams spinning up new AWS accounts face a much steeper ramp for experimenting with Bedrock.

Finding and Reusing Older AWS Accounts 🔎📦

If your organization has older AWS accounts lying around, they may offer instant scaling advantages. With the new AWS Organizations Direct Account Transfer feature, accounts can move between Organizations without removing payment methods or performing the old, painful detachment workflow.

When moving such accounts, remember to update:

Legal entity name
Root user email
Addresses and billing contacts
Tax information

If your organization uses Trusted Access for AWS Account Management, these updates are straightforward. Also make sure to audit the account for any leftover resources before dedicating it to GenAI workloads.

Why New Accounts? Why Not Run Everything in One AWS Account? 🧩💼

Putting Bedrock experiments, different production services, and dev workloads into a single account sounds convenient — until it isn’t. Combining everything into one single AWS Account creates unnecessary risk and operational noise.

Isolation Protects You 🔥🧱
Account boundaries are AWS’s strongest safety net. One bad experiment or IAM mistake shouldn’t touch production. Isolation limits accidental data access, cost spikes, and incident blast radius.
Governance Stays Clean 📜✨
Different workloads need different controls. Separate accounts keep audits simpler, give clear ownership, and let you apply SCPs and guardrails without compromise.
Costs Stay Transparent 💸📊
Mixing Bedrock prototyping with core services muddies cost reporting. Individual accounts let you track usage, set budgets, and avoid team-to-team disputes.
Experiments Move Faster ⚡🔬
A sandbox (ideally an older account with higher limits) lets you test models, tweak IAM, and push boundaries freely — without risking production stability.
It Matches AWS Best Practices 🏗️📚
AWS recommends multi-account setups for lifecycle separation and blast-radius control. Using older accounts for Bedrock while keeping core workloads isolated follows this playbook perfectly.

How to Break Free From Bedrock’s Slow Lane 🚀💡

Identify older AWS accounts that haven't been used recently.
Transfer them into your AWS Organization using the streamlined Direct Account Transfer workflow.
Update all account metadata for compliance.
Deploy your Bedrock workloads — and unlock higher default limits instantly.

This approach helps teams accelerate their AI development journey despite the strict constraints placed on newly created AWS accounts.

Have you seen similar quota differences in your environment? Share your experience — more data points help the community understand the pattern! 🙌

Use OpenAI Codex CLI with Amazon Bedrock Models - Pay As You Go

Gabriel Koo — Wed, 27 Aug 2025 15:39:16 +0000

NEW (2026 Jan) Newer versions of Codex uses /v1/responses API by default and dropped support for chat completions endpoint. You'll need to add wire_api = "responses" to your existing config to use the new endpoint instead: https://docs.aws.amazon.com/bedrock/latest/userguide/bedrock-mantle.html.

OpenAI Codex CLI on Amazon Bedrock Models: Why Bother?

Here’s why Codex plus the Amazon Bedrock models make sense under some cases:

Pay as you go: No fixed cost—just pay for Bedrock tokens and Lambda invocations. No monthly minimum while Amazon Q Developer CLI has a free-tier quota, you must upgrade to the $19 USD/month paid plan if you have breached it, which still enforces usage caps.
Use your own fine-tuned models: Swap model endpoints easily; the gateway can even route to your own Amazon Bedrock fine-tunes (e.g. Nova) without friction.
Transparent logging: Codex’s request/response logs give you full visibility — a plus for debugging and cost tracking.
No AWS IAM/Identity required - Perfect for Headless Workloads: You only need your Bedrock Access Gateway API key; no need to log into your AWS identities with an inconvenient console authentication with your AWS Identity Center user/Builder ID (great for CI/CD and ephemeral cloud instances).
Regional flexibility: Yes, you could use Claude Code with Amazon Bedrock, but then I live in Hong Kong where Claude model usage is not allowed.
Amazon Nova Micro: Price King: For pure simple text LLM tasks, swapping Sonnet 4 for Nova Micro cuts costs by a factor of 85 — Comparing between Nova Micro and Sonnet 4.
If you have a bunch of AWS Credits from AWS events - you're cover with your usages with gpt-oss / Nova family of models!

Setup: Codex CLI + Bedrock Gateway

(UPDATE: Deprecated after Codex v0.80.0 https://github.com/openai/codex/discussions/7782)

Get your Lambda Gateway Function URL and API Key after deployment. (Check my earlier article for a step-by-step guide to get it running on Lambda via AWS SAM: https://dev.to/aws-builders/use-amazon-bedrock-models-via-an-openai-api-compatible-serverless-endpoint-now-without-fixed-cost-5hf5)

Here's a no-brainer if you want to skip my article and deploy it right away:

(
  cd /tmp && \
  git clone --depth=1 https://github.com/gabrielkoo/bedrock-access-  gateway-function-url && \
  cd bedrock-access-gateway-function-url && \
  ./prepare_source.sh && \
  sam build && \
  sam deploy --guided
)

Now install Codex:

npm i -g @openai/codex

Configure Codex like so:

# ~/.codex/config.toml
profile = 'bedrock'

[profiles.bedrock]
model = 'openai.gpt-oss-120b-1:0'
# OR
# model = 'us.amazon.nova-premier-v1:0'
model_provider = 'bedrock'
model_reasoning_effort = "low"
# NEW! Newer versions of Codex uses /v1/responses API by default.
wire_api = "chat"

[model_providers.bedrock]
name = 'bedrock'
base_url = 'https://RANDOM_HASH_HERE.lambda-url.AWS_REGION.on.aws/api/v1'
env_key = 'CODEX_OPENAI_API_KEY'

Alternatively, if you want to stick to only gpt-oss models but not e.g. Claude/Nova families of models, you can use the latest official OpenAI compatible endpoint with an Amazon Bedrock API Key instead - there will be no need to host the Bedrock Access Gateway:

...
web_search = "disabled"

[model_providers.bedrock]
name = "AmazonBedrock"
base_url = "https://bedrock-mantle.us-west-1.api.aws/v1"
env_key = "ENV_KEY_FOR_YOUR_BEDROCK_API_KEY"

...

[profiles.gpt-oss]
# NOTE: The model ID is truncated if you use the responses API.
model = "openai.gpt-oss-120b"

Query the LLM:

codex --profile bedrock "What is my public IP address?"

Model Support

Note that not all Bedrock models work over the gateway. Models must support tool calls.

GPT OSS (20b/120b): Optimzied with Codex
Nova family (Premier, Pro, Lite, Micro): All tested and working.
Claude, Llama, Mistral, Command R: Working, subject to regional restrictions (e.g. Hong Kong).

Amazon Q Developer CLI vs Codex CLI on Bedrock

Amazon Q Developer CLI is indeed officially supported in Hong Kong — but after your free usage (50 agentic chats/month), you'll need the $19/month paid plan, and may hit quotas even then.

Codex CLI via Amazon Bedrock gives unmetered usage (subject to whatever quotas you have on Amazon Bedrock itself and Lambda), no AWS login required - you just need to prepare the API key that you defined your self when you deployed the Bedrock Access Gateway.

Why Don't I Just Use the new OpenAI Compatible Endpoint?

Refer to my other blog article AWS Launches OpenAI-Compatible API for Bedrock (and I Did Some Tests!), the new OpenAI compatible Amazon Bedrock API endpoint supports gpt-oss 20b as well as 120b out of the box, other models like Nova or Claude are not supported.

So with my solution of wrapping the calls via a Bedrock Access Gateway, you can switch to other models whenever you want according your choice.

Summary

Codex CLI + Amazon Bedrock (via OpenAI-compatible gateway) gives developers a way to use pay-as-you-go agentic CLI Agents, swap fine-tuned models easily, and avoid region/pricing issues present in other AWS or Anthropic toolings. For minimal cost, Nova Micro is unbeatable for text workloads. And yes, my serverless gateway solution is the backbone — but more about that in my previous blog!

🚀 AWS Launches OpenAI-Compatible API for Bedrock (and I Did Some Tests!)

Gabriel Koo — Wed, 06 Aug 2025 16:45:12 +0000

🚀 AWS OpenAI-Compatible API for Bedrock OSS GPT: Real-World Dev Tests & Insights

AWS just dropped a bombshell by launching open-weight GPT models (gpt-oss-120b, gpt-oss-20b) on Amazon Bedrock as serverless, pay-as-you-go option to self hosting — this caught a lot of headlines 👀.

But as someone obsessed with Developer Experience (DX), I was even more stoked that AWS finally launched an official OpenAI-compatible endpoint:

https://bedrock-runtime.us-west-2.amazonaws.com/openai/v1

This puts AWS right alongside Gemini/VertexAI and Anthropic as companies providing first-party OpenAI SDK compatibility, and finally brings native OpenAI "plug-and-play" infra to AWS 🤩

Before we dive in, here’s my previous deep-dive blog for context on using Amazon Bedrock models with OpenAI compatibility (without fixed costs):

👉 https://dev.to/aws-builders/use-amazon-bedrock-models-via-an-openai-api-compatible-serverless-endpoint-now-without-fixed-cost-5hf5

😎 Wait—Does This Make the Bedrock Proxy Gateway Projects Stale?

Nope! While simple OpenAI API calls still runs perfectly with these new official compatibile Amazon Bedrock Runtimeendpoints, there are critical compatibility gaps — so custom proxies/gateways like my bedrock-access-gateway-function-url project and the official bedrock-access-gateway project are nowhere near obsolete yet.

By the Way, AWS Credits Does Cover the `gpt-oss` models!

It has been quite a pity that typical AWS Credits do not cover other models like Claude or Mistral. It's great that this time with the official partnership between AWS and OpenAI, the usage for gpt-oss models seems covered finally:

🧑‍💻 I Vibe-Coded the Tests with Kiro AI IDE — Here’s What Happened

You can check my actual repo and test scripts here:

📦 https://github.com/gabrielkoo/test-official-amazon-bedrock-openai-compatible-endpoint

AWS Official Docs: AWS Bedrock Docs: OpenAI Chat Completions API

(Note: The docs didn’t mention support is limited to latest OpenAI gpt-oss models only, and didn't mention directly support for tool_call or response_format features/parameters, so do read on for my real-world dev findings!)

Summary Table:

As a control test, I also ran the same testing script with my local Ollama gpt-oss:20b model:

Test	Bedrock OpenAI Endpoint	Local Ollama + `gpt-oss`	Notes
Basic Chat Completion + Reasoning	✅ Works: Step-wise reasoning included	✅ Works, chain-of-thought	Bedrock wraps reasoning in `<reasoning>` tags, both are great for math/logic explanations
`reasoning_effort` parameter	✅ Supported, adjusts explanation depth	✅ Supported, works	Fine-tune output detail—rare outside OpenAI OSS models
🔧 Tool/Function Calling (`tools`/`tool_choice`)	❌ Only intent as JSON, not native `tool_calls`	✅ True OpenAI tool_call schema	Just like the initial state of `o1-mini`: AWS "gets" the tools, but you can’t delegate natively
`response_format` (JSON schema)	❌ Not supported, API error	✅ Works, emits JSON	Bedrock doesn’t do response schema yet
Token param (`max_tokens` vs `max_completion_tokens`)	❌ Must use `max_completion_tokens`	✅ Accepts either	Bedrock is stricter, adjust scripts accordingly
Non-OSS Model (`Nova`, `Titan`, etc.)	✅ "Model not found", as expected	✅ Same	Endpoint only exposes OSS GPT models (`gpt-oss-*`) for now

🌟 Key DX Takeaways

API Parity, But Not Quite Full Compatibility: Any OpenAI SDK code (and most OpenAI agent frameworks) can now point at AWS’s Bedrock Runtime /openai/v1 endpoint with just env var tweaks — no code rewriting. This UX is 🔥 for engineers and innovation teams!
Transparent, Adjustable Reasoning: Get full chain-of-thought out-of-the-box, plus parametric reasoning level control. Perfect for math, science, coding, and explainability applications!
Tool Calling Support—Almost There: AWS Bedrock OSS GPT models recognize function call intent (return JSON), but don’t yet emit proper tool_calls blocks in the response. (While the same test case passed for my local Ollama test on the same gpt-oss model). This feels just like the early days of the o1-mini models, so: expect better in future, but for now, your OpenAI tool-calling apps will need adapters or proxies. 🙂
Strict OpenAI Param Parsing: If your code uses max_tokens, it’ll break here — swap to max_completion_tokens for as gpt-oss is a reasoning model.
Model Registry/Open Weight Only: The official endpoint supports only gpt-oss models so far (gpt-oss-120b, gpt-oss-20b). Other Amazon Bedrock serverless models like Nova will get you a fast fail as of now.
Still Need Dev Proxy Power: bedrock-access-gateway-function-url is not obsolete! It remains highly useful for compatibility workarounds or for non-gpt-oss models.

As a disclaimer, if we refer to the model documentation of gpt-oss on AWS, https://docs.aws.amazon.com/bedrock/latest/userguide/model-parameters-openai.html, tool calls were not supported directly for gpt-oss on Amazon Bedrock hosted version as of now, so it’s totally possible when AWS release more model support for the new OpenAI endpoint, these could no longer become a concern!

💡 DX Tips for Devs

Add backward/forward compatibility adapters if you’re building serious agentic stacks.
Watch the docs (Invoke a model with the OpenAI Chat Completions API), but always test yourself — AWS hasn't listed all the current limitations yet clearly in the documentation!
Port your scripts with care:
- Reasoning, logic, coding, and OpenAI-style chat: 👍
- Tool use, output schemas: ⚠️ adapter or proxy needed today!
OSS all the things! Making the full open-weight LLM stack API-portable is a big win for multi-cloud, local/cloud hybrid, and plain old sovereignty nerds. It feels like a new era for "bring your own infra, but keep your SDKs."

Closing

Thanks to Kiro AI IDE for the vibe-code collab and to AWS's team on making the OpenAI compatible endpoint finally launched!

AWS’s new official OpenAI compatible endpoint for gpt-oss models is a major leap for cloud/dev portability, even if a few “classic” OpenAI features like native function calling aren’t fully there (yet). If you need adapters, or want total control, proxies like mine remain essential. But for standard chat/reasoning patterns, Bedrock is now OpenAI SDK ready — from your own existing code based on OpenAI SDKs.

Happy building! 🚀🤖

P.S. This article was written on 2025-08-06, 1 day after the new endpoint it was launched, so it's totally possible that by the time you read this blog article, AWS could have already rolled out more support for other common OpenAI SDK features!

Regain access to Amazon Q CLI in CloudShell with This Simple Trick

Gabriel Koo — Sat, 02 Aug 2025 16:53:56 +0000

If you’ve tried to use the Amazon Q CLI a.k.a. q chat command in AWS CloudShell recently, you might have seen the frustrating message 🤯:

Q CLI integration is temporarily disabled. For continued access to Q Chat, please use the integrated version in your AWS Console.

This is annoying especially for me as a true lover of Amazon Q CLI which is so far the best GenAI CLI agent for AWS daily users.

The good news is here’s a guide on how to force reinstall Amazon Q CLI and log in with your own account. Plus, some perspective on why this happens, and why Amazon Q CLI is still a must-have for me 🛠️ — especially if you’re experimenting with GenAI in cloud environments.

What’s Happening?

As of time of writing (2025-08-02), AWS has temporarily disabled Amazon Q chat CLI features in CloudShell due to an internal issue (officially documented by AWS here). Q-based inline suggestions and the actual chat functionality are both disabled in CloudShell. ⛔️

Why Is This an Issue (My Own Speculation)?

I saw an user reporting the issue ⚠️ as early as 2025-05-23: https://github.com/aws/amazon-q-developer-cli/issues/1944.

Here’s why I think it’s happening 🧠:

CloudShell authenticates you for AWS credentials natively using AWS IAM users, roles, or Identity Center Access Role sessions — so that you can do API calls like aws sts get-caller-identity within CloudShell without the hassle of configuring the credentials.

In contrast, Amazon Q CLI expects authentication via AWS Builder ID or AWS IAM Identity Center for Pro plan.

These authentication models may not be exactly compatible, especially for chat-based workflows requiring licensing or entitlements outside basic AWS IAM integration.

One last possibility is that it is actually quite hard to apply limits or attribute usage rates of Amazon Q CLI. Imagine if one’s Amazon Q CLI usage with a particular IAM user in CloudShell session exceeds a fair pre-defined limit, one can just create another IAM user and have rate limits reset as a separate user. One could even programmatically set this up for literally unlimited Amazon Q CLI usage quota (if there is not any AWS Account wide limit applied). 💸

Until AWS aligns these behind the scenes, expect more time from the Amazon Q CLI team to work on a fix.

But Amazon Q CLI is Still Cool!

Even with this hiccup, Amazon Q CLI remains hugely valuable ⚡️:

AI-driven CLI suggestions right where you work
Natural language command generation
Instant code explanations and diagnostics
Makes security and automation work faster and less error-prone

If you use it outside CloudShell, it truly supercharges your workflow.

Force Reinstall & Login Steps

Here’s comes to the main topic - how you can force a clean install and log in outside CloudShell.

Easy. Since AWS CloudShell computing environment is just x64 Amazon Linux under the hood, just force uninstall it and then install it according to the official documentation.

# This removes the existing installation
sudo rm `which q`
# Download the latest ZIP-based installer
curl --proto '=https' --tlsv1.2 -sSf "https://desktop-release.q.us-east-1.amazonaws.com/latest/q-x86_64-linux-musl.zip" -o "q.zip"

unzip q.zip
./q/install.sh

When prompted:

Do you want q to modify your shell config (you will have to manually do this otherwise)?
Answer: Yes

This helps set up the Amazon Q CLI and autocompletion back.

Next, authenticate with your preferred credentials:

q login

You’ll be guided through a browser-based process (powered by Builder ID, Identity Center, etc.), enter the code, and your CLI session will be authenticated. Once authorized, you get full access (outside limitations of CloudShell).

Now, let’s test it out. Try:

q help
q inline enable

With proper authentication, you’ll see Amazon Q CLI in all its glory (in supported environments) 🚀.

Important Points to Note:

With my method, your Amazon Q CLI usages in CloudShell are bind to your own AWS Builder ID account / your Identity Center user. Be aware of usage rate limits. 🧾
AWS CloudShell environments would be automatically deleted after 120 days of inactivity. You might need to re-do this hack if you haven’t used a particular CloudShell environment for more than 120 days. 🗑️

Key Takeaways

Don’t stress if q chat is broken in CloudShell—it’s a temporary AWS-side limitation.
Use my way to force reinstall and login outside AWS CloudShell for a working setup.
Integration hurdles are likely caused by the difference in authentication flows: IAM / Identity Center built into CloudShell vs. Builder ID or Pro Plan requirements for Amazon Q CLI.
Amazon Q CLI is a great tool—just leverage it where AWS supports full features, and watch for future CloudShell improvements.

Happy hacking — enjoy more intelligent, secure, and productive time in your CloudShell terminal! ☁️

DEV Community: Gabriel Koo

Why I Split-Tunnel My VPN for AI Services — and Let Cloudflare's Application Library Pick the Domains

The problem: a "domain list" is never just one domain

The shortcut: Cloudflare's Application Library

What the API actually returns

From hostnames to a routable config

Why this is a networking lesson, not a VPN lesson

Takeaways

One last thing: the legitimate-use note

Sources

Stop Putting API Keys in Your MCP Config: Per-User OAuth with Amazon Cognito + AWS Lambda

The gap nobody fills: your identity in front of a shared-key API

The architecture

Why type: http, not type: stdio

One endpoint, two kinds of caller

The last mile most "MCP + OAuth" posts skip

The honest limitation

It works

Why this generalizes

What does all this cost? Almost nothing.

Further reading

Resurface Claude Code Usage Across Your Team with CloudWatch OTEL (No Lambda)

Architecture

Why This Works

Setup

IAM Execution Role

API Gateway with AWS Service Integration

API Key Authentication

Configure Your Tools

Claude Code

Claude CoWork (Team & Enterprise)

GitHub Copilot CLI

Gemini CLI

Cursor (via Hooks)

What You Get

Region Availability

Gotchas

Cost

When NOT to Use This

Security Considerations

Beyond Coding Agents

Source Code & One-Click Deploy

Further Reading

Bedrock for AI Coding Tools: Mantle vs Gateway vs LiteLLM — A Decision Guide for AWS Credit Burners

TL;DR

The three paths

1. Bedrock Mantle — no self-hosted infra required

2. bedrock-access-gateway — self-hosted, all models

3. LiteLLM — the universal translator

Tool compatibility matrix

A note on Codex CLI

Quick setup: OpenCode + Mantle

The bottom line

Track available Mantle models

From 3-Minute Cold Starts to ~20 Seconds: Whisper on AWS Lambda + EFS for OpenClaw

TL;DR

The Problem

What I Tried (and Why It Didn't Work)

Option 1: Amazon Transcribe

Option 2: SageMaker Serverless Inference

Option 2b: Bedrock Marketplace

Option 3: Lambda with Bundled Model

Option 4: Lambda + S3 (No EFS)

What Actually Worked: Lambda + EFS + S3

INT8 vs FP16: The Model Size Trade-off

Cost Breakdown

vs. OpenAI Whisper API

Architecture

API Compatibility

Model Management API

Performance Tip: Always Specify Language

Real-time Factor

Open Source

Pre-warming

Conclusion

I Squeezed My $1k Monthly OpenClaw API Bill with ~$20/Month in AWS Credits — Here's the Exact Setup

Who This Is For

Prerequisites

What Actually Costs Money in OpenClaw?

My Config: All 5 Layers on AWS Credits

Why `type: http`, not `type: stdio`

By the Way, AWS Credits Does Cover the `gpt-oss` models!