DEV Community: Amit

Which Tier Does Your Vector Workload Live On?

Amit — Thu, 09 Jul 2026 22:33:51 +0000

TL;DR

Since Bedrock Knowledge Bases launched in late 2023, the default vector store was OpenSearch Serverless with a ~$700/month floor you paid whether you queried it or not. Amazon S3 Vectors added a true pay-per-use tier, and the question changed to which tier of the cost/latency curve your workload belongs on.
The tiers sort cleanly: hot (OpenSearch, sub-50ms, always-on), warm (Aurora + pgvector, beside relational data), cold (S3 Vectors, ~100ms, pay-per-use). Picking a store is picking a tier, not picking a winner.
The cold tier is a category the market already validated. Turbopuffer built a ~$100M business storing vectors on object storage for Cursor, Notion, and Anthropic; Cursor cut its vector-database cost 95% moving to it. S3 Vectors is AWS making that same architecture a managed primitive.
Agent workloads belong on the cold tier, and the reason is structural: an agent's retrieval is one step inside a multi-second reasoning loop, so 100ms versus 10ms is noise. The latency that disqualifies a customer-facing search box is invisible to an agent.
S3 Vectors is not a database replacement. It is semantic-only, has no hybrid search, and trades single-digit-ms latency for cost. Know the three limits before you build on it.

When Amazon Bedrock Knowledge Bases launched at the end of 2023, the default vector store was OpenSearch Serverless, and it came with a bill before you stored anything. The floor was roughly $700 a month — four OpenSearch Compute Units (two for indexing, two for search with standby) at about $0.24 per unit-hour, running whether you queried once a day or a thousand times a second. For a production search system, that floor is rounding error. For a personal knowledge base or an internal document search, it is the reason the project never ships.

Amazon S3 Vectors, generally available since December 2025, prices the way the rest of S3 does: no minimum charge, pay for what you store and what you query, nothing while idle. One 30-project analysis puts the entry point at roughly $0.60 for a first month against OpenSearch Serverless's hundreds, and finds S3 Vectors 15 to 66 times cheaper across small and medium workloads. OpenSearch has since narrowed its own gap — fractional compute units and a scale-to-zero tier cut the idle cost — but the arrival of a true pay-per-use tier is what reset the default. The question stopped being "which vector database wins" and became "which tier of the cost and latency curve does this workload belong on."

The Tiers

Every AWS-native vector option now sorts onto one curve that trades latency for cost. Picking a store is picking a point on that curve.

Tier	Latency	Fits	AWS option
Hot	Sub-50ms, always-on	High QPS, customer-facing search	OpenSearch Service
Warm	50–200ms	Vectors beside relational data, ACID	Aurora PostgreSQL + pgvector
Cold	~100ms warm, sub-second cold	Large or long-lived, infrequent or bursty	S3 Vectors

Amazon Bedrock Knowledge Bases sits across the whole curve as the managed RAG layer, and it supports every store as a backend — OpenSearch Serverless and Managed Cluster, Aurora, Neptune Analytics, Pinecone, MongoDB Atlas, Redis, and S3 Vectors. You choose the tier; Bedrock runs the ingestion, chunking, embedding, and retrieval on top of it. The store is a decision about cost and latency, not a decision about capability.

The shift since S3 Vectors arrived is that the top and bottom of the curve now combine instead of compete. You keep hot vectors in OpenSearch for the queries that need speed and spill the long tail into S3 Vectors for cost, and AWS supports importing directly from one to the other. "Pick one database" was the wrong frame; "which tier for which slice of the data" is the right one.

The Cold Tier Is Already a Business

S3 Vectors can read like an AWS experiment if you have not been watching the vector-database market. It is AWS's managed entry into a pattern that a startup has already turned into a company.

The storage math behind the cold tier is not subtle. When you embed text, the vectors are much larger than the source — Turbopuffer measures 1KB of text expanding to roughly 16KB of vector data after chunking and embedding. Keeping all of that in memory or on SSD is where the bill comes from. Object storage runs about $0.02 per GB against roughly 16 times that for SSD, and most vectors in a real corpus are read rarely. Storing the cold majority on object storage and caching only the hot slice is the whole idea.

Turbopuffer built its company on exactly that idea. It stores vectors on S3, GCS, or Azure Blob and caches hot data on SSD and RAM by access pattern. Sacra estimates it reached about $100M in annualized revenue by March 2026, up 2,400% year over year, serving Cursor, Notion, Anthropic, Linear, and Superhuman. When Cursor moved its codebase index to this architecture, it reported cutting vector-database cost by 95% — because most codebases are queried infrequently, and paying in-memory rates to keep them all live was the expensive mistake.

Turbopuffer is one vendor of several. LanceDB is an open-source, file-based vector store that runs directly on any S3-compatible backend, and AWS itself published an architecture for 1B+ vectors on LanceDB, S3, and Lambda where the only fixed cost is the S3 storage footprint and each query costs fractions of a cent. Spice.ai integrated S3 Vectors into its open-source engine. The industry even converged on shared vocabulary for it — hot, warm, and cold vector tiers — because enough teams are building this way that the pattern needed names.

S3 Vectors is AWS making that architecture a managed primitive: the cold tier without running the caching layer yourself. The reason to trust the tier is that a $100M business, a marquee set of AI customers, and AWS's own reference architecture all point at the same economics.

Why Agents Belong on the Cold Tier

Here is the insight that makes S3 Vectors the default for agent work, and it is easy to miss if you only look at the latency number.

A customer-facing search box lives or dies on latency. A user typed, a box must fill, and 100 milliseconds versus 10 is the difference between snappy and sluggish. That workload belongs on the hot tier, and the OpenSearch bill is the cost of the experience.

An agent is a different shape of workload. When an agent retrieves, the retrieval is one step inside a reasoning loop that already takes seconds — the model is thinking, calling tools, reading results, thinking again. As one analysis puts it, adding a vector lookup that takes 100ms instead of 10ms is rarely the bottleneck when the model-side latency of a single tool call is already 500ms to several seconds. The 90 milliseconds you save on the hot tier vanish inside a loop that was never going to be fast. You paid the always-on floor to optimize a step that is not on the critical path.

This is why AWS positions S3 Vectors as purpose-built for agent memory: agent memory grows continuously, is queried in bursts, and tolerates hundreds of milliseconds because it lives inside a slow loop. The cold tier fits the shape of the work. The hot tier would be paying a premium for speed the agent cannot use.

The access pattern is the same one that made the cold tier a business. Cursor's codebases sit dormant most of the time and get queried in bursts; agent memory has the identical profile — mostly idle, occasionally read, never on a human's critical path. The workload that justified an object-storage-native database for code search is the workload most agents already have.

Where S3 Vectors Fits

The official use cases share one profile: large or growing corpora, moderate or bursty query rates, latency budgets measured in hundreds of milliseconds, and a preference for zero infrastructure. That profile shows up in five recurring patterns:

Semantic search over large collections — documents, media, medical images, video archives where cost per vector dominates.
RAG long tail — the bulk of a knowledge base that is queried occasionally, with a hot tier in front only if some subset gets constant traffic.
Agent memory — persistent, growing, bursty, latency-tolerant. The canonical fit.
Batch evaluation corpora — embeddings you query in batches, not in real time.
Cold tier of a tiered design — the durable, cheap floor under an OpenSearch hot tier.

My own case is the RAG long tail. I put my blog — 70-plus posts — into a Bedrock Knowledge Base on S3 Vectors. Small corpus, queried when I am writing and want to know what I already said, entirely latency-tolerant. A retrieval returns the relevant posts with strong semantic scores:

query: "do agent memories fade while skills persist"
  0.825  memories-fade-skills-persist.md
  0.753  skills-are-git-native-distribution.md
  0.725  agent-sprawl-is-a-skills-problem.md
  0.714  what-is-a-skill.md

I reach that knowledge base from my agents through one tool on a gateway — a pattern worth its own post, How AgentCore Gateway Turns Any API Into an Agent Tool. The point here is the storage decision underneath it: this workload is cold-tier, and the cold tier no longer costs $700 to enter.

The Three Limits

S3 Vectors buys its cost advantage with tradeoffs, and a design that ignores them will disappoint.

Semantic-only, no hybrid search. AWS documents that S3 Vectors does not support hybrid search — the combination of vector similarity with keyword matching. For content full of exact tokens like part numbers or version strings, pure semantic recall scores poorly, and OpenSearch with lexical scoring remains the right tier. My prose has no such tokens, so the limit does not bite, but a corpus of technical specs would feel it on the first query.

Latency is a floor, not a target. Warm queries land around 100 milliseconds, cold ones sub-second — fine inside an agent loop, disqualifying for an interactive search box. When part of the workload needs speed, the answer is not to abandon S3 Vectors but to tier: keep the bulk cold, promote the hot subset to OpenSearch, which AWS supports importing into directly.

Metadata is capped. S3 Vectors limits custom metadata per vector, which constrains how much filtering you can push into the store. For rich per-document filtering at scale, that cap is a real design input, not a footnote. I take apart exactly what that cap does to ingestion and filtering in What Actually Gets Stored When You Put a Vector in S3 Vectors.

So What

Stop asking which vector database is best. Start asking which tier your workload sits on, because the answer is usually obvious once you look at the shape of the queries. Customer-facing and latency-critical goes hot. Sitting next to relational data goes warm. Large, growing, bursty, and patient — which describes most internal RAG and nearly all agent memory — goes cold.

The reason this matters now is that the cold tier stopped being expensive. The $700 floor was quietly deciding architectures — pushing people toward one always-on database because standing up anything felt like the same fixed cost. Remove the floor and the honest answer for most agent and internal workloads is the cheap tier, running at pennies, tolerating a latency the agent never notices. The premium tier is still there when you need speed. Most of the time, you do not.

What Actually Gets Stored When You Put a Vector in S3 Vectors

Amit — Thu, 09 Jul 2026 22:33:15 +0000

TL;DR

A vector store does not have a "document" field. Each record is three parts: a key, the vector (an array of floats), and metadata. Your readable text is stored as metadata, not as a first-class body.
One document becomes many records. A knowledge base splits each file into chunks, embeds each chunk into its own vector, and writes one record per chunk. Document-level facts like tags get copied onto every chunk of that document.
S3 Vectors splits metadata into filterable and non-filterable, with a hard 2 KB cap on the filterable part. Chunk text is far larger than 2 KB, so it must be stored non-filterable — miss that and every write fails with a 400.
Filterable metadata is what lets you narrow the corpus before the vector math runs. The same query, filtered to one tag, searches a different slice and returns different results.

A vector store keeps three things per record. The readable document is split across many records, and the text sits inside a metadata field. I learned the exact shape of this by loading a 74-post blog into an S3 Vectors index and then reading back what was actually written. Once you see how the record is really built, the common ingestion failures explain themselves.

Three Components, No Document

Pull a single record back out of an S3 Vectors index and it has exactly three parts:

{
  "key":      "cf4f6331-8869-4d4b-acf1-e1bdf7660fb4",
  "data":     { "float32": [-0.1117, 0.004, -0.007, ...] },
  "metadata": { ... }
}

The key is a unique ID for the record. The data is the vector — in this index, an array of 512 floats, because the embedding model produces 512 dimensions. This array is the only thing similarity search compares. It is also not human-readable: you cannot recover the sentence from those 512 numbers.

So where is the sentence? In the metadata. When a knowledge base writes a record, it stores the chunk's readable text in a metadata key. There is no separate document body in the store — the text is metadata riding alongside the vector:

  one record
    ├─ key   : unique ID
    ├─ data  : 512 floats  (the vector — searchable, not readable)
    └─ metadata : key-value bag
         ├─ chunk text          (the readable sentence)
         └─ tags, year, date    (filter fields)

That single fact reframes everything else: a vector store is a similarity index over float arrays, plus a bag of key-value pairs per record, and your content lives in the bag.

One Document Becomes Many Records

Here is what happens under the hood. I loaded 74 posts. The index holds 488 records. One post became 18 separate records on its own.

The pipeline splits each file into chunks — roughly 300 tokens each — embeds every chunk into its own vector, and writes one record per chunk:

  one post (.md file)
    ├─ chunk 1  (~300 tokens) ──▶ record: key + 512 floats + metadata
    ├─ chunk 2             ────▶ record: key + 512 floats + metadata
    ├─ chunk 3             ────▶ record: key + 512 floats + metadata
    └─ ... chunk 18         ────▶ record: key + 512 floats + metadata

Retrieval searches across all 488 chunk-vectors and returns individual chunks, which is why a single query can return two different passages from the same post: they are two different records that both scored well.

This changes how document-level information has to be stored. A post's tags belong to the whole post, but there is no "post" record to attach them to — only 18 chunk records. So document-level metadata gets copied onto every chunk of that document. All 18 chunks carry the same tags, the same year, the same source URI. The redundancy is deliberate. It is what makes the next part work.

Filterable and Non-Filterable Metadata

S3 Vectors splits metadata into two classes, and the split is the part that breaks builds.

Filterable metadata can be used in query filters. You narrow a search with it: only records where tags contains agents, only records where year equals 2026. It is capped at 2 KB per record, because filtering has to stay fast.

Non-filterable metadata cannot be filtered on, but it is returned with results and has room — it shares a 40 KB-per-record total. It is meant for the large payloads: the chunk text, long descriptions.

The trap is the default. Every metadata key is filterable unless you explicitly mark it non-filterable at index creation, and that choice is immutable — you cannot change it later without recreating the index. A knowledge base stores chunk text in a metadata key, and a 300-token chunk is well over 2 KB. Left filterable, it blows the budget, and every write fails:

Invalid record for key '...': Filterable metadata must have at
most 2048 bytes (Service: S3Vectors, Status Code: 400)

The fix is to declare the text key non-filterable when you create the index:

aws s3vectors create-index \
  --vector-bucket-name my-vectors \
  --index-name blog-index \
  --data-type float32 \
  --dimension 512 \
  --distance-metric cosine \
  --metadata-configuration \
    '{"nonFilterableMetadataKeys":["AMAZON_BEDROCK_TEXT","AMAZON_BEDROCK_METADATA"]}'

That moves the text out of the 2 KB filterable budget and into the 40 KB total. Because the setting cannot be changed after creation, getting it wrong means deleting the index and starting over. It is the one parameter worth reading twice.

The Budget Is the Design

The metadata limits are the design constraints you build against (full list here):

Limit	Value
Total metadata per record	40 KB
Filterable metadata per record	2 KB
Metadata keys per record	50
Non-filterable keys per index	10
Dimensions per vector	1–4096

The 2 KB filterable cap is the one that shapes behavior. Small structured fields — tags, year, date, author — belong in the filterable budget. The big text belongs outside it. Since document-level fields get copied onto every chunk, they need to stay small: multiply a fat filterable field across every chunk of every document and the budget disappears fast.

Filtering Narrows the Corpus Before the Math

The payoff of putting tags in filterable metadata is that a query can narrow the corpus before the vector search runs. Same query, different slice.

Unfiltered, a search for agent cost and pricing returns the pricing posts:

query: "what did I learn about agent cost and pricing"
  0.687  how-to-optimize-agent-subscriptions.md
  0.682  ai-subscriptions-are-secretly-usage-models.md
  0.682  harness-is-where-the-margin-lives.md

Add a filter for tags containing skills, and the identical query searches only the chunks from skills-tagged posts — a different subset of the 488 records — and returns different results:

filter: tags listContains "skills"
  0.607  agent-sprawl-is-a-skills-problem.md
  0.603  skills-as-institutional-memory.md

The filter runs first. It selects which chunk-vectors are even eligible, then similarity search ranks within that subset. This is why the tag lives on every chunk: the filter has to be able to include or exclude each record on its own, without knowing which document it came from. Filters compose, too — year equals 2026 AND tags contains pricing narrows on both fields before a single distance is computed.

What's Missing

Two limits are worth knowing before you lean on this.

The filterable budget is tight. 2 KB per record is enough for tags and dates, not for rich structured metadata. A workload that wants to filter on many fields, or on long field values, will feel the ceiling — and because filterable keys are duplicated across every chunk, the ceiling arrives sooner than the per-record number suggests.

And the immutability is real. Dimension, distance metric, and the non-filterable key list are all fixed at index creation. There is no migration path — only delete and rebuild. For a personal blog that is a minor annoyance. For a large index it is a reason to model the metadata schema carefully before the first write.

So What

A vector store is a similarity index over float arrays, with a metadata bag per record where your text and your filter fields live together under a hard budget. Once that model is clear, the failures stop being mysterious: ingestion breaks because the text was left filterable, filtering returns nothing because the field was never declared, and the same query returns different results because a filter changed which records were eligible. The store rewards understanding its actual shape — three components per record, one document shredded across many of them, and a 2 KB line that decides where your text is allowed to live.

This is the storage layer underneath the tier decision. For why an agent's retrieval workload belongs on this cold tier at all, see Which Tier Does Your Vector Workload Live On?.

How AgentCore Gateway Turns Any API Into an Agent Tool

Amit — Thu, 09 Jul 2026 22:32:39 +0000

TL;DR

AgentCore Gateway is a managed front door for agent traffic. Agents connect to one endpoint; the gateway presents every registered tool as a single virtual MCP server.
It solves the two hard parts of exposing a capability to agents: inbound auth (who may call the gateway) and outbound auth (how the gateway reaches the backend). Both are configured on the gateway, not in each tool.
A tool is a target. Register a Lambda or an API as a target, and the gateway translates the agent's MCP call into an invocation of that backend, then translates the result back. No protocol code in the tool.
The gateway collapses the M×N wiring problem to M×1. M agents each connect once; N tools each register once; neither side tracks the other's count. AWS calls the MCP behavior aggregation mode.

AgentCore Gateway is a managed front door between your agents and the tools they call. An agent opens one connection to the gateway and sees a menu of tools; behind the gateway, each tool is a Lambda, an API, or another service that never had to know the Model Context Protocol existed. The gateway does the translation, the authentication on both sides, and the fan-out to every connected agent. I run one with three tools on it, and adding the third took an afternoon because the gateway had already solved everything that is usually hard.

This post is about how that front door actually works — the target model, the two authentication boundaries, and the path a single tool call travels.

The Shape of It

A gateway sits between two populations that should not have to know about each other's size. On one side, agents. On the other, backends — Lambdas, APIs, services. The gateway is the only thing both sides connect to.

flowchart LR
    A1["agent: terminal"] --> GW
    A2["agent: editor"] --> GW
    A3["agent: CLI"] --> GW
    GW["AgentCore Gateway (one MCP endpoint)"] --> T1["target: kb-retrieve"]
    GW --> T2["target: web-search"]
    GW --> T3["target: publish"]

Every agent connects to the same endpoint. Every tool is a target behind it. The agent asks the gateway what tools exist and gets the combined list of all targets. AWS calls this aggregation mode: for MCP targets, the gateway "acts as an MCP server that combines the capabilities of all its MCP targets into a unified virtual MCP server." One endpoint, many tools, and the agent cannot tell that the tools live in separate Lambdas.

Targets: How a Backend Becomes a Tool

A target is the declaration that turns a backend into a tool. It names the backend, the tool it exposes, and the input schema the agent should send. The gateway supports three categories of target:

Target type	What it connects	Behavior
MCP	APIs, Lambda functions, existing MCP servers	Aggregated into the unified virtual MCP server
HTTP	Other agents, A2A services, HTTP endpoints	Proxied directly, path-based, no aggregation
Inference	Model providers	Routed by requested model

For a Lambda or an API you use an MCP target, and the gateway takes on the translation job: it "converts agent requests using protocols like Model Context Protocol into API requests and Lambda invocations." The tool author writes a plain handler that takes a query and returns a result. The handler contains no MCP code, no auth code, no protocol version handling. The gateway wraps all of that around it.

Registering the target is the entire integration. Point the gateway at the Lambda, give the tool a name and an input schema, and the tool appears in every agent already connected to that gateway on their next tool listing. There is no per-agent change.

Here is the target that turns my knowledge-base Lambda into a tool. It is a Lambda target with one tool definition — a name, a description the agent reads to decide when to use it, and the input schema:

{
  "name": "kb-retrieve",
  "targetType": "lambda",
  "lambdaArn": "<my-retrieval-lambda-arn>",
  "toolDefinitions": [
    {
      "name": "kb_retrieve",
      "description": "Semantic search over my blog knowledge base. Returns relevant source chunks with their document source and relevance score.",
      "inputSchema": {
        "type": "object",
        "properties": {
          "query": { "type": "string", "description": "The search query or question" },
          "numberOfResults": { "type": "integer", "description": "Number of chunks to return (1-20). Default: 5." }
        },
        "required": ["query"]
      }
    }
  ]
}

That declaration is the whole contract. The gateway now knows how to present the tool to agents and how to invoke the Lambda behind it. The description matters more than it looks: it is what the agent reads to decide whether this tool fits the task.

Two Authentication Boundaries

A gateway authenticates on both sides, and both boundaries are configured on the gateway itself. That dual authentication is what makes it a security boundary, and it is the reason the gateway can be trusted to sit in front of every tool.

          inbound auth                 outbound auth
          (JWT / IAM)          (service role / credential)
   agent  ──────────▶  Gateway  ────────────────▶  backend (Lambda / API)
     (holds a token             (holds credentials
      for the gateway)           for the backend)

Inbound controls who may call the gateway. The gateway requires an inbound authorizer and supports four types: OAuth (JWT), IAM (SigV4), authenticate-only (validate the token, delegate the decision to the target), and none (development only). Mine uses a JWT authorizer, so an agent presents a bearer token, the gateway validates it, and only then is the tool list even visible.

Outbound controls how the gateway reaches the backend. For a Lambda target, the gateway uses an attached execution role to invoke the function. For an API or external MCP server, you attach a credential provider that stores an API key or OAuth credentials, or configure SigV4 signing. The agent never holds the backend's credentials. It holds a token for the gateway; the gateway holds credentials for the backends.

That split is the point. A tool author does not implement authentication. They register a target and the gateway applies the inbound rule that already exists and the outbound credential the target declares. This is what the docs mean by calling Gateway the only managed service providing both ingress and egress authentication.

The Path of One Call

Follow a single tool call from an agent asking "what have I written about this" to an answer:

sequenceDiagram
    participant AG as Agent
    participant GW as AgentCore Gateway
    participant L as Lambda target
    participant API as Backend API
    AG->>GW: MCP call (tool + args) + bearer token
    Note over GW: validate token (inbound auth)
    Note over GW: translate MCP call to Lambda invoke
    GW->>L: invoke with service role (outbound auth)
    L->>API: one API call
    API-->>L: result
    L-->>GW: return payload
    Note over GW: translate result to MCP response
    GW-->>AG: tool result

The agent sends an MCP request naming the tool and its arguments, carrying its bearer token. The gateway validates the token, finds the target, translates the MCP call into a Lambda invocation, and calls the Lambda using its own service role. The Lambda makes the single backend API call it exists to make, returns the result, and the gateway translates that back into an MCP response. The agent gets a tool result and never saw the Lambda, the role, or the API.

Every boundary the tool author would normally build — the protocol layer, the token check, the backend credential — is handled at the gateway. The Lambda is left with only its one job.

Why the Count Stops Mattering

The reason this architecture is worth the setup is what happens as tools and agents multiply.

Wire each tool directly into each agent and you have an M×N problem: every new tool must be added to every agent, and every new agent must be given every tool. AWS names this exactly — the gateway "eliminates the M×N integration problem."

  direct wiring (M×N edges)          gateway (M+N edges)

  agent 1 ──┐ ┌─ tool A            agent 1 ─┐          ┌─ tool A
          ├─┼─                          ├─▶ gateway ─┼─
  agent 2 ──┘ └─ tool B            agent 2 ─┘          └─ tool B
  (every agent wired               (each agent connects once,
   to every tool)                   each tool registers once)

With a gateway, each agent connects once and each tool registers once. The two sides never track each other's number. I felt this concretely: I run several agents — a terminal agent, two editors, a CLI — and adding a knowledge-base tool meant editing none of them. The tool was there the next time each one connected, because it was registered to the gateway they already trusted.

Finding the Right Tool Among Many

Aggregation creates a second-order problem: as the tool list grows, handing an agent every tool description on every request bloats the prompt and makes the model pick worse. The gateway has an answer built in. Enable semantic search on the gateway and it exposes a meta-tool — x_amz_bedrock_agentcore_search — that an agent calls with a natural-language query to find the most relevant tools instead of loading all of them upfront.

My gateway turns this on with one line of configuration:

"protocolConfiguration": { "mcp": { "searchType": "SEMANTIC" } }

This is retrieval applied to tools — the same pattern as retrieval over documents, pointed at the tool catalog. AWS notes it is "particularly useful when you have many tools". With three tools it changes little. It is the mechanism that lets the M×1 model keep scaling past the point where a flat tool list would start hurting the agent.

What's Missing

The afternoon is real, but it assumes the gateway already exists. Standing up the first gateway — the inbound authorizer, the identity flow, the service role that lets targets reach cloud resources safely — is the actual work, and it is not an afternoon. Wiring your very first tool is genuinely simpler as a direct connection. The gateway earns its place once the second and third tool are in view.

The other open edge is blast radius. A gateway that fans one tool out to every agent also fans out every mistake. A target with too broad an outbound role, or one that returns more than it should, now reaches every connected agent at once. Centralizing the tool surface centralizes the risk, and the controls that answer it — per-target scoping, fine-grained authorization at the gateway boundary — are their own build. For a handful of read-only tools it is fine. I would not put a write-capable tool behind a shared gateway without more control than a basic setup gives you.

So What

AgentCore Gateway moves the hard parts of tool integration — protocol translation, inbound auth, outbound auth, fan-out — out of every tool and every agent and into one managed layer. A tool becomes a plain handler plus a target registration. An agent becomes one trusted connection. The capability I wanted behind it was a Bedrock Knowledge Base on the cheap tier of vector storage, reachable now from every agent I run.

The distance between "this is an API" and "my agents can use this" used to be a project. With the gateway in place it is a target registration, and the plumbing you paid for once carries every tool after it.

I Pointed an AI Code-Security Tool at My Own Control Plane

Amit — Thu, 09 Jul 2026 07:40:14 +0000

TL;DR

Security is moving into the code-generation loop itself — guardrails fed to the agent as it writes, plus an AI review on every pull request. Corridor is the sharpest version of this I've tested, and it's the pattern, not just the product, that matters.
To test it honestly I planted a real authorization bug (IDOR) in my own AWS control plane and opened a pull request. It caught it in under two minutes: CWE-639, high severity, the correct fix handed back — for $0.14.
My existing scanners — Semgrep, Bandit, CodeQL — missed the same bug. They match known patterns, and broken access control is missing logic, not a pattern. Reasoning about the code is the capability Corridor adds and rule-based tools lack.
The caveats that keep me honest: on the individual tier your code is processed on Corridor's servers with no zero-retention guarantee, and Corridor caught my bug while running inside Claude Code — the same agent whose vendor now ships this capability natively. Corridor runs on the platforms most likely to compete with it.

AI coding tools broke the assumption that most security tooling was built on: that a human writes the code, and a scanner checks it later. Veracode tested more than 100 LLMs across 80 coding tasks and found that when a model could choose a secure or insecure implementation, it chose the insecure one 45% of the time — OWASP Top 10 vulnerabilities, introduced silently, at the speed code is now generated. The scan that runs after commit is reviewing a decision that shipped three prompts ago.

Corridor is one answer to that. The thesis behind it: move security into the generation loop, not after it. The company raised a $25M Series A at a $200M valuation led by Felicis, with angels from Anthropic, OpenAI, and Cursor, and founders who led Secure by Design at CISA. They call the category Agentic Coding Security Management. I wanted to know whether the category is real or whether it's a scanner with a coat of paint.

So I imported one of my own repositories — a serverless AWS control plane with a Cognito auth layer, DynamoDB persistence, and a set of Lambda handlers — and tested it the only way that produces an honest answer: I planted a real bug and watched.

What It Actually Does

Corridor works in two places, and the difference between them is the most important thing to understand about the product.

The first is prevention. Corridor runs a server that your coding agent — Claude Code, Cursor, Codex — calls before it writes code. The agent sends its plan, Corridor returns security guidance scoped to your codebase, and the agent writes code that already accounts for it. Your model still writes the code. Corridor supplies the security context your model would otherwise lack. It does not replace your coding model; it informs it.

The second is detection. When you open a pull request, Corridor receives it, reviews the change on its own servers with its own models, and posts findings back on the pull request. This half runs entirely on Corridor's infrastructure.

Those two places are the whole product. Prevention improves the code your agent writes. Detection reviews the code after it is written and flags what the guidance missed.

YOUR MACHINE (your model writes)          CORRIDOR BACKEND (their model reviews)
─────────────────────────────────        ──────────────────────────────────────
  coding agent
  (Claude Code / Cursor / Codex)
        │
        │  1. plan  ──────────────►  MCP: analyzePlan
        │                                  │
        │  ◄──────────────────────  guardrail context
        │      2. (scoped to your codebase)
        ▼
  agent writes safer code
        │
        │  3. git push + open PR
        ▼
  GitHub  ───── webhook ──────────►  4. review diff (hosted model)
        │                                  │
        │  ◄───── finding on PR ─────  IDOR · CWE-639 · high
                                           (posted back to the PR)

  PREVENTION: your model generates,        DETECTION: their model reviews,
  Corridor supplies the rules              entirely on their infrastructure

The left column is your machine, where your model writes the code. The right column is Corridor's infrastructure, where its model reviews it. Steps 1 and 2 happen before the code exists; steps 3 and 4 happen after you open the pull request. The dividing line also answers the question that matters most for sensitive code: where your code goes. Everything on the right side leaves your machine.

It Read the Codebase Before It Judged It

Setup is a GitHub connection and a repo pick. Corridor scanned the repository and produced ten security guardrails in a few minutes. The guardrails were specific to my code, citing the actual files and functions in each one.

Two examples show the depth. On authentication, Corridor identified that every route enforced the same auth check and that the health-check endpoint was the one intentional exception. On privilege escalation, it traced how the system issues tokens and flagged the exact component that decides what permissions a token carries.

The more telling result was the guardrails Corridor chose not to raise. It examined the code for SQL injection and concluded the bug cannot occur here, because the database layer does not build queries from strings. It reached the same conclusion for cross-site scripting and CSRF: the application returns JSON, has no web frontend, and uses bearer tokens, so those attack classes do not apply. Corridor explained each decision instead of listing the vulnerability as a warning to dismiss.

This is the difference that matters. A rule-based scanner sees a database call and raises SQL injection. Corridor read how the database is actually used and ruled the class of bug out. A tool that reasons about the code earns attention. A tool that fires on every pattern gets muted.

Planting a Real Bug

Clean-repo guardrails prove little. The real test is whether Corridor catches a bug that a developer, or a hurried agent, would actually introduce.

My usage API already handled this correctly. One route returned a team's usage, and it verified that the caller was allowed to read that team's data before returning anything.

if not principal.is_admin and principal.team != team:
    raise errors.ForbiddenError("Cannot read usage for another team")

I broke it on purpose. I added a second route that took the team name straight from the URL and returned that team's usage without the ownership check.

if path.startswith("/usage/") and method == "GET":
    team_id = path[len("/usage/"):].strip("/")
    if team_id:
        return _handle_usage(team_id, history=0, models=False)

This is a textbook Insecure Direct Object Reference (IDOR): the route checks that the caller is logged in but never checks that they own the data they asked for. Any team can read any other team's spend by changing the team name in the URL. It is the kind of bug that appears when someone adds a convenient shortcut and forgets that the original route earned its safety with a check the shortcut skipped. I committed it, pushed the branch, and opened a pull request.

What Came Back

The review posted in under two minutes, as an inline comment on the exact lines:

Because this endpoint only requires INVOKE_SCOPE, a team client can request /usage/other-team-id and retrieve another team's usage data. This is an IDOR / cross-tenant authorization bypass.

The remediation it proposed was correct: check that the caller owns the team before returning data, and allow reading another team's data only for an admin. This is the same protection the original route already had. Corridor reconstructed my own security pattern from the codebase and handed it back as the fix.

The finding came with everything needed to act on it: the vulnerability class (CWE-639), a high severity rating, the exact code, and the file and line. The entire review cost $0.14.

Was the Finding Real?

Yes. It is a genuine cross-tenant data exposure, classified correctly, with the right fix. My existing scanners ran on the same pull request — Semgrep, Bandit, and CodeQL — and none of them flagged it. They look for known bad patterns, and this bug is not a pattern. It is missing logic. Judging whether an authorization check is present requires understanding what the code is supposed to do, and that is the capability Corridor adds.

Two caveats keep this honest.

Corridor marked the finding as not proven reachable. It identified the vulnerable code but did not claim a working exploit path, because reachability depends on deployment details it cannot see. The finding is a strong signal, and confirming exploitability is still the reader's job.

The finding also did not appear in the default results when I queried Corridor's data through its API. It existed and I could retrieve it directly, but the default view shows only the main branch, and my bug lived on a feature branch. This is a configurable default rather than a failure, though it is the kind of default that hides branch findings from anyone who does not know to change it.

What's Missing

Corridor caught the bug. The open question is what happens to your code to make that possible.

On the individual tier, both halves of Corridor send your code off your machine: the guidance step sends your plan and surrounding context to Corridor's servers, and the review step sends the full pull request diff. Zero data retention is available only on the enterprise tier. The finding record confirmed that Corridor processed my diff on its own servers with a hosted model. For a personal repository, that is a fair trade. For code under a real data-classification obligation, a security tool that retains your source is itself a risk to weigh, and removing that risk requires the enterprise tier and a sales conversation.

The larger question is convergence. In February 2026 Anthropic shipped Claude Code Security, and its own description matches Corridor's pitch closely: rule-based static analysis "catches common issues but often misses more complex vulnerabilities, like flaws in business logic or broken access control." Broken access control is the exact bug Corridor caught for me, and it caught it while running inside Claude Code. Corridor's advantage is that it works across every agent, the same way whether you use Cursor, Codex, or Claude Code. Its risk is that the agent vendors are building the same capability into their own products, and Corridor depends on those products to run. The company is building on top of the platforms most likely to compete with it.

The Verdict

The category is real, and Corridor is a credible version of it. I have run enough security scanners to recognize the difference between one that matches patterns and one that reasons about code, and the IDOR catch is the second kind. It read my code, understood the missing check, and handed back the fix my own scanners could not produce. Security that runs inside the generation loop is the right answer to code that ships faster than anyone can review it.

One test does not settle the question. I ran the prevention path and a single pull request. The number that decides whether a tool like this survives daily use is the false-positive rate across dozens of pull requests over months, and I have not measured that yet. Corridor earned the next test.

Skills Are a Git-Native Distribution Primitive

Amit — Tue, 07 Jul 2026 00:47:53 +0000

TL;DR

Most tools solve discover-and-install for skills but leave drift as the default. The missing operation is contribute-back: a versioned, CI-gated publish that treats a skill update like a software release.
The industry converged on git-native distribution as the pattern. The gap was a tool that built the complete loop — not just install, but version, review, publish, propagate.
One registry now exists with that full lifecycle, built by the Snyk founder on a $125M bet that skills need the governance code always needed.
Publishing is unilateral by default — nobody reviews unless a team wires a CI gate in. The infrastructure exists; enforcing it is still a choice.

Institutional knowledge that lives in a wiki page dies when the person who wrote it changes teams. Institutional knowledge that lives in a SKILL.md — instructions an agent actually reads before acting — survives, because the artifact that carries the knowledge is the same artifact that does the work. That's the thesis. The interesting part isn't the thesis anymore. It's that the industry has been quietly building the infrastructure for it, and the complete lifecycle for it now exists.

I've been circling this from the writing side for months, not only the tooling side. Your Agent's Behavior Is Code — Start Versioning It made the same claim in April: institutional knowledge locked in people's heads or unread wikis is now versionable, diffable, executable, and should be treated with the same discipline as code. Skills as Institutional Memory went further and named the specific gap: "Skills don't yet have a standard for communicating breaking changes to methodology." That gap is the one Tessl closed.

The Convergence Was Already Real

Every signal pointed the same direction, independently, over the past several months:

Anthropic's skills repo — a public collection of shareable Agent Skills, now past 158,000 GitHub stars. Fork it, adapt it, PR your improvement back.
gh skill install — GitHub's own CLI installs a skill straight from any repo into your local agent, across Claude Code, Goose, Junie, OpenCode, Windsurf, and more.
mdskill.dev — a live, security-audited directory sitting at 10,000+ skills across 270+ repos.

The pattern underneath all three is the same three-operation model:

REPO (source of truth)
  └── docs/skills/ or .claude/skills/
        └── SKILL.md + scripts + resources
        ↓ discover / install / adapt
LOCAL HARNESS (runtime)
  └── vendored or symlinked copy
        ↓ contribute back (PR)
REPO (updated with field improvements)

Discover, install, contribute back. Every tool above nails the first two operations. Almost none of them close the third.

Rajiv Pant named the gap precisely: "Both the installed copy and the source had improved independently over the week." That's drift — the exact failure mode you'd expect once discover-and-install is solved but contribute-back isn't. If you've been managing your own skills for a while, you've probably already built half the fix yourself: a symlink from your agent's skill directory into a personal repo, so there's one copy, not two, and nothing to drift. That solves replication. It does nothing for versioning, review, or distributing an update to a different machine.

What the Complete Loop Actually Looks Like

I tested one registry — Tessl — directly rather than reading about it. The practitioner's version of what this feels like day-to-day — the skill sprawl, the update loop, the autonomous-agent angle — is in Agent Sprawl Is a Skills Problem. The mechanism, confirmed hands-on:

A skill lives as a plain SKILL.md plus a manifest in an ordinary git repo — nothing proprietary about the storage. Tessl's own reference tile runs a real CI pipeline on every pull request: a check that fails the merge if the manifest's version field wasn't bumped, a structural lint, and an automated quality review — before a publish step ever fires on merge to main. That's the "contribute back" operation, finally closed with the same rigor as a software release, not a manual sync.

Publishing doesn't hand a live git checkout to consumers. It vendors an immutable, versioned snapshot into a local cache, pinned to the exact version published — so editing the source does nothing to anyone's install until you deliberately publish again. I edited a skill, bumped its version, republished, and ran the update command elsewhere: the exact edit landed in the installed copy, and nowhere else, until I asked it to.

--bump patch|minor|major on the publish command is the standard for communicating breaking changes that didn't exist when I wrote about the gap in April. It's semantic versioning, borrowed wholesale from software, applied to a skill's methodology instead of its code — the exact mechanism that was missing.

Governance is a five-role ladder — Consumer, Member, Publisher, Manager, Owner — each with a defined permission set, confirmed against Tessl's own documentation. Publisher can create, publish, and unpublish (within a two-day window, an npm-style undo guard, not indefinite retraction). Manager and Owner control who holds that role. None of it requires code execution inside the platform — agents read the resulting files the same way they'd read anything else in a repo.

The Market Bet Behind This

The registry that implements this — Tessl — doesn't lead with distribution on its homepage. It leads with "Skills are the new code. Treat them that way," and three questions aimed at governance risk: would you know if a risky skill ran in your environment, what are duplicate and outdated skills costing your team, are your agents actually using the skills they have. Distribution is the proof point. The claim being sold is bigger — that skills are about to need the same enterprise discipline code has always needed.

The founder is Guy Podjarny, previously of Snyk — the developer-security business built on the same instinct: take a category everyone was quietly ignoring and make it a governance line item. The company has raised $125 million from Index Ventures, Accel, GV, and boldstart. That's a well-capitalized bet that the shift from code-centric to skill-centric development is real. Worth remembering: "shift security left" took Snyk years to become assumed vocabulary rather than a pitch. Whether "skills are the new code" follows the same slow curve — or never lands — is a question the funding round answers with confidence and the market hasn't answered at all yet.

What's Still a Choice, Not a Guarantee

Here's the open thread the industry hasn't answered yet, and Tessl doesn't answer it either: who reviews the skill when the author and the contributor are both agents?

The honest answer, confirmed by testing it: nobody, by default. Publishing is unilateral for anyone with Publisher role. The lint-and-review-as-CI-gate pattern is real and it works — but it exists because one team chose to require it in their pull request checks, not because the registry enforces it as a platform guarantee. Strip that CI config out, and a Publisher ships straight to the registry with no review step at all. The infrastructure for rigorous, git-native skill governance now exists start to finish. Whether any given team actually wires the review gate in, or grants Publisher broadly and skips it, is still a social decision, not a technical one.

So what: skills distribution stopped being a hypothetical pattern the moment a tool built the complete lifecycle — discover, install, version, review-if-you-choose-to, publish, propagate — on infrastructure (git, CI, a registry) that already knows how to ship software. The part that was missing is built now. The part that was always going to be a choice — whether anyone actually enforces review before a skill reaches production — still is.

Agent Sprawl Is a Skills Problem

Amit — Tue, 07 Jul 2026 00:47:17 +0000

TL;DR

Agent sprawl is real — but the harder problem underneath it is skill sprawl. Skills that exist but aren't wired in. Skills that are wired in but stale. No single view of what you have.
Skills and MCP servers are now the shared infrastructure layer of any agent setup. They need the same lifecycle code has always had: versioning, distribution, update propagation.
For a solo builder: a registry with an update loop means fixing a skill once reaches every harness. That loop didn't exist before.
For teams: the question isn't just "do we have skills" — it's "which version is each developer on, who reviewed it, and is it safe to run."
Autonomous agents running headless in CI still need skills. They still drift. The manual sync model breaks when there's no human at the keyboard.

Have you been running into agent sprawl over the past year?

I have. Not in an abstract way — concretely. I have over 80 skills built up across a year of working with AI agents. They live in .claude/skills/, .agents/skills/, repo-specific directories, content drafts folders, and at least a dozen places I'd have to grep to find. Some are wired into Claude Code. Some into Cursor. Some into neither. Some were written, used once, and never touched again. A few I rebuilt from scratch because I forgot they already existed.

That's skill sprawl. And it compounds the agent sprawl problem instead of solving it.

The Thing Nobody Warned You About

When people talk about agent sprawl, they usually mean too many agents, too many tools, ungoverned access. That's real. But the harder problem is one level below it.

Skills are the shared context layer that makes agents useful for your work, not just capable work in general. I wrote about this last year — the difference between an agent that can do a thing and an agent that does the thing the way you need it done is the skill. The methodology, the heuristics, the failure modes, the output format. That's the skill.

The problem is that skills don't have a lifecycle. You write one. It lives somewhere. You fix a bug in it. The fix lives in one place. Every other harness, every other machine, every other developer on your team is still running the old version. Nobody knows.

This is the same problem software had with dependencies before package managers. You'd copy a library into your project. It worked. Then upstream fixed a bug. You never heard about it. You're still running the broken version.

Skills are there now. And MCP servers are getting there fast.

What My Workspace Actually Looked Like

Here's the concrete version. Across four repos on my main machine, I found 48 SKILL.md files. Some were installed properly — vendored into the right harness directories, showing up when agents loaded. Most weren't. They existed as files. They weren't doing anything.

The skills that were wired in were split: some in .claude/skills/, some in .agents/skills/, a few symlinked from a personal repo. When I fixed a skill — added a missing step, corrected a failure mode I'd hit — I updated one copy. The others stayed stale. I had no way to see which harness had which version without going directory by directory.

I wasn't managing skills. I was accumulating them.

The market is hitting the same wall at scale. Microsoft shipped APM in April — positioning it as the package.json for AI agent configuration, where you declare skills, prompts, instructions, and MCP servers once in an apm.yml and every harness gets the same setup. Portkey shipped a skills registry the same month. "Agent harness sprawl" is appearing in enterprise risk frameworks. The average enterprise now runs 12 agents, according to Salesforce's 2026 MuleSoft Connectivity Benchmark — and 50% of those agents are operating in isolated silos with no enterprise-level governance.

We spent 2025 building agents. We're spending 2026 figuring out how to govern what we built.

Skills and MCP Servers Are Infrastructure Now

The frame that clarifies this: skills and MCP servers are no longer optional context. They're the infrastructure layer of any serious agent setup. Every harness needs them. Every autonomous agent needs them. They need to be versioned, distributed, and kept current — the same way any other shared dependency does.

That's what was missing. Not better skills. Not more skills. A lifecycle for the ones that already exist.

When I publish a skill update today, tessl update propagates it to every harness that has it installed — Claude Code, Cursor, Codex, .agents/skills/, all of it. One command. I don't touch each directory manually. I don't remember to sync anything. The fix reaches everywhere or it reaches nowhere, by design.

That loop is the entire value. It's small and it's precise and it didn't exist before.

What This Means for Teams

Multiply my problem by a team of ten.

Now every developer has their own version of the skill. Some have the one from last month. Some have a fork they modified locally. Nobody knows whose is authoritative. Someone hits a failure mode and fixes it in their copy. The fix never propagates.

The questions a team needs to answer — which skills are approved to run, who reviewed them, what version is each developer on — are the same questions they already answer about npm packages. Dependency governance is a solved problem in software. It just hasn't been applied to skills yet.

The governance layer a team needs: which skills are approved, who reviewed them, what version each developer is on, a CI gate that fails the PR if the version wasn't bumped. That's a solved problem in software — it's what registries with role-based access and semantic versioning do. Tessl applies that model to skills. For a solo builder, most of it is overhead. The value is just the update loop. For a team, the governance layer is the point.

Autonomous Agents Are the Part Nobody's Talking About Yet

Here's where this gets sharper.

I run autonomous agents — headless sessions, CI pipelines, scheduled workflows with no human watching. They use tools. They follow workflows. They need skills to do that work the way I need it done, not just in a generic capable way.

Those agents still drift. When I update a skill, a headless agent running in CI has no way to know. It's still loading the old version. There's no person at the keyboard to notice.

tessl launch skill targets OpenHands, Codex CLI, and other autonomous runtimes directly. The same registry, the same versioned snapshot, the same update command — but the consumer is a running agent instead of a developer's local harness. The skill reaches the autonomous agent the same way it reaches Claude Code.

That closes the last gap. The manual sync model works when a human is at the keyboard. It breaks when the agent is running on its own.

What's Still on You

The inventory scanner (tessl inventory import) is built for exactly the problem I described — finding the skills you already have scattered across repos. It didn't work for me: it only reads public GitHub org repos, and my skills live in private personal-account repos. Zero results. The migration from sprawl to managed is still a manual job if your setup looks like mine.

Publishing is unilateral by default. Anyone with Publisher role can push to the registry with no review step. The CI gate exists — you have to choose to wire it in. The infrastructure is there. Whether teams actually enforce it is still a social decision.

Auto-generated evals only work for skills with a checkable output. Conversational workflows don't generate scenarios. You have to write those by hand.

These are real gaps. They don't undermine the core loop — publish, install, update — but they're worth knowing before you build on top of it.

The Open Question

The industry is converging on this fast. Microsoft APM, Portkey, and Tessl are three different bets on the same gap, made within months of each other. The format already converged — SKILL.md is readable by 30+ agent platforms. What was missing was the lifecycle. That's what's being built now.

For me, the immediate value was simple: fix a skill once and the change reaches every harness. That's it. That's the thing that didn't exist before.

Whether any of these tools becomes the standard the way npm did — or whether the whole category stays infrastructure for teams already thinking carefully about this — is the question the market hasn't answered yet.

For the broader argument — why the industry converged on git-native distribution and what it means that one company built the complete lifecycle — see Skills Are a Git-Native Distribution Primitive.

Zed on Bedrock Makes the Agent Harness Portable

Amit — Mon, 06 Jul 2026 18:33:42 +0000

TL;DR

Wire Zed's open-source agent harness to Amazon Bedrock and the editor inherits your AWS account's governance boundary — auth, audit, region, billing — instead of one provider integration per tool.
The harness stays fixed; the model plane swaps underneath it. Claude, OpenAI GPT, xAI Grok, and open-weight models become config-change choices, not migrations.
Native Bedrock (Converse) covers Claude. OpenAI and Grok live on a separate OpenAI-shaped endpoint and need a small local translation proxy — included in full below.
Zed's Business plan is a per-seat governance license that stacks on top; inference still runs in your own Bedrock account, model-agnostic either way.

Zed is not another VS Code skin with chat bolted on. It is a code editor from Zed Industries, the team that previously created Atom, Electron, and Tree-sitter. That history matters because Zed's center of gravity is still the editor: low-latency text, collaboration, project context, and code-adjacent workflows that stay close to the files.

The AI layer follows the same shape. Zed's AI docs separate agent paths from model access. The native Zed Agent is one way to run agentic work inside the editor. External agents and terminal-backed sessions are other ways. Model access is a separate decision: Zed-hosted models, provider APIs, gateways, subscriptions, local models, or Bedrock.

That makes Bedrock support interesting for a practical reason: Zed's agent harness can run across the models your Bedrock account can reach. The harness surface is Zed. The model plane is Bedrock. The bill for inference is still the model bill, but the editor-side agent experience is no longer tied to a single hosted model plan.

That is the real unlock, and it is bigger than a personal convenience. The coding agent is now the highest-volume AI surface in most engineering orgs — in a16z's 2025 survey of 100 enterprise CIOs, software development is called out as the killer use case, with one CTO reporting nearly 90% of new code generated through AI coding tools. Point that surface at a governed model plane and every request inherits the account's auth boundary, audit trail, region controls, and billing — instead of scattering source code across one direct provider integration per tool. Claude can be the default today. OpenAI and xAI/Grok can become the same kind of choice when the account and endpoint expose them. The user keeps one editor harness and swaps the model plane underneath it.

The distinction still matters. Zed provides the native editor harness: agent panel, project context, file edits, terminal-adjacent work, and model selection. Bedrock provides the model boundary: account access, model availability, region behavior, inference profiles, auditability, and billing. Wiring Zed to Bedrock is powerful because those two layers line up cleanly — and because that boundary is exactly the thing enterprises are now scrambling to put in front of every model call.

The Clean Shape

Zed exposes three useful surfaces for agentic work. They can look similar from the sidebar. They are not the same system.

Path	What Zed Owns	What The Agent Owns
Native Zed Agent on Bedrock	Editor harness, tools, model picker, thread surface	Bedrock owns inference for the selected model
ACP external agent	Editor integration and protocol boundary	Agent harness, model routing, tool policy, memory behavior
Terminal thread	Terminal surface inside Zed	The CLI's native behavior

This is where ACP belongs in the post: as boundary context. Zed Industries helped create the Agent Client Protocol, and ACP matters when Zed is hosting an external coding agent. But this setup is not an ACP setup. It is the simpler path: run the native Zed Agent with Bedrock as the model provider.

That table is the whole decision. If the goal is "use Zed's agent harness across Bedrock models," configure the native provider. If the goal is "preserve the behavior of a different coding harness," run that harness directly or through ACP and accept that Zed is now a host, not the agent brain.

Why This Is A Control-Plane Decision

The wiring is a config file. The reason to do it is a governance decision that the rest of the industry is converging on from the other direction.

Enterprises no longer run one model. In a16z's CIO survey, 37% of enterprises now run five or more models in production, and the driver is differentiation by use case, not just lock-in avoidance — one model wins on code completion, another on system design, another purely on cost. That same report notes procurement now looks like traditional software buying, where security and price often outweigh raw accuracy: "for most tasks, all the models perform well enough now — so pricing has become a much more important factor." Model choice is a business gate, not a preference.

The catch is that choice is getting harder to keep. As agentic workflows mature, a16z finds switching costs are rising: prompts, tools, and guardrails get tuned to one model's quirks, so "changing models is now a task that can take a lot of engineering time." The defense against that lock-in is a stable harness over a swappable plane. If the model is a config value behind one API, swapping it is a config change. If the model is wired directly into each tool, swapping it is a migration.

That is why a category has formed around this exact shape. Analysts and vendors now call it the AI control plane or agent gateway — one layer that owns identity, policy, audit, cost attribution, and data residency for every model call, explicitly including terminal-based coding agents. The market is consolidating fast enough that gateways are being acquired and donated to foundations. Most teams reach for a third-party gateway to get this. The point of this post is narrower and cheaper: if you are already on AWS, Bedrock is that plane. One API (Converse), one IAM boundary, one audit surface, inference profiles for capacity and region strategy, and data that stays inside the account. Pointing the coding harness at it means the developer gets the governed boundary without a new vendor in the request path.

Zed's editor and agent harness are open source (GPL-3.0-or-later, with Apache-2.0 components), from Zed Industries. Zed also offers a Business plan — a per-seat license fee that buys the enterprise governance features on top: org-wide model policies, data-governance controls, role-based access, and unified spend visibility. The fee buys the controls, not the models. Inference still runs through your own Amazon Bedrock endpoint, in your own AWS account. So the layers stack cleanly: pay Zed for the governance layer, keep Bedrock as the model-agnostic plane underneath it.

The Bedrock Configuration

The stable Bedrock version uses Zed's native Bedrock provider with an AWS profile and region:

{
  "agent": {
    "default_model": {
      "provider": "bedrock",
      "model": "global.anthropic.claude-sonnet-5",
      "effort": "MEDIUM",
      "enable_thinking": true
    },
    "favorite_models": [
      {
        "provider": "bedrock",
        "model": "global.anthropic.claude-sonnet-5"
      },
      {
        "provider": "bedrock",
        "model": "global.anthropic.claude-opus-4-8"
      }
    ]
  },
  "language_models": {
    "bedrock": {
      "authentication_method": "named_profile",
      "profile": "<aws-profile>",
      "region": "us-east-1"
    }
  }
}

The important choice is the global.* model ID. Amazon Bedrock inference profiles can route model invocation across Regions and are the right fit when the account exposes global profiles for the model family. In the account I checked, the current useful pair was:

global.anthropic.claude-sonnet-5
global.anthropic.claude-opus-4-8

Sonnet 5 is the default because it is the latest Anthropic model visible in the account. Opus 4.8 stays as a favorite because some work still wants the premium reasoning path. That is enough for the Anthropic slice. Adding every older model to the picker turns model selection into inventory management.

The CLI Is The Source Of Truth

The UI can lie by omission. Docs can lag. Limited previews can exist before a model appears in every account. The CLI is the first verification layer:

aws bedrock list-foundation-models \
  --region us-east-1 \
  --query 'modelSummaries[].[providerName,modelId,modelName]' \
  --output table

aws bedrock list-inference-profiles \
  --region us-east-1 \
  --query 'inferenceProfileSummaries[].[inferenceProfileId,inferenceProfileName,status]' \
  --output table

For US-only validation, run the same check in us-east-1, us-east-2, and us-west-2. The result is not a universal statement about Bedrock availability. It is the truth for that account at that moment.

That account-specific read matters for OpenAI and Grok. The point of the setup is one free Zed harness across Claude, OpenAI, and xAI frontier models through Bedrock, plus the open-weight catalog. The catch is that the proprietary frontier models do not appear in list-foundation-models or list-inference-profiles at all — those commands cover the native Converse catalog, and OpenAI GPT and Grok live behind the separate OpenAI-shaped endpoint. The CLI check above will not show them even when the account can reach them. The account I checked exposed the full frontier and open-weight set through that endpoint's own model list, much of which never surfaced in the standard catalog.

That is the trap: the native catalog being empty of these models is not evidence they are unavailable. It is evidence you are looking at the wrong plane.

The OpenAI And Grok Path

OpenAI GPT and xAI Grok models do not come through the native Bedrock Converse API. They come through a separate Bedrock endpoint that speaks the OpenAI API shape: bedrock-mantle.<region>.api.aws. The catalog and inference profiles used by the native provider never surface them, which is why they look absent even when the account can reach them.

The proprietary frontier models are the headline, but the same endpoint is where the open-weight catalog lives too. Bedrock carries a wide set of open-weight models — families like DeepSeek, Qwen, GLM, Kimi, Mistral, Gemma, and Nemotron — and many answer on this OpenAI-shaped endpoint rather than the Converse API. The catalog is broad and refreshes often, though it trails the absolute open-weight frontier by roughly one to three months: the newest top-of-leaderboard open weights usually land on Bedrock a wave or two after their public release. The point holds regardless of which model is on top this month — the harness stays put, the model plane underneath it moves.

The endpoint splits by model family. GPT 5.x answers only on the Responses API (/openai/v1/responses). Grok and the open-weight models answer on Chat Completions (/openai/v1/chat/completions). The path prefix is /openai/v1, not /v1 — the bare /v1/models route lists models but the inference routes live under /openai.

Zed's openai_compatible provider can represent either shape, but it cannot reach the endpoint directly. Two mismatches block it. First, the endpoint authenticates with an AWS bearer token, not a static API key, and the token is region-scoped and short-lived. Second, GPT 5.x and Grok use different tool-call and message formats, and GPT 5.x's Responses API streaming events do not match the Chat Completions stream shape Zed expects.

A small local proxy closes both gaps. It mints a bearer token per region from the AWS session, routes each model to the correct rail, translates tool and message formats for the Responses API, and adapts the Responses streaming events back into Chat Completions chunks. Zed points at http://127.0.0.1:<port> as an openai_compatible provider and treats the models as first-class — streaming and tool use included.

{
  "language_models": {
    "openai_compatible": {
      "bedrock-mantle": {
        "api_url": "http://127.0.0.1:4327",
        "available_models": [
          {
            "name": "openai.gpt-5.5",
            "display_name": "GPT 5.5 (Bedrock)",
            "max_tokens": 128000,
            "capabilities": {
              "tools": true,
              "images": false,
              "parallel_tool_calls": false,
              "prompt_cache_key": false,
              "chat_completions": true
            }
          },
          {
            "name": "xai.grok-4.3",
            "display_name": "Grok 4.3 (Bedrock)",
            "max_tokens": 1000000,
            "capabilities": {
              "tools": true,
              "images": false,
              "parallel_tool_calls": false,
              "prompt_cache_key": false,
              "chat_completions": true
            }
          }
        ]
      }
    }
  }
}

The proxy carries the model-family logic. Zed sees one endpoint. The right sequence still holds: model visible in the endpoint's model list, API call succeeds against the correct rail, proxy translates the formats, editor config lands, model becomes a favorite. Reversing that sequence creates a polished model picker full of broken doors.

The Proxy, In Full

The proxy is one Python file with no dependencies beyond aws-bedrock-token-generator. It holds a model table, mints tokens per region, and translates in both directions. Swap the profile name and model IDs for what your account exposes.

#!/usr/bin/env python3
"""Bedrock Mantle proxy for an OpenAI-compatible client (e.g. Zed).
Streaming + tool calls, both rails. pip install aws-bedrock-token-generator
"""
import json, os, sys, time, urllib.error, urllib.request
from http.server import BaseHTTPRequestHandler, ThreadingHTTPServer
from aws_bedrock_token_generator import provide_token

HOST = os.environ.get("MANTLE_PROXY_HOST", "127.0.0.1")
PORT = int(os.environ.get("MANTLE_PROXY_PORT", "4327"))
AWS_PROFILE = os.environ.get("AWS_PROFILE", "default")

# rail: "openai_responses" (GPT 5.x) or "openai_chat" (Grok, open-weight models)
MODELS = {
    "openai.gpt-5.5": {"region": "us-east-1", "rail": "openai_responses"},
    "xai.grok-4.3":   {"region": "us-west-2", "rail": "openai_chat"},
}

_token_cache = {}

def log(msg): print(f"[proxy] {msg}", file=sys.stderr, flush=True)

def get_token(region):
    c = _token_cache.get(region)
    if c and c["exp"] > time.time() + 60:
        return c["token"]
    saved = {k: os.environ.get(k) for k in ("AWS_REGION", "AWS_DEFAULT_REGION", "AWS_PROFILE")}
    os.environ.update({"AWS_PROFILE": AWS_PROFILE, "AWS_REGION": region, "AWS_DEFAULT_REGION": region})
    try:
        token = provide_token()
    finally:
        for k, v in saved.items():
            os.environ.pop(k, None) if v is None else os.environ.__setitem__(k, v)
    _token_cache[region] = {"token": token, "exp": time.time() + 3000}
    return token

def to_text(content):
    if isinstance(content, str): return content
    if isinstance(content, list):
        return "\n".join(p.get("text", "") for p in content if isinstance(p, dict))
    return ""

def build_chat(model, cfg, messages, tools, max_tokens, stream):
    url = f"https://bedrock-mantle.{cfg['region']}.api.aws/openai/v1/chat/completions"
    body = {"model": model, "messages": messages,
            "max_completion_tokens": max_tokens, "stream": stream}
    if tools:
        body["tools"] = tools
        body["tool_choice"] = "auto"
    return url, body

def build_responses(model, cfg, messages, tools, max_tokens, stream):
    url = f"https://bedrock-mantle.{cfg['region']}.api.aws/openai/v1/responses"
    system, items = None, []
    for m in messages:
        role = m.get("role")
        if role == "system":
            system = to_text(m.get("content", "")); continue
        if role == "tool":
            items.append({"type": "function_call_output",
                          "call_id": m.get("tool_call_id", ""),
                          "output": to_text(m.get("content", ""))}); continue
        if role == "assistant" and m.get("tool_calls"):
            for tc in m["tool_calls"]:
                fn = tc.get("function", {})
                items.append({"type": "function_call", "call_id": tc.get("id", ""),
                              "name": fn.get("name", ""), "arguments": fn.get("arguments", "{}")})
            t = to_text(m.get("content", ""))
            if t: items.append({"role": "assistant", "content": t})
            continue
        ttype = "input_text" if role == "user" else "output_text"
        items.append({"role": role,
                      "content": [{"type": ttype, "text": to_text(m.get("content", ""))}]})
    body = {"model": model, "input": items, "max_output_tokens": max_tokens,
            "stream": stream, "store": False}
    if system: body["instructions"] = system
    if tools:
        body["tools"] = [{"type": "function", "name": t["function"]["name"],
                          "description": t["function"].get("description", ""),
                          "parameters": t["function"].get("parameters", {})} for t in tools]
        body["tool_choice"] = "auto"
    return url, body

def upstream(url, body, token):
    return urllib.request.Request(url, data=json.dumps(body).encode(),
        headers={"Authorization": f"Bearer {token}", "Content-Type": "application/json",
                 "Accept": "text/event-stream" if body.get("stream") else "application/json"},
        method="POST")

def responses_stream_to_chat(resp):
    mid, created, tools_seen = f"c-{int(time.time()*1000)}", int(time.time()), False
    for raw in resp:
        line = raw.decode("utf-8", "replace").rstrip()
        if not line.startswith("data: "): continue
        data = line[6:]
        if data == "[DONE]":
            yield b"data: [DONE]\n\n"; return
        try: ev = json.loads(data)
        except json.JSONDecodeError: continue
        t = ev.get("type", "")
        def chunk(delta, finish=None):
            return ("data: " + json.dumps({"id": mid, "object": "chat.completion.chunk",
                "created": created, "model": ev.get("model", ""),
                "choices": [{"index": 0, "delta": delta, "finish_reason": finish}]}) + "\n\n").encode()
        if t == "response.output_text.delta":
            yield chunk({"content": ev.get("delta", "")})
        elif t == "response.output_item.added" and ev.get("item", {}).get("type") == "function_call":
            tools_seen = True; it = ev["item"]; idx = ev.get("output_index", 0)
            yield chunk({"tool_calls": [{"index": idx, "id": it.get("id", ""), "type": "function",
                "function": {"name": it.get("name", ""), "arguments": ""}}]})
        elif t == "response.function_call_arguments.delta":
            yield chunk({"tool_calls": [{"index": ev.get("output_index", 0),
                "function": {"arguments": ev.get("delta", "")}}]})
        elif t == "response.completed":
            yield chunk({}, "tool_calls" if tools_seen else "stop")
            yield b"data: [DONE]\n\n"; return

def responses_json_to_chat(payload, model):
    text, tools = [], []
    for i, item in enumerate(payload.get("output", [])):
        if item.get("type") == "message":
            text += [c.get("text", "") for c in item.get("content", []) if c.get("type") == "output_text"]
        elif item.get("type") == "function_call":
            tools.append({"id": item.get("id", f"call_{i}"), "type": "function",
                "function": {"name": item.get("name", ""), "arguments": item.get("arguments", "{}")}})
    msg = {"role": "assistant", "content": "".join(text) or (None if tools else "")}
    if tools: msg["tool_calls"] = tools
    return {"id": f"c-{int(time.time()*1000)}", "object": "chat.completion",
            "created": int(time.time()), "model": model,
            "choices": [{"index": 0, "message": msg,
                         "finish_reason": "tool_calls" if tools else "stop"}]}

class Handler(BaseHTTPRequestHandler):
    protocol_version = "HTTP/1.1"
    def log_message(self, *a): pass
    def _json(self, status, payload):
        d = json.dumps(payload).encode()
        self.send_response(status); self.send_header("Content-Type", "application/json")
        self.send_header("Content-Length", str(len(d))); self.end_headers(); self.wfile.write(d)
    def do_GET(self):
        if self.path in ("/v1/models", "/models"):
            self._json(200, {"object": "list", "data": [{"id": k, "object": "model",
                "created": 0, "owned_by": "bedrock-mantle"} for k in MODELS]})
        else: self._json(404, {"error": {"message": "not found"}})
    def do_POST(self):
        if self.path not in ("/v1/chat/completions", "/chat/completions"):
            return self._json(404, {"error": {"message": "not found"}})
        req = json.loads(self.rfile.read(int(self.headers.get("Content-Length", "0"))))
        model = req.get("model", "")
        if model not in MODELS:
            return self._json(400, {"error": {"message": f"unsupported model: {model}"}})
        cfg = MODELS[model]
        messages, tools = req.get("messages", []), req.get("tools") or []
        max_tokens = req.get("max_completion_tokens") or req.get("max_tokens") or 4096
        stream = bool(req.get("stream", False))
        token = get_token(cfg["region"])
        builder = build_responses if cfg["rail"] == "openai_responses" else build_chat
        url, body = builder(model, cfg, messages, tools, max_tokens, stream)
        try:
            if stream:
                self.send_response(200)
                self.send_header("Content-Type", "text/event-stream")
                self.send_header("Cache-Control", "no-cache")
                self.send_header("Connection", "close"); self.end_headers()
                with urllib.request.urlopen(upstream(url, body, token), timeout=300) as up:
                    gen = responses_stream_to_chat(up) if cfg["rail"] == "openai_responses" else up
                    for line in gen:
                        self.wfile.write(line); self.wfile.flush()
            else:
                with urllib.request.urlopen(upstream(url, body, token), timeout=120) as up:
                    payload = json.loads(up.read())
                result = responses_json_to_chat(payload, model) if cfg["rail"] == "openai_responses" else payload
                self._json(200, result)
        except urllib.error.HTTPError as e:
            self._json(e.code, {"error": {"message": e.read().decode("utf-8", "replace")}})

if __name__ == "__main__":
    log(f"listening on http://{HOST}:{PORT}/v1  models: {', '.join(MODELS)}")
    ThreadingHTTPServer((HOST, PORT), Handler).serve_forever()

Start it with an AWS profile that can call Bedrock, then point Zed's openai_compatible provider at the port. Two details are load-bearing. The stream uses Connection: close, not Transfer-Encoding: chunked — a plain HTTPServer does not chunk-encode, and a client expecting chunked framing fails with a decode error. And the token must be minted for the same region as the model's endpoint, or the call returns a region-scope error.

What Free Does Not Mean

The Zed harness can be free as the editor-side agent surface. Bedrock inference is still metered. Frontier models are still frontier-model economics. The win is that the harness is not another model subscription gate.

The other missing piece is automatic model routing inside the harness.

Some coding agents use a premium model for the visible reasoning path and a faster smaller model for internal classification, summarization, sub-agent work, or cheap tool-adjacent tasks. That is not merely a model setting. It is harness behavior.

Zed's native agent model setting does not appear to add that internal routing layer. If the selected model is Sonnet 5, assume the native Zed Agent is using Sonnet 5 for the agent session unless Zed documents otherwise. If an ACP-backed agent does internal routing, that behavior belongs to the external agent process, not to Zed.

This is the trade:

Choice	Benefit	Cost
Zed native Agent on Bedrock	Free editor harness across reachable Bedrock models	Bedrock usage is still billed; internal routing is less visible
Mature CLI agent directly	Full native harness behavior	Separate surface from the editor
Mature CLI agent through ACP	Better editor integration	Protocol layer may hide or flatten parts of the native experience

The wrong expectation is "same model, same behavior." The better expectation is "same model plane, different harness."

So What

Wiring Zed to Bedrock is worth doing because it turns Zed's agent harness into a portable surface across the models the account can reach — and it does it by riding a boundary the enterprise already trusts. The coding agent inherits one auth boundary, one audit surface, and one place inference is metered. Model choice becomes a config change instead of a per-tool migration. Region and capacity strategy — inference profiles for cross-region failover, provisioned throughput for guaranteed capacity — is inherited, not rebuilt per developer. And there is no third-party gateway added to the path to get any of it. For a regulated org distributing one inference budget across many teams and use cases, that is the difference between governed model access and a fleet of ungoverned direct integrations.

But the configuration is not the agent architecture. The real architecture has three layers:

Model plane: Bedrock native Converse provider, global inference profiles, and the separate Mantle endpoint for proprietary frontier models and open-weight models
Editor plane: Zed Agent, Agent Panel, tools, and model picker
Harness plane: native Zed behavior, a local translation proxy for non-Converse models, or an external ACP/CLI agent's own routing logic

Keep those layers separate and the setup stays legible. Collapse them and every failure becomes confusing: is the model unavailable, the provider misconfigured, the account not allowlisted, the proxy translating a format wrong, the ACP bridge flattening state, or the harness choosing a different internal path?

The open thread: the OpenAI and Grok path runs through a local translation proxy, not native Zed support. I have an open issue with the Zed team asking for native Mantle support in the Bedrock provider. Until that lands, this is a working workaround, not a clean integration.

Zed Is Open Source and Can Drive Every Frontier Model on Bedrock

Amit — Mon, 06 Jul 2026 05:17:07 +0000

TL;DR

Zed is GPL v3 open source — the harness is free. Wired to Bedrock, the only bill is inference.
Native config covers the Claude family (Fable 5, Sonnet 5, Opus 4.8, Sonnet 4.6, Haiku 4.5) with a handful of JSON lines and an AWS profile.
GPT 5.5, GPT 5.4, and Grok 4.3 require a local proxy for two reasons: Bedrock issues short-lived credentials (~50 min) that Zed's static settings cannot auto-rotate; and GPT 5.x additionally needs the Responses API, which Zed does not speak natively.
A macOS LaunchAgent keeps the proxy alive at login — once set up, it is invisible.

Zed is open source under GPL v3. That matters more now than it did a year ago, because the editor is no longer just a text surface — it is the harness that runs your agentic coding work. A GPL-licensed harness that you can inspect, build from source, and extend is a different proposition than a closed one that gates model access behind its own subscription.

Wired to Amazon Bedrock, Zed becomes something specific: a single open source editor that drives frontier models from Anthropic, OpenAI, and xAI through one model picker. Claude Fable 5, GPT 5.5, Grok 4.3 — all reachable from the same agent panel, billed through the same Bedrock account, with no per-model subscription on the editor side.

That is the point. Here is the working setup.

The Native Bedrock Config

Zed's native Bedrock provider takes an AWS profile and region:

{
  "agent": {
    "default_model": {
      "provider": "amazon-bedrock",
      "model": "claude-sonnet-4-6",
      "effort": "high",
      "enable_thinking": true
    },
    "favorite_models": [
      { "provider": "amazon-bedrock", "model": "us.anthropic.claude-fable-5" },
      { "provider": "amazon-bedrock", "model": "us.anthropic.claude-sonnet-5" },
      { "provider": "amazon-bedrock", "model": "claude-opus-4-8" },
      { "provider": "amazon-bedrock", "model": "claude-sonnet-4-6" },
      { "provider": "amazon-bedrock", "model": "claude-haiku-4-5" }
    ]
  },
  "language_models": {
    "bedrock": {
      "authentication_method": "named_profile",
      "profile": "<your-aws-profile>",
      "region": "us-west-2"
    }
  }
}

Three choices worth explaining:

enable_thinking: true — Zed passes extended thinking parameters when the model supports them. For Claude Sonnet 4.5 and later the effective difference on hard reasoning tasks is measurable. The cost is latency. Leave it on for the default model, add it explicitly to favorites where you want it.

effort: "high" — Controls how much reasoning budget Zed allocates for thinking-capable models. The default is already high for Claude Opus 4.8. Set it explicitly so the config is readable.

Model IDs — Zed's native Bedrock provider uses short-form IDs for models in its internal catalog (claude-sonnet-4-6, claude-opus-4-8, claude-haiku-4-5). For newer models not yet in Zed's catalog, use the Bedrock inference profile ID directly: us.anthropic.claude-fable-5, us.anthropic.claude-sonnet-5. Both formats work in favorite_models.

The Multi-Provider Expansion

Some Bedrock accounts expose a separate OpenAI-compatible endpoint that carries frontier models beyond the standard Bedrock catalog: OpenAI GPT 5.5, GPT 5.4, and xAI Grok 4.3. These are not in the regular list-foundation-models output. They require an endpoint specifically designed for OpenAI-wire-format clients, and the API surface splits further: GPT 5.x uses the Responses API (/openai/v1/responses), while Grok uses Chat Completions (/openai/v1/chat/completions). These are different wire formats.

The proxy is needed for two reasons, not one. The first is credential rotation: Bedrock issues short-lived tokens (roughly 50 minutes) and Zed's openai_compatible provider takes a static API key in settings — it cannot auto-rotate expiring credentials. The proxy regenerates the token transparently on expiry, for all three models. The second reason is specific to GPT 5.5 and 5.4: they require the Responses API (/openai/v1/responses), which Zed does not speak natively. Grok 4.3 actually uses Chat Completions — the same wire format Zed already understands — so for Grok the proxy is purely a credential rotation bridge.

The proxy runs on localhost:4327, presents a single /v1/chat/completions endpoint to Zed, and routes each model ID to the right upstream surface with the right format. It generates short-lived credentials from the AWS profile rather than storing static keys. Grok gets reasoning_effort: "none" added to avoid spending the token budget entirely on internal reasoning before producing output.

{
  "language_models": {
    "openai_compatible": {
      "bedrock_mantle": {
        "api_url": "http://127.0.0.1:4327/v1",
        "available_models": [
          {
            "name": "openai.gpt-5.5",
            "display_name": "OpenAI GPT 5.5 (Bedrock)",
            "max_tokens": 128000,
            "supports_tools": true
          },
          {
            "name": "openai.gpt-5.4",
            "display_name": "OpenAI GPT 5.4 (Bedrock)",
            "max_tokens": 128000,
            "supports_tools": true
          },
          {
            "name": "xai.grok-4.3",
            "display_name": "xAI Grok 4.3 (Bedrock)",
            "max_tokens": 1000000,
            "supports_tools": true
          }
        ]
      }
    }
  }
}

Then add the proxy models to favorites:

{ "provider": "openai_compatible", "model": "openai.gpt-5.5" },
{ "provider": "openai_compatible", "model": "openai.gpt-5.4" },
{ "provider": "openai_compatible", "model": "xai.grok-4.3" }

Tested results: GPT 5.5 and 5.4 respond in under 5 seconds. Grok 4.3 is slower — it is a reasoning model and spends time on internal reasoning even with effort set low. The OSS models (GPT OSS 120B, OSS 20B) also work through the same proxy, but they are less interesting as favorites since Claude Haiku 4.5 is in the same speed tier and available natively.

Keeping The Proxy Alive

A proxy that dies when the terminal closes is not a usable setup. The right pattern on macOS is a LaunchAgent: a plist in ~/Library/LaunchAgents/ that starts the proxy at login and restarts it if it crashes.

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN"
  "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
  <key>Label</key>
  <string>com.yourname.zed-bedrock-proxy</string>
  <key>ProgramArguments</key>
  <array>
    <string>/path/to/venv/bin/python</string>
    <string>/path/to/proxy.py</string>
  </array>
  <key>EnvironmentVariables</key>
  <dict>
    <key>AWS_PROFILE</key>
    <string>your-bedrock-profile</string>
    <key>AWS_REGION</key>
    <string>us-east-1</string>
  </dict>
  <key>RunAtLoad</key>
  <true/>
  <key>KeepAlive</key>
  <true/>
  <key>StandardOutPath</key>
  <string>/path/to/proxy.log</string>
  <key>StandardErrorPath</key>
  <string>/path/to/proxy.err.log</string>
</dict>
</plist>

Load it with launchctl load ~/Library/LaunchAgents/<label>.plist. Verify with launchctl list | grep <label> — a non-zero PID in the first column means it is running.

The Zed settings then point api_url at http://127.0.0.1:4327/v1 and the proxy is just infrastructure that stays out of the way.

What This Architecture Does Not Solve

Internal model routing. Some coding agents use a fast cheap model for classification and a premium model for the visible reasoning path. Zed's native agent uses whichever model you select in the picker for the session. The proxy does not change that. If you want tiered routing, it needs to be built into the proxy or handled by an external agent through ACP.

Automatic model selection. There is no logic in this setup that chooses the right model for the task. That is a deliberate choice — the picker forces explicit selection and the favorites list keeps the set small. A picker full of models becomes a decision you make on every session. Curate the favorites to three to five models maximum.

Streaming. The proxy currently buffers the full response and returns it as a single chunk. Zed handles this gracefully, but you lose the token-by-token streaming that native models provide. For fast models (GPT 5.5, Claude) this is barely noticeable. For slower reasoning models like Grok, it means the interface shows nothing until the full response lands.

The Three-Layer Model

The setup becomes legible when you keep the three layers separate:

Layer	What It Is	Who Owns It
Model plane	Bedrock: account access, region, billing, model availability	AWS + your account configuration
Editor plane	Zed: harness, tools, file context, model picker, thread surface	Zed Industries
Adapter layer	Local proxy: API format translation, credential rotation, model routing	You

The adapter layer exists because of two concrete gaps: Bedrock issues short-lived credentials that Zed's static config cannot rotate, and GPT 5.x requires the Responses API while Zed speaks Chat Completions. The proxy bridges both.

So What

One editor, eight frontier models, no separate model subscriptions, credentials from the profile already configured for everything else. The cost is Bedrock inference pricing, which is the same whether you reach the model through a UI or through Zed.

The open thread: the model is available. Whether Zed's harness gets the best out of GPT 5.5 or Grok — tool behavior, context management, multi-turn coherence — is not something I have measured yet. Having the model in the picker is not the same as knowing the harness is well-matched to it.

Your Agent Doesn't Need the Secret. It Needs the Result.

Amit — Mon, 06 Jul 2026 02:38:07 +0000

TL;DR

The standard agent-credential pattern — vault stores the secret, agent reads it via op read or env injection — still puts the secret value inside the agent's process, even if only for milliseconds.
That's the actual threat surface: not whether the vault is encrypted at rest, but what a single manipulated tool call can exfiltrate while the agent is running.
Three patterns are converging to close that gap: scope vaults to the task instead of the person, issue tokens that expire with the task instead of the session, and broker credentials at the network layer so the agent never possesses what it authenticates with.
Open-source proxies that do exactly the third thing — sit between the agent and the internet, inject the right credential per destination, return only the response — already exist and are self-hostable.
The question worth asking shifts from "where do my secrets live" to "does my agent ever need to hold one."

If you've wired vault-backed credentials into an agent setup, you've made a call that feels obviously correct: stop putting API keys in .env files and shell profiles, store them in a vault, pull them at runtime. op read, op run, scoped service account tokens — the mechanics are solved and well-documented.

Here's the question that call doesn't answer: what happens in the half-second after the agent reads that secret into its own process?

The pattern — vault → op read → environment variable → agent subprocess — is a real improvement over plaintext files checked into a repo. The fourth post in this series covered exactly this shape: an M2M secret resolved into memory just long enough to acquire a token, then discarded. That resolution window is still a window. The secret ends up inside the agent's memory, in its environment, reachable by anything the agent can be talked into doing. A prompt injection that gets an agent to print $API_KEY or send it to an attacker-controlled endpoint doesn't care that the key came from a vault five minutes earlier, or that it was only there for milliseconds. It cares that the key is sitting in os.environ right now.

This isn't hypothetical. SANS frames it precisely: an agent holding valid credentials with a manipulable prompt is a confused deputy — an entity with the right to act, tricked into acting on someone else's behalf. The vault did its job. The leak happens one layer up, in the part of the system the vault was never built to protect.

What's converging to close the gap

Three patterns are showing up across security research and tooling released this year, and they share one property: each one reduces what the agent can hold — not where the secret is stored.

Scope the vault to the task, not to the person. One vault holding everything an agent might ever touch is one stolen token away from everything. The fix mirrors what cloud teams already do with IAM roles: a deploy role that can deploy and nothing else, a read role that can read and nothing else. Apply the same split here — a narrow vault per trust boundary, a separate scoped token per vault, read-only by default. Compromise one, and an attacker gets one bounded set of secrets, not the keys to everything you own.

Make the token's lifetime match the task's lifetime, not the session's. Several 2026 writeups land on the same arithmetic: a two-minute agent run backed by a sixty-minute token has an exposure window thirty times longer than the work that justified it. NIST's NCCoE published a concept paper in February naming SPIFFE/SPIRE, OAuth 2.0, and zero-trust identity as the standards under consideration specifically for agent identity — all three assume a credential should expire with the task it was issued for, not the human's coffee break.

Stop handing the secret to the agent at all. This is the one that changes the architecture, not just the timer. Agent Vault, an open-source project from Infisical, puts a proxy between the agent and the internet. The agent's outbound traffic routes through it; the proxy terminates the connection, recognizes the destination host, injects the matching credential, and forwards the request. The agent receives the API response. It never receives the API key. Route every outbound call through a broker that authenticates on the agent's behalf, and a prompt injection that convinces the agent to dump its credentials finds nothing to dump — there's nothing there to find.

What's missing from the current default

The pattern most builders land on still resolves the secret into the agent's own runtime. op read puts the value in a variable. op run puts it in the environment. Both are real improvements over a plaintext file in a repo. Neither removes the secret from the blast radius of a single bad tool call.

That's not an argument against vault-backed credentials. It's the floor, not the ceiling — and it's an argument for asking the next question: of the credentials your agent currently holds, how many does it actually need to hold, versus simply needing the result of using one?

An agent that calls a payment API doesn't need the API key. It needs "charge succeeded" or "charge failed." An agent that queries a database doesn't need the connection string. It needs the rows back. In both cases the credential is a means to an end the agent doesn't care about — and every means it's carrying is something it can be talked into handing over.

So what

The question worth sitting with isn't "where do my secrets live." Vaults answer that one well. The question is: does my agent, at any point in its execution, possess a value that — if pulled out through one manipulated tool call — would matter?

If the answer is yes, the fix isn't a better vault. It's removing the agent from the chain of custody — narrower scopes so a leak is bounded, shorter-lived tokens so a leak expires fast, and brokered access so there's nothing to leak in the first place. The agent gets the outcome it asked for. It never gets the thing that produced it.

This is the eighth post in a series on 1Password as infrastructure. It extends the fourth, which treated a secret resolved into agent memory as the trust boundary, and the seventh, which scoped that resolution to a narrower vault and token. This post argues both are still the floor: the fix that removes the exposure window entirely is brokering, not scoping. Start from the first post.

One Token for Everything Is Still One Token

Amit — Mon, 06 Jul 2026 02:37:32 +0000

TL;DR

The third post in this series made the case for scoping a service account token to a workflow's actual purpose instead of "the project" or "everything I have." Easy to agree with. The test is putting it into practice on a box that runs agents continuously — not a CI runner that exits when the job ends.
In practice, that means a narrower vault for each purpose, with its own token, so a leak is bounded to exactly what that purpose touches — not "everything this box happens to do."
In practice: moved a flat secrets file holding five mixed-purpose values — a search API key, an email API key, two internal dispatch secrets, plus config that isn't secret at all — into a vault scoped to exactly those five, behind a read-only token that can see nothing else.
First attempt to populate that vault from the scoped token failed with a permission error. That's not friction to route around. That's separation of duties working — the box that runs agents consumes secrets, a separate session with a human behind it authors them.
op run wraps the tool's launch command and resolves vault references into its environment at start time — the tool reads process.env.API_KEY exactly as before and never imports op.

The third post in this series made the case for scoping a service account token to the workflow's actual purpose rather than to "the project" or "everything I've got" — a ci-deploy token that can see two items, not a master grant that happens to be wearing a service-account costume. The argument is right. Here's what it looks like to actually live inside it — on a box that runs agents, not a CI runner that exits when the job finishes.

The difference matters because the failure mode is different. A CI token's exposure window is the length of a job. An agent box's service account sits there, live, for as long as the box is up — reachable by anything the agent can be talked into doing, on every run, indefinitely. The scoping argument isn't theoretical there. It's the only thing standing between "an agent got confused once" and "an agent got confused once, near a token that could read everything."

What scoping actually requires you to give up

The instinct, even after you've accepted the scoping argument in principle, is to scope generously "to be safe" — one vault per box rather than one per purpose, on the theory that fewer vaults means less to manage. That's the same mistake at one remove: a box that runs five different kinds of agent work ends up with one token that can read all five kinds of secret, and the blast radius is back to "everything this box touches," just relabeled.

The harder, correct version: scope to the purpose, even when the purpose is small and the setup feels like overkill for it.

Scope the vault, then scope the token

Putting the principle into practice is structural, not procedural. Don't build a token with broad reach and trust it to behave — build a vault with narrow contents, and a token that can only ever see that vault. The token stops being something you have to trust, and becomes something that's simply incapable of the wider thing.

The concrete version of this: a flat secrets file on a remote box held five mixed-purpose values — a search API key, an email API key, two internal dispatch secrets — sitting alongside config values that weren't secrets at all. Anything that could read that file could read all five, regardless of which one a given task actually needed. The file was the vault, and it had no concept of scope.

The fix: a new 1Password vault holding exactly those five items and nothing else. A service account scoped to read-only access on that one vault — not "all vaults," not "everything I have." The flat file reduced to a single line: the token that opens that one narrow door, with a comment explaining what it can and can't do and where the actual secrets live now.

Compromise that token, and an attacker gets five bounded values scoped to one box's workflows — not the keys to everything the vault's owner has ever stored.

Testing the boundary — on purpose

The first attempt to populate the new vault came from the scoped token itself, and it failed:

[ERROR] You do not have permission to perform this action

That's the system working exactly as configured. Read-only means read-only — including for the person who set it up, if they're using the scoped credential to do it. Population had to happen from a separate session, on a separate machine, with a human signed into the full account behind it.

That friction is the point, not a bug to engineer around. It draws a hard line between two roles that are easy to blur when one person is doing both: the place that runs agents, which only ever needs to consume credentials, and the place that authors credentials, which needs write access and a human in the loop. Collapse those into the same session and you've quietly rebuilt the single point of failure you just finished scoping away. Keep them separate, and a compromised agent runtime literally cannot rewrite its own permissions — there's no code path for it.

Making the tool vault-transparent

None of this required touching the tool that actually uses the search API key. It already checked an environment variable first and fell back to a local config file second — written months before any of this existed. The sixth post in this series covers exactly this shape: wrap the launch command in op run --env-file=..., and the vault reference resolves into the process environment at the moment the tool starts. The tool reads process.env.API_KEY exactly like it always did. The only thing that changed is where that value originates — and the tool has no way to know, or care.

Wrap what you didn't write. Don't fork it to teach it about the vault.

What's still true after all of this

Scoping the vault and the token shrinks the blast radius from "everything" to "exactly what this one workflow needs." It's a real reduction — five bounded values instead of an open door to the whole account.

It does not make the secret disappear. It still lands in the tool's process environment for the life of that process, the same as it always did — just for a narrower set of values, reachable by a narrower set of things. Smaller blast radius. Not zero. That's the next question worth sitting with, and one worth coming back to.

So what

Agreeing that a token should be scoped to its purpose costs nothing. Building the vault that makes the wider grant impossible — rather than merely discouraged — costs an afternoon, a permission error you have to sit with instead of route around, and the discipline to keep "the box's vault" from quietly becoming "the project's vault" the next time something new needs a credential.

That's what the principle looks like once it has to survive contact with a box that's up at 3am running something you didn't watch it start. Five bounded values, one read-only token, a permission error that fired exactly when it should have. Boring. Which is the point — boring is what it looks like when the blast radius was decided in advance instead of discovered during incident response.

This post extends the third and sixth posts in the eight-post series on 1Password as infrastructure — applying the service-account and op run patterns with a narrower scope than "one token for everything." The eighth post goes further: a scoped, read-only token is still a value that lands in the tool's process. Brokering removes it from that process entirely. Start from the first post.

Go-to-Market Was a Relay Function. AI Makes It a Decision Function.

Amit — Mon, 22 Jun 2026 19:59:45 +0000

If you work in go-to-market — sales, partnerships, field strategy, enablement — you've probably noticed that your AI-augmented output now spans territory that used to belong to 4-5 separate functions. You can synthesize market signal across dozens of accounts, generate competitive positioning, model acquisition economics, and draft partnership frameworks — all in the same week, all by yourself.

The problem isn't capability. The problem is that your role was designed before any of this was possible.

The Relay Architecture

Here's how go-to-market has historically worked inside most technology companies:

Engineering builds → Product defines → Go-to-market carries it to customers.

The go-to-market function sits downstream. It doesn't decide what gets built, how it's priced, or which markets to enter. It executes against those decisions. The value was in translation and distribution — making complex things legible to buyers, then relaying buyer feedback back upstream.

In practice, this means:

Strategy teams synthesize market signal (not you)
Product marketing owns positioning (not you)
Finance models acquisition economics (not you)
Corporate development structures partnerships (not you)
Product management decides what goes on the roadmap (not you)

Your job? Cover your accounts. Generate pipeline. Close deals. Relay feedback. Repeat.

The a16z framework for scaling go-to-market orgs describes this architecture clearly: go-to-market leaders "operate alongside a small team — as both a player and a coach — to do whatever it takes to close deals or reach customers." The scope is execution, not strategy. The decisions happen elsewhere.

Why It Was Designed This Way

This wasn't arbitrary. Pre-AI, the relay model made sense because human bandwidth was the binding constraint.

One person couldn't simultaneously track 50 accounts, generate positioning variations, model unit economics, negotiate partnership terms, AND synthesize field evidence for product roadmap decisions. That workload required specialization. You needed a marketing team, a strategy team, a rev ops team, a partner team, and a product team — each owning their slice.

The relay architecture was a rational response to bandwidth scarcity. Every layer between an idea and its execution existed because no single person could span the full loop.

The Collapse

AI removes the bandwidth constraint. Not entirely — but enough to break the relay model's assumptions.

A Norwest Venture Partners analysis of the emerging "GTM Engineer" role documents this collapse: one person now "owns the knowledge base, the systems layer, how agents interact with each other, and the entire lead flow process." Their benchmark data shows companies using AI tools are 3x more likely to have raised revenue targets, while SDR and BDR hiring is declining.

Clay, the data enrichment platform, frames their version more sharply: the GTM Engineer "collapses SDR, AE, and SE roles into one." Three specialized relay nodes become one decision-maker with AI agents handling execution.

What one AI-native person in a go-to-market function can now span:

Previously Required	Now Possible Solo
Strategy team synthesizes market signal	AI scans 50+ accounts, surfaces patterns in real-time
Marketing + agencies generate positioning	One person generates, tests, iterates positioning daily
RevOps/Finance models acquisition economics	AI models unit economics on the fly
BD + legal drafts partnership frameworks	One person structures, negotiates, iterates frameworks
PM synthesizes field feedback into roadmap	Quantified evidence delivered directly to product leadership

The relay function collapses into a decision loop: sense market → synthesize → decide → execute → measure → iterate. One person. One loop. AI handles the execution layer.

The Structural Gap

Here's what's actually happening: the person evolved, but the container didn't.

Microsoft's 2026 Work Trend Index calls this phenomenon "blocked agency" — employees whose AI-augmented capabilities exceed what their role definition allows them to act on. Among AI users surveyed, 58% say they're producing work they couldn't have done a year ago. But only 13% say they're rewarded for reinventing their work.

Jared Spataro, Microsoft's CMO for AI at Work, puts it directly: "When the system itself has a governor on the speed that it can go, it doesn't matter how fast an individual can run."

A Workday study from January 2026 frames the same gap: "Employees are using 2025 tools inside 2015 job structures." Less than half of roles have been updated to reflect AI capabilities.

BCG's June 2026 research quantifies the tension: 67% of regular AI users report improved job satisfaction, but 41% simultaneously report increased cognitive load. They call it the "joy paradox" — AI makes the work better while making the role container feel more constraining.

For someone in go-to-market, this lands as a specific frustration: you're operating across product influence, market strategy, customer acquisition, and partnerships — but your decision rights, metrics, and scope are still drawn around the relay function. You're measured on pipeline generated, not on strategic decisions made.

What Needs to Change

The question isn't whether go-to-market teams should adopt AI tools. Gartner projects 95% of seller research workflows will begin with AI by 2027, up from less than 20% in 2024. Adoption is happening.

The question is whether organizations will redesign go-to-market roles to match the expanded agency those tools unlock.

The World Economic Forum frames the destination clearly: "Human value moves to the work around execution — context, responsibility, trust, and decision-making authority."

For go-to-market specifically, that means transitioning from:

Relay (carry decisions downstream, relay feedback upstream) → Decision (synthesize signal, make strategic calls, own outcomes across product, market, and customer dimensions)

The expanded scope isn't "do the same work faster." It's a category shift from execution against someone else's decisions to making the decisions yourself — across product influence, market strategy, customer acquisition, and partnerships.

BCG warns that 50-55% of US jobs will be reshaped by AI in the next 2-3 years. For go-to-market roles, "reshaped" doesn't mean automated away. It means the scope explodes while the org chart stays frozen.

The Unresolved Thing

I don't have a clean answer for how to navigate this transition inside an existing organization. The research is clear on what needs to happen — role redesign, expanded decision rights, new metrics. But the practical path from "I'm operating at expanded scope" to "my org acknowledges and structures for that scope" remains unsolved.

The Augment Code survey found that only 19 of 219 engineering leaders have updated role definitions to match jobs that have fundamentally changed. The words respondents used to describe how they feel: "excited, anxious, invigorated" — all at the same time. That tracks.

What I know: the AI-native practitioners who are furthest ahead on the adoption curve — Microsoft's data puts this group at roughly 16% of the workforce — are the ones feeling this tension most acutely. The capability is real. The organizational permission isn't.

The relay model served its purpose when bandwidth was the constraint. Bandwidth isn't the constraint anymore. The constraint is organizational imagination.

Reset Windows Are Product Design

Amit — Sat, 06 Jun 2026 22:21:20 +0000

TL;DR

AI subscriptions do not primarily differ by model quality anymore. They differ by reset window design.
Claude uses a five-hour reset after you hit the included limit. Perplexity restores each Pro Search credit exactly 24 hours after you use it. Devin mixes daily and weekly quota. Copilot and Cursor are closer to monthly credit accounting.
These windows shape user behavior. Burst windows reward sprints. Rolling restore windows reward daily pacing. Weekly quotas reward batching. Monthly credit buckets reward explicit budget management.
The useful comparison is not unlimited versus limited. It is whether the reset logic matches the actual rhythm of the work.
The market still hides this layer too often. Reset mechanics should be front-page product information, not something users reverse-engineer after they stall out mid-task.

The most important design choice in an AI subscription is usually not the model. It is the reset window.

That sounds like billing trivia until you use these products heavily. Then it becomes obvious that a five-hour reset, a rolling 24-hour restore, a daily cap, and a monthly credit bucket do not feel remotely the same. They create different working habits, different failure modes, and different emotional contracts between the user and the vendor.

This is why the current AI subscription market is so confusing. The checkout pages look similar. The behavior design underneath is not.

The Real UX Layer

Here is the layer that matters.

Product	Publicly documented reset design	What it teaches the user
Claude	Included usage resets every five hours once you hit the limit	Work in hard sprints, then stop or pay
Perplexity Pro	Each Pro Search credit returns 24 hours after use	Spend steadily; each search has its own timer
Devin Pro / Max	Pro: daily and weekly quota. Max: weekly quota, no daily cap	Scope jobs and batch autonomous work deliberately
GitHub Copilot	Monthly AI credit allowance	Treat agents as metered features inside a seat
Cursor	Monthly billing cycle with included usage and paid overage	Treat the subscription like a budget starter pack
Gemini	Mostly daily feature limits that can change	Use broadly, but within daily feature lanes
ChatGPT + Codex	Reset exists, but no single universal public window across plans	Learn the system by usage, not by policy page

That table is product design, not accounting.

The reset logic tells users how much the vendor wants usage to spike, how much it wants usage to smooth out, and how much operational unpredictability it is willing to surface.

Burst Windows Produce Sprint Behavior

Claude's paid-plan usage docs are unusually explicit. When you hit the included limit, the plan resets every five hours. If you enable usage credits, you can keep going at standard API rates instead of waiting.

That produces a very specific user rhythm.

You do not casually idle in Claude the way you idle in a chat app. The optimal behavior is to line up work, enter with a bounded task, push hard, and exit when the work is done. The window trains you toward burst discipline because stray conversation and long transcript drift are no longer harmless. They cannibalize the same five-hour allowance you need for the next real task.

This is one reason Anthropic's own guidance emphasizes shorter conversations, lighter tool usage, and fresh threads for new topics. The reset window and the usage advice are the same design choice seen from two angles.

Five hours is not just a limit. It is a behavior shaper.

Rolling Restore Windows Produce Pacing

Perplexity Pro uses a different philosophy. Users get at least 300 Pro Searches per day, and each credit is restored exactly 24 hours after it is used.

That is cleaner than the classic midnight reset model because it maps the budget to actual activity instead of the calendar. If I spend 40 searches at 2:15 PM, those 40 searches come back at 2:15 PM tomorrow. The system is local to my behavior.

That design quietly encourages pacing over bingeing. It discourages the feeling that you should burn everything before midnight because tomorrow is a fresh bucket anyway. It also makes the product easier to reason about when the unit of work is discrete, as search requests generally are.

This is why research subscriptions often feel calmer than coding subscriptions. Search is naturally chunked. Autonomous coding work is not.

Daily and Weekly Quotas Produce Portfolio Thinking

Devin's self-serve billing docs are the clearest example of an agent-native reset model. Pro combines daily and weekly quota. Max removes the daily cap and keeps the weekly quota. The usage docs add two important details: idle sleep does not materially consume usage, and there is no limit on simultaneous sessions.

Those policies change the operational mental model completely.

You stop thinking in terms of one conversation. You start thinking like a portfolio manager. Which jobs deserve to run today? Which ones can wait until tomorrow's daily refresh? Which ones are worth spending from the weekly pool because they may unlock downstream work? Which tasks should be split into smaller parallel sessions because that improves throughput without wasting idle time?

That is not a chat product mentality. That is queue design.

The weekly layer matters because autonomous agents often have spiky value. Some days you want ten scoped runs. Some days you want zero. A weekly envelope absorbs that variation better than a strict daily wall, but only if the daily cap does not pinch too early. Devin Max is effectively selling that flexibility.

Monthly Buckets Produce Budget Awareness

Cursor's pricing docs say Pro includes $20 of API agent usage plus bonus usage, and its billing docs tie the reset to the billing cycle. GitHub Copilot's pricing and billing docs do the same with AI credits layered on top of unlimited completions.

Monthly buckets produce a different behavior again.

They do not tell you when to work during the day. They tell you how honest to be about your budget. This is why Cursor's documentation is so strong. It explicitly says daily agent users often land above the sticker price. The user is invited to think in monthly spend, not monthly entitlement.

Monthly resets fit better when the work itself is already budgeted monthly. Teams buy seats. Individuals expense subscriptions. Managers reconcile spend at the end of the month. The product behavior lines up with the purchasing system.

The downside is that the system can hide waste longer. A bad five-hour window hurts immediately. A sloppy monthly bucket can drift for three weeks before anyone notices the overage logic.

Opaque Windows Create Learned Helplessness

This is where ChatGPT + Codex becomes interesting.

OpenAI is becoming more transparent under the hood. Codex moved to token-based credit pricing on April 2, 2026. Flexible credits for Plus and Pro make the overflow path explicit. But the reset layer for the bundled consumer experience is still harder to reason about than Claude, Perplexity, Devin, or Cursor.

That opacity matters more than people admit. When users cannot predict when capacity will come back, they start self-throttling in ways the product did not intend. They save prompts. They avoid ambitious tasks. Or they push until failure and then feel arbitrary punishment. None of that is good design.

This is the hidden cost of soft boundaries. They feel friendly at signup. They become fuzzy and stressful under load.

Why This Matters More For Agents

Reset windows mattered less when these tools were mostly chat.

They matter much more when the product is expected to run autonomous loops, use tools, inspect repos, or produce long artifacts. How Do AI Agents Spend Your Money? found agentic coding tasks can consume roughly 1000 times more tokens than simpler coding interactions, with up to 30 times variance on the same task. That means the reset system is no longer a background billing detail. It directly governs whether users trust the product enough to hand over real work.

The harder the product leans into autonomous behavior, the more the reset policy becomes part of the product surface.

This is why AI subscriptions are drifting away from classic SaaS psychology. A seat that can burn unpredictable compute in bursts does not behave like email, storage, or project management software. It behaves more like a gateway to a volatile infrastructure budget with a UX wrapper on top.

So What

The best way to compare these products is not by asking which one has the highest headline limit.

The better question is: which reset window matches the natural rhythm of the work?

If the work comes in intense bursts, a five-hour system like Claude makes sense. If the work is steady daily research, Perplexity's rolling restore model is better. If the work is a queue of scoped autonomous jobs, weekly quota plus idle sleep semantics, like Devin's, is much closer to the actual workload. If the work is fundamentally monthly budget management, Cursor and Copilot are more honest contracts.

The missing piece is standardization. Vendors still talk about models, features, and price more clearly than they talk about reset mechanics, even though reset mechanics do more to determine how the subscription actually feels.

The open thread I am still stuck on: do these products eventually standardize around explicit reset-language the way cloud products standardized around pricing primitives, or do vendors keep treating reset logic as a soft, semi-hidden UX lever because ambiguity sells better than clarity?

Part 2 of the Agent Economics series.
← Part 1: AI Subscriptions Are Secretly Usage Models · Part 3: Autonomous Agents Break Flat-Rate Pricing →