DEV Community: David Moores

How Many Endpoints Does It Take to Ask 'How Was Your Experience?'

David Moores — Mon, 20 Jul 2026 07:50:00 +0000

Cross-posted from carrick.tools.

Every developer has looked at a survey widget, the little "How was your experience? 😞 😐 😊" popover, and thought: I could build that in a weekend. And you could. A form, a POST /responses, a GET /surveys, somewhere to look at the answers. Six endpoints, maybe eight if you're feeling enterprise about it.

Formbricks is that product, built properly, in the open, over several years. I recently indexed its entire system as part of a series of scans I've been running against open-source projects: the main app, its two SDKs and its n8n integration, four repositories in total, with every API endpoint and every outgoing call extracted straight from the source.

The survey is six endpoints. The product is 167.

This post is about where the other 161 come from. Not because they shouldn't exist. Spoiler: every single one of them earns its place. But the gap between "the thing" and "the product around the thing" is the most underestimated number in software, and for once we get to look at it directly.

The weekend version

Here is the part of Formbricks that matches the product on the landing page:

GET   /api/v1/client/:workspaceId/environment     # what surveys should I show?
POST  /api/v1/client/:workspaceId/displays        # I showed one
POST  /api/v1/client/:workspaceId/responses       # here's an answer
PUT   /api/v1/client/:workspaceId/responses/:id   # here's the rest of the answer
POST  /api/v1/client/:workspaceId/user            # who's answering
POST  /api/v1/client/:workspaceId/storage         # they uploaded a file

Six endpoints. Fetch the surveys, show one, collect the answer. If you squint, it's the weekend project.

Now let's leave the landing page.

Browsers happen

Those six endpoints get called from other people's websites, and browsers have rules about that. Before a browser lets a page on someone else's domain call your API, it sends a preflight request to check it's allowed. So nearly every client endpoint needs a matching OPTIONS handler just to say yes.

Then there's versioning. The Formbricks SDK is installed on other people's websites, and you can't force the whole internet to upgrade on your release schedule. So the client API exists twice, as v1 and v2, running side by side, indefinitely.

Six endpoints are now 27. Nothing happened. No feature was added. The survey got zero better. The count quadrupled because browsers and old SDK versions are the physics you operate in.

(A detail from the scan that shows versioning in real life: Formbricks' two SDKs currently call different versions of the same endpoint. The React Native SDK moved to v2, the JavaScript SDK is still on v1. Both work fine. That's what a migration looks like while it's happening.)

Scripts happen

The moment real teams use your product, someone wants to do everything the UI does from code. So: a management API. Surveys, responses, contacts, contact attributes, webhooks, file storage, all with create, read, update and delete, spread across v1 and v2 like everything else.

That's 46 endpoints. It's the least surprising section of the codebase, and it's also more than a quarter of the entire API, for a capability no screenshot will ever show.

Then there's a newer v3 API, 20 more endpoints, where you can see the product's future arriving: POST /api/v3/surveys/generate, survey templates, validation, a whole feedback-classification pipeline. The AI-era surface, growing alongside the two generations it hasn't replaced.

Big companies happen

Login in the weekend version is a session cookie. Login in the product is 29 endpoints, and every line on it is something a customer once refused to sign without:

GET|POST  /api/auth/**                        # the auth framework's catch-all
GET       /api/auth/saml/authorize            # enterprise single sign-on
POST      /api/auth/saml/callback
POST      /api/auth/saml/token
GET       /api/auth/saml/userinfo
GET       /api/auth/sso/recovery/complete     # for when SSO locks everyone out
GET       /.well-known/openid-configuration/**
GET       /.well-known/oauth-authorization-server/**
GET       /.well-known/oauth-protected-resource/**
ALL       /api/envoy-auth/**                  # 7 methods
ALL       /api/traefik-auth/**                # 7 methods
POST      /api/mcp                            # your AI agent wants in too

SAML is the login system large companies insist on, and it takes four endpoints plus a recovery route for the day it locks everybody out.

My favourite two lines are envoy-auth and traefik-auth. People who self-host Formbricks put it behind a gateway (a server that sits in front of your app and filters traffic), and the two popular gateways each have their own way of asking "is this request allowed?" So the app answers that question twice, in two dialects, across fourteen endpoint-and-method combinations. That's not login for users. That's login for the infrastructure standing in front of the app.

And those .well-known addresses? They make the survey tool a full OAuth provider, the same kind of "Sign in with..." system Google or GitHub runs. That surface exists for AI. The metadata endpoint recognises exactly one resource, /api/mcp, the route where AI assistants connect, and the configuration lives in a file called mcp-oauth-provider-options.ts with self-service client registration switched on.

The MCP specification requires OAuth for HTTP servers, so none of this is a quirk of Formbricks — it is what conformance looks like, and it is spreading. Two years ago this layer did not exist for anybody. Now a survey tool runs its own login provider so that AI agents can register themselves and sign in to ask it questions. The fringe grew a fringe.

Other products happen

Nobody wants survey answers to live in the survey tool. They want them in Slack, Notion, Airtable and Google Sheets. Connecting to each of those means an OAuth handshake: you get sent to their consent screen, click Allow, and get redirected back. The app hosts 9 endpoints whose only job is to catch those redirects:

GET  /api/v1/integrations/slack      + /callback
GET  /api/v1/integrations/notion     + /callback
GET  /api/v1/integrations/airtable   + /callback  + /tables
GET  /api/google-sheet               + /callback

And that's just the inbound half. The scan also captures every outgoing call the app makes, and the traffic flows the other way too: posting messages to Slack, creating pages in Notion, five different Airtable calls, Google token exchanges. The product isn't just an API. It's a customer of half a dozen other products' APIs, and each of those relationships is more code than the original survey form.

Add 4 webhook endpoints so other tools can subscribe to your events, the same way you subscribe to Slack's. (Pleasingly, the scan matched these against Formbricks' n8n integration, which was built in January 2024 and still lines up with the live API today.)

The business happens

The remaining 32 endpoints are the product becoming a company: organisations, teams, roles, member management, a Stripe webhook for billing, file storage, health checks, a CSV importer. There's even a GET /legacy-organization-settings/:id/** redirect, a small archaeological record of a data model that no longer exists but whose URLs still arrive.

And the product phones out

Here's the part the endpoint count doesn't even cover. The same scan lists everything the app calls, and self-hosted Formbricks talks to roughly a dozen outside services that have nothing to do with collecting survey answers:

Two bot-detection services, Google's reCAPTCHA and Cloudflare's Turnstile, because abuse happens and customers disagree about whose checker they trust
Unsplash, because surveys have background images and someone will want to search for one
GitHub's releases API, because the app checks its own latest release to tell you you're out of date
The Formbricks cloud, for licence checks and usage reporting on the enterprise edition
An email and CRM platform, for the vendor's own lifecycle emails

Every arrow points at a reasonable decision. There is no villain in this codebase.

It's not just surveys

Maybe feedback tools are unusually complicated? They're not. I ran the same scan across Documenso, an open-source document-signing product, another one-sentence idea. Its system came out at 116 endpoints and 148 outgoing calls, including a 13-message protocol between its embeddable signing widget and the app, because "sign a PDF" eventually means "sign a PDF inside someone else's product, in an iframe, with events firing back to the host page".

The pattern generalises, and every developer already knows it from the inside. Your side project was 90% done, and the remaining 10% was login, billing, integrations, browser rules and gateway configs, forever. What the scans add is the actual ratio. In Formbricks, the thing on the landing page is about 16% of the API surface. The other 84% is the membrane between the product and the world: browsers, big companies, gateways, other products, abuse, money, and time, in the form of API versions that can never die because someone's SDK from 2024 still calls them.

So the next time someone asks how long the product will take, here's a number to reach for. The feature is 16% of it. The world is the rest.

A note on method

The numbers come from static analysis at pinned commits (Formbricks d1c3edf and its satellite repos; Documenso 3ff7f70), run with Carrick, the context layer for polyrepo TypeScript teams. For this post it parsed each repository's source and extracted endpoints, including wrapped handlers and re-exported route files, plus outgoing calls even when they're built through wrapper functions and environment variables, then matched callers to endpoints across repositories.

The raw counts contained three obvious extraction artifacts, which I excluded; 167 is the honest figure. The commits are pinned and the repositories are public, so every specific claim in this post can be checked against source, and I re-verified the ones I quote directly before publishing.

Endpoints are one part of what Carrick indexes. It also captures every function with its intent description, the dependencies between services, and the real request and response types on every endpoint. Agents query that index over MCP to find what already exists across an organisation, code against real contracts instead of guessing at JSON shapes, and reuse functions instead of rewriting them. In CI it checks every pull request against the same index for contract drift, duplicate functions and version conflicts. Carrick is free during pre-release at carrick.tools, and the full extraction data behind these counts is available at david@carrick.tools.

Introducing Carrick

David Moores — Fri, 17 Jul 2026 09:17:37 +0000

Cross-posted from carrick.tools.

Today we're launching Carrick, the context layer for polyrepo TypeScript teams: one index of your whole system, every service, type and contract, served to your AI coding agents over MCP and checked in CI. It lets agents search code by what it does, validate against real types across service boundaries, and reuse functions instead of rewriting them. Your agent reads one structured answer per question instead of grepping and reading source, so a question costs the same handful of tokens whether your system is three repos or thirty.

Carrick works by parsing the source code of your deployed code in CI, so every merge updates your agent's context. This is achieved by first building an AST (abstract syntax tree) using SWC (Speedy Web Compiler) and extracting function closures, API routes, protocol contracts and dependencies. Alongside that, a TypeScript compiler sidecar resolves explicit and implicit types, which lets Carrick know function arguments, return types and request/response types with compiler accuracy. The AST data goes through a multi-stage inference process which allows Carrick to know not only what your code does, but what it is intending to do. We then vectorise these intent descriptions and expose them to agents, letting your agents grep by meaning, not by name. Agents can ask natural language questions and get back the code they need, instead of hunting for it.

For example, if an agent needs to enrich user information with subscription data, it might grep "subscriptions" or "users", serving up hundreds of matches. The agent then needs to identify likely source files and load them into context. If these are misses, it tries again. Once the code is found, it might be some logic on a route path /subs hoisted onto a route tree that looks like api -> v1 -> users. The agent then needs to traverse this tree, import by import, file by file, to surface api/v1/users/subs, assuming it manages to do so correctly. All of that context gathering repeats every time you build a new feature. Carrick lets agents take a different path. They can surface the correct code by asking for precisely what they need:

"I need to enrich user data with subscription data"

Carrick builds vector embeddings and constructs call graphs in CI to surface exactly what the agent needs in one go during development. Even anonymous functions are found easily, as the search runs on intended behaviour, not naming. This is also what lets agents reuse your existing utilities instead of writing near-duplicates of them. If the agent needs to call an endpoint it has the mount graph fully resolved, and should it need to construct the correct request body, it has complete type level access to that too.

Carrick is accessible to your agents via MCP at https://api.carrick.tools/mcp. Because it integrates tightly with the TypeScript compiler, it exposes real type information to your agents, so they can validate against real types across service boundaries instead of hallucinating API shapes. It constructs accurate REST routes across your applications using a branch to leaf node routing process that follows mounting patterns and path identifiers across API libraries and frameworks, and it does the same across protocols. This combination provides powerful context to agents, allowing them to build and maintain contracts across your services by only reading source code. The same index also works for you in CI: the Carrick GitHub App catches contract drift in the pull request, flagging duplicate functions, version conflicts and contract mismatches before they merge.

The context problem compounds when working cross repo, which is why Carrick has been developed for single repositories, monorepos and polyrepo setups. If you are vibe coding on a single repo, serving a React frontend with a monorepo backend, or dealing with multiple repositories in a legacy system, you can use Carrick today to give your agents the whole system's context, build confidently and ship rapidly.

Carrick is free during pre-release. No card, no application. Every scan, MCP call and PR check runs on us while we figure out pricing, and polyrepo teams can apply for three months of free inference. When paid plans arrive there will be a generous free tier for solo developers, and team pricing designed so that deploying Carrick pays for itself, but more importantly, delivers better agentic output and fewer bugs in production.

You can get started now at app.carrick.tools: sign up with GitHub, add the Carrick GitHub Action to your TypeScript repositories, and connect your agent:

claude mcp add --scope user --transport http carrick https://api.carrick.tools/mcp

Authentication is keyless end to end: the Action authenticates via GitHub Actions OIDC, and your agent signs in with Carrick through the browser, so no API key changes hands. You can find the system requirements and documentation at docs.carrick.tools.

Carrick is the context layer for polyrepo TypeScript teams: one index of every service, type and contract, served to AI coding agents over MCP and checked in CI. Get started.

Benchmarking LLM Structured Outputs

David Moores — Mon, 25 May 2026 18:33:04 +0000

Cross-posted from carrick.tools.

When you read the API documentation for OpenAI, Anthropic, or Google Gemini, the feature called "structured outputs" looks like a solved problem: pass a JSON schema, get back JSON that conforms to it.

In production, it is not a contract. It is a well-typed, best-effort suggestion.

At Carrick, the code-analysis scanner I work on, our post-LLM pipeline relies on a four-stage fallback parser. We attempt a direct parse, strip markdown fences, scan for array bounds inside surrounding garbage text, and finally apply regex cleanup. If all four fail, we drop the payload and proceed. If structured outputs worked as advertised, this would be a single serde_json::from_str(response).

To isolate why this defensive parsing is necessary, I built a benchmark testing 8 synthetic schemas against six models (the flagship and cheaper tiers from each provider). The schemas isolate one structural stressor each: a flat baseline, a 3-level nested object, a 7-level nested chain, a long enum, a oneOf tagged union, nullable + format fields, a 20-item array, and a closed object with additionalProperties: false. Every response is validated against the original schema using two independent validators (ajv and hyperjump). A response only counts as strict adherence when both agree.

Here is how the implementations actually behave.

At a glance

Of the 8 stressor schemas, here is how many each model handled with full strict adherence on every run, and how many tripped a specific failure mode:

Three patterns emerge. OpenAI rejects most schemas at submit time and then conforms perfectly on what is left. Anthropic accepts every schema but silently corrupts one specific structure. Gemini rejects a narrow set of features and conforms perfectly on the rest. Each pattern is the symmetric mirror of the others.

1. Anthropic accepts complex schemas, then silently returns the wrong shape

Anthropic's tool-use API is the most permissive of the three. It accepts almost any standard JSON schema as the input_schema for a tool, and on 7 of the 8 schemas in this bench, both Claude Sonnet 4.6 and Claude Opus 4.7 produce strict-conforming output 100% of the time. The failure mode is concentrated on one schema: a 7-level nested object chain (S3).

On S3 at n=20 runs per model:

Claude Sonnet 4.6: 20 of 20 runs silent-failed. Strict adherence 0% (95% CI: 0%–16.1%).
Claude Opus 4.7: 7 of 20 runs silent-failed. Strict adherence 65% (95% CI: 43.2%–82.3%).

The failure mode is unusual. Instead of returning a 7-level nested object, the model emits the entire nested structure as a single JSON-encoded string assigned to the root level1 field. Here is one of the Opus failures verbatim:

{"level1":"{\"name\":\"system\",\"child\":{\"name\":\"ingest_pipeline\",
\"child\":{\"name\":\"batch_24a17\",\"child\":{\"name\":\"parse_stage\",
\"child\":{\"name\":\"error_handling\",\"child\":{\"name\":\"dlq_promotion\",
\"leaf\":{\"value\":\"2 rows failed JSON parsing and were promoted to dlq
.ingest.parse-errors; weekly cleanup later inspected 412 items, removed
312, returned 100 for reprocessing\",\"kind\":\"outcome_summary\",
\"count\":2}}}}}}}}"}

The schema declares level1 as type: object. The model returned type: string containing a JSON serialisation of what the object should have been. ajv's diagnostic:

/level1 must be object {"type":"object"}

This is the most dangerous failure mode in the benchmark because:

The transport layer says success. The API returns HTTP 200 with no error field and no refusal signal.
The SDK does not validate. The Anthropic client passes tool_use.input back to your application without checking whether it conforms to the input_schema you sent.
The output parses cleanly. JSON.parse(response) succeeds, returning { level1: "{\"name\": ..." }. Only an explicit schema validator catches the type drift.

The mechanism is consistent across all 27 silent failures in the dataset (20 Sonnet plus 7 Opus): the model wraps the entire nested payload in a single string value. Run-to-run variance is in where the string boundary sits, not in whether the wrapping happens.

2. OpenAI enforces adherence by rejecting standard schemas

OpenAI's strict: true mode is the symmetric mirror of Anthropic. Where it accepts a schema, it produces strict-conforming output. Where the schema does not meet strict mode's narrow dialect, the request never reaches the model.

Of the 8 bench schemas, only 2 pass OpenAI's strict-mode rules (S1 baseline, which I deliberately shaped to be strict-compliant, and S8 closed object). The other 6 are rejected before the call is sent.

OpenAI strict mode requires:

Every object must explicitly declare additionalProperties: false.
Every property must be listed in the required array.
Type-arrays (e.g., type: ["string", "null"]) and oneOf unions are unsupported.

The bench performs the same schema validation OpenAI's API would perform, locally, before submission. A representative rejection (for the 7-level schema):

OpenAI strict mode violations:
  $: object missing additionalProperties: false;
  $.level1: object missing additionalProperties: false;
  $.level1.child: object missing additionalProperties: false

The rejection rate is identical between gpt-5.4-mini and gpt-5.5. The check runs server-side at the schema-submission layer before any model is invoked, so flagship intelligence does not change the outcome.

If you pull a schema from an OpenAPI spec or package.json, it will likely fail. Your options are to rewrite the schema to the strict dialect, or disable strict mode and inherit Anthropic's silent-failure problem.

3. Gemini is the rigid middle ground

Gemini's schema validator rejects modern JSON Schema features that OpenAI strict also bans (oneOf, type-arrays, $ref) but accepts the looser shapes OpenAI strict refuses. On the 6 of 8 bench schemas that clear Gemini's pre-flight, both Gemini Pro 3.1 and Gemini Flash 3.5 maintain 100% strict adherence at n=5 each (Wilson 95% CI for 5/5: 56.6%–100%; tight enough across 6 schemas to support the pattern).

The two rejected schemas are S5 (uses oneOf) and S6 (uses type: ["string", "null"] plus format: date-time). Gemini surfaces the rejection at submission time with a clear error naming the unsupported feature.

Notably, Gemini handled the same 7-level deeply nested schema that destroyed Anthropic at 100% strict adherence on every run. Where Gemini accepts a schema, it conforms.

The outcome matrix

The full pilot, condensed to one grid. S3 and S7 ran at n=20 for Anthropic; all other cells ran at n=5.

Defensive implementation patterns

The provider feature called "structured output" cannot be trusted as an application boundary. To handle the realities of the current APIs, your pipeline needs explicit guardrails. Here is the implementation priority:

Run an independent validation step. An HTTP 200 from the provider means nothing. Validate every single response payload against your schema using ajv, hyperjump, or a custom walker in your own codebase before passing the data to your application logic.
Redefine success criteria. Treat a standard parse error, a schema violation, and a refusal as equal failure modes. Trigger the same retry/fallback logic for all of them.
Flatten Anthropic schemas. Deep nesting triggers silent corruption in Claude, including at the flagship tier. Flatten structures into top-level arrays of sibling objects wherever possible. If a schema exceeds three or four levels of depth, consider refactoring it.
Compile schemas to the OpenAI dialect. If you are targeting OpenAI strict mode, author your schemas from the start with additionalProperties: false propagated to every sub-level and no optional fields.
Strip unions for Gemini. Avoid oneOf and ["string", "null"]. Use anyOf for unions and rely on a single nullable type constraint.

What this bench does and does not measure

Three caveats worth surfacing explicitly:

OpenAI rejection is bench-side, server-rule-mirrored. The 6 of 8 schemas reported as rejected by OpenAI are rejected by a pre-flight validator inside the bench that implements the documented strict-mode rules (additionalProperties: false, every property required, no type-arrays, no oneOf). I did not separately submit each schema to the OpenAI API and observe the server's 400 response, so the rejection rate reported here is the rate at which OpenAI's documented strict-mode rules disqualify normal JSON Schema, not the rate at which OpenAI's server returns an error. If OpenAI relaxed strict mode tomorrow, the bench would not notice.

Gemini schemas are normalised before submission. Gemini's structured-output API supports a narrower keyword set than OpenAPI / draft-2020-12 JSON Schema. The bench's convertSchemaToGemini function passes through the keywords Gemini's docs list as supported (type, enum, format, min/max, required, properties, items) and drops the rest before submission. The validator still checks Gemini's output against the original schema, so any constraint the converter drops is implicitly given a free pass on the Gemini side. For the current corpus this only affects S5 and S6 (already rejected at pre-flight), but it would matter for any future schema relying on const, pattern, or additionalProperties as a real constraint.

Sample sizes are uneven. The two cells the article quotes specifically (Anthropic Sonnet and Opus on S3 deep nesting) ran at n=20 each. The S7 long-array cells also ran at n=20 after an initial pilot revealed the Anthropic adapter was hard-capped at max_tokens: 4096, which was inflating the truncation rate; raising the cap to 8192 brought both Anthropic tiers to 100% strict adherence on S7. Everywhere else the bench ran at n=5 per cell, which is enough to see the dominant outcome but not enough to claim sharp rates.

Methodology, raw JSONL, schemas, and reproducible scripts are available at carrick-llm-structured-bench. The full re-run that backs the figures above cost roughly $8 in API credits and took about an hour of wall time.

Why Coding Agents Are Getting More Expensive (And How To Fix It)

David Moores — Sat, 23 May 2026 19:33:24 +0000

Cross-posted from carrick.tools.

Coding agents like Claude Code and Cursor now have context windows that support up to a million tokens. While larger contexts are useful, they are also the reason your API costs are increasing and you are hitting usage limits faster than before.

If your $20 Pro subscription feels like it covers less ground lately, or you are running into rate limits early in the day, it comes down to how these tools manage context under the hood.

Why prompt caching matters so much

The economics of long-context models rely heavily on prompt caching. Providers like Anthropic discount cached input tokens by about 90 percent [2]. This discount is what makes a million-token window financially viable.

However, caching requires exact prefix matching. As Simon Willison has noted, if your prompt is 99 percent identical to the previous one, but the very first token has changed, the cache breaks [3]. Anthropic's own documentation confirms that caching reads sequentially—any change before a cache breakpoint invalidates everything that follows.

This becomes an issue when agents use naive keyword searches to dump dozens of raw source files into the context window. It creates a volatile prompt. Editing a single line in any of those files changes the prefix. Agents also periodically summarize conversation history to manage context limits, which shifts the prefix again. Every time this happens, you get a full cache miss.

What an idle session costs

The impact of these cache misses adds up quickly. Boris Cherny from the Claude Code team at Anthropic recently explained this on Hacker News:

Normally, when you have a conversation with Claude Code, if your convo has N messages, then (N-1) messages hit prompt cache... The challenge is: when you let a session idle for >1 hour, when you come back to it and send a prompt, it will be a full cache miss... In an extreme case, if you had 900k tokens in your context window, then idled for an hour, then sent a message, that would be >900k tokens written to cache all at once, which would eat up a significant % of your rate limits, especially for Pro users [1].

If you step away for an hour, your first prompt when you return will burn a massive chunk of your daily allowance before the response even comes back.

This isn't theoretical

Developers are already tracking this issue. In Claude Code issue #46829, "Cache TTL silently regressed... causing quota and cost inflation," users analyzed their session logs and found a 20 to 32 percent increase in cache creation costs, alongside a spike in quota consumption for users who rarely hit limits before [4].

When the cache drops, you pay full price for hundreds of thousands of input tokens on every request. Relying on an agent to churn through raw, un-cached source code to find an answer will drain a daily compute budget in hours.

What Carrick does differently

This is why we built Carrick. The solution is not to load thousands of lines of source code just to find a single route or type definition.

Instead of dumping files into the context window, Carrick provides a pre-computed context layer via MCP. When an agent needs to know how to construct a request body for a specific endpoint, it doesn't need to load the router tree and its dependencies. It queries Carrick.

Carrick returns the resolved mount graph and compiler-grade types. What normally takes 50,000 tokens of raw source code is condensed into about 500 tokens of structured data.

Keeping the prompt small keeps the prefix stable, which preserves the cache. For some workflows we have seen token savings of up to 95 percent*, allowing your usage limits to actually last throughout the day. By shifting the heavy lifting from the agent's context window to a dedicated cache, you stop wasting tokens on raw codebase traversal.

* Measured on semantic lookups across three TypeScript microservices, then extrapolated to a 50-source-file baseline. Keyword-friendly queries sit toward the low end of the range; the gap widens with codebase size and the number of repos searched.

References

Boris Cherny (Anthropic), comment on An update on recent Claude Code quality reports, Hacker News. https://news.ycombinator.com/item?id=47880089
Anthropic, Prompt caching, Anthropic Documentation. https://docs.anthropic.com
Simon Willison, writing on prompt caching mechanics. https://simonwillison.net
Claude Code issue #46829, Cache TTL silently regressed... causing quota and cost inflation. https://github.com/anthropics/claude-code/issues/46829

Carrick is the missing context layer for AI coding agents — a semantic index of function intents, routing graphs, and compiler-grade types, served over MCP. Join the beta.

The Agentic Bottleneck: Why We Need to Rethink CI

David Moores — Sat, 23 May 2026 19:33:22 +0000

Cross-posted from carrick.tools.

The agentic development cycle has completely upended traditional software engineering practices. Teams are looking to ship faster than ever and are being enabled by massive improvements in the capabilities of coding agents. Engineers now run multiple agent sessions in parallel and can ship complex features without ever reviewing the output code, trusting that test coverage and CI checks will prevent broken code from hitting production.

For the engineers who have been around long enough to appreciate code aesthetics, you might agree with me that agentic software is verbose. Agents will happily write out logic to cover dozens of edge cases when a single, well-considered solution would provide the same functionality with far less complexity. For me, the question then becomes: should we care?

I believe that whether you take this approach or not, as we ship and deploy faster, human oversight is dwindling. We trust the models more as they become more capable, and the replacement for human review becomes the checks we put in place in CI.

The Bottleneck

As the quantity of generated code increases, the test coverage increases right alongside it. Agents will happily write tests for extremely unlikely scenarios to try to provide guarantees for themselves that the code will work once deployed.

These tests run quickly in isolated environments during the development process, but the entire bloated suite eventually needs to run in CI. The net result is a slowdown in velocity. As a codebase grows, teams trying to ship into production must wait for thousands of tests to pass, with the code and the tests barely reviewed by human eyes.

What Are We Waiting For

So what are the guarantees here? When we only care about the outcome, why wait for these guarantees at all? Could there be a way that works in tandem with the agentic development process, rather than one that acts as a blocker to productivity at each stage of the development lifecycle?

I believe this is where we might be headed. A shift left out of CI and into the agentic coding process itself. I see a vision of the future being an ephemeral test process that runs alongside the agent while it builds.

In this scenario, an agent writes a function, immediately generates a minimal unit test in a sandboxed container, and validates it. Once validated, the test code is scrapped. Instead of hoarding the tests, minified metadata is shipped with the code that reflects the exhaustive test cases the agent felt were necessary. This metadata is essentially a structured attestation, hashed with the commit and the location of the code, stating that the function satisfied its type contracts and produced specific outputs.

If the code is modified, the attestation breaks and the agent runs the ephemeral loop again. Codebases could shrink by hundreds of thousands of lines, drastically reducing the required context for any agent to complete a task. CI speeds up. Velocity accelerates.

The Circular Verification Trap

There is a catch to this idea, though. If an AI writes the application code and then writes the test code, you risk a circular validation loop. The test might just confirm that the code does what the code does, checking for internal consistency rather than actual correctness.

If we aren't careful, line coverage just becomes a trust-washing metric. To avoid this, these attestations would need to encode intent and strict type contracts, not just execution paths. The agent needs an accurate map of what the system is supposed to do before it starts writing tests to prove it did it.

Moving Validation Upstream

I think it will be a while before this future becomes a reality. We are only just learning to take our hands off the steering wheel with agents, and moving to an entirely new paradigm for providing certainty feels a while off.

But as agents get faster, the pipeline has to evolve. We cannot keep writing code at machine speed and testing it at human speed. Whether it looks exactly like ephemeral testing or something else entirely, the next major bottleneck in software engineering is CI. And we are going to have to shift left to fix it.

Carrick is the missing context layer for AI coding agents — a semantic index of function intents, routing graphs, and compiler-grade types, served over MCP. Join the beta.

The Multi-Repository TypeScript Problem

David Moores — Thu, 17 Jul 2025 00:00:00 +0000

Cross-posted from carrick.tools.

Work on a large enough TypeScript code base with distributed teams and you're likely working within either a monorepo or polyrepo architecture. Choosing one or the other depends on a number of decisions which can range from architectural (isolated services, independent deployments) to business (self-organising teams, devops maturity, multi-language services). The developer community can be polarising on the merits of both, but when it comes to TypeScript, monorepos have profound benefits. With little additional tooling you can give all your services access to a single shared TypeScript package. Dig a little deeper into modern tooling and you might use tRPC to share types or nx workspaces.

Unfortunately the story in a polyrepo architecture isn't so simple, but there are options:

Perhaps your APIs are bound quite strictly to your database schemas. You could fire up these databases and use introspection and codegen tools to generate types from the schemas.
You could publish a shared NPM types package on a private registry and get all your TypeScript projects to consume it.
You could go "Contract-first" — write the contracts, make each service consume the schemas and generate the types.

With all of the above, there is tooling that has to exist in each repository, and for each team that means maintenance:

If the shared types are updated, then each service needs to know about the version change. You could use a product like Dependabot and alert you on a cadence, but with private registries this isn't trivial and is the cadence frequent enough (but not noisy) if you are deploying frequently?
If the business isn't "contract-first", then APIs can be updated but the ticket to update the contract sits on the backlog.
If the business is "contract-first", then each team/service needs to commit an update to their service to access new versions of the schemas.

This problem — what at Carrick we like to call TypeScript's project boundary problem — is what we're going to try and solve today. Put on your waders as we're going deep in the weeds. Let's go!

The Dream

For the sake of (mild) simplicity, lets limit this discussion to APIs. What if we could look at a Producer and Consumer within different repositories and compare their request and response as if they were inside a monorepo? Better yet, what about if we could do this in CI so that we can get this type checking goodness at the same point when we would typically run tsc?

TypeScript needs to understand the full project context to perform type checking. It builds an AST (Abstract Syntax Tree) by traversing imports and exports across files, resolving each type reference to its complete definition. So therefore we would need to have both the producer and consumer from different repos inside a single TypeScript codebase to perform type checks. Extracting the code for either the producer or consumer isn't ideal — do we add the producer to the consumers repo or vice-versa? Do we create a third project? and if so, what dependencies would we need for the code to be valid?

What about if we just extract the types? That seems more straightforward — we can somehow take the request and response types, store them somewhere and reference them in an isolated TypeScript project at CI time. Lets give that a go.

The Recursive Type Discovery Problem

First we need to get the types. Carrick utilises a great library called ts-morph which provides an API on top of the TypeScript compiler that allows us to perform a surgical extraction of the type. Assume we can extract the type at a position in the source file for both the consumer and producer repositories…

// PRODUCER SIDE (user-service)
export type GetUsersResponse = Response<User[]>;

// CONSUMER SIDE (comment-service)
const users: User[] = await fetch('/api/users').then(r => r.json());

// Copy Response type:
export type Response<T> = {
  // ... wait, what properties does Express Response actually have?
};

// Copy User type:
export type User = {
  // ... wait, what properties does User actually have?
};

OK, we've run into a problem. The types are composites of other types. If we're going to compare these two types we need their dependencies. Let's fetch them!

// Looking up Express Response<T>:
export type Response<T> = {
  status(code: number): this;
  json(obj: any): this;
  send(body?: any): this;
  cookie(name: string, val: string, options?: CookieOptions): this;
  locals: Record<string, any>;
  app: Application;
  req: Request;
  // ... 47 more properties
} & ServerResponse;

// ============= THE NAMING CONFLICTS =============
// Meanwhile, consumer service has its own types:

export type Response<T> = {     // Name clash with Express Response!
  success: boolean;
  data: T;
  message?: string;
};

export type User = {            // Name clash with producer User!
  userId: string;               // Different structure entirely!
  displayName: string;
};

OK this has exploded in complexity. What we wanted to do was compare User against User, but we're now at:

Types manually copied so far: 47
Types still needed: ~200+
Naming conflicts: 12
Circular dependencies discovered: 8

Let's find a new approach. Ideally what we want is to recursively find the types if they are defined in the project, and if they are imports we want to preserve the import and add it to our TypeScript project.

ts-morph: TypeScript's Compiler as a Library

ts-morph provides a wrapper around the compiler APIs and allows us to traverse the type graph intelligently. To do that we need the source file and the bit position of the type. For Carrick we use SWC to traverse nodes in a TypeScript file and extract these positions. Now we can implement something like this:

import { Project, Node } from 'ts-morph';

// Create a TypeScript project programmatically
const project = new Project({
  tsConfigFilePath: './tsconfig.json'
});

extractTypeAtPosition('src/handlers.ts', 1247);

with extractTypeAtPosition roughly implemented as:

function extractTypeAtPosition(filePath: string, position: number) {
  const sourceFile = project.getSourceFile(filePath);
  const node = sourceFile.getDescendantAtPos(position);

  if (Node.isTypeReference(node)) {
    console.log(`Found type reference: ${node.getText()}`);
    processTypeReference(node);

    for (const typeArg of node.getTypeArguments()) {
      if (Node.isTypeReference(typeArg)) {
        processTypeReference(typeArg);
      }
    }
  }
}

function processTypeReference(typeRef: Node) {
  const typeName = typeRef.getTypeName().getText();
  const symbol = typeRef.getTypeName().getSymbol();

  if (symbol) {
    for (const declaration of symbol.getDeclarations()) {
      const filePath = declaration.getSourceFile().getFilePath();

      if (filePath.includes('node_modules')) {
        // External dependency - preserve as import
        addToImports(declaration);
      } else {
        // Local type - recursively collect its definition
        collectDeclarationsRecursively(declaration);
      }
    }
  }
}

So now we have:

Local types recursively extracted with full definitions
External types preserved as clean imports
We only follow the types we actually need

This gives us the type resolution including dependencies, but how are we going to make these work across service boundaries?

Creating Portable Type Packages

To keep the scope of this article manageable, lets make some assumptions from here on out so that we have a clear mental model of where we are and what we need to achieve to address the dream of running type checks across service boundaries.

We have two services — User Service repository and Comment Service repository.
We have the above ts-morph program running in a CI process for each repo within Github.
This process is running on pushes to main.

…which means we have a few more problems to address:

How do we associate the producer and consumer?
How do we store and retrieve these types and their dependencies in each service that requires them?
How do we output the type check results?

Associating the Producer and Consumer

As the producer and consumer likely have similar types, there is a high chance of duplicates if we were to build the type files as-is. Different services can also be built by different teams so we can't rely on naming conventions, but we can be fairly certain that the routing that the producer and consumer use will be the same. We can use that to associate the types and create type aliases for unique naming:

// For PRODUCERS (API endpoints):
function generateProducerTypeName(endpoint: ApiEndpoint): string {
  const method = endpoint.method.toLowerCase();
  const normalizedRoute = normalizeRoute(endpoint.route);
  return `${capitalize(method)}${normalizedRoute}ResponseProducer`;
  // Result: "GetApiUsersResponseProducer"
}

// For CONSUMERS (API calls):
function generateConsumerTypeName(call: ApiCall): string {
  const method = call.method.toLowerCase();
  const normalizedRoute = normalizeRoute(call.route);
  const callId = call.call_id || generateCallId();
  return `${capitalize(method)}${normalizedRoute}ResponseConsumer${callId}`;
  // Result: "GetApiUsersResponseConsumerCall1"
}

Storing Types and Their Dependencies

Each CI process needs to create a self-contained package that can be shared with other repositories. This requires two key artifacts:

1. The TypeScript definitions file:

// user-service_types.ts
import { Response } from 'express';
import { ObjectId } from 'mongodb';

export interface User {
  id: string;
  name: string;
  email: string;
  preferences: UserPreferences;
}

export interface UserPreferences {
  theme: 'light' | 'dark';
  notifications: boolean;
}

export type GetApiUsersResponseProducer = Response<User[]>;
export type PostApiUsersRequestProducer = User;

2. The dependency manifest (package.json):

{
  "name": "user-service-types",
  "version": "1.0.0",
  "dependencies": {
    "express": "4.18.0",
    "mongodb": "5.1.0",
    "@types/node": "18.15.0"
  }
}

These artifacts get uploaded to shared storage (S3, DynamoDB, etc.) where other CI processes can download them.

Performing the Type Validation

Now we have all the pieces, but how do we actually use them to validate compatibility?

When a repository's CI process runs, it downloads the type packages from all its related services and creates a temporary TypeScript project specifically for validation. In this isolated environment, we can:

Reconstruct the Project — We create source files from the downloaded definitions.
Unify Dependencies — We merge the dependencies from each package's manifest into a single package.json.
Install — We run npm install to ensure all external types are available to the compiler.

We're programmatically constructing a valid TypeScript project where types from completely separate repositories can coexist and be compared.

Compiling Within the Compiler

The beauty of this approach is that we can simply let TypeScript's own type checker determine compatibility. Instead of writing custom validation logic to manually traverse and compare type structures, we can leverage simple assignability rules.

// Create a type compatibility checker from the temporary project
const typeChecker = validationProject.getTypeChecker();

// Find the aliased producer and consumer types
const producerType = findType('GetApiUsersResponseProducer').getType();
const consumerType = findType('GetApiUsersResponseConsumerCall1').getType();

// Let TypeScript decide compatibility
const isCompatible = producerType.isAssignableTo(consumerType);

if (!isCompatible) {
  const error = getTypeCompatibilityError(producerType, consumerType);
}

If it fails, we create a fake assignment that's guaranteed to fail, then extract TypeScript's own error message:

function getTypeCompatibilityError(
  producerType: Type,
  consumerType: Type
): string {
  const testCode = `
    declare const producer: ${producerType.getText()};
    declare const consumer: ${consumerType.getText()};
    const test: ${consumerType.getText()} = producer;
  `;

  const tempFile = project.createSourceFile('temp.ts', testCode);
  const diagnostics = tempFile.getPreEmitDiagnostics();
  tempFile.delete();

  const error = diagnostics.find(d =>
    d.getMessageText().includes("not assignable")
  );

  return error ? error.getMessageText() : "Types are incompatible";
}

This approach is powerful because TypeScript already knows about the nuances of its own system. The validation feels seamless because it uses standard TypeScript compilation — we're just operating it across repository boundaries in a way it wasn't originally designed for.

The Engineering Reality

Complexity explosion — What started as a simple idea of extracting and comparing types became a deep dive into TypeScript's symbol resolution, module loading, and type instantiation systems.
Considerations of performance at scale — Building a Github action and keeping it snappy in CI runs means finding ways to shave critical seconds off compile times. Creating an isolated TypeScript environment for cross-repo checking means we're not running the TypeScript compiler across hundreds of files.
TypeScript's boundaries are artificial — The compiler already has all the machinery needed for cross-project type checking, it's just not exposed in a way that makes it easy. Most of our engineering was about building a workaround for those artificial boundaries.

This approach scales because we're leveraging TypeScript's existing infrastructure rather than building a parallel system. Every improvement to the TypeScript compiler automatically improves our validation accuracy.

The dream of monorepo-style type safety in a polyrepo architecture is possible. You just need to convince TypeScript to look beyond its own project boundaries.

Carrick is the missing context layer for AI coding agents — a semantic index of function intents, routing graphs, and compiler-grade types, served over MCP. Join the beta.