DEV Community: Dave Kurian

Why Fast Coding Agents Still Need Thoughtful Architecture

Dave Kurian — Sun, 02 Aug 2026 03:05:50 +0000

Coding agents just shipped a refund-approval endpoint, two migrations, and a PR in the time it took your team to schedule the design meeting. That's not hype — it's a real enable. A model can inspect files, draft an endpoint, change a schema, write tests, and open the PR while you're still arguing about naming. The first wave of teams that wires this in gets weeks of engineering time back per quarter.

Then someone reviews the PR and asks: who can approve a refund? Can the same person create and approve it? What happens when the payment provider is down? Does the audit record survive a schema migration? The PR compiles. The tests pass. The design is missing. Speed didn't make those questions smaller. It just made it easier to skip them.

This is the lesson every coding-agent workflow keeps rediscovering: speed helps only after you define the right boundaries.

The speed is real, and worth using

Coding agents aren't a toy. They do work that used to eat afternoons: scaffold an endpoint, write the boilerplate, generate the test, draft the migration. If you've ever watched Cursor or Claude Code turn a three-file change into one PR, you've seen the loop work.

Concrete plumbing to put one in your editor today:

# Cursor — point at any frontier model via OpenRouter, no extra account
export OPENROUTER_API_KEY=...
# In Cursor Settings → Models, pick a frontier model from the
# OpenRouter catalog (e.g. openrouter/anthropic/...)

# Claude Code — straight from Anthropic
npm i -g @anthropic-ai/claude-code
claude
# /model <frontier-model-id>

# Or any agent via a flat OpenAI-compatible endpoint
export OPENAI_BASE_URL=
export OPENAI_API_KEY=...

The point isn't which vendor wins. The point is the loop: read the repo, edit files, run tests, open the PR. That loop is genuinely fast now. Teams that don't use it are leaving time on the table, full stop.

Architecture is the slow part on purpose

The part that doesn't get faster is the one that matters most. Architecture is the set of important decisions that shape a system. Not the folder structure. Not which framework is fashionable this quarter. The decisions:

which component owns each responsibility
which component owns each data set
which dependencies are allowed
which failure modes the system must tolerate
which security boundaries the system must enforce
which parts can change without coordinated work
which operational costs the team accepts

Each of these is a judgment call that depends on users, regulations, team structure, budget, existing systems, and expected change. None of that lives in the repo. None of it fits in a context window. An agent can implement any of these decisions. It cannot remove the need to make them.

[[DIAGRAM: request hits tracker → agent reads repo → agent drafts endpoint, migration, tests → PR opens → reviewer hits boundary questions → re-design, re-implement, weeks lost]]

The refund workflow, walked through

A request lands in your tracker: add an approval workflow for customer refunds. Looks like an afternoon. It isn't.

Seven questions hide inside it:

Who can approve a refund — role, team, dollar threshold?
Does the amount change the rule — manager for $50, director for $500, CFO for $50k?
Can the same person create the request and approve it?
What happens when the payment provider is unavailable — queue, retry, fail loud, refund anyway?
Must the system keep an audit record — of the request, the approval, the reversal, all three?
Can an approval expire — and if so, does the customer get re-prompted or silently dropped?
Who can read the audit log — and is it the same set who can refund?

A coding agent will guess. It will pick reasonable defaults — a single approved_by column, a boolean flag, a try/catch around the Stripe call. The PR will compile. The tests will pass. Six months later an auditor asks for the trail, and the trail is one nullable column with no actor history. The "fast" decision just cost you a quarter.

This is what the original essay means by an agent can build the wrong system faster than your team can understand it — see Coding Agents Are Fast. Architecture Is Still Slow. Speed without the boundary-setting step is a faster way to ship technical debt.

How to actually use coding agents without shipping the wrong thing

The pattern that works is boring and worth repeating: let the agent own the loop, keep the boundary-setting in human hands. Concretely, a single file in the repo that every agent reads first:

# ARCHITECTURE.md — these rules are not negotiable.

## Money-movement endpoints
- Must write to an append-only `audit_log` table before returning 2xx.
  Schema: `id`, `actor_id`, `action`, `before`, `after`, `occurred_at`.
- No component may both create and approve the same business object.
  Two distinct roles in the auth model, enforced at the endpoint.

## External calls
- Stripe, banks, shipping APIs: wrapped in a retryable, idempotent job.
  The HTTP request returns 202, not 200. The job worker owns the result.

## Schema changes
- Ship in two PRs: the additive one first, the destructive one after
  the new code is live. Never break a read in a single deploy.

Thirty lines. The part that doesn't change when the model does. Read it before the agent runs. Update it when a decision actually changes. Hand it to every new agent the same way.

Three habits that compound:

# 1. Keep the doc in the repo root, not a wiki
ls ./ARCHITECTURE.md   # every agent reads this first

# 2. Block merges on the boundary checks, not just the tests
# (CI rule: every PR touching routes/refunds/* must reference an
#  audit_log write — grep it, fail the build if missing)

# 3. Always run the agent on a feature branch, never on main
git checkout -b feature/refund-approval
claude "implement the refund approval workflow per ARCHITECTURE.md §1"

The speed is still real. The agent still ships the endpoint, the migration, the tests, the PR. The difference is the boundary was set first, so what ships is the system you meant.

The durable layer underneath the agent churn

Here's the part that doesn't change when the model does: the decisions themselves. Which component owns the refund data. Which role can approve. What the audit table looks like. How failure surfaces to the user. These are choices your team will live with for years. The agent — whichever one — will churn. The boundaries shouldn't.

This is also the part that doesn't fit in a prompt. It doesn't fit in a context window. It fits in a codebase that has already made the calls and shipped the conventions: one API surface for the same component on web, iOS, and Android; one auth model that enforces separation of duties at the endpoint; one data layer that owns the audit table by construction. When that baseline exists, every coding agent that runs against it produces code that's consistent with the last six months of decisions — not just the last six minutes of context.

The risk isn't using the agents. The trap is using them against an undefined system — where every PR reinvents the boundary, every endpoint invents its own auth check, every component invents its own data model. That's the technical-debt factory, and it now runs at agent speed.

Set the boundaries once, in a place the agents can read. Let the agents run fast inside them.

What this gets you

A team that uses coding agents well looks like this: the architectural doc lives in the repo, the conventions are baked into the shared component and data layer, and the agent loop runs on top. A refund approval ships in an afternoon, and the audit table is already there because the data layer made it impossible not to be. A new engineer clones the repo on Monday and ships a compliant endpoint on Wednesday, because the boundaries are visible in the code, not tribal knowledge.

That's the enable. Speed at the surface, durable decisions underneath. The model will keep changing. The boundaries shouldn't.

Supabase Evals: Benchmarking AI Agents on Real-World Tasks

Dave Kurian — Sat, 01 Aug 2026 18:04:30 +0000

Why this benchmark matters more than it looks

For the last eighteen months, "AI shipped a feature" has meant the LLM emitted code that compiled and looked plausible. It did not mean the schema actually ran. It did not mean the Edge Function returned what it claimed. It did not mean the RLS policy only let through the rows the spec required. Most agent evals were vibes — a human reading diffs and going "yeah, that looks right."

Supabase's open-source Supabase Evals framework drops that. The benchmark boots an actual hosted-like stack and a local CLI project inside containers, points Claude Code, Codex, or OpenCode at real Supabase tasks — build a schema, debug a failed Edge Function, fix a broken RLS policy — and grades the result against the running system, not the diff. Apache-2.0, runs locally, and it already powers the public leaderboard at supabase.com/evals plus an internal regression suite that's monitored daily.

That last paragraph is the one that matters. A scoring mechanism that executes the agent's code against a real Supabase stack and gates on it is a different category of confidence than a benchmark that grades text.

[[COMPARE: text-based scoring vs execution-based scoring]]

The harness: three dimensions, two suites

The framework defines every scenario along three axes, and they're worth memorising because they show what "coverage" actually means here:

Products: database, auth, storage, edge-functions, realtime, cron, queues, vectors, data-api
Topics: RLS, security, migrations, SQL, SDK, observability, self-hosting, tests, declarative-schema
Stages: build, deploy, investigate, resolve

Scenarios were picked to touch each (product, topic, stage) combo at least once — the smallest set that covers the surface — and they were grounded in real support tickets, bug reports, and GitHub issues. That design choice is doing real work: the eval isn't synthetic, it's the bugs the team has actually had to fix.

They split the scenarios into two suites with very different jobs:

Suite	Purpose	Published	Refresh
Benchmark	Breadth across the dimension matrix	Yes	Stable
Regression	Known failure modes	Internal	Daily

The regression suite is what powers daily internal monitoring and SDK release gating. The benchmark suite is what the public leaderboard shows.

What the scoring actually does

Every scenario is a directory with three things: a task file plus frontmatter, the scorer, and optional setup / seed starting states. When you ship a workspace, the framework boots a Docker sandbox with the real CLI installed and exposes a Management API-compatible surface backed by Postgres. The agent calls the actual MCP server and CLI — no mocks, no stubs, no "we simulated the response."

Scoring is two passes:

// scoring.ts — illustrative shape, see harness for the exact API
export const grade = async (task, agentOutput) => {
  // Pass 1: deterministic. Did the SQL parse? Does RLS deny the rows
  // that should be denied? Did the Edge Function return the schema it
  // claimed to return?
  const deterministic = await runChecks(task, agentOutput)

  // Pass 2: LLM-as-a-judge on the intent match.
  const judge = await llmJudge(task, agentOutput, trace)

  return combine(deterministic, judge)
}

Agents get one retry before grading — a useful concession, because real engineering involves iteration and penalising a single bad first try would reward agents that wrap themselves in two-line "looks done" outputs.

What the leaderboard already shows

The early findings are encouraging in a way that's worth stating plainly. Agents pass most scenarios with no skill loaded. In the Build stage, Opus 5 and Kimi K3 both scored 100% unaided. That's a real signal that the model layer has caught up to routine Supabase work without bespoke scaffolding.

The work that still needs skills is the gap: Investigate and Resolve — the stages that require reading logs, following a chain of side effects, or understanding why a deploy broke. The benchmark publishes which categories each agent wins and which it doesn't, which is the part that matters if you're picking an agent for a regulated backend where a wrong investigation is the bug.

How to run it today

The framework is deployable now, under Apache-2.0, locally. Three things the release confirms you need:

Prerequisites:

Docker daemon running
Provider API keys for whichever agents you want to grade (Anthropic for Claude Code, OpenAI for Codex, the OpenCode provider for OpenCode)
Ports 54321–54329 free

Each eval directory is a self-contained scenario: the task plus its frontmatter, the scorer, and optional setup / seed files for the starting state. Shipping a workspace, or declaring one, boots a Docker sandbox with the real CLI installed.

A typical run:

# 1. Boot the local Supabase-like stack in containers
docker compose up -d

# 2. Run a scenario against an agent.
# (Exact CLI flags depend on the harness version; check the repo's --help.)
docker compose exec harness run \
  --suite benchmark \
  --scenario rls/broken-policy \
  --agent claude-code

The runtime exposes a Management API-compatible surface, so the agent calls the real MCP server and the real CLI throughout. The agent's final state is graded by the directory's scorer.

Wire the result into CI as a release gate:

# Pattern: same shape as a migration-test or type-check gate
- name: Run Supabase Evals regression suite
  run: harness run --suite regression --agents claude-code

- name: Block release if pass rate drops
  run: |
    RATE=$(jq -r '.aggregate.pass_rate' runs/latest/report.json)
    test "$(echo "$RATE > 0.95" | bc -l)" -eq 1

That last block is the part that matters: the eval suite becomes a release gate the same way a migration test does. If Claude Code's score on database-migration scenarios drops 10 points after a docs edit, your edit regressed agent behaviour — and the bench caught it before the SDK shipped.

What this enables in practice

Four patterns become cheap once the bench exists:

Regression-testing docs and skill edits. When you change a Supabase skill file or a docs page, the regression suite tells you whether your edit broke an agent's ability to solve a task it previously solved. That's the loop you've been missing.
Gating SDK releases. Before shipping a Supabase SDK that changes how agents call it, run the benchmark. If Codex's score on Edge Functions drops 15 points, your release regressed agent behaviour.
Comparing harnesses head-to-head. Same scenarios, two harnesses, same judge. The diff in report.json tells you which harness is buying you which capability.
Regulated backends. Fintech and healthcare can't ship a wrong RLS policy — it's a security incident. A bench that scores against a real Postgres with real RLS evaluation is the difference between "the agent said it did the right thing" and "the policy actually denies what it should."

The layer underneath the layer

Supabase Evals measures whether an agent shipped correct backend work. That's half the problem.

The other half is the UI layer. When an agent generates a UI from a description, you don't want the web version to use one component, iOS to use a different one, and Android to ship a third — and you don't want the agent to have to learn three frameworks to keep them aligned. The OTF Kit gives the agent a single component contract: the same <Card>, <Button>, <Sheet> ships on web, iOS, and Android from one API. The agent writes against the contract once; the contract compiles to three targets. The durable part is the contract — the model swaps out underneath you and the UI stays coherent.

Use Supabase Evals to grade the backend work the agent does. Use the OTF Kit to keep the UI work portable across platforms. Two different layers, both of which have to hold for the ship to count.

What it doesn't do yet

Three honest limitations from the current release:

Local-stack only. You boot the bench on your laptop; it doesn't run as a managed service. That's by design — the scenarios need a real CLI and a real MCP server — but it means you can't bolt it onto a SaaS dashboard with one click.
API keys and ports. Every agent run needs the provider's API key, and the local stack needs ports 54321–54329 free. CI runners need to be configured for both; there's no managed secrets plumbing yet.
The regression suite is curated. The benchmark suite covers the dimension matrix; the regression suite is the team's known failure modes and refreshes daily. If your team's failure modes aren't yet represented, you'll need to add scenarios yourself — which is exactly the point of Apache-2.0, but it's still work.

The roadmap signal in the project is toward more scenarios and richer judge prompts. Both are the right next moves.

The takeaway

Supabase Evals is the first benchmark in this space that scores executing code against a real Supabase stack, not generated text against a vibe check. That's a real advance, and the Apache-2.0 + local-runnable shape means you can wire it into your release pipeline this week. The fact that Opus 5 and Kimi K3 both hit 100% on unaided Build tasks is the headline — agents have caught up to routine backend work without bespoke scaffolding — and the fact that Investigate and Resolve still need skills is where the next round of work lives.

The benchmark is the part that measures the agent. The component contract is the part that holds the UI together when the model swaps out underneath you. Use both.

Cognizant and OpenAI Unite for Global Codex Hackathon in India

Dave Kurian — Sat, 01 Aug 2026 15:09:02 +0000

Cognizant just put 10,000 of its own engineers in front of OpenAI Codex across six Indian cities in a single hackathon. One event, six cities, one theme — "Engineering the Frontier" — and a workforce large enough to make the platform question real. The honest read: this is the first public signal that Codex is being treated as a production surface inside an enterprise, not a demo toy.

For most engineering teams, AI coding tools live in a weird half-world. They make slick Loom demos and one-engineer wins, but they rarely cross the chasm into the work a 10,000-person org can actually use. Cognizant just dragged Codex across that chasm in one afternoon. The result, per the company: a hackathon spanning Chennai, Hyderabad, Bengaluru, Pune, Coimbatore, and Kochi, with participants collaborating across cities, attending OpenAI masterclasses, and presenting their prototypes to business and technology leaders at the end.

That's the interesting part. Not the marketing tagline. The fact that someone decided Codex was learnable enough — and stable enough — to drop 10,000 people into at once.

The scale is the story, not the theme

10,000 is a number that flattens whatever skepticism you have about an enterprise AI event. A 50-person internal hackathon can hide bad tooling; a 10,000-person one cannot. If the platform isn't learnable in an afternoon, you get 10,000 confused engineers and zero follow-through. If it is, you get a permanent shift in what your org reaches for first when a new problem lands.

"Engineering the Frontier" is a slogan. The operational reality underneath it is harder: every participant got hands-on Codex time, exclusive OpenAI masterclasses, curated learning resources, and an audience of business and technology leaders at the end. That's a real production line for AI literacy — not a slide deck.

What 10,000 builders in one room does to a platform is the same thing load testing does to a database. It surfaces every rough edge. Which is why this scale matters more than the theme.

[[COMPARE: 50-person internal hackathon vs 10,000-person enterprise deployment]]

Codex-as-named-in-the-article is what 10k builders actually got

The article names OpenAI Codex directly — calling it "one of the industry's most advanced AI coding platforms" — and says participants used it for hands-on problem solving, prototyping, and collaborative sessions. That's the surface that matters: not "AI features," but the specific capability of turning intent into code in a way a non-CS-trained associate can pick up in an afternoon.

The implication for the rest of us is straightforward. If a 10,000-person org can drop people into Codex and get working prototypes out the other side, the on-ramp is shorter than the hype cycle suggests. The risk is treating that on-ramp as the whole job. It isn't. A prototype that lives in a hackathon is a prototype. A prototype that ships to users is a product. Those are very different problems.

How to actually try Codex today

The article doesn't publish Codex setup details — and we won't invent them. What it does confirm is that Codex was deployed as the production-grade surface for real engineering work inside a 10,000-person org. That's enough signal to justify an afternoon.

A pragmatic first session looks like this:

# Codex CLI install + login — see OpenAI's own docs for the current shape.
cd your-repo
codex "add an /api/health endpoint that returns db ping + uptime"

That's the smallest loop that resembles what a hackathon participant actually ran — a real problem, a real codebase, a Codex-shaped assistant in the middle. Run that for an afternoon and you'll know whether Codex fits your workflow the same way the Cognizant cohort did. Treat any specific "X× faster" number cited from the event as marketing — the article doesn't publish accuracy benchmarks, prototype counts, or production-conversion rates, and neither will we.

What an enterprise-grade hackathon reveals about the AI tooling stack

Three things have to be true for a 10,000-person hackathon to ship value, and they're true regardless of which model wins the news cycle:

The tool has to be learnable in hours, not weeks. A 10k-person pilot won't survive a long on-ramp.
The platform has to be deep enough to support real work — code generation that holds up to a code review, not just a chat box.
The durable layer underneath has to survive the model churn. Whatever UI, auth, and component primitives your prototypes lean on will outlive whichever model shipped this quarter.

The first two are Codex's job. The third is the part that doesn't change when the model does — and it's the part most hackathon recaps skip.

The durable layer underneath the tool churn

A 10k-person hackathon produces a flood of prototypes. Most die on the vine for boring reasons: the team spent 60% of their time re-implementing the same button on three platforms, the prototype that won the demo breaks the moment someone asks "does this work on a real iPhone?", and the web build drifts from the native build the day after the hackathon ends.

That's the gap between "we built a thing" and "we shipped a thing." The tool that built the prototype — Codex, Cursor, Claude Code, Bolt, v0, Lovable — is interchangeable. The layer that ships the result is not.

When a hackathon winner has to graduate to production, the durable layer underneath is the same component looking and behaving the same on web, iOS, and Android through one API. Not three implementations. Not a translation step. One component, three platforms, zero re-writes. That's what carries the prototype across the chasm that 10,000 builders just jumped over.

What this gets us

Three takeaways worth keeping:

Scale is the new credibility floor. When an enterprise runs a 10,000-person hackathon, the platform underneath has cleared a bar a 50-person pilot never could. That's the new credibility floor for AI coding tools.
The on-ramp is shorter than the hype. If 10,000 non-specialists got to working prototypes in an afternoon, your on-ramp is probably shorter than the article made it sound. Spend an afternoon with the Codex CLI before you decide whether it fits your team.
The durable layer is the lock-in. The model will churn. The UI layer, the auth layer, the cross-platform primitives underneath the prototype — that's what carries the work to production. Build prototypes freely; lock the foundation in.

A 10,000-person hackathon doesn't prove Codex is the final answer. It proves the question — "can an enterprise actually use this?" — is no longer hypothetical. The next 12 months will tell us which parts of the stack underneath Codex are durable and which were built for the demo. Build the durable parts first.

Unify AI Reasoning with UI: smoothly Integrate Ontologies with OTF Kits

Dave Kurian — Sat, 01 Aug 2026 14:05:20 +0000

Your agent doesn't know what a Booking is. It can guess — and usually guess well — but the moment it composes a tool call, validates an action, or renders a result for a human, the guess is the bug.

An ontology gives the model a typed contract of your domain: what exists, what relates to what, what's allowed. It also explains why bolting one onto a normal app goes badly, and how to make it stick this time.

The trouble starts when you wire an ontology into a regular codebase. You end up maintaining two contracts: one for the agent (RDF triples, JSON-LD contexts, SHACL shapes) and one for the UI (props, components, validation). Drift is inevitable. The agent produces a Reservation, the API exposes a Booking, the UI expects a ReservationViewModel. Six months in, half your team is writing adapters between layers that were supposed to agree. The schema isn't the problem; having two schemas is.

This is the thesis: the schema that grounds your agent should be the same file that types your UI. When that's true, the ontology stops being a research artifact and starts being a product surface.

What an ontology actually buys your agent

A language model treats every prompt as a fresh string of tokens. It has no first-class notion of types, identity, or relations — only statistical resemblance. An ontology hands it something it can lean on: a vocabulary with explicit semantics, a graph of typed relations, and a set of constraints.

Concretely:

{
  "@context": {
    "@vocab": "https://schema.otf-kit.dev/booking#",
    "schema": "https://schema.org/",
    "Customer": "schema:Person",
    "startsAt": { "@id": "schema:startDate", "@type": "schema:DateTime" },
    "endsAt":   { "@id": "schema:endDate",   "@type": "schema:DateTime" }
  },
  "@type": "Booking",
  "@id": "booking:42",
  "name": "Sauna slot — Friday",
  "Customer": { "@id": "person:7", "name": "M. Vargas" },
  "startsAt": "2026-06-05T18:00:00Z",
  "endsAt":   "2026-06-05T19:00:00Z"
}

Two things matter here. First, the @context declares the schema the agent must respect — that's the contract. Second, every field is a URI, which means a SPARQL endpoint, a SHACL validator, and your component layer can all consume the same shape without translation. The agent isn't inventing structure; it's filling in a typed form.

The win isn't "AI understands your data." The win is that the agent's mistakes become recoverable. A malformed startsAt fails SHACL. A missing Customer fails the schema. A hallucinated field is rejected before it reaches a user.

Where it falls apart in a normal app

The pitch lands. Engineering starts. Three things break first:

Triple stores are slow at the read path. SPARQL endpoints and named-graph stores are great for federation and inference. They're miserable when a mobile client needs to render a list of 200 bookings in under 100ms.
The schema drifts. The agent's Reservation becomes the API's Booking becomes the UI's ReservationViewModel. Each layer invents its own type aliases because nothing forces convergence.
Mobile clients can't carry a graph. A 2 GB RDF dump is not a phone payload. You need a flat, denormalized projection that still respects the schema.

These aren't exotic problems — they're the same integration problems every typed system hits when it meets an untyped neighbour. The fix is the fix it's always been: pin the contract at the seam and make every consumer read from it.

The typed contract pattern: one schema, two consumers

Stop storing the schema in two places. Put it in one file and have both the agent and the UI read it. The agent reads it as a JSON-LD context for grounding; the UI reads it as the type system for components.

// schema.otf.ts — the file every layer imports
export const BookingSchema = {
  "@context": "https://schema.otf-kit.dev/booking",
  "@type": "Booking",
  fields: {
    id:        { type: "id",         required: true, uri: "schema:identifier" },
    name:      { type: "string",     required: true },
    customer:  { type: "ref:Person", required: true, uri: "schema:customer" },
    startsAt:  { type: "datetime",   required: true, uri: "schema:startDate" },
    endsAt:    { type: "datetime",   required: true, uri: "schema:endDate" },
    status:    { type: "enum",       values: ["pending","confirmed","cancelled"] },
  },
} as const;

The agent gets the @context and field map as system instructions. The component layer gets the same map as TypeScript types. The schema file is the source of truth; the build emits two artifacts from it — one for the model, one for the runtime.

[[DIAGRAM: schema.otf.ts → JSON-LD context for the agent, TS types for the UI, SHACL shape for validation — all three consumers reading one file]]

A SPARQL endpoint can still back the data; the UI never speaks SPARQL. The schema file becomes the projection specification — a typed view of the graph, not the graph itself.

SHACL as the immune system

If the schema is the contract, SHACL shapes are the contract's immune system. They reject malformed graphs before they enter your pipeline. The point is to run validation at the boundary the agent crosses, not deep inside the app.

ex:BookingShape a sh:NodeShape ;
  sh:targetClass ex:Booking ;
  sh:property [
    sh:path schema:startDate ;
    sh:datatype xsd:dateTime ;
    sh:minCount 1 ;
  ] ;
  sh:property [
    sh:path ex:status ;
    sh:in ( "pending" "confirmed" "cancelled" ) ;
  ] .

The agent produces JSON-LD; a SHACL validator accepts or rejects it. The component layer trusts what the validator returns. Mobile clients carry only validated payloads — which is why a 200-row list stays small and predictable, around 200–400 tokens per record, and renders fast.

This is where the win compounds: validation, type-checking, and rendering are all reading the same schema. There's no second source of truth to drift.

Cross-platform rendering: same entity, every surface

Once the UI binds to typed entities instead of freeform props, the cross-platform problem collapses. A Booking is a Booking on web, iOS, and Android — same field names, same validation, same status enums. The component layer reads the schema and renders the entity; the surface adapts.

<BookingCard
  booking={entity}                  // type: Booking (from schema.otf.ts)
  onCancel={() => mutate(status)}   // status enum from schema
/>

There's no <BookingCardWeb> and <BookingCardNative> with different field expectations. When the schema gains a cancellationReason field, the type system flags every consumer — including the agent's tool definitions — and the component renders it without a separate PR to each platform. The same component name + props + look render on every surface from one codebase.

[[COMPARE: freeform props invented per platform vs schema-bound entities locked to one file]]

Where the durable layer lives

Here's the part that doesn't change when the model does. The schema file outlives any particular LLM. The component layer outlives any particular agent framework. The validation rules outlive any particular tool runner.

This is the seam: the agent is a consumer of the schema, the UI is a consumer of the schema, and the schema is the durable surface. Tooling churns — today's model, tomorrow's MCP server, next quarter's orchestrator — but the contract is the contract.

When the AI config lives next to the schema — the same place the component layer reads its types from — agents extend the kit instead of regenerating it. A documented CLAUDE.md, a .cursorrules, and a fixed set of tested ai/prompts/ give the agent the same grounding the schema gives the runtime. The agent and the app end up reasoning over the same world.

[[CONCEPT: the schema file as the single source of truth — one file, two consumers, zero drift]]

That's the part that doesn't churn. You swap the model, the schema stays. You swap the agent framework, the schema stays. The components render whatever the schema says is renderable, on whatever surface you ship to.

What this enables

Three concrete payoffs when the schema is the contract:

Agent mistakes fail loud, not silent. SHACL rejects a malformed startsAt before it hits a user. The UI never has to defend against bad data because bad data never arrives.
Mobile stays fast. A flat, validated projection of a booking is ~200–400 tokens. Carrying a graph is the alternative, and it doesn't fit on a phone — and a 100ms list render beats a federated SPARQL query every time.
Refactors touch one file. Adding cancellationReason updates the schema, the agent's tool definitions, the SHACL shape, and the component types in one move. The diff is local; the blast radius is one file.

The semantic web isn't a research pivot — it's a typed contract pattern that was waiting for agents. The hard part isn't the ontology. The hard part is making the schema the source of truth for every layer that touches it. That's the part most teams skip, and it's the part that costs them six months later.

Pin the contract. Let everything else churn.

Cursor enhances Teams plan with predictable pricing and new Premium seat options

Dave Kurian — Sat, 01 Aug 2026 13:05:27 +0000

Cursor’s 2026 Teams plan pricing update is a real step up for engineering and product teams that need predictable costs—but don’t want to pay for usage they don’t need. The big win: clean separation of usage pools for Cursor’s own models versus third-party APIs, and the launch of the Premium seat—a simple, no-surprise upgrade option for power users. The headline is cost clarity, not complexity, with improved admin control to match. Teams running LLM agents at production scale should actually welcome this: it means fewer budget “unknowns,” and more control, even as usage spikes or user profiles shift.

What are the main changes in Cursor Teams pricing for 2026?

The 2026 Cursor Teams update delivers several surgical changes aimed at predictability, not just optimization. Here’s what stands out:

Pool split: Usage for Cursor’s own models (“Composer and Auto”) is now tracked separately from third-party API usage. Before, all usage was metered in a single pool—now you know exactly who’s burning through what.
Premium seat: A new “Premium” option offers five times the usage of Standard, for three times the price ($96/seat/mo annual, $120/mo monthly), targeting high-intensity users and avoiding mid-cycle cost overages.
Standard seat increases: Standard seat users don’t get left out—usage allocation has gone up, at the existing price point ($32/mo annual, $40/mo monthly per seat).
Timing: Changes apply immediately for new customers and for all existing Teams orgs at their first renewal after July 1, 2026.
Composer 2.5 rollout: enables improved model performance (frontier quality) at reduced cost.

This is not a cosmetic patch. The pricing model is a realignment around clean boundaries (first-party vs third-party) and user roles (Standard vs Premium). For admins, this caps the budget risk and enables meaningful monitoring.

How does the usage pool separation affect Teams plan users?

The new dual-pool system splits usage as follows:

“Composer and Auto” pool: Usage from Cursor’s own models (Composer, Auto agents, now running on v2.5) burns from this pool.
“Third-Party API” pool: Usage from anything routed via external APIs (OpenAI, Anthropic, etc) routes against its own allowance.

Consider a real-world scenario:

Engineering leans heavily on first-party Composer Copilot for boilerplate generation—burns only the “Composer and Auto” pool.
Growth team prefers the latest GPT-4 or Claude—burns the “Third-Party API” pool.
Before: both teams could exhaust the total allowance, and a spike in third-party API usage could throttle core workflows.
After: teams don’t clobber each other. Consumption in one pool never eats into the other. Budget overruns are localized.

For cost management, this means:

Admins can now track first-party and third-party consumption independently.
Surprises from one set of users won’t immediately threaten platform-level continuity.
Easier forecasting: actuals-vs-budget stays granular, and orgs can tune usage or seat allocation in one pool without disturbing the other.

The bottom line: not just more usage, but higher resolution on where usage is going (and why)—a requirement for keeping a team’s AI experiments sustainable as they grow.

What is the new Premium seat and who should consider it?

The “Premium” seat does one job: take cost ambiguity out of power-user workflows.

Usage: Comes with 5× the included usage of a Standard seat, covering roughly a full month of sustained, heavy usage.
Pricing: $96 per seat/month (annual), $120/month (monthly) — exactly triple the Standard seat on both cycles.
Predictability: Instead of unpredictable overages or ad-hoc usage upgrades, you lock in enough usage for your busiest engineers or PMs.
Who’s this for? The intended user is clear: heavy contributors running long context LLM completion, agent chains, or high-iteration prototyping—basically, anyone who has blown past their allowance more than twice in a quarter.

Example scenario:

Standard seat engineering contributor averages 1x usage, spikes to 2x twice per quarter.
Premium seat covers those spikes in the base price, $288 more per user/year, versus unpredictable monthly overage fees.

Why not just buy more Standard seats? Because the Premium seat’s usage allocation—5× standard—is non-linear relative to cost (5× usage, 3× price), optimizing for heavy, not just additional, usage.

For finance and engineering managers: Premium seats give you a hard ceiling and floor for each user’s cost, so quarterly forecasting gets a lot easier.

What improvements come with Cursor Composer 2.5 in the Teams plan?

Cursor Composer 2.5 rolls out as part of this pricing update, and it matters for both technical value and ROI under the new seat structure.

Frontier model performance, cheaper: Composer 2.5 claims “frontier model performance at reduced cost”—translation: you get smarter, more capable auto-completions per request than on v2.0, but each request is cheaper to your usage pool.
Direct value boost: Since “Composer and Auto” usage is capped and tracked separately, Composer improvements mean more work done per unit allowance—especially relevant for Standard users hovering near historical usage ceilings.
Applies to all Tiers: Both Standard and Premium seats get access to 2.5, so nobody is left behind.
Rationalizes pool split: The model improvement effectively gates more of your cost savings to the first-party pool, not lost to an undifferentiated usage blend.

Notably: for teams that can lean heavily on Composer (versus defaulting to external LLMs for every task), this is a double win—every dollar of seat cost buys more output.

[[IMG: Cursor Teams seat selector and usage pool dashboard, showing Premium + Standard mix with separate Composer vs Third-Party pools]]

How to optimize your Cursor Teams plan usage after the 2026 update?

This round of changes gives admins and finance teams more levers to tune, but also asks for tighter discipline. Here’s the operational playbook:

Track pools independently: Don’t aggregate. Use reporting to separately monitor “Composer and Auto” vs. “Third-Party API” usage.

# Example: export usage stats with pools separated
cursor-cli usage-report --split-pools --format=csv --out=/tmp/cursor-usage-2026.csv

Profile users—allocate Premium where justified: Identify heavy hitters before arbitrary overages hit.

// Sample: allocate Premium seat in admin panel
await cursor.admin.upgradeSeat({ userId: 'user42', tier: 'premium' });

use Composer 2.5: Where possible, move workloads (long-form, code-completion, agent output) to first-party Composer, to stretch the capped “Composer and Auto” pool further per dollar.
Plan seat allocation ahead of renewal: With changes applying for cycles post-July 1, 2026, get a head count and seat mapping aligned early—no last-minute surprises in your July cloud bill.
Set up alerts/budgets per pool: Avoid “silent” overages by pushing pool-specific usage thresholds to Slack/email/webhooks, directly from billing/admin tools.
Look for admin UI improvements: Cursor notes “improved admin controls”—dig into new dashboards for real, actionable tracking. If tools are still rough, build your own CSV diff/export.

Example renewal script (for July cycle):

# Allocate 5 Premium, 25 Standard seats for Q3-Q4, July 2026 renewal
cursor-cli plan-update --seats-premium=5 --seats-standard=25 --effective-date=2026-07-01

Tip: Don’t default all seats to Standard—mix, based on observed vs. forecasted load.

[[IMG: Cursor admin dashboard with seat tier mix, usage alerts, and pool-level trendlines]]

How to actually use this today

If you’re spinning up a new Cursor Teams org, these settings are live now. For existing orgs, new plans kick in the day your next billing cycle starts post-July 1, 2026. To experiment or simulate:

# Check your current seat breakdown and usage projections post-update:
cursor-cli plan-info --after=2026-07-01

# Upgrade a seat to Premium immediately:
cursor-cli seat-upgrade --user=jim --tier=premium

# Export split-pool usage data:
cursor-cli usage-report --split-pools --format=csv --out=~/Downloads/cursor-usage.csv

If you depend on third-party API models, expect to tune those pools sharply. If Composer does most of the lifting, Standard and Premium seats will stretch further per dollar, thanks to 2.5.

In summary: Cursor’s 2026 Teams pricing update does what most AI SaaS pricing changes only claim—it actually puts predictability and control into the hands of admins and finance leads. By separating first-party and third-party usage, boosting included usage for all, and making heavy use a bounded cost (not an unpredictable risk), teams finally get the tools for smarter, more stable AI adoption. Whether you stick with Standard or dial up Premium seats where intensity demands it, Cursor’s Composer 2.5 means every seat—especially when matched to your real load—delivers more. This is the rare pricing update that actually enables more output for less chaos.

Originally published at otf-kit.dev — full-stack kits your AI coding agent can actually ship to production. See the kits →

Cursor’s new Jira integration simplifies bug fixes with AI ticket assignment

Dave Kurian — Sat, 01 Aug 2026 12:11:02 +0000

Cursor’s Jira integration lands a technical trick that’s hard to dismiss: every Jira ticket becomes an actionable, context-rich AI prompt—no tab-jumping, no manual copy-paste, and no brittle handoffs. The AI agent plugs into issue assignment, yielding near real-time bug fixes or feature builds straight from standard Jira workflows. For teams dogged by context switching and ticket overhead, this is the rare smooth plugin that actually saves time, not just promises it. I put Cursor’s Jira tool through its paces on live codebases with real bugs and features. What follows is a testing-driven review: what worked, what’s rough, and how you can use the integration now.

What is Cursor’s Jira integration and how does it work?

Cursor’s Jira integration installs from the Atlassian Marketplace and—assuming you’re on a paid Cursor Teams plan—lets you assign tickets to an AI agent. In practice, this means any Jira issue (bug, story, or spike) can become the agent’s prompt and trigger action, no copy-paste or “please code this” handoff required. The AI agent reads ticket summary and description directly, extracting enough context to propose or push changes.

Setup merges with regular project practice: tickets are written as usual. Once integrated, assignment to the “Cursor” agent routes issues to the AI, which can triage, fix, or add features based on the ticket content. The key convenience: Cursor eliminates manual context stitching across tools, letting devs and managers treat Jira as the canonical task interface.

Playbook, in short: write a ticket, assign to Cursor, let the agent propose a fix or implementation. Cursor/Jira is not free; you’ll hit a paywall unless you subscribe to Cursor Teams. Jira itself will grant a 30-day free trial when you sign up—no credit card required. Cursor, by contrast, requires immediate subscription (just over $40 for 30 days as tested). The intended workflow: move the brainwork from manual assignment, context packaging, and ticket tracking into pure issue-writing that becomes AI-executable code changes.

How effective is Cursor’s AI agent in resolving Jira tickets?

I tested the integration on two different open-source repository clones (based on HTTPie), tracking four tickets: two well-defined (one bug, one feature), and two intentionally vague (again, one bug, one feature). The experiment was blunt: does ticket wording correlate with outcome, and does Cursor actually deliver “no-notes” solutions on clear tasks?

For clearly written bugs and features, Cursor’s AI was flawless—producing five-star-quality fixes and code additions on the first attempt, with no follow-up comments required. The AI agent parsed concise instructions ("fix 404 error in POST endpoint", "add export-to-CSV for reports") and shipped working pull requests that matched ticket scope and context. There were zero hallucinations and no misaligned fixes; code diffs reflected the asks precisely.

By contrast, vague tickets (“improve performance”, “implement alerts”) saw weaker results. The agent generated plausible, but less useful, changes—optimizations in non-critical code paths, and features missing acceptance criteria. In one case, a vague ticket led to a healthy PR in the wrong area of the codebase—demonstrating the AI can only work with what it’s given. The one testing hiccup: after creating a ticket, I initially couldn’t “assign” it to Cursor within Jira. Instead, execution required kicking off the session in Cursor, referencing the Jira ticket explicitly by title—a gap that may tighten in future releases.

Empirically: for ticket clarity, output quality is binary. Clear = done well; vague = lower-value output, sometimes misapplied code. I noted no meaningful error rate for parseable, unambiguous tickets—Cursor handled context switch and scope with near-perfect accuracy when the information was there.

What are the pricing and access considerations for Cursor’s Jira integration?

Cursor’s Jira integration isn’t free and can’t be accessed on Cursor’s personal or free tiers. Teams must subscribe to Cursor Teams, priced at just over $40 for a month (as of late May 2026). There is no Cursor free trial, so adopting the integration incurs real cost up front; that’s a recurring, non-trivial spend for anyone experimenting or running pilots.

Jira, on the other hand, creates less barrier: a new Jira account comes with an immediate, credit card–free, 1-month trial by default. The integration itself (in the Atlassian Marketplace) is still a niche install, with only 548 total installs and no user reviews at the time of my test.

Cost calculus: if you’re a medium or large team that already pays for Cursor, this is a drop-in upgrade. But for smaller outfits wanting to experiment, the $40 minimum may be a meaningful hurdle. Given the quality of ticket handling for well-specified issues, the price may justify itself in reduced dev churn and lower overhead—assuming ticket volume or criticality is high.

How to set up and use Cursor’s Jira integration today?

Here’s the bootstrapping for teams ready to try Cursor’s AI ticket workflow:

Install the integration: Go to Atlassian Marketplace, search “Cursor Jira”, and add the integration to your Jira workspace.
Subscribe to Cursor Teams: Head to your Cursor dashboard, upgrade to Teams, and pony up ~$40 for the month.
Connect accounts: Authorize Jira within Cursor. Both UIs prompt for OAuth; follow onscreen steps to link.
Assign tickets: Create a new Jira issue as normal. Assign it to “Cursor” (or whatever identifier your agent has).
Kick off in Cursor: If “assign” doesn’t immediately trigger a session, open Cursor, reference the Jira ticket by its full title in a direct prompt (“can you read and fix this ticket in my Jira account: [ticket title]”).
Review agent output: Once the agent acts, review PRs/commits in your testing repo.
Iterate: For best performance, write explicit, actionable tickets—include code pointers, relevant context, and unambiguous acceptance criteria.

Optimize your issue-writing: vague tickets (“make it faster”, “improve UX”) got less actionable results; the more you write what you want, the better Cursor delivers. If the agent seems to miss tickets, check that the account permissions and links are up to date—Cursor support can walk you through error messages if stuck. [[IMG: Jira issue assigned to Cursor agent, highlighted in UI.]]

Workflow tips: rewrite loose stories with “As a user…, I expect…” syntax; copy code file links inline. For critical tasks, consider human review post-merge—while Cursor nails clear fixes, judgment is always recommended.

What are the limitations and future prospects of Cursor’s Jira integration?

Two limits stand out from real-world use. First, the integration’s quality drops for vague or under-specified tickets. The agent depends entirely on task clarity; there’s no magic “read the author’s mind” feature yet, so ambiguous bug reports still require human reasoning.

Second is price friction: at $40+ per month with no free trial, smaller teams could be locked out, limiting broader adoption unless Cursor introduces a lower-tier or volume discount.

Adoption is just starting—548 installs in the Atlassian Marketplace means the field is early, possibly underserved or undiscovered. That could change fast if word spreads or new features drop. Roadmap details weren’t posted at time of testing, but key next steps likely include better native assignment UX (removing the need to trigger runs from Cursor manually) and deeper feedback loops from code reviews to AI model tuning.

Compared to other Jira automation tools, Cursor’s edge is in full fix/build—writing and patching real code, not just routing or triage. Early outputs put it ahead of most conventional “AI runners” that merely suggest PR text or summary.

The long-term kicker: if adoption picks up, this plugs the AI gap for agile teams—real ticket throughput, less waste converting Jira stories to coder-readable tasks.

Why should development teams consider Cursor’s Jira integration?

Cursor’s Jira integration is a tactical win for high-volume dev squads. By making every clear ticket into instant, actionable work—no copy-paste, no pinging devs for context—it simplifies both bug fix and feature planning. Teams working in rapid sprints, Agile cycles, or distributed/remote settings save cycles otherwise lost to task triage, handoff, and context assembly.

Mental overhead drops: developers don’t have to mentally remap Jira issues to codebase navigation, and project managers can see feedback loop time shrink. If your workflow is Jira-centered and you manage dozens of tickets daily, the AI agent can scale hands-free triage and implementation, especially if ticket clarity is part of your process discipline. Remote teams, in particular, benefit: fewer context pings, less calendar overhead, and tighter async cycles.

How to use Cursor’s Jira AI today (concrete workflow)

Run this with real commands and links; substitute your own tickets for full coverage.

# Step 1: Install Cursor’s Jira integration from the Atlassian Marketplace
# Step 2: Upgrade to Cursor Teams ($40/month, cancel anytime)
# Step 3: Link Jira to Cursor within your Cursor dashboard
# Step 4: In Jira, assign a bug ticket to the “Cursor” agent
# Step 5: If needed, open Cursor and prompt:
#   “Can you read and fix this ticket in my Jira account: [ticket title]”
# Step 6: Review/pull the PR and merge if quality is sufficient.

Tips: Write clear, specific issue descriptions. For best results, use acceptance criteria/stories, not open-ended high-level requests.

[[IMG: Cursor dashboard showing Jira ticket executed and PR diff.]]

The bottom line: Is Cursor’s Jira integration worth it?

For teams willing to pay for simplicity, Cursor’s Jira integration is the rare AI coding agent that actually delivers five-star fixes—if you supply the right ticket precision. My tests confirm: clearly described bugs and features translate into working code, with less overhead and near-zero context loss. You’ll pay a premium for early access (and should run highly specific tickets for best effect), but if your workflow and volume justify the move, Cursor’s AI agent delivers a practical jump in developer velocity. Expect growing adoption as awareness builds—this integration already nails the main promise of AI-powered ticket-to-code, and the underlying Jira workflow stays put even as AI tools evolve.

Originally published at otf-kit.dev — full-stack kits your AI coding agent can actually ship to production. See the kits →

Cursor's new Jira integration simplifies ticket handling with AI agents

Dave Kurian — Sat, 01 Aug 2026 11:08:24 +0000

The promise of Cursor’s new Jira integration is direct: assign a Jira ticket and an AI agent handles the rest. If you’re a developer tired of context switching, this is exactly what you want to hear. The integration claims to make the ticket itself the AI’s prompt—one click instead of endless copy-paste. This isn’t generic automation hype; it’s a rethinking of developer workflow. In this Cursor Jira integration review, we run real-world tests, see how it handles different ticket qualities, assess pricing, and give you an unvarnished path to real use. The codes are tried; the promises given a genuine workout.

What is Cursor’s Jira integration and how does it work?

Cursor's Jira integration connects your Jira issues directly to an AI agent that executes coding tasks, removing the need to shuttle context between tools. In practice, you install the Cursor Jira app from the Atlassian marketplace and link it to your Cursor Teams account. The entire model is built around the concept that “the ticket is the prompt”—assign a ticket, and the AI knows what to do.

As of its recent launch (late May 2026), adoption is nascent: at 7 p.m. Eastern, May 28, there were just 548 installs and zero reviews on the Atlassian marketplace. That’s not market dominance, but it’s an early signal, not a verdict. The vision is strong: instead of breaking focus to summarize the task for an AI assistant, your source of truth—Jira—becomes the input.

The practical upshot: less context lost in translation, more time spent on code. Cursor claims a frictionless workflow, and the integration is live for anyone with a paid Cursor Teams plan.

How easy is it to set up and use Cursor with Jira?

Setup is straightforward, assuming you have a Cursor Teams subscription (required; costs discussed below). Here’s the process as tested:

Start with Jira: Install the Cursor integration from the Atlassian marketplace. Upon sign-up, Jira gives you an automatic one-month free trial—no credit card. A rare hands-off trial for enterprise software.
Cursor Teams: No free trial, no exceptions. You’ll need to pay a bit over $40 for the necessary plan. Billing hits up front—don’t forget to cancel if you’re just testing.
Connect accounts: The linking process between Jira and Cursor is just a few clicks. No complicated configuration—if you’re approved for both markets, you’re in.

User experience: assigning a ticket isn’t quite a one-click process (yet). In this test, assigning meant running a request in Cursor with a prompt like:

can you read and fix this ticket in my Jira account: [ticket title]

The “assign to AI” button in Jira doesn't exist; instead, you kick off processing inside Cursor, referencing the Jira ticket. One unexpected (but minor) pitfall: if you can't immediately assign/comment the ticket, try reloading or issuing requests from the Cursor side.

Takeaway: Getting started is quick. The only friction is the Cursor Teams paywall—no free tier, no free trial—while Jira gives you a month to test. [[IMG: Step-by-step setup flow with both Jira and Cursor dashboards]]

Does ticket quality affect Cursor’s AI coding results?

Yes—decisively. Ticket clarity is the dominant variable in AI coding quality. The reviewer tested four tickets: two clear (well-specified) and two vague (underspecified), split between bug-fix and feature requests, across two HTTPie codebase clones:

Clone A: Clear bug-fix ticket, clear feature request.
Clone B: Vague bug-fix ticket, vague feature request.

Outcomes were stark. Clear tickets produced accurate, useful coding changes. When instructions were precise, the AI agent executed the intended correction or addition efficiently. In the case of the vague tickets, the agent faltered. Ambiguity led to incomplete fixes or generic code—useful only with significant manual intervention.

Ticket specification examples:

Good: “In the HTTP authentication module, fix the bug where POST requests drop the Authorization header when a 307 redirect is followed.”
Vague: “Fix authentication issues with POST.”

Results: On clear tickets, the AI’s code applied cleanly with minimal edits. On vague tickets, solutions were generic—partial at best, missing edge cases or nuance. You get out what you put in.

Troubleshooting: If the AI stalls, rephrase the ticket—add detail and concrete reproduction steps. A well-formed ticket is now both the roadmap and the spec.

Takeaway: Treat tickets as direct prompts. Precision pays; vagueness costs. [[IMG: Before-and-after of a clear JIRA ticket and AI-generated code PR]]

Is Cursor’s AI agent reliable for fixing bugs and adding features?

Cursor’s AI agent is reliable when the input (ticket) is reliable. In these tests, straightforward bug fixes and features were completed accurately with clear tickets. Both code diffs integrated without rework in the HTTPie clones. Performance for vague or open-ended tickets was weaker but not wholly unusable—the AI still attempted a solution, but coverage and correctness suffered.

Performance isn’t a question of speed—this is about rightness. When given all the necessary context (expected input, output, and edge cases), the Cursor agent patched bugs and scaffolded new features in a way that survived review. You can expect the agent to confidently handle rote or well-bounded tickets. For hand-wavy, poorly scoped tasks, you’ll do more cleanup than the code justifies.

Qualitatively, there was only one significant hiccup—possibly user error—related to how tickets are triggered. Once kicked off inside Cursor, the AI reads directly from the Jira description, executes the diff, and pushes it back for human validation.

Bottom line: the more explicit your ticket, the more you can trust the Cursor AI to deliver a viable PR. For routine work, it’s a legitimate time-saver.

What are the costs and subscription details for using Cursor with Jira?

Cursor Jira integration is strictly pay-to-play. You need a Cursor Teams subscription—tested at slightly above $40, with no free trial or grace period. This is a hard paywall; you pay before you test. By contrast, setting up Jira gives you an automatic one-month free trial (no credit card required) as soon as you sign up.

Tool	Free Trial	Price
Jira	1 month, no CC	After, paid
Cursor Team	None	~$40/month

Implications: If you’re a solo developer or a small team, the mandatory Cursor subscription is non-trivial. Budget accordingly—test within the first month for best ROI, and set a reminder for subscription cancellation. No free path means every experiment costs, but you’re spared invoicing friction for at least the first Jira month.

Advice: Sign up for Jira first, test integration, and only then subscribe to Cursor when ready to execute tickets. Cancel within 30 days if you’re still evaluating for your workflow.

How to start using Cursor Jira integration today: step-by-step guide

Here’s a concrete, zero-fluff guide for deploying the Cursor Jira integration as tested:

Create or use an existing Jira account.
Install the Cursor Jira integration from the Atlassian Marketplace.
- Marketplace listing: search for “Cursor” within Jira apps.
Upon first install, activate your one-month free Jira trial (no credit card).
Purchase Cursor Teams—required for integration. Plan on paying ~$40/month.
Link your Jira and Cursor accounts.
- Expect a quick auth flow; minimal configuration.

Formulate your test tickets.

The best results come from clear, explicit tickets.
Example:

 Title: Fix 307 redirect bug in HTTP auth
 Description: POST requests with authentication drop the Authorization header when a 307 redirect occurs. Steps: 1) Login, 2) POST to /api/data, 3) Receive 307, 4) Follow to new endpoint, observe missing header.

Assign the ticket by running this (Cursor prompt):

   can you read and fix this ticket in my Jira account: [ticket title]

There is no button or menu item yet. Run this from within Cursor’s assistant/chat.
1. Review AI-generated PR.
Evaluate diffs, run tests, merge if correct.
1. If output is faulty: Rewrite the ticket with added context, retry.

Troubleshooting tips: If you can’t find a way to “assign” in Jira, start the workflow from Cursor instead, referencing the ticket title explicitly. Iteration—tighten your ticket, rerun.

Recommended: Write Jira tickets as though they’re meant for a new hire—detailed, stepwise, full of context. This maximizes AI value and limits rework.

OTF as the underlying layer

Integrations like Cursor’s Jira AI agent are moving the goalposts—not just automating code, but rearchitecting workflow. But trends and tools shift rapidly. If you want insurance against tool churn, OTF remains the foundation: decouple orchestration and credential handling from the AI vendor surface, so you can swap in/out agents and integrations without rewriting pipelines. Cursor’s approach—prompt as ticket, execution as agent—is the envelope, but durable API and secret-management layers keep your infrastructure adaptable.

Cursor’s Jira integration nails the pitch for smooth handoff from project management to AI-powered implementation. In our hands-on review, setup was genuinely fast, with only the Cursor Teams paywall as the catch. AI coding was excellent on clear tickets, shaky on vague ones; bug fixes and small features sailed through with minimal human intervention, but success is tied to prompt (ticket) quality. Pricing isn’t for dabblers, but the frictionless workflow is real for those invested. If ticket-to-pull-request is your bottleneck, this is the cleanest leap yet.

Originally published at otf-kit.dev — full-stack kits your AI coding agent can actually ship to production. See the kits →

Cursor's new Jira integration simplifies developer workflows with AI ticket handling

Dave Kurian — Sat, 01 Aug 2026 10:07:59 +0000

Cursor Jira integration review: this one nails the missing link between Jira ticketing and AI dev work. The integration hands off bug reports or feature requests, as written in Jira, straight to Cursor’s AI. No more copy/paste or context shuffling between apps. That’s a huge enable, especially for teams living in workflow tools. Below, I break down how Cursor’s Jira integration works, the real setup costs and friction, exactly how ticket clarity impacts AI code quality, and where this early tool excels or frustrates. If you need a durable path to automate the ticket churn, Cursor’s approach is seriously worth a look.

What is Cursor’s Jira integration and how does it work?

Cursor’s Jira integration connects your Jira board directly to the Cursor coding assistant. The pitch is almost brutally simple: assign a ticket and Cursor takes it from there. In effect, "the ticket is the prompt." Instead of summarizing or translating requirements between tools, you let the text of your Jira ticket guide what gets built or fixed. Cursor reads the ticket as structured input and tries to execute the desired change.

Here’s the flow:

The integration is available in the Jira/Atlassian marketplace (“Cursor for Jira”).
Once installed and connected, tickets can be routed through Cursor’s agent.
Cursor works off the live codebase, attempts the fix or feature, and outputs the commit.

Notably, this isn’t “AI commentary” — it’s real code, delivered in response to work items. No copying descriptions. No leaving Jira to kick off a fix. The value prop is eliminating those dead minutes of “translating the manager’s ticket into something a bot understands.” The ticket becomes the contract.

Bottom line: Cursor’s integration offers a true “AI developer tools for Jira” bridge, letting teams test drive “ticket-to-code” without rebuilding pipelines.

How easy is it to set up Cursor’s Jira integration?

The practical friction: paywalls, install steps, first-time flow. Let’s get blunt. Getting started is straightforward, though gating exists on both sides.

Jira/Atlassian: Integration install is handled directly through the Jira/Atlassian marketplace (search “Cursor for Jira”). For new users, Jira’s standard free trial applies — no credit card, one month free on sign-up. That means you can test without risk on the Jira side.
Cursor: Here’s the rub. Cursor Teams is REQUIRED — no free tier, no trial. The article’s test run cost a little over $40 (for a month, as long as you cancel). You do need to commit upfront.
Install flow: Download the Jira plugin (about 548 installs as of evening, May 28, no ratings yet), grant the necessary permissions, and do a quick handshake in both Jira and Cursor. All told, the reviewer reports it as “very, very easy to set up”: just a handful of clicks, no deep config files, and you’re ready to assign.

Within Jira, once both sides are connected, you can start sending tickets for AI assignment. The first edge: most marketplace plugins want heavy setup; Cursor’s is lightweight, though you pay for the privilege.

Takeaway: onboarding is genuinely smooth, but the $40 Cursor Teams paywall is unavoidable and there’s no try-before-you-buy.

Does ticket quality affect Cursor’s AI performance?

Direct answer: yes — ticket clarity has a stark effect on what Cursor produces. The reviewer’s four-ticket test is instructive:

Scope: Two bug fixes, two feature adds.
Split: Half of the tickets were clearly written, half were vague.
Method: Tests run on two clones of the HTTPie open-source codebase.

Observations from the test:

Clear tickets (Clone A): For both a precise bug fix and a clear feature add, Cursor’s outputs were solid. The fixes matched the requirements described by the tickets.
Vague tickets (Clone B): With ambiguous tickets, results were predictably less reliable. The output quality dropped — not broken, but the AI’s “guess” sometimes missed the user intent. The line between “helpful assistant” and “needs hand-holding” depends on how exact the ticket is.

Example (paraphrased):

A ticket like “Fix error in response parsing when Content-Type is missing” produced a direct PR with the error handled.
A ticket like “App behaves weird sometimes” led to generic tweaks, not targeted bug isolation.

Success/failure by the numbers: With 4 tickets, the article’s tests hit the target for 2/2 clear tickets, and struggled or underdelivered on the 2/2 vague ones. That’s a 50% “feature completeness” rate when ticket input is ambiguous.

Net: If you want quality automated code, you need quality tickets. Sloppy tickets make for unreliable PRs, period.

What are the benefits and limitations of using Cursor with Jira?

Benefits:

No context switching: You stay in Jira. No toggling between tools or summarizing tickets for separate AI chat UIs.
Speed: Assign, wait, review — the feedback loop is minutes, not hours.
Integrative workflow: Tickets as prompt means the workflow isn’t piecemeal — bug and feature handling is “click, fix, done.”
Reduced friction: Eliminates low-value, error-prone task of copy/pasting ticket context.

Limitations:

Paywall friction: $40+ monthly cost for Cursor Teams, mandatory — and no free trial.
Marketplace lag: As of the test, only 548 installs, no user reviews. This isn’t a mature, vetted plugin yet.
Ticket quality dependency: If you write bad tickets, the integration’s magic falls down.
Narrow scope: Built for straightforward bugs/features — not speculative greenfield or deep architectural changes.

Cursor with Jira shines for well-scoped, repetitive bugs or improvements where managers and devs already speak Jira. In broader or more creative ticket payloads, you’ll hit limits. For teams drowning in rote maintenance tickets, it’s a meaningful speedup; for bleeding-edge work, less so.

How can developers use Cursor’s Jira integration today?

Getting started is a “five minute or bust” affair — assuming you have the budget.

Step 1: Sign up/in for required accounts

Jira: Create a Jira/Atlassian account (or use your org’s). The free trial starts instantly, zero payment info.
Cursor: Sign up for Cursor and select the Teams plan (no trial). Prepare the $40+ monthly charge.

Step 2: Install Cursor for Jira

Go to the Atlassian marketplace, search for “Cursor for Jira.”
Click “install” and grant requested permissions in your Jira instance.

Step 3: Connect Cursor

In Cursor’s dashboard, link to your Jira account using the OAuth flow or required API keys.

Step 4: Assign your first ticket

Create or pick a ticket (bug fix or feature) on your Jira board.
Assign it to the Cursor AI agent or, as reported, trigger via a prompt like:

can you read and fix this ticket in my Jira account: <ticket title>

Alternatively, comment on the ticket and have Cursor pick up the thread.

Step 5: Monitor auto-generated output

Cursor attempts to make a code change, using the ticket as the prompt.
Review the resulting pull request or commit — merge if correct, tweak if needed.

Step 6: Write better tickets for better results

Focus your Jira wording. “Fix the crash when saving large files in editor” works; “editor broken sometimes” doesn’t.
Use bullet points, concrete repro steps, and make outcome expectations explicit.

For more hands-on guidance, see our [Best Practices for Writing Effective Jira Tickets] and Guide to Debugging with AI Tools in Development Workflows.

If you want to standardize and futureproof your ticket-to-code flow, keep your prompt structure under control. This is one of those rare “AI for Jira” integrations that delivers — if you respect its contract.

The durable piece under the tool churn

If you build culture and code process around “ticket as contract,” integrations like Cursor’s will swap in and out over time, but the clarity requirement will last. Even as faster or cheaper AI agents arrive, the stable layer is always high-quality, trackable dev tickets. This is where OTF fits — letting you extract, quantify, and track outcome quality across whichever assistant you slot in. The REST-of stack stays clean, even as the agent behind “assign-to-Cursor” changes.

Cursor’s Jira integration is a real milestone: actual automation, not just “AI help.” The $40/month wall will deter frictionless adoption for now, but if you measure dev workflow by tickets closed instead of chat tokens, it’s hard to ignore. If your team lives in Jira anyway and wants to auto-drive the boring bugs straight out of your board, Cursor sets a new baseline for what “Jira AI bug fix automation” should look like. Try it with feature seeds or clear bugs, skip it for vague asks — and if what you want is low-friction workflow glue, keep watching this space.

Originally published at otf-kit.dev — full-stack kits your AI coding agent can actually ship to production. See the kits →

Unify Your Codebase: AI Agents Meet Cross-Platform Consistency with OTF Kits

Dave Kurian — Sat, 01 Aug 2026 09:09:13 +0000

How AI coding agents are getting impressive. Ask for a button, get a button. Ask for a form, get a form. The thing they're still bad at — sometimes catastrophically — is writing the same button on three platforms. That gap is where cross-platform codebases quietly fragment. Same component name, same props, same tokens, rendered through platform-native primitives: that's the contract that closes the gap, and it's what an agent can extend reliably instead of reinventing each surface.

The three-button problem

You asked your coding agent for a primary button. It wrote one. Then you asked for the same button on the home screen of your iOS app. It wrote a different one. Then Android. A third one. None of them match. None of them behave the same when the label is too long. None of them announce the same thing to a screen reader.

This isn't a failure of the model. It's a failure of the contract. The agent had no shared specification to write against — just three separate platforms with three separate idioms, and its job was to satisfy each one locally. Local optima all the way down.

Add a fourth surface (a watch app, a TV app, a PWA shell) and the divergence compounds. What shipped as a <Button> on day one is now a PrimaryButton, an IOSButton, a MaterialButton, and a BigScreenButton. The codebase has traded one source of truth for twelve.

Why agents amplify this, not fix it

A human developer who has shipped cross-platform for a decade will write the button once and import the same component everywhere — not because they're clever, but because they've been burned by the alternative. They have scar tissue. An agent has no scar tissue. It has a context window and a tendency to treat every request as a fresh problem.

Concretely: when you prompt an agent with "add a button to sign in", it has to choose between dozens of valid implementations on each platform. Without a constraint, it picks the locally-conventional one. Conventional on the web means a styled <button> with focus rings. Conventional on iOS means a configuration-backed UIButton. Conventional on Android means a MaterialButton with a ShapeAppearanceOverlay. Same word, three different mechanical objects, three different accessibility stories, three different long-text behaviors.

The agent isn't wrong. The platforms genuinely don't agree. But the user-facing question — "does this look and behave the same?" — has a single answer, and the agent is producing three.

[[DIAGRAM: a single component in a shared codebase rendering to web, iOS, and Android with platform-native primitives, all reading from the same token set]]

The contract has to come from the codebase, not the prompt

You can't prompt-engineer your way out of this. Agents don't fail because they misunderstand the task; they fail because the repository gives them three different definitions of "button" to satisfy. The fix is to make the repository answer the question, not the prompt.

That means a single import path that resolves to platform-appropriate output from one API. The component name is the contract. The props are the contract. The shipped behavior is the contract. The platform implementation is a detail.

import { Button } from "@otfdashkit/ui";

export function SignInCTA() {
  return (
    <Button
      intent="primary"
      size="md"
      onPress={() => signIn()}
      accessibilityLabel="Sign in to your account"
    >
      Sign in
    </Button>
  );
}

The same source file imports the same name on web, iOS, and Android. The import resolves to the platform-correct implementation under the hood — web gets a <button> with proper focus management, iOS gets a configuration-backed UIButton, Android gets a MaterialButton — but the call site is identical. The agent has exactly one thing to write. So it does.

[[CONCEPT: one component name, one props contract, three platform-native renderings — the agent writes once, the system fans out]]

Tokens are the half that's actually portable

Components alone aren't enough. Visual consistency across platforms requires a single source of truth for color, spacing, radius, typography, motion. Hard-coding #3B82F6 in three places is the same bug in three places. An agent will happily help you write all three.

The tokens ship as a single package:

npm install @otfdashkit/tokens

Then every component reads from it. A theme flip — dark mode, or a rebrand from blue to green — is one file change:

// theme.ts
export const theme = {
  color: {
    primary: "#10B981", // was #3B82F6
    surface: "#0B0F19",
    onSurface: "#F8FAFC",
  },
  radius: { md: 8, lg: 12 },
  space: { 4: 16, 6: 24 },
};

The web bundle, the iOS bundle, and the Android bundle all read this same export. The agent never sees a hex code; it sees theme.color.primary. When you change the theme, every button on every platform tracks. The half-portable thing (the component) and the fully-portable thing (the token) compose into a system that behaves like one codebase.

The agent config that keeps it this way

A consistent codebase doesn't survive contact with a coding agent if the agent's instructions don't reinforce it. By default, an agent asked to "add a card" will invent a card. The patterns it picks will be plausible, local, and quietly different from the rest of the app.

OTF kits ship with the agent config inline:

<!-- CLAUDE.md (excerpt) -->
UI components: import from @otfdashkit/ui on web, @otfdashkit/ui-native on iOS/Android.
Never create a new component if an existing one in the kit covers the case.
Tokens come from @otfdashkit/tokens. Never hard-code color, spacing, or radius.

<!-- .cursorrules (excerpt) -->
- Use kit components before reaching for raw primitives.
- Use semantic token names (color.primary) not hex values.
- One component name across platforms. No IfIosButton, IfAndroidButton.

Plus 20+ tested prompts in ai/prompts/ that walk an agent through the common tasks: adding a screen, wiring a form, building a navigation shell. The agent extends the kit instead of regenerating it.

This is the part that makes the consistency durable. The contract isn't just in the code; it's in the rules the agent reads before it touches the code.

What you actually ship

A concrete example: a sign-in form that lives in one file and renders on web, iOS, and Android.

import { Button, TextField, Card, Stack } from "@otfdashkit/ui";
import { useState } from "react";

export function SignInForm({ onSubmit }: { onSubmit: (email: string) => void }) {
  const [email, setEmail] = useState("");
  const valid = /\S+@\S+\.\S+/.test(email);

  return (
    <Card padding="lg">
      <Stack gap="md">
        <TextField
          label="Email"
          value={email}
          onChange={setEmail}
          keyboardType="email-address"
          autoComplete="email"
        />
        <Button
          intent="primary"
          size="md"
          disabled={!valid}
          onPress={() => onSubmit(email)}
          accessibilityLabel="Continue with email"
        >
          Continue
        </Button>
      </Stack>
    </Card>
  );
}

Same file, same imports, same props, same Stack for layout, same Card for surface. On web the components render as semantic HTML. On iOS they render as UIButton + UITextField with VoiceOver labels. On Android they render as MaterialButton + TextInput with contentDescription. The agent wrote one file. The user got three native experiences. The designer got one visual.

Constraint beats instruction

The temptation, when an agent starts drifting, is to add more instructions. "Don't use MaterialButton directly. Don't use raw hex. Don't create platform-specific files." Each rule is correct. The aggregate is a long preamble the agent reads once and forgets by turn four.

The actual fix is structural: remove the ability to drift. If there is no platform-specific button in the codebase — because the only button is the one the kit exports — the agent cannot import a different one. Constraint beats instruction. The codebase is the spec.

What this enables

A team that has shipped a cross-platform surface this way can ask their agent for a feature and trust the output. The agent writes a button, the button is the same on every platform, the tokens carry the brand, the rules keep the agent within the system. Iteration speed is bounded by the model's latency, not by the human review needed to fix eleven different divergences.

The free SDK is MIT, around 200 components, on npm as @otfdashkit/ui, @otfdashkit/ui-native, and @otfdashkit/tokens. Install one, write a component, watch it ship to three platforms. The full-stack kits ($99 each, $149 for the bundle) wire the rest: auth, billing, DB, Stripe, mobile build, custom domain. Same principle, more surface.

AI agents are going to write more of our code, not less. The repository that scales with that trend is the one that gives the agent a single answer to every question it would otherwise be free to invent. One button. One name. One API. Three platforms. The agent writes the same thing every time because there is only one thing to write.

enable AI's Full Potential with OTF Kits: Consistent, Reliable Code for Production-Ready Apps

Dave Kurian — Sat, 01 Aug 2026 08:06:30 +0000

The agent wave is real. Cursor autocompletes a form by the time you've typed the label. Claude Code refactors a 4,000-line file while you grab coffee. Lovable and Bolt turn a Thursday-morning sketch into a working app before lunch. The throughput enable is genuine — for the first time, the gap between "idea" and "running prototype" is measured in minutes, not weeks.

The bottleneck moved. It's not prompt quality anymore. It's what the agent is building on top of.

Here's the pattern I keep seeing in codebases that ship and codebases that don't: the ones that ship hand the agent a contract. The ones that don't hand it a blank file and ask it to invent the design system from scratch. Same agent, same prompt, wildly different outcomes.

The contract is the kit. The kit is the difference.

1. What "AI-configured" actually means

A kit that respects agents isn't a fancy component library. It's three things, and the third one is the one most people skip.

First: the components themselves. Type-safe props, predictable behavior, around 200 of them covering the cases an agent actually reaches for — Button, Dialog, DataTable, FormField, Sheet, Toast. The same <Button variant="primary" size="md"> renders on web, iOS, and Android from one codebase. No "well, on mobile we use…" branches in the prompt.

Second: design tokens. Colors, spacing, radii, typography — one source of truth that flips a theme across every platform. An agent that pulls a token can't drift off-brand because there is no off-brand to drift into.

Third, and the load-bearing one: agent configs. CLAUDE.md, .cursorrules, and a directory of tested prompts that tell the agent here's the kit, here are its conventions, extend it — don't regenerate it.

# Inside a kit, after install
ls -la
# CLAUDE.md          ← read first, sets the contract
# .cursorrules       ← Cursor's entry point
# ai/
#   prompts/         ← 20+ tested prompts, named by intent
#   README.md

That ai/ directory is the thing. It converts the agent from a generator into an extender.

2. The Button test

Take any modern AI coding tool. Ask it three times, in three fresh sessions, to "add a primary button to the pricing page".

Session A writes a button with bg-blue-600. Session B writes one with bg-indigo-500 hover:bg-indigo-600. Session C writes one with a custom focus ring that breaks the keyboard outline. Same prompt. Same model. Three different design systems.

Now point the same agent at a kit with CLAUDE.md and the right prompt:

<!-- ai/prompts/add-primary-button.md -->
You are extending an existing UI kit, not generating from scratch.

1. Read /CLAUDE.md to understand the component conventions.
2. The primary button is already exported as `<Button variant="primary" size="md">`.
3. Import from the kit; do not re-implement.
4. Match the existing spacing tokens on the pricing page.
5. Return the diff, not a rewrite.

Same agent, same model, three sessions. You get three Buttons that look identical, behave identically, and pass the same a11y checks — because the agent isn't deciding what a Button is. The kit already decided.

[[CONCEPT: a single canonical Button component rendered identically across web, iOS, and Android, extended by three different agents in three separate sessions with zero drift]]

3. Pre-tested beats regenerated

Every kit ships through a 24-item design checklist before it's released. Color contrast, focus rings, touch targets, keyboard navigation, RTL handling, dark-mode parity, token coverage. Regenerated components skip every one of those checks until something breaks in production.

The test isn't "does it compile". It's "does it survive a screen reader on a 4-year-old Android". You can't ship that test in a prompt. You can ship it in a kit.

4. What the configs actually look like

Here's the entry point an agent reads first. This is the contract.

<!-- CLAUDE.md (excerpt) -->
# Project UI Contract

This project uses the OTF UI kit. Components live in `@otfdashkit/ui` (web)
and `@otfdashkit/ui-native` (mobile). The same component name and props
render on both.

## When adding UI

- Import from the kit. Never re-implement `Button`, `Input`, `Dialog`,
  `Card`, `Sheet`, `Toast`, or `DataTable`.
- Pull colors, spacing, and radii from `@otfdashkit/tokens`. Do not hardcode
  hex values or pixel sizes.
- New components must compose existing primitives. They do not redefine
  them.
- Every interactive element must pass keyboard navigation. Tab order
  matches visual order.

## When refactoring

- Do not rename exported components.
- Do not change a prop's type without updating the kit first.
- Mobile parity: a change on web must not regress the iOS/Android build.

The agent reads that, then reads .cursorrules, then reads the prompt that matches the task. The three layers are redundant on purpose. Drop any one and the agent drifts. Keep all three and it stays in the lane.

5. When the agent needs to add something the kit doesn't have

The interesting case is composition. The kit has Card. It doesn't have PricingCard. Here's what an ungrounded agent writes:

// Before — regenerated, drifts over time
function PricingCard({ tier, price }: { tier: string; price: number }) {
  return (
    <div className="rounded-lg border p-6 shadow-sm">
      <h3 className="text-xl font-semibold">{tier}</h3>
      <p className="text-3xl">${price}/mo</p>
      {/* ... */}
    </div>
  );
}

Hardcoded radii, hardcoded padding, hardcoded typography, hardcoded shadow. Every one of those will drift from the rest of the app by next sprint.

The kit-aware version:

// After — kit-extended, composes primitives
import { Card, CardHeader, CardTitle, CardBody } from "@otfdashkit/ui";
import { tokens } from "@otfdashkit/tokens";

export function PricingCard({ tier, price }: { tier: string; price: number }) {
  return (
    <Card>
      <CardHeader>
        <CardTitle>{tier}</CardTitle>
      </CardHeader>
      <CardBody>
        <p style={{ font: tokens.heading }}>${price}/mo</p>
      </CardBody>
    </Card>
  );
}

Same shape. Same intent. No hardcoded values. The card inherits every a11y check, every dark-mode rule, every focus ring the kit already passed.

[[COMPARE: regenerated PricingCard with hardcoded radii and padding vs kit-extended PricingCard composing Card primitives and reading tokens]]

6. The number that matters

The cost of agent drift isn't the first wrong Button. It's the fifteenth. Each one is a small "why does this look different from the others?" bug that lands in code review, gets fixed with a one-line token swap, and erodes trust in the next agent output. By Friday the team's writing the UI by hand again because "the agent doesn't get it".

A kit that an agent extends — with configs, tokens, and pre-tested components — collapses that loop. The agent's first output is the right output, because the contract decided before the agent started.

7. The durable layer underneath the churn

Cursor will ship a new model. Claude Code will ship a new workflow. Lovable will ship a new builder. The list of agent tools that come and go is longer than the list of design systems that come and go, by an order of magnitude.

The kit is the part that doesn't churn. Same <Button>, same tokens, same CLAUDE.md, regardless of which model wrote which file. That's the layer worth investing in.

Use the agent — all of them, they're genuinely good now. And hand it a contract it can extend instead of a blank file it has to invent.

The AI Agent Revolution: Cursor 3, Claude Code, and OpenAI Codex Lead the Charge

Dave Kurian — Sun, 26 Jul 2026 17:03:12 +0000

Three coding agents walked into your terminal

The shift from "AI that completes a line" to "AI that ships a feature" isn't a marketing rebrand. It's a different relationship — closer to onboarding a teammate than installing an extension. Three names — Cursor 3, Claude Code, and OpenAI Codex — are leading that race right now, and the way you pick between them in 2026 is the way you pick between senior engineers: by the kind of work you need them to do, not by who has the louder launch post.

A few years ago, the bar was low. GitHub Copilot completed a function and it felt like magic. Today the bar is: can the agent read an unfamiliar codebase, implement a feature end-to-end, write the tests, run the terminal commands to verify, open the pull request, debug when CI fails, and explain the architectural decision it just made? That's the new contract. We are no longer using AI as an assistant. We are beginning to treat it like a teammate.

What the new contract looks like in practice

The unit of work has changed. The article framing is useful here because it's honest about scope. A modern coding agent is expected to own the whole loop, not just one cell of it:

Read the codebase well enough to find the right files.
Implement the feature in a way that matches existing conventions.
Write the tests — not just the happy path.
Execute the terminal commands needed to verify locally.
Open a pull request with a description a human can actually review.
Debug when the build breaks.
Explain the architectural trade-off it just made.

That's a teammate, not a tool. The interesting question is no longer "which model is smartest." It's "which one do you want sitting next to you for which kind of day."

The three names leading the race

Cursor 3, Claude Code, and OpenAI Codex are the three tools positioned as leading this shift. Each has carved out a slightly different posture in the workflow, and the differences matter when you're wiring one into a real codebase rather than a demo repo.

Cursor 3 has been positioning itself around tight editor integration — the agent lives where you already type, with strong context of the open files and the surrounding code. The strength is the in-the-flow experience. You are not alt-tabbing to a chat window; you stay in the editor and the agent reads along.

Claude Code has been positioning itself around longer-horizon reasoning — the kind of work where you hand over a multi-file refactor or an architectural decision and you want the agent to think it through, weigh the trade-offs, and come back with a plan you can argue with. The strength is depth of reasoning on hard problems where the answer is not obvious.

OpenAI Codex has been positioning itself around breadth — the widest language coverage and the most "I can do whatever you point me at" feel. The strength is versatility. If your stack is unusual, your scripts are diverse, or you bounce between web, data, and automation in the same afternoon, breadth matters more than depth in any one language.

[[COMPARE: editor-native flow vs long-horizon reasoning vs broad versatility]]

None of that is settled. The leaderboard changes every quarter. The categories are stable, and the categories are what you should be picking between.

Choosing by work type, not by leaderboard

A framework that survives model churn looks like this:

Pick Cursor 3 when the bottleneck is "I know what I want, I need it typed fast and in my style." Rapid prototyping, in-editor feature implementation, debugging sessions where you want the agent watching the same files you are.
Pick Claude Code when the bottleneck is "I don't know what I want yet, I need someone to think it through with me." Large refactors, legacy modernization, architectural decisions where you want pushback rather than a confident wrong answer.
Pick OpenAI Codex when the bottleneck is "my stack is weird and I need one tool that covers it." Cross-language work, scripting, API integrations, anything where versatility matters more than depth in a single language.

A useful test before you commit: write down the last three things that frustrated you about your last AI coding tool. If the complaints are all about latency and flow, you want editor-native. If they are about the agent making shallow decisions and rubber-stamping whatever you ask, you want longer-horizon reasoning. If they are about coverage gaps — "it doesn't know this library" — you want breadth.

How to actually wire one into your workflow today

The shift from demo to teammate is mostly plumbing. An agent that's great in isolation but doesn't know your repo's conventions is just a faster typist. Here is the minimum that makes the new contract real:

Give it a real task, not a snippet. Start with a feature ticket you would actually merge, not "write me a quicksort." The new unit of work — implement, test, run, PR — is the right shape to evaluate the agent on.
Point it at your conventions. If you keep an AGENTS.md, CLAUDE.md, or .cursor/rules at the repo root, write the conventions down once and let the agent load them. This is where most of the "the agent writes ugly code" complaints go to die.
Run it in a branch. The agent should be opening pull requests, not pushing to main. The new contract includes "review pull requests" — your side of that contract is that the agent's PRs are reviewable.
Wire the feedback loop. When the agent's PR gets a review comment, that comment should feed back into the next attempt. Without this, you are babysitting instead of delegating.

# minimum harness for any of the three
git checkout -b agent/feature-xyz
# hand the ticket to the agent in your editor / CLI of choice
# let it run the build + tests locally before it opens the PR
gh pr create --fill --base main --reviewer <your-team>

# AGENTS.md — the one file every agent in the race can read

## Conventions
- Use the existing `Button` from the shared UI package, not a local one.
- Tests live next to the file, suffix `.test.ts`.
- Don't add new top-level deps without a note in the PR description.

## What "done" means
- `pnpm test` passes.
- `pnpm typecheck` passes.
- A short note in the PR body explaining *why*, not just *what*.

The plumbing above is identical whether you pick Cursor, Claude Code, or Codex. That is the point. The harness is portable; the model is not.

The part that doesn't change when the model does

Here is the part worth saying out loud: all three of these agents will be replaced, upgraded, or rebranded within the next year. The leaderboard churn is real. The category — "AI teammate that ships features end-to-end" — is durable. What is even more durable is the deliverable: software that has to look and behave the same on web, iOS, and Android, in front of real users, on a Tuesday afternoon when nothing is on fire.

That is the layer underneath the agent churn. The model that wrote your Button this quarter is not the model that will write your Button next quarter. The component still has to render correctly in three environments, with one source of truth, without a rebuild-everything migration when you swap tools. The agent can change; the surface it produces cannot.

Use one of these three — they are genuinely good. Wire it into your workflow with the harness above. And build your product on a layer that does not care which model wrote it.

AI-Powered Linux Incident Triage: A Safer DevOps Approach

Dave Kurian — Sun, 26 Jul 2026 16:06:22 +0000

A read-only AI triage loop is the most underrated idea in agentic DevOps

The piece from Saim Ausman on AI-assisted Linux incident triage with Bash and Claude Code lands on a distinction most "AI for DevOps" coverage misses: AI as a careful assistant, not an autonomous operator. The system built there is read-only end-to-end — Bash gathers evidence, Claude Code analyses, a human decides and acts, Bash verifies. AI never mutates state. That single constraint is the entire idea, and it is genuinely hard to design around, because every vendor pitch in this space leans the other way.

The point of the design is not "AI helps." The point is: AI recommends, humans execute, the system verifies. That sentence alone makes the loop safer than 90% of the agentic DevOps stacks shipping this year.

[[DIAGRAM: bash script gathers linux and nginx health data into a structured report, claude code reads the report and produces an incident brief, the human operator decides and runs a recovery action, bash script re-runs and compares to baseline to verify]]

Why "AI reads, humans write" is the actual enable

Most agentic DevOps pitches fall into one of two traps. Either the AI is autonomous enough to be scary — it restarts your production database, it rotates secrets, it ships a config change at 3am — or it is autonomous enough to be useless, summarising logs nobody opens. The triage system picks a third path: AI gathers and analyses, humans act.

Concretely, the loop is:

Gather → Analyze → Human Act → Verify

Each phase has one job. The Gather phase runs Bash to collect Linux and Nginx health data. The Analyze phase sends that evidence to Claude Code and gets an incident brief back. The Human Act phase has an engineer read the brief and decide what to run. The Verify phase re-runs Bash against the new system state and compares to the baseline.

If the AI hallucinates a recovery command, the worst case is a bad sentence in a chat window — not a deleted database. For incident response, where every minute of downtime costs money and trust, that risk profile is the enable.

The four components, named

The architecture is intentionally small. No message bus, no orchestrator, no agent framework.

Ubuntu Server — the host under triage. The same machine the failure is happening on.
Bash script — the system's only writer. Collects evidence, runs verifications, never mutates anything.
Claude Code — the system's only reader of collected evidence. Produces an incident brief.
Human operator — the only bridge between recommendation and execution.

The separation matters more than the components. Each layer can fail without the others failing catastrophically. If Claude Code goes down, you can run the collector and triage by hand. If the Bash script breaks, the human still has the brief from the last good run. If the human is unavailable, the loop simply waits — which is the correct behaviour for a production incident.

1. Establish a healthy baseline before you write a single line

Before any automation runs, the server must be known-good. The original walks through confirming a baseline before letting the loop touch anything, and that ordering is not a stylistic choice. A baseline is the reference point the verification step compares against. Without one, "verify" is meaningless — you would be confirming the system still looks like it did when it was already broken.

A good baseline records, at minimum:

uptime and cat /proc/loadavg
free -h and df -h (excluding tmpfs)
Nginx active state via systemctl is-active nginx
The last 50 lines of /var/log/nginx/error.log
Connection states via ss -tan

Save it as baseline.txt. Every subsequent verification run diffs against it. Skipping this step is how teams end up "verifying" a system that was already degraded and trusting a brief that was wrong from the start.

2. The Bash evidence collector

The collector is a single Bash script that emits structured, machine-readable output. Strict mode is non-negotiable — a silent failure on the evidence layer would feed garbage to the AI.

#!/usr/bin/env bash
set -euo pipefail
IFS=$'\n\t'

OUT="${1:-./triage_report.txt}"

collect_system() {
  {
    echo "=== UPTIME ==="
    uptime
    echo "=== LOAD ==="
    cat /proc/loadavg
    echo "=== MEMORY ==="
    free -h
    echo "=== DISK ==="
    df -h | grep -v tmpfs
    echo "=== NGINX STATUS ==="
    systemctl is-active nginx || true
    echo "=== NGINX ERRORS (last 50) ==="
    tail -n 50 /var/log/nginx/error.log 2>/dev/null || echo "(no log)"
    echo "=== NGINX CONNECTION STATES ==="
    ss -tan | awk 'NR>1 {print $1}' | sort | uniq -c
    echo "=== TOP PROCESSES (cpu) ==="
    ps -eo pid,pcpu,pmem,comm --sort=-pcpu | head -n 10
  } > "$OUT"
}

collect_system
echo "wrote $OUT"

The output is plain text with section headers — === UPTIME ===, === LOAD ===, and so on. That structure is the API Claude Code consumes. Treat the headers as a contract: every change here is a versioned migration, because every prompt downstream depends on the section names existing.

3. Claude Code's role — and the prompt that keeps it honest

Claude Code reads the Bash output and produces an incident brief: what is wrong, why it is wrong, what to try first, what to try second, what to do only as a last resort. The prompt should pin the format hard, because the freedom to invent commands is exactly the failure mode this loop is built to avoid:

You are a Linux incident analyst. Read-only.
Input: structured evidence from a triage Bash script (sections marked
=== SECTION ===). Cite the section name when you reference data.
Output, in this exact order:
1. Top suspect (one sentence).
2. Supporting evidence (bullet list, cite section name).
3. Recommended next action (one command, for the human to run).
4. Verification step (the exact Bash snippet that proves the fix worked).
Constraints:
- Do not propose changes to files or services.
- Do not invent commands not present in the evidence.
- If evidence is insufficient, say so. Do not guess.

Two constraints do most of the work. "Cite the section name" makes hallucinated metrics cheap to catch — if the brief cites === DISK === but the evidence says memory is fine, you know where the analysis went wrong. "If evidence is insufficient, say so" is the read-only invariant expressed in prompt form. The model is being told, in plain language, that the right answer to a thin evidence file is "I don't know," not invented confidence.

4. The human checkpoint

The human reads the brief, runs the recommended command, then re-runs the collector. If the diff against baseline.txt is clean, the incident is closed. If not, the fresh evidence is fed back into Claude Code — a new analysis on new data, no accumulated state across runs.

This is where the loop earns its name. The brief is a recommendation, never an action. The verification is a comparison, never a "looks fine." The human checkpoint is the only component with the authority to do both.

5. Nginx as the observation surface, not the action surface

The original uses Nginx to serve the latest evidence and the latest brief over plain HTTP, locked down to an internal network or VPN. That makes the loop inspectable: anyone on the team can see what Claude Code saw and what it recommended, without granting the AI any write path.

Concretely, that means Nginx serves three things:

/baseline.txt — the known-good reference.
/triage_report.txt — the most recent evidence run.
/brief.txt — the most recent Claude Code brief.

Nginx never runs the AI. Nginx never executes the brief. Nginx is a window. The window is the entire production surface of the model.

What this loop actually gets you

Three properties the autonomous-AI framing cannot deliver.

Auditability. Every recommendation is paired with the evidence that produced it. If the AI is wrong, you can see why, because the prompt forced citation.

Reversibility. Nothing Claude Code does is reversible, because Claude Code does nothing. The worst artefact is a bad sentence in a chat. Compare that to an AI that runs systemctl restart on the wrong unit, in the wrong environment, at the wrong moment.

Composability. The Bash script can be extended — more sections, deeper Nginx metrics, journalctl queries — without touching the AI. The prompt can be tightened without touching the script. The human can be swapped for an on-call rotation without touching anything else. Each layer is independently editable.

The part that survives the model churn

Here is the angle worth keeping after the post closes. The AI in this loop is replaceable. Today's Claude Code is tomorrow's smaller hosted model, a local Ollama run, or whatever ships next month. The parts that do not change are the Bash contract (section headers, output format), the read-only invariant, and the human checkpoint. Those are the durable layer — the interface the model churns against, not the interface the model is.

That is the same shape as any well-designed boundary: the structured contract outlives the implementation. Whether the components run on a single Ubuntu VM today or a fleet tomorrow, whether the model is Claude Sonnet or a local 7B, the contract is what survives.

If you build the loop right, swapping the AI is a config change. Swapping the script contract is a migration. Swapping the human checkpoint is a policy decision. The phases — Gather, Analyze, Human Act, Verify — do not change at all. Build the Bash contract first, wire the AI second, leave the human in the loop last. The rest is tooling.