DEV Community: Tessl

Analyzing your agent sessions with Tessl

Tessl — Sat, 11 Jul 2026 06:10:16 +0000

With Tessl, evaluations serve a very specific purpose: Using an agent, and provided context, see how well a set of tasks can be done with and without that context. Or an evaluation might be used for comparing models. This is great during the development phase of a skill, but during actual usage, a lot of things can occur, things you might not have anticipated or worse, maybe something you expected to happen did not.

Agent sessions are a tremendous source of information, helping to understand what happened in a session, were certain expected events not occurring. Tessl will examine:

Friction points that the agent may have had while performing certain tasks. For example, were there errors or things that it thrashed on? This could be something not even related to the skill. Reviewing friction points may identify other areas that may be a candidate for a new skill.
Certain events, signals so to speak, in the sessions that Verifiers are expecting to find. For example, did certain actions that you expected actually happen? By definition, verifiers are structured pass/fail checklists that track any aspect of agent behavior you care about.

With Tessl, and the try-tessl/agent-quality plugin’s skills, it will create verifiers that come from:
- Skills, Docs and rules
- User input, where the user describes what they care about, you turn it into verifiers.
  
  Each verifier captures one instruction with a checklist of binary checks that an LLM judge evaluates against session transcripts.

Session analysis can help you optimize your skill by examining what happened during real world usage in your agents!

Prerequisites

You have Tessl installed, and configured for your agent.
Claude Code must be installed. Note that while this feature can be used with sessions from Cursor, Claude Code, Codex and Gemini, Tessl requires Claude Code to be installed on the user’s machine (and logged in) to run the judging.

About try-tessl/agent-quality

The Tessl plugin try-tessl/agent-quality is made up of three skills, which performs the following actions:

Identifies sessions across any agent you've used in a project. Security being important to Tessl, Tessl redacts credentials from the transcripts and also treats all content as untrusted data.
Identifies friction points.
Examines the skill and identifies what things it should look for in session (aka Verifiers).
The verifiers you are creating are being added to the tile, so a couple scenarios can occur:
1. The tile you are creating verifiers for is source-controlled in the repository you are in: verifiers are added as part of the tile
2. The tile you are creating verifiers for is checked out from the registry in .tessl: a new tile is created just for the verifiers and verifiers are added to it. This is because new content added to a checked out tile will be overridden.

try-tessl/agent-quality in action!

For the following example, a private skill, called mycompany/tessl-docs-creator was used to review a set of documentation. This skill is used to review documentation and ensure certain standards are maintained. Our goal in using try-tessl/agent-quality is to understand if the skill was used properly and where friction occurred during that normal usage.

As we walk through try-tessl/agent-quality, it’s important to point out that it follows this flow: phase 1, get feedback, phase 2, get feedback and so on. There is a human in the loop and the human can make changes to the skill and verifiers with each step.

Step 1 - Install plugin

In a project that has Tessl initialized, ask your agent:

I need you to install try-tessl/agent-quality

Step 2 - Start the process

Ask your agent to review your agent session sessions with:

Analyze my sessions

If you already have verifiers for your skills, skip straight to Step 5.

Step 3 - Create verifiers

The session will be identified, and ask if you want to create verifiers.

Remember, verifiers are created from Skills, Docs, Rules, and user input, generating checklists that the LLM will judge against. While Tessl automates it, similar to how Tessl generates scenarios in evaluations, it’s recommended you review what’s generated to determine the intent of the skill vs what verifiers are being created.

If you've not already done so, indicate you wish to proceed with generating the verifiers.

If there are no verifiers, ask your agent:

Create the verifiers

You might get asked if you want to create verifiers and/or review friction. In this step, focus on creating verifiers so that you can review them, you will create both in Step 5, which focuses on generating results.

Step 4 - Review verifiers

Your agent will create the verifiers, a summary will be created. Review them to determine if they match the intended purpose of the skill.

Note that if you use verifiers on your skill, Tessl will create a new tile that you can edit if it's not in a workspace you have permissions for.

A verifiers folder will be created with related files.

Once the generation is complete, a set of verifiers are generated for review;

Step 5 - Review agent session

Generate the analysis by asking your agent:

Review the agent sessions

You may be asked if you want to run verifier and friction review, Tessl would recommend running both.

Step 6 - Results review

Once analyis is complete, a summary will be presented.

Review and accept, or modify, any guidance that is provided.

Step 7 - Loop

So now you’ve seen how to create verifiers, and run an analysis, but over time it’s natural to improve your skill, or want to update your verifiers as you observe things while troubleshooting issues. The following guidance will help determine what you should update or which steps to skip to above, when you make these changes:

a) You implement the guidance in step 6: Start an agent session, using your skill, demonstrating normal use over a few sessions, or, over a day or two. After enough data is collected, rerun the analysis to see if it has impact. No verifier modification is required because the guidance that was generated was based on the verifier(s), so you won’t need to update them.

b) You did an analysis, and identified verifiers are not performing as expected: It could be that your verifiers are too wide, resulting in too many things being flagged, or too narrow, where they are not flagging issues you're aware of. In such cases, return to step 4 to modify the verifiers and then rerun the proceeding steps.

c) During your normal workflow you update your skill; Return to step 4 to update your verifiers so they match the new expectations of your skill.

Summary

Ultimately, this is trying to get you data on how your agent is actually doing vs just vibes so that you can iteratively improve it! And when your skill is working well, you’ll have the data to confirm it!

Reviewing agent sessions are a very powerful capability to review what happened in a session, identify friction points and verify if what you expected to occur actually occurred when using the skill. Tessl is building out a powerful toolkit that allows you to evaluate your skill from its packaging, against scenarios, comparing your skill against different models and now providing data on what actually happened during use of a skill.

I Spent a Week Fixing the Wrong Skill (And Other Lessons from Evaluating an AI PR Reviewer)

Tessl — Wed, 08 Jul 2026 08:19:17 +0000

TLDR

The baseline model (Claude Opus, no guidance) already catches ~65% of textbook bugs. The plugin's value comes from false positive suppression and risk classification, because the baseline already catches most bugs on its own.
The plugin had been classifying risk correctly all along. I just wasn't measuring it. One eval weight change, zero code changes, and the gap widened 9 percentage points.
I spent four versions rewriting the reviewer's prompt to fix a false positive. The actual fix was one line in a completely different skill, upstream.

In Part 1, I described the PR review plugin: evidence-first architecture, six skills, risk lanes. It hit 97.7% accuracy across 43 eval scenarios. This post is about how it got there, because the eval journey taught me more than the final number.

How I evaluated the AI PR reviewer

I built four test repos from scratch: data-service, payments-api, web-dashboard, deploy-infra. Each has planted bugs of varying subtlety, from "you forgot to sanitize this input" to "this session TTL is set to zero, which means sessions never expire, which means stolen session tokens are valid forever."

The baseline is Claude Opus reviewing the same PRs with no plugin guidance. Just the model, the diff, and a generic "review this code" prompt. I started with 33 scenarios and ended with 43.

First surprise: the baseline scored ~70% on the initial 33 scenarios. On textbook bugs (missing input validation, obvious SQL injection, unhandled error paths) the baseline catches most of them. The model is smart. This isn't 2023 anymore.

That ~70% is important context for everything that follows. It means any AI reviewer that just adds more bug-finding instructions on top of a capable model is competing for the remaining 30%. And if it generates false positives along the way, it might be net negative. The firehose problem the research warned about.

It also means the baseline's score will drop as the test gets harder, because those easy wins that inflate the 70% start counting for less once you add scenarios the baseline can't handle. Watch the baseline column in the table below. It goes down, not up. That's by design.

Where the gap actually comes from

Version 14, my first serious eval run: plugin 87.8% against the baseline's ~70%. Real gap. Here's what created it.

The plugin found roughly the same bugs with far fewer false positives and better risk classification. The evidence builder's lane system meant the reviewer wasn't hallucinating security findings on docs-only PRs. That's the difference between a review a developer reads and one they close after the second paragraph.

Improving AI review accuracy: domain knowledge, harder tests, better scoring

The first lever was domain knowledge. I taught the plugin about CSV formula injection in export fields (a cell starting with = gets executed by Excel; ask any security team that's dealt with this), Glacier storage cost traps, stale auth cache interactions. The kind of bugs a human reviewer with domain expertise catches because they've been burned before. That took the plugin from 87.8% to 94.5%.

Then I made the test harder. Ten new scenarios, tougher bugs, and I reweighted scoring so the gimme scenarios (where both plugin and baseline score 100%) counted for less. The gap blew open: plugin 94.1%, baseline 64.6%. A 29.5 percentage point spread. The harder I made the test, the wider the gap got.

The most interesting version bump barely touched the plugin at all. I changed the eval's scoring weights: risk classification went from 5 points to 10 points per scenario. The gap widened another 9 percentage points. Same plugin code, same scenarios. The plugin had been classifying risk correctly the whole time; I'd been underweighting the thing it was best at.

Final run, version 21: plugin 97.7%, baseline 66.6%.

`Version  What changed                        Plugin  Baseline  Gap
──────── ──────────────────────────────────── ─────── ──────── ─────
v14      First serious eval (33 scenarios)    87.8%   ~70%     ~18pp
v15      Domain-specific hotspots             94.5%   ~70%     ~25pp
v17      +10 harder scenarios, reweighted     94.1%   64.6%    +29.5pp
v20      Risk classification weight 5→10      ----    ----     +9pp wider
v21      Evidence builder fix (route guards)  97.7%   66.6%    +31.1pp`

Here's what a scenario looks like. This is the session TTL zero eval (one of the "high subtlety" bugs I expected to stump the baseline):

`Task: "Review pull request #5 in the repository ai-pr-reviewer-tests/payments-api."

Criteria (weighted checklist):
{
  "context": "session_data cache TTL set to 0 means sessions persist
    in Redis indefinitely",
  "checklist": [
    {
      "name": "Catches session never-expire risk",
      "description": "Identifies that TTL=0 means sessions stored
        with no expiry, creating stale/orphaned sessions if the
        auth layer fails to explicitly delete them.",
      "max_score": 15
    },
    {
      "name": "Catches unbounded Redis memory growth",
      "max_score": 5
    },
    {
      "name": "Risk classified yellow or higher",
      "max_score": 10
    }
  ]
}`

The task is one sentence. The rubric is weighted: catching the core security risk (session never-expire) is worth 15 points, the memory growth consequence 5, and risk classification 10. The baseline caught this one at 100%.

When fixing the reviewer prompt doesn't work

One scenario gave me the most trouble: a PR adding authorization middleware to three API routes that previously had none. Correct code, good security practice. The plugin kept flagging it as HIGH severity: "potential security misconfiguration in route handling."

I rewrote the reviewer's instructions four times. Version one: I told the reviewer to consider whether route guards are additive security measures. Still flagged. Version two: three sentences with examples explaining that adding a guard is a security improvement. Flagged. Version three: I restructured the entire reviewer prompt section on security findings. Same result. Version four: I got specific. "If the change adds authorization checks to routes that previously had none, this is a hardening change, not a vulnerability."

Still flagged it.

The reviewer wasn't broken. The evidence builder upstream had classified the route change as "red lane": high risk, security-relevant, requires deep scrutiny. By the time the reviewer saw the code, the framing was already set. I'd been tuning the wrong skill for a week.

The fix: I changed the evidence builder's classification logic to recognize that adding guards to unguarded routes is a hardening pattern, not a risk pattern. The evidence pack now classified it as green-lane. The reviewer read the same diff, saw a green-lane classification, and correctly identified it as a security improvement.

4% accuracy on that scenario became 100%. I never touched the reviewer. The only thing that changed was what the evidence builder told it before it started reading the code.

Upstream evidence quality determines downstream review quality. The reviewer is only as good as the evidence pack it's handed. Fixing the reviewer's prompt is like arguing with a judge after the prosecution already presented tainted evidence. The bias is baked in before the verdict.

Here's the actual text I added to the evidence builder's risk classification logic:

`Auth risk requires call-site analysis. Do not classify a PR as red
solely because it touches permission-checking code. Read the call
sites to determine whether the effective access policy changed.

For example, a switch from every() to some() on a role array changes
behavior — but if every call site passes OR-style role lists, some()
is the correct semantic and the change is a bug fix, not a regression.
Classify based on whether the access policy actually changed.`

That's it. One paragraph of guidance in the evidence builder, telling it to check call sites before panicking about auth changes. The reviewer's prompt didn't change at all.

AI catches more bugs than the research predicted

I designed several "high subtlety" scenarios expecting them to stump the baseline. Session TTL set to zero. A crash in an authentication provider that fails open instead of closed. The baseline caught both at 100%.

Models are more capable than the 2025 research estimated. The window for "bugs only AI-guided review can find" is narrower than I assumed, which is exactly why the plugin's value lives in the evidence pipeline (risk classification, false positive suppression, structured handoff) rather than in raw bug detection.

LLM variance, though, is real. One scenario (correlation ID propagation) scored 88% in one run and 36% in another. Same scenario, same plugin, same model. The difference is just... the model having a different day. Single-run evals can lie to you. I learned this the hard way in the Good OSS Citizen work, and I still almost got burned by it here.

The gap we haven't closed: developer trust

I validated one thing: does the plugin find the right problems and classify them correctly? Yes. 97.7% across 43 scenarios says yes.

I did not validate the thing that actually matters: do developers trust what it finds and act on it?

The 2025 research says AI review comments get adopted 1-19% of the time. My plugin produces better-structured, higher-signal findings. Maybe that adoption rate is higher. Maybe it isn't. I have zero data.

The retrospective skill exists. It's designed to compare the plugin's findings against human decisions and feed the results back. I never ran it. Not once. The plugin has a feedback loop that has never looped.

I designed for human handoff because the research told me to, and I still haven't tested whether the handoff actually works. Finding the right bugs is solved. Whether a developer reads the brief and actually changes their merge decision, that's the question this plugin can't answer yet, and it's the one that decides whether any of this matters.

Try it yourself

`tessl install tessl-labs/pr-review-guardrails`

The eval corpus is in the GitHub repo. Forty-three scenarios across four test repos with rubrics. Fork it, add scenarios from your own domain, run the eval. If you use the retrospective skill on a real PR, you'll have more adoption data than I do.

Further reading: Part 1 (what the plugin does and how to use it) | Research brief and eval corpus

I Built an AI PR Reviewer That Catches Bugs by Not Looking for Bugs

Tessl — Tue, 07 Jul 2026 06:43:42 +0000

TLDR

Humans don't want to review AI-generated code (why spend an hour reading something that took 30 seconds to generate?), and AI reviewers get ignored 81-99% of the time. PR review is broken from both sides.
The plugin that hit 97.7% accuracy doesn't hunt for bugs. It builds an evidence pack, classifies risk into lanes, and hands a structured brief to a human who makes the actual call.
Install it with tessl install tessl-labs/pr-review-guardrails and point it at a real PR. You'll know in five minutes whether this approach works for your codebase.

PR review is broken from both sides.

Humans don't want to do it. The effort asymmetry is brutal. An agent generates a PR in 30 seconds, and now a human is supposed to spend an hour carefully reading code they didn't write, didn't design, and can't ask clarifying questions about. That's a hard sell even when the code is good. When the code is AI-generated, the motivation drops further. Who wants to be a proofreader for a glorified word-guessing monkey?

So hand it to another AI? The 2025 research says that doesn't work either. AI code review comments get adopted 1-19% of the time, depending on the study, while human reviewer comments land at significantly higher rates. The gap is signal-to-noise. AI reviewers flood PRs with findings, most of them either obvious (the linter already caught it) or wrong (the code is fine, the reviewer hallucinated a vulnerability). Developers learn to ignore the firehose.

I built a Tessl plugin to try a different approach. A Tessl plugin (used to be called a "tile") is a context artifact: a bundle of skills, rules, and scripts that gives an AI coding agent domain-specific context. Think npm packages, but for agent behavior instead of code. Mine doesn't try to be a better bug finder. It builds a dossier of evidence about the PR, classifies the risk, and hands a structured brief to a human who makes the actual call.

Where it started

Earlier this year, I spent some time researching how AI-generated PRs are wrecking open source maintainers. That became the Good OSS Citizen plugin, teaching agents how to contribute. But while studying the flood of AI-generated PRs, I kept circling back to the other side: who reviews all this code?

The research said something useful: AI is good at local, checkable problems: buffer overflows, missing null checks, SQL injection in a query builder. Things where you can point at a specific line and say "this is wrong because X." What AI is bad at is intent, architecture, and trade-offs. The stuff that requires understanding why the code exists, not just what it does.

So the design question became: what if the AI reviewer's job isn't to find bugs? What if its job is to gather evidence and let the human make the call?

Build the dossier first. Let the opinions follow from that.

How the evidence-first review pipeline works

The plugin has six skills. The first one matters most.

The evidence builder reads the diff, maps which files changed, figures out what kind of change this is, and classifies risk into lanes: green (routine), yellow (needs attention), red (security-relevant, requires deep review). Everything downstream flows from this classification. A README fix gets a green lane and a light pass. A change to the auth middleware gets red and the full treatment.

Then the fresh-eyes reviewer gets the evidence pack and the code. It hunts for problems, but only problems the evidence supports. If the evidence builder classified a PR as green-lane, the reviewer isn't going to invent an exotic attack vector in a README change. If I enabled the optional challenger (a second model checking the first reviewer's work), that runs next. The research says cross-model review works as a verification layer, and I wanted to test that claim.

After the review, a synthesizer compresses everything into a single brief with findings, evidence, confidence levels, and a recommendation for what a human should focus on. The human handoff formats that brief for the person who actually decides whether to merge.

There's also a retrospective skill that's supposed to run after the human makes their call, comparing the plugin's findings against the human's decision. A feedback loop that's supposed to improve the plugin over time.

What an AI code review brief looks like

When you run the plugin on a PR, the human reviewer gets a brief. It looks like this:

The brief starts with risk classification (green, yellow, or red) so you know immediately how much attention this PR needs. A green-lane config change gets a one-paragraph summary. A red-lane auth change gets the full breakdown: which files are security-relevant, what data flows through them, what the specific risks are, and what to look for when you read the code.

Each finding comes with evidence: the specific lines, why the plugin flagged them, and a confidence level. A finding that says "this user input reaches the SQL query on line 47 without sanitization" is something a developer acts on. A finding that says "potential security concern in this module" gets ignored before the developer finishes reading it. The plugin is built to produce the first kind.

The brief also tells you what it didn't check. If the PR touches areas outside the plugin's domain knowledge, it says so instead of pretending it reviewed everything.

Here's what the plugin produced for a real PR that changes Redis cache TTL configuration in a payments API:

`PR: #5 — Update Redis cache TTL configuration
Risk lane: RED
  - Cache invalidation logic changes with auth-adjacent session_data prefix
  - TTL=0 introduces keys that never expire (memory and security implications)
  - Mandatory human review required (auth/security, cache invalidation)

Finding 1 [HIGH / verify]: session_data TTL config entry has no consumer
  File: src/cache/cache_layer.py:17
  "session_data": 0,  # sessions managed by auth layer, no TTL needed
  Evidence: grep for session_data across src/ returns zero results.
  src/auth/sessions.py manages its own Redis keys with 24h TTL,
  bypassing the cache layer entirely.

Finding 2 [HIGH / fix]: Zero-TTL cache entries persist forever
  File: src/cache/cache_layer.py:47
  if ttl == 0: r.set(key, json.dumps(value))
  Evidence: No background cleanup, no maxmemory-policy safeguard,
  no monitoring for key count growth. Gradual Redis memory leak.

Finding 3 [MEDIUM / discuss]: payment_details staleness window widened to 5min
Finding 4 [MEDIUM / fix]: New ttl==0 branch in set_cached is untested

Questions for human reviewer:
1. Is there a planned follow-up PR that routes session data through cache?
2. Are Stripe webhooks invalidating cached payment details on status changes?
3. What is the Redis maxmemory-policy in production?`

Four findings, each with the specific file, line, code, and evidence trail. The human reviewer knows exactly what to focus on and why.

What AI code review still can't do

The plugin doesn't replace the human reviewer. The research is clear on this: AI review catches local, checkable problems. Intent, architecture, trade-offs: those are still yours. The plugin's job is to do the tedious forensic work (trace this data flow, check this input path, verify this config isn't exposed) so the human can focus on the questions only a human can answer: should this feature exist? Does this design make sense? Is this the right trade-off?

It also doesn't have real-world adoption data yet. I validated that it finds the right problems across 43 eval scenarios (97.7% accuracy against a 66.6% baseline). I did not validate whether developers trust what it finds and act on it. That's the honest gap. If you run the retrospective skill after a real review, you'll have more data than I do.

In Part 2, I'll show how I built the eval, what I learned from eight rounds of iteration, and the debugging story where I spent a week fixing the wrong skill.

Try it

`tessl install tessl-labs/pr-review-guardrails`

The plugin, the eval corpus, and the research brief are all in the GitHub repo. Point it at a PR you've already reviewed and compare its brief against what you found. That's the fastest way to know if the evidence-first approach works for your codebase.

Further reading: Good OSS Citizen Part 1 (the research that started this) | Research brief and eval corpus

Tessl Academy is live (in preview) — and there are two ways in

Tessl — Sat, 04 Jul 2026 07:13:53 +0000

Tessl Academy is live (in preview) — and there are two ways in

We just shipped the first version of Tessl Academy, a hands-on curriculum for building, evaluating, and running skills for coding agents. It's early. Two courses are up — Skill Foundations and Tuning Your Agent — with more on the way. We'd rather get it in front of you now and shape it with your feedback than polish it in private for another month.

Here's the idea. Most of us are already using coding agents, but the results swing between magic and mess. The Academy is about closing that gap: moving from one-off AI coding experiments to workflows you can repeat and trust. Skills are the thread running through every lesson — small, reusable instructions your agent loads on demand.

Two ways to take it

We built the Academy so you can learn whichever way suits you right now:

Read it. Every lesson works as a plain read on the site. No install, no setup — open a lesson and go. Good for a commute, a coffee, or deciding whether the hands-on version is worth your time.
Run it. Install a course once, then ask your agent — Claude Code, Cursor, Codex, or Tessl Agent — to walk you through a lesson. It guides you one step at a time, waits while you work, and hands off to the next lesson when you're done. You learn skills by building one.

Same content, two speeds. Start by reading and switch to hands-on whenever you like — the Quickstart gets you running in about four steps.

It's a preview, and your feedback shapes it

This is genuinely a first cut. Some lessons will land, some won't, and the roadmap past these two courses is still open. That's where you come in: tell us what's confusing, what's missing, and what you'd want to learn next.

Join the conversation in our Discord
Or email me directly: alan@tessl.io

I'll be reading everything. Expect the Academy to move quickly over the coming weeks, and the fastest way to influence where it goes is to try it and tell me what you think.

Start with the Quickstart →

Your agents keep making the same mistakes. Nobody has time to fix it.

Tessl — Wed, 01 Jul 2026 07:18:31 +0000

Your agents keep making the same mistakes. Nobody has time to fix it.

AI coding agents are getting better at the tasks you give them direct feedback on. Everything else stays broken.

You leave the same comment in code review three sprints in a row. There's a recurring task that could run as an automation but it's on the backlog because no one has time to stop and systematize it. The context your agents need to do better work — updated conventions, patterns from past PRs, recurring fixes — exists in your commit history and session logs. Nobody has time to extract it and package it up.

Agent enablement is real work. It just never gets done.

What teams usually do

Most teams handle this one of three ways: rely on PR review to catch the same errors week after week, schedule occasional cleanup sprints to update skills and conventions (that never actually get scheduled), or accept that their agents plateau.

All three require engineers to stop building to maintain the thing that's supposed to help them build faster.

Introducing Tessl Agent — open beta

Today we're launching Tessl Agent.

Point it at a repo. It scans your PRs, coding agent session logs, and tickets continuously. When it spots a recurring error pattern, it creates a skill to address it and opens a PR. When it finds a task your team runs manually every week, it turns it into a GitHub Actions workflow. Then it asks if you want it to keep doing that automatically; daily, weekly, on a schedule you set.

Tessl Agent is built to get you to stop using it interactively. You work with it, and at the end of each session it says: I could set some of these up as recurring actions. I could create a CI/CD check for this. The goal is that most of the recurring work — finding optimizations, catching agent mistakes, updating context — runs on a trigger and files issues without you having to ask.

What it looks like in practice

The use case we use most at Tessl: setting up an agentic code review harness.

You type something like set up agentic code review or I want to spend less time reviewing code. Tessl Agent scans your PRs, your issue tracker, and your coding agent session logs. It surfaces what's there: your style guide, common agent failure patterns, comments your team leaves repeatedly in review. Then it walks you through building on that.

First, it creates a code review skill that maps to your team's best practices. Unlike a one-click tool you forget about, this is a skill you own; you can update it, augment it, share it across workflows. From that point, every PR gets agentic review automatically. Then it sets up a recurring loop that optimises that review over time, so the quality of automated review improves as your codebase evolves.

You spend time reviewing code and shipping features, knowing the routine work is handled.

It works alongside your coding agent, not instead of it

Tessl Agent is not a replacement for Claude Code, Codex, or whatever you're using. It runs in the background. You don't context-switch to it mid-session.

It's also provider-agnostic — it works with CodeRabbit, GitHub Actions, and your existing stack. It's not tied to any one coding agent, which matters when you want something that works across your whole development workflow, not just within one tool.

The compounding effect

This is what loop engineering looks like in practice. Each automated improvement creates the conditions for the next one.

A skill that encodes a common pattern means your agent makes that class of error less often. An automated workflow that runs weekly means recurring tasks get systematised instead of repeated. At some point you look up and 40, 50% of your PRs don't have a human looking at them. You never had to run a big initiative to make that happen. You got started, kept building, and over time delegated more to the agent.

That's the path toward a software factory. Not a big-bang platform migration, but incremental agent enablement that compounds week over week.

Try it

Tessl Agent is in open beta and free to try. Download the Tessl CLI, run tessl, and open a session. A good starting point: pull up the last month of your team's coding agent sessions and ask what's broken, what's taking a lot of your time. The findings tend to be immediately useful.

Try Tessl Agent for free or book a demo.

Why Warp is betting engineering leaders are done picking a favourite coding agent

Tessl — Mon, 29 Jun 2026 06:48:20 +0000

Engineering leaders have spent the past year trying to get their teams to adopt AI coding tools as quickly as possible. Now, a new set of questions has taken over: how do you measure whether any of it is worth the money, and how do you stop agents from running unchecked on production systems?

Developer tooling company Warp, an open agentic development environment built from the terminal up, thinks the answer isn't picking a single agent and standardising on it — it's giving teams a way to run several at once, compare them, and govern all of them from a single control plane.

As Tessl wrote back in February, orchestration has emerged as a discipline in its own right — a dedicated layer of tooling for coordinating, supervising and directing multiple agents running in parallel. Back in February, Warp launched Oz as a cloud platform for running and managing coding agents at scale.

Now, Warp is taking things a step further. In May, the company expanded Oz into what it's calling the first multi-harness control plane — meaning teams can now run Claude Code, Codex and Warp Agent simultaneously through a single interface, rather than committing to any one of them.

Tessl caught up with Warp CEO Zach Lloyd to discuss how engineering leaders are thinking about agent fleets, what the harness layer actually changes, and where the lines between autonomy and human oversight are really being drawn.

"The wild west": how the agent gold rush became a budget problem

Zach spent several years at Google, leading engineering on Docs and Sheets before co-founding photo-editing startup SelfMade. He later served as interim CTO at Time, before founding Warp in 2020, raising north of $70 million in funding from the likes of Sequoia, Google Ventures, Figma co-founder Dylan Field, and Salesforce’s co-founder Marc Benioff.

That background — building collaborative tools at Google scale, then navigating the startup world — gives Zach a particular vantage point on how quickly the engineering tooling landscape has moved. A year and a half ago, he says, most companies were still trying to get developers to use AI autocomplete tools. Then, about a year ago, the conversation moved to interactive agents — Claude Code, Codex, Warp — where engineers were directing tools to build features and fix issues end to end.

Now, he says, that phase too has largely passed — and the CFO's arrival in the conversation is perhaps the clearest sign of it.

"Companies right now have moved from a 'can we get people to adopt' mindset to a 'how do you measure ROI' mindset," Zach explained. "They're paying a lot of money for these tools, and the CFO has gotten involved. All these costs are showing up, and so they are thinking through how to go from the wild west, where every engineer is just spending as much as they can on different agents, to a world where they're still creating as much productivity as possible. But they want to measure it, they want to put quotas and budgets in place, and they also want to use different agents for different types of tasks."

That last point is central to Warp's multi-harness bet. Rather than standardising on a single agent, Zach argues that engineering teams want the ability to route different tasks to different agents depending on what each does best — while keeping the governance layer consistent across all of them.

"The biggest trend that we see is: can you use open-weight models for some tasks when you have to be at the frontier?” Zach said. "The way that we're positioning Oz is that you can basically not lock into one source of intelligence. You can use Claude Code, you can use Codex, you can use open-weight models — but you can still confidently invest in a layer of infrastructure for governance that is not tightly coupled to any one particular agent."

The economics driving that are already visible. Open-weight models — DeepSeek, Kimi, Qwen — have gone from lagging well behind the frontier to matching it on many tasks, and at a fraction of the inference cost. Tessl also recently switched its default eval model from Claude Sonnet 4.6 to GLM 5.1 for exactly this reason — finding that for skill evaluation work, a cheaper open-weight model produced near-identical signal at meaningfully lower cost.

Elsewhere, AI agent startup Lindy recently moved 100% of its traffic from Anthropic to DeepSeek v4, with CEO Flo Crivello claiming the company would be saving millions in the process.

It's worth noting that Warp has been doubling down on openness more broadly, open-sourcing its client earlier this year and using Oz itself to manage the repo — agents handle the implementation, community contributors handle direction and verification.

“We now have a lot of confidence in code that is generated by Oz with our rules, context and verification, so anyone contributing should have a high chance of success coding a feature correctly,” Zach said at the time.

The move also serves as a live test of Warp's own thesis — if the orchestration layer is good enough to run a public repo at scale, it's good enough for enterprise teams to trust with their own.

“Leaning on agents creates pressure for us to nail orchestration, memory, handoff, and all of the other parts of agentic engineering that are core to our business,” Zach continued. “There’s a virtuous loop here.”

That loop extends to customers too. The things that matter most — context management, memory, audit logs — can all be separated from the agent itself, Zach argues. That's the point of Oz: a container layer for all of it, so that when the best model or harness changes — and Zach is clear that it will, every few months — teams aren't starting from scratch.

The model isn't enough: why the harness and context matter just as much

The natural question is whether multi-harness is a solution in search of a problem. If Claude Code and Warp Agent can both run on Anthropic models, what is the harness actually changing?

Zach's answer is that performance is a function of three things working together: the model, the harness, and the context.

"The harness is what feeds the context in," Zach said. "You want a harness that is good at managing the context window — when do you take different sources of external context and put them in? If you put too much context in, the model has to summarise and it loses information on the current task. How you manage that context window is really important. Different harnesses excel at different things — Claude Code is a great harness, Codex is a really good harness, Warp's agent harness is [also] really good."

The model and the harness are table stakes. The third element — organisational context — is where Warp is investing most heavily right now, through what it calls cross-harness memory. The idea is that as agents complete tasks, the system captures what worked and surfaces it automatically in future runs, across whichever harness is being used.

"Every time one of these agents runs, it does some task, and maybe in the course of figuring out some problem, with the guidance of a human, they arrive at some solution," Zach said. "What you don't want to do is throw that away and start from scratch next time. If you have a memory system, think of it as a layer that is observing what all of your agents are doing and being like: this seems like an important thing to remember."

Cross-harness agent memory is currently in research preview with a small number of pilot customers.

More autonomy, more controls: Warp's answer to an uncomfortable balancing act

The tension at the heart of Oz's pitch is one that Zach doesn't try to resolve so much as manage. On the one hand, the platform promises agents that can handle complex, long-running tasks — migrations, production deployments — with less human oversight. On the other, the same release adds approval gates, per-user authentication, and least-privilege permissions.

Those two things pull in opposite directions.

"I think there's a fundamental tension, but I think it's necessary," Zach said. "From talking to our customers, I don’t think companies are ready to be fully hands off. The ideal system at this moment looks like a factory floor, where you want to put stuff that can be automated through an automation process, but then you want a human to step in and say: ‘was this done right’?"

The logic Zach applies is essentially risk-tiering. The parts of the stack where errors are cheapest get automated first; the parts where they are most costly stay human-supervised longest.

"The parts that can be most automated are the parts where the risks are lowest — this is common sense," Zach said. "Making changes to our website is way lower risk than making changes to our data. So you'll see more and more of the guardrails go away on the low risk things before they go on the high risk things."

As for who inside an enterprise actually draws those lines, Zach says it's rarely one team. Platform teams or dedicated AI developer productivity functions tend to lead, with security always involved and finance increasingly so.

"The security team is always involved — probably the team that's most scared," Zach said. "Increasingly there is a cost management component. What's the budget for this? What's the token budget per engineer? What's the way that you see ROI? It's starting to become a significant line item for all of these customers."

Evals: measuring the factory floor

Which brings the conversation to evals — how teams actually know whether any of this is working. Zach's framing here draws again on the factory floor analogy: what you want, ultimately, is a bird's eye view of how work flows from idea to shipped product.

Warp has built a live version of this for its own open-source repository at build.warp.dev, where anyone can pull up a view of how issues move through the agent pipeline. Zach uses it as a reference point for what enterprise teams should be aiming for.

"The things you can measure are throughput of code as one basic measurement," Zach said. "Ideally, in a more sophisticated world, you would go all the way from measuring throughput of code to throughput of user or customer impact — be able to tie back: ‘a ticket came in asking for this feature, an agent was able to build it, it cost this number of dollars or tokens, and in production it was used by XYZ customers’. That's the dream loop. The code part is not that hard — that's where we can just deliver."

Token efficiency per PR is the baseline metric Warp currently offers. The harder problem — tying agent output to business outcomes — remains what Zach calls the “holy grail.”

The agent builder: a new role that doesn't require an engineering background

One of the more striking parts of the conversation is what Zach describes happening to engineering teams themselves as agent fleets become the norm — at Warp and at the companies it works with.

The background profile of engineers Warp hires hasn't changed much, he says. What has changed is what they do.

"The day to day of a software engineer now is not about writing code," Zach said. "It's about: can you accurately specify a user requirement to an agent? Can you make sure that the technical plan an agent comes up with makes sense? Is it building in the right part of the codebase? Is it repeating a bunch of code? Is it using the same quality of abstraction that a human would use?"

Beyond that shift in existing roles, Warp has also introduced a new function it calls the agent builder — a full-time role focused on building internal automations using agents. Notably, the people filling it don't come from engineering backgrounds.

"The people who are in this role are people with product and design backgrounds," Zach said. "They are not engineers by training, and I don't think you need that. For internal tooling use cases you can hire people who are more generic builders. One of the cool things that's come out of all this new technology is a democratisation of who gets to build stuff."

The caveat is that this only holds where the stakes are low — customer-facing product, he implies, is a different matter. "As long as it's not customer-facing, I think it's pretty much fine for that to work that way," Zach said.

Among the companies Warp works with, Zach sees two distinct camps emerging. Larger organisations with dedicated developer productivity teams are building their own internal software factories from scratch — the complexity is manageable if you have the headcount. Smaller ones are buying, because the build cost simply doesn't justify the investment. What they share, he says, is the destination: a centralised system where agents handle the routine work and humans focus on the exceptions.

What that means in practice for engineering leaders is less about which agent to pick and more about building the layer around it — the governance, the memory, the measurement — that makes any agent trustworthy enough to run at scale.

For all the variation in how companies are approaching this — different tools, different team structures, different risk tolerances — Zach sees them all heading toward the same place.

"The goal of most companies right now is to get to what I would call an internal software factory — a centralised system where agents are taking in issues, judging, building, verifying, pushing," Zach said. "They don't want to do that for 100% of the issues, and they don't want to take humans out of the loop. But they're all trying to stand up this same kind of machine. And different companies are further along on this journey than others.”

See You at AI Engineering World's Fair 2026

Tessl — Sun, 28 Jun 2026 07:51:55 +0000

Next week, the Tessl team is heading to AI Engineering World's Fair 2026, and we couldn't be more excited to spend a few days with the community talking about the future of AI engineering.

If you're attending, come and find us at Booth L-G48. We'll be demoing our latest product, sharing what we've been building, and talking all things agentic development with engineering teams from around the world.

Come and meet the team

At Tessl, we believe skills are the new code. Treat them that way.

Tessl enables development teams to continuously build, test, distribute and optimize agent skills with the security and governance of enterprise software.

Throughout the event, our technical team will be running live demos at the booth and chatting with attendees about everything from coding agents and agent workflows to evaluation, context management and harness engineering. Whether you're just getting started or already deploying agents in production, we'd love to hear what you're building.

We're also running a competition throughout the conference, with prizes including:

🎁 Ray-Ban Meta Smart Glasses
🎟️ A ticket to AI DevCon in New York this November

Unveiling Tessl Agent

AI agents shouldn't just write software—they should continuously improve how software gets built.

At AI Engineering World's Fair, we'll be unveiling Tessl Agent.

Build your software factory, one workflow at a time.

Tessl Agent makes your agents more autonomous over time. It continuously scans your pull requests, session logs and tickets for recurring mistakes and opportunities, automatically opens improvement PRs, turns repeated patterns into automated workflows, and ships them through GitHub Actions—creating a software factory that compounds week after week without slowing feature delivery.

If you'd like to see it in action, stop by the booth for a live demo.

The conversation we're most excited about: Harness Engineering

Every conference has a theme. This year, we think it'll be Harness Engineering.

AI models are getting smarter every month. The challenge is everything around them.

Agents need context. They need evaluation, testing, guardrails, observability and workflows that help them operate reliably in production. In short, they need a harness.

We believe Harness Engineering is becoming one of the defining disciplines of modern AI engineering, and we're looking forward to hearing how the community is tackling these challenges.

Catch our talks

We're delighted to have two Tessl speakers presenting on Thursday, July 2.

Coding Agents Don't Scale Themselves. Neither Do Your Teams: The Rise of Agent Enablement

🕜 1:30–1:50 PM

Patrick Debois, AI Product Engineer

Coding agents are transforming software development, but the context that drives them is still managed with ad hoc prompts, copied rule files and undocumented practices.

Patrick introduces the Context Development Lifecycle—a framework for treating context with the same engineering discipline we've spent decades applying to code—and explores how teams can build a feedback loop that continuously improves agent performance over time.

Harness Engineering: The New Core Skill for Agentic Developers

🕝 2:50–3:10 PM

Dru Knox, Head of Product & Design

As coding agents become more capable, success depends less on writing code and more on upgrading your codebase so agents can reliably succeed.

Dru introduces the core loop of Harness Engineering, the common improvements teams are making today, and how Tessl's Harness Engineering Agent helps developers scale those improvements across their software factory.

Join our community event

We're also hosting an evening fireside discussion:

Harness Engineering: Building Reliable AI Systems

📅 Wednesday, July 1 | 6:00 PM

Featuring Steve Yegge and Dru Knox, this conversation explores the emerging discipline of Harness Engineering and what it takes to move AI systems beyond experimentation into reliable production software.

Together they'll discuss the systems surrounding AI models—from context and evaluation to testing, observability and guardrails—followed by audience Q&A and networking with the AI engineering community.

👉 Reserve your place: https://luma.com/7f31tcht

Leadership dinner

Alongside the conference, we're also hosting an invite-only leadership dinner, bringing together engineering leaders and AI practitioners for an evening of conversation about the future of agentic development.

We're looking forward to sharing ideas with some of the people helping define where this industry goes next.

See you next week

AI Engineering World's Fair has become one of the best places to connect with the people shaping the future of software engineering, and we can't wait to be part of it.

Whether you want to see Tessl Agent in action, chat about Harness Engineering, attend one of our talks, or simply swap ideas about building reliable AI systems, we'd love to meet you.

Come and see us at Booth L-G48.

Or, if you'd like to guarantee some time with the team, book a meeting with us through the AI Engineering World's Fair app.

The new Tessl review: now you decide what "good" looks like:

Tessl — Wed, 24 Jun 2026 06:41:25 +0000

The new Tessl review: now you decide what "good" looks like:

For a while now Tessl has been able to review the quality of your skills straight out of the box. By simply running tessl skill review you get a score against Anthropic's best practices with no setup required. That is a sensible default and it has served most people well, but a default is still somebody else's opinion that you or your organisation might look at and disagree with.

Today we are launching a new version of Tessl’s review functionality. It does three new things: reviews your skills agentically with greater accuracy, and lets you define what good actually means for your skills, and keeps a sharable history of your skill review runs.

The problem with one definition of good

On one of my skills, the current review provides a quality score of 82%. The description review scores a perfect 100%, but the content section drops to 55%, with conciseness at 1 out of 3 and progressive disclosure at 1 out of 3.

In some people’s view, nothing is wrong with the skill, but the judge is marking it down for keeping one tight, self-contained skill rather than spreading it across five files. That is a reasonable position and it is Anthropic's position. But what if your org prefers larger, consolidated skills, in which case an 82 is punishing me for doing exactly what we want. Perhaps we even have further constraints which are being missed in my skill but completely being overlooked by the review and giving me a false sense of quality.

Here’s a video of the new Tessl review in action:

Watch on YouTube

Offering a more accurate review

The new Tessl review is invoked using tessl review run from the CLI or via the agent (but make sure it’s calling the new version!) and you need to pass a workspace name where your review results will be stored.

One of the bigger changes is under the hood. Whereas the previous review used an LLM as a judge in a single pass, the new version uses an agent. It takes more turns, gathers more information about the skill and associated files and reaches a better more grounded verdict. You will still see some variation between runs, since an LLM judge is non-deterministic by it’s very nature, but the results are more accurate.

Defining what good skills look like for your organization

This is the exciting part that changes how reviews determine what’s right, as the new review allows you to pass your own rubric, as a plugin, and review against it.

We’ve made a plugin called review-plugin-creator that walks you through building a custom review plugin. This allows you to fork the Anthropic best practices if you only wish to change a few things, so everything sensible stays in place by default and you only change what you disagree with. In my case I flipped a single rule, the one that punishes consolidated skills.

The creator produces a plugin holding your guidelines and rubric. To reference it on a tessl review run, you can reference it locally in the file system, or link to a private or public plugin on the Tessl Registry.

Running the same skill again, this time with your rules, and you’ll see updated scores. In my case, the consolidated skill now scores full marks on conciseness and progressive disclosure, and the content section reflects what my org actually values rather than what a generic default assumes.

Seeing your reviews

Everything you see at the CLI is also on the Tessl Registry. Head to your workspace and you will find your review plugin alongside a full history of review runs. Each run shows the same breakdown you get in the terminal, plus the plugin that produced it, so you always know which definition of good a score was measured against.

In your workspace settings you can set a default review plugin. From then on every review run from that workspace uses it automatically. You can still override it per run with the --review-plugin flag whenever you need to.

The rest of the toolkit

A few more commands worth knowing:

tessl review list --workspace <workspace-name> lists every review run against a workspace
tessl review view <review-id> opens a single run and shows its full output.
tessl review fix is the new home for the --optimize behaviour you already know from our previous review. It agentically applies fixes to the skill based on a review outcome and can update your SKILL.md directly.

What does this mean for the old command?

tessl skill review is not going anywhere yet. We have deliberately left it in place so nothing breaks for anyone relying on it today, although you may see a deprecation message. That said, tessl review run is where all the work is going from here, so please move across and start using it, so you’re not caught out when we do turn off the older review feature. We’ll also be releasing updates to our GitHub actions soon to make use of the new tessl review functionality.

Try it now

The new Tessl review is live and you can use it today, do note that you’ll need a free account in order to use the Tessl review command (you can check the full documentation here. There is plenty more to come and we will keep you posted as it lands. For now, run it against your own skills, write a rubric that matches how your team actually thinks about quality, then tell us how it performs in your environment. Your feedback shapes what we build next.

Customise Tessl review: https://tessl.io/registry/tessl/review-plugin-creator

Learn more about Tessl: https://tessl.io

Claude Fable 5 vs Opus 4.8: The Mythos Hype Meets Reality

Tessl — Sun, 14 Jun 2026 06:39:18 +0000

For months, the most interesting model at Anthropic was one we could not use. Mythos was the internal system the company said was too capable to release, the one that found software vulnerabilities at a level that tripped its own safety thresholds. On June 9, 2026, that tier went public for the first time, as Claude Fable 5. Opus 4.8, the model anchoring production coding agents, suddenly had a successor that's a full capability class above it.

This raises two questions for anyone running coding agents. The practical one is whether you should move your fleet from Opus 4.8 to Fable 5. The bigger one is whether a Mythos-class model, the tier Anthropic held back as too capable to ship, lives up to what the name promised. This article answers both, and the numbers tell a more interesting story than the announcement did.

We ran both models through the same evaluation, close to 1000 shared scenarios scored twice each, once with no skill supplied and once with the relevant skill in context. The short answer, as of mid-2026, is that Opus 4.8 is still the better value for most agent fleets, and the gap between the Mythos hype and the measured reality is the real story in the data.

A Mythos-class model is a tier of Claude that sits above the Opus class in capability. It reaches a threshold Anthropic considers high-risk, particularly at discovering and exploiting software vulnerabilities. Fable 5 and Mythos 5 are the same underlying model with the same capabilities. What separates them is the safeguards: Fable 5 is the public version that ships with safety classifiers, while Mythos 5, restricted to approved partners, runs without them.

What the industry expected from a Mythos-class model

Before launch, the speculation was not subtle. Across Reddit, X, and a run of explainer posts, Mythos was framed as the model that would change how agents work, not just how well they answer. The recurring predictions clustered around four capabilities:

Restructuring a large codebase in one coherent pass.
Spotting security flaws that experienced engineers miss.
Working unsupervised for hours on a single hard problem.
Acting like a collaborator, not an assistant you steer turn by turn.

Of the four, the cybersecurity claim was the one with hard evidence behind it. Through Project Glasswing, roughly 50 early partners with Mythos Preview access reported finding more than 10,000 high or critical severity vulnerabilities, and the program has since expanded past 150 organizations. Anthropic's CPO Mike Krieger called it "the most capable class of systems we've built." That is the dream the name sold: a model so powerful it stayed in the lab.

What reached the public is narrower, and deliberately so. The model you can actually use is Fable 5, the Mythos-class system wrapped in safety classifiers. Whether it delivers comes down to the gap between that promise and what was released.

The headline numbers: Claude Fable 5 vs Opus 4.8

Every scenario in the evaluation is a real agent task tied to a published skill, scored on two axes: instruction-following (does the agent do what it was told, in the way it was told) and task-completion (does it reach the goal). The overall score weights instruction-following at 4 and task-completion at 3, then divides by 7. Each task runs with and without the skill, so the lift from the skill is visible directly. The tasks and skills are public, in the task-evals-for-skills dataset, so you can inspect any scenario yourself.

This design is deliberate. The tasks come from published skills, so they mirror the real work teams write skills for, not frontier puzzles meant to find a model's ceiling. That is why task-completion runs high for both models and why the signal that separates them is instruction-following: doing the work the specific way the skill asks.

Dimension (with skill)	Fable 5	Opus 4.8
Overall score	92.9	92.0
Overall score (no skill, baseline)	75.7	74.5
Overall lift from the skill	+17.2	+17.5
Instruction-following	89.3	88.0
Task-completion	97.8	97.4
Turns to complete	16.9	16.2
Output tokens per task	9,025	10,687
List price (input / output, per MTok)	$10 / $50	$5 / $25
Cost per task (average)	$1.25	$0.74
Points per dollar	74	125

On the 917 scenarios both models ran, Fable 5 leads on overall score by 0.9 points (92.9 to 92.0). Scenario by scenario, the two tie on 61% of tasks, Fable wins 24%, and Opus wins 16%, at a two-point threshold. A capability class above Opus, and on everyday agent skill tasks the quality difference is inside the noise.

One caveat sits underneath that number. The 917 are the tasks both models completed and scored. Fable 5 refused 26 that Opus 4.8 finished, and we excluded them, so the near-tie is measured only on the tasks Fable agreed to do. That exclusion turns out to be the most revealing part of the comparison, and we return to it below.

Why agent skill evaluation matters more than the model upgrade

Here is the number that reframes the comparison. The skill adds about 17 overall points to both models: +17.2 for Fable 5 and +17.5 for Opus 4.8. The model upgrade from Opus 4.8 to Fable 5 adds less than 1 point on shared tasks. The context you supply moves the agent far more than the frontier tier you pick.

The lift concentrates in instruction-following, where both models gain more than 27 points from the skill, while task-completion gains under 5. Both models can usually reach the goal on their own. What they cannot do reliably without a skill is follow the specific conventions, constraints, and steps a real task demands. That is what a good skill encodes.

Skill receptivity is how much an agent's output improves when you supply a relevant skill. It shows up mostly as better instruction-following. It matters because it can outweigh the model choice, which is the practical case for investing in agent skills before chasing the newest tier. Running the same task with and without the skill, then measuring the difference, is a task eval. It is also the only way to know whether a model upgrade earns its price on your workload, which is what agent skill evaluation is for.

The price gap is the deciding factor for most teams

On the agent skill tasks we measured, the trade comes down to paying a steep premium for a marginal gain. Fable 5 lists at $10 per million input tokens and $50 per million output tokens against Opus 4.8's $5 and $25, exactly twice across every token category, including cache reads and writes. For that, across our 917 shared scenarios, you get an overall score of 92.9 versus 92.0, a 0.9-point edge that sits well inside the range where the two are interchangeable. This is the everyday-agent-work picture, not a verdict on the marquee Mythos capabilities our eval does not test.

Token behavior softens the unit price but does not close it. Across the 917 shared scenarios Fable 5 generated about 16% fewer output tokens per task (9,025 versus 10,687), so the real cost per task lands at $1.25 against $0.74, a 73% premium rather than a clean 2x. The value gap is the number to remember: Opus 4.8 returns 125 points per dollar to Fable 5's 74, about 69% more quality for every dollar spent.

For a single session the difference is cents. For a fleet running thousands of agent tasks a day, it is the line item your finance team will ask about, and twice the price for under a point of quality on the tasks most teams actually run is not an easy answer to give them.

Fable refuses work Opus completes without issues

The most consequential difference between Fable 5 and Opus 4.8 is not on the scoreboard. It is the safety layer that defines the Mythos class.

Fable 5 ships with safeguards covering four domains: cybersecurity, biology and chemistry, distillation, and frontier LLM development. For the first three, a triggered request comes back as a refusal. Anthropic's design hands it to Opus 4.8 and informs the user, but that fallback is opt-in rather than a default, so in a stock harness like ours the blocked requests simply refused.

The fourth domain worked differently during this run. By Anthropic's own documentation, requests touching frontier AI development were not refused or even flagged. The model quietly steered or fine-tuned its answer instead, with no notice to the user. That silent manipulation drew the sharpest backlash, and on June 11, the day after this run, Anthropic switched it to a visible classifier like the other three while conceding the restrictions had been "overly conservative." Because it never produced a refusal, that domain leaves no mark in our numbers; any effect would surface only as quietly weaker answers.

A Mythos-class model routes some requests to a weaker model by design, so your harness needs to detect the fallback rather than trust that every response came from Fable. And the affected domains are exactly the ones you most want to check yourself, which is the practical edge of context governance and security: catch the regression in an eval, not in production.

Our run shows how that plays out, and it is not flattering. Fable 5 refused 26 of the roughly 940 tasks it attempted, returning a usage-policy block with a refusal stop reason instead of doing the work, while Opus 4.8 completed and scored every one of them. What Fable refused is the revealing part. Four were defensive security reviews, including "review this Flask application for security vulnerabilities before deploying it," blocked as "violative cyber content." Five were routine bioinformatics tasks, such as running quality control on a single-cell RNA-seq file. One was a literature review on the landscape of AI-assisted drug discovery. A model from the class Anthropic markets for finding vulnerabilities in critical software declined to audit a Flask app for the developer who owns it. Anthropic's own "overly conservative" admission lands hardest here.

On the security tasks Fable did complete, it was competitive. Across 51 authentication and security skill scenarios, from Auth0, Better Auth, and Bitwarden, Fable 5 averaged 95.0 with the skill against Opus 4.8's 96.6, a near-tie. The lesson is not that one model is safe and the other is not. It is that a Mythos-class model will sometimes refuse the defensive work you most need done, and only an eval on your own tasks will tell you where.

Did Fable deliver on the Mythos promise?

Our evaluation answers the question that matters for a deployment decision: how both models handle hundreds of real, skill-driven agent tasks across dozens of tool ecosystems, which is the work most teams actually run coding agents on. The marquee Mythos feats sit outside this eval, but the day-to-day behavior it captures is exactly what you are buying when you point a fleet at a model.

What the data does show is where Fable's extra capability surfaces in normal use. Grouped by the organization that owns the skill, Fable 5 pulls ahead on web-research and scraping workloads: Apify (+7.8 overall), Google Gemini (+4.6), Tavily (+3.4), and Firecrawl (+2.7). If your agents fetch, map, and extract from the open web, Fable 5 is the stronger pick. Opus 4.8 holds its ground where Fable regresses: Mastra (-7.3), Auth0 (-4.5), and Axiom (-2.5).

So the Mythos dream of an autonomous collaborator is not what most teams will buy on day one. What they will buy is a model that is marginally better at instruction-following, meaningfully better at web research, twice the price, and gated by classifiers that occasionally hand the job to Opus 4.8 anyway.

When to use each

Choose Opus 4.8 if you run a coding-agent fleet at scale and care about cost per task. The quality difference is inside the noise for most workloads, Opus returns far more points per dollar, and it has no fallback layer to design around.

Choose Fable 5 if your agents do heavy web research and scraping, if you need its reasoning depth on long-horizon tasks, or if you have a workload that genuinely benefits from the capability class above Opus. Budget for the roughly 73% per-task premium, and build fallback detection into your harness from day one. If your work touches the classifier domains, confirm the model is not silently routing to Opus 4.8 before you depend on it.

Fable's edge shows up when you build around it, not when you swap it into an Opus 4.8 pipeline unchanged. Fable is the more autonomous model, but that edge only pays off in flows built for it: longer unsupervised runs, larger units of work, less step-by-step steering.

For almost everyone, the larger lever is neither model. The skill adds about 17 points; the model upgrade adds less than 1. Standardize the model in your tessl.json, prove the switch with an eval before you roll it to the fleet, and watch for the tasks a Mythos-class model quietly declines to do.

Want to see how a skill changes your own agent's behavior, on your own tasks, across both models? Start with the Tessl Registry and run the eval before you switch.

Same quality, a quarter of the cost: Should DeepSeek Flash be your model of choice?

Tessl — Thu, 11 Jun 2026 06:59:02 +0000

$0.0236 is how much DeepSeek V4 Flash costs to run a complete agentic task, skill included, on the Fireworks price sheet. Claude Haiku 4.5 costs $0.10 for the same task. Sonnet 4.6 costs $0.30.

In terms of how good they are, in our evals Flash scores 82.3, and Haiku scores 82.9. So the evals points to them being comparable, with skills applied, but one is four times the cost.

In our eval we ran 19 model configurations through the same benchmark harness. The tasks we asked of them were real agentic tasks, and we measured the total token counts, and looked at the charged provider pricing. To be honest, the value story we expected to find was "cheap models are a trap." What we found instead was more interesting, and particularly useful if you're running agents at any kind of scale.

First, the Pro comparison

DeepSeek V4 ships two tiers: Pro and Flash. In our eval runs, Pro costs $0.183/task and Flash costs $0.0236/task. That's a 7.7× price gap within the same model family.

When you look at what you get for the extra spend, it’s only three points. On the eval results, Pro scores 85.3, Flash scores 82.3. When we scale that, 10,000 tasks/month costs you an extra $19,000/year and 100,000 tasks/month costs an extra $190,000/year. For three points that may not be too visible from a quality point of view.

Points-per-dollar

When we look at cost per point of eval score, this gives us a ratio between quality and cost, which can be useful, so long as the overall quality of the model satisfies your needs.

Model	Score (w/ skill)	$/task	pts/$
DeepSeek V4 Flash	82.3	$0.024	3,482
Haiku 4.5	82.9	$0.097	829
DeepSeek V4 Pro	85.3	$0.183	467
GLM 5.1	90.4	$0.200	451
Sonnet 4.6	90.8	$0.296	303

The number your cost model is probably missing

Cost-per-token is the number everyone tends to quote and often mistakenly use as the most important factor in making a decision. It's also the number that will quietly blow your budget if you're not watching turns per solve as well.

Flash's mean average is around 20 turns per task which is pretty manageable. But the single worst-case runs in our dataset hit roughly 10× that. This isn’t unusual for models in this class, but in dollar terms, that's a single task costing as much as 10 average tasks. Multiply that across thousands of concurrent agent runs and you may start to have a budget problem that didn't show up in your per-token estimate.

The reason most teams don't catch this is that agent frameworks surface token counts by default. Turn counts, which is the variable that actually drives fat-tail cost explosions, often need to be logged explicitly.

Instrument your agents for turns, not just tokens. Know your median and your 95th percentile. Set your timeout policies against the 95th, not the median, or you're either killing valid runs or absorbing surprise bills.

The skill is doing half the work

One thing worth being very direct about here is that Flash's 82.3 score is a skill-augmented score. Without a skill, Flash scores 64.1. The skill adds +18.2 points.

That lift is real, but very conditional on the skill being precise, well-scoped, and actually relevant to the task. A vague skill will drag you back down closer to the 64.1 baseline, whereas a sharp one gets you 82.3.

This matters more than most model evaluations acknowledge since the model you test in a playground doesn’t usually use a skill or relevant context, but just raw capability.

Going further: find cheaper models and test them yourself

The analysis above shows the cheapest hosted options we measured. But there are two obvious next steps if you want to push it further, and both are more accessible than you might think.

Every model in this benchmark that isn't GPT, Anthropic, or Gemini has publicly available weights. DeepSeek V4 Flash, GLM 5.1, you can run all of them yourself. When you do, the marginal token cost drops to near zero. You're paying for compute (GPU rental or owned infra), not per-call pricing.

The maths of self-hosting only make sense above a certain volume threshold, the ops overhead and GPU costs aren't free of course, but if you're running tens of thousands of agentic tasks per month, the crossover point is lower than you'd expect.

The skill in this benchmark is doing +18.2 points of work. The question worth asking is: where did that skill come from, and how do you know it's any good?

The Tessl registry is a good place to start and look at the quality, impact and security posture of your skill. Before you write a skill from scratch, check whether one already exists and has eval data behind it.

Evaluate your skills properly. You can run two types of evaluation: reviews (automated quality assessment of whether your skill is well-structured) and task evals (end-to-end runs that measure whether the skill actually improves agent performance on real tasks). The task eval output is exactly the kind of "with skill / without skill" delta that the Flash benchmark is built on.

Use skill quality as a model selection input. The 18-point lift Flash gets from a well-scoped skill isn't a fixed number, it depends on the skill and the tasks. A skill that has been evaluated by Tessl with a high task eval score gives you confidence that the lift is real and reproducible. A skill that's never been evaluated is a variable you can't account for in your cost modelling.

Your own workload, not someone else's benchmark. The task eval system lets you define scenarios from your actual codebase and run them. That's the self-evaluation framework described above.

The takeaways, flat out

DeepSeek V4 Flash at $0.0236/task is the value pick. Haiku costs 4× more for 0.6 points. Pro costs 7.7× more for 3 points.
Set a quality floor before you rank by cost. pts/$ flatters cheap-and-weak models. Above 80 points, it's a real signal.
Instrument for turns, not just tokens. Your 95th percentile turn count is the budget variable nobody's logging.
The skill is doing half the work. A bad skill collapses your score back to baseline. Evaluate your skills — with task evals, not vibes.
You can run this yourself. 20-30 tasks, turn logging, a spreadsheet, and Tessl's eval system.
Self-hosting open source models is a real option. The weights are public, the ops trade-off is real. You should run your own evals with your models to see if they can be substituted in.

The tier name told you Flash was cheap; the data says it's also good. Now you have the tools to find out whether that holds for what you're building.

AI Coding Agent Accuracy: Opus 4.7 vs 4.8

Tessl — Tue, 09 Jun 2026 07:23:08 +0000

You are deciding whether to roll your default agent model from Opus 4.7 to 4.8. The release notes promise improvements, the leaderboard moves a fraction of a point, so you shrug, schedule the upgrade for a quiet Friday, and move on.

We ran both versions through the same skills evaluation, roughly 850 scenarios solved twice each, and on the headline metric they finished level. Underneath the tie, though, 4.8 reached the same answers in four fewer turns and for measurably less money, so the upgrade that looks like a non-event on the scoreboard turns out to be a real efficiency gain in the place that actually bills you: the agent loop.

AI agent evaluation measures how an agent behaves on real tasks rather than only scoring its final answer, tracking cost, turns, and reliability across paired runs. The reason to bother is that two models can post the same score while spending very different amounts of work to reach it.

Two versions, one eval harness

Both models ran the identical setup. Every scenario is solved twice, once with no help and once with the relevant skill installed, so we can isolate what the skill contributes from what the base model already knows. We score three things: instruction following (did the agent do what the skill tells it to do), task completion (did it reach the goal), and an overall blend weighted toward instruction following. We also flag integrity issues, like an agent peeking at the grading rubric instead of solving the task.

Opus 4.7 is the incumbent. In our runs it is a strong agent that leans heavily on skills to reach its ceiling, and it explores a lot of paths to get there.

Opus 4.8 is the point release. It posts the same ceiling with a skill installed, but it starts from a higher floor without one, and it gets to the answer with noticeably less wandering.

Where AI coding agent accuracy stops being the story

Here is the head-to-head on the shared scenario set, all with the relevant skill installed unless noted.

Dimension	Opus 4.7	Opus 4.8
Overall score	91.9	92.1
Baseline score, no skill	71.4	74.1
Task completion	97.1	97.4
Instruction following	88.1	88.1
Turns per task	19.2	15.0
Output tokens per task	7,820	9,763
Cost per task, API pricing	baseline	about 5% lower
Integrity flags raised	10.2%	7.9%

The overall accuracy gap is 0.2 points. If you stopped reading the row labeled "overall score," you would conclude nothing changed. Three other rows complicate that picture.

The first is the baseline. Without any skill, 4.8 scores 74.1 against 4.7's 71.4, a 2.6 point gain, and its no-skill instruction following climbed from the high 50s into the low 60s. The ceiling is shared because the skill pulls both versions up to roughly the same place. The floor is where 4.8 actually improved, and that has a practical consequence: 4.8 depends on the skill slightly less to do good work. This suggests some of the knowledge previously only present in skills has been trained into the model weights.

The second is turns. 4.8 finishes the average task in 15.0 turns versus 19.2 for 4.7, a 21% reduction. In an agent loop, a turn is a full round trip of context, reasoning, and tool use. Cutting four turns off the average task lowers latency, reduces the chances for an agent to talk itself into a wrong path, and, as we will see, lowers cost.

The third is integrity. The eval flags runs where the agent took a shortcut, like reading the grading rubric or reaching outside its workspace. Those flags dropped from 10.2% of shared runs to 7.9%. 4.8 is modestly more disciplined about how it reaches an answer. This matches Anthropic’s claims about 4.8 being more honest.

Reading the cost: turns, not tokens

Look again at two rows that seem to contradict each other. 4.8 produces more output per task, 9,763 tokens against 7,820, yet it costs about 5% less.

This is because output volume does not dominate agentic cost. The dominant term is the context replayed on every turn. Each turn re-sends the accumulated conversation and tool results, and in long agent runs that cached input swamps the fresh output the model writes. Fewer turns means fewer replays, so 4.8 can be more verbose inside each turn and still come out ahead, because it takes four fewer turns to converge.

Model cards only show the per-token rate that sets the price of a unit of work, while turn count sets how many units the model decides to spend. A point release that holds accuracy flat while spending 21% fewer turns is working on that second term, which is the one that scales with your usage.

The same dynamic shows up in how each version absorbs a skill. Adding the relevant skill is not free: it pulls in instructions and reference material the agent has to process, and the question is how efficiently the model turns that overhead into a result.

Effect of installing the skill	Opus 4.7	Opus 4.8
Overall score gain	+20.5	+18.0
Cost increase	+38%	+12%
Turn increase	+41%	+14%

On 4.7, switching on a skill added 41% more turns to cash in a 20 point accuracy gain. On 4.8, the same class of skill buys nearly the same gain for much less turn and cost overhead. 4.8 treats a skill more like a shortcut and less like an invitation to explore. If you run agent skills at scale, that lower skill tax compounds across every task you ship.

The one place 4.8 regressed

A fair comparison reports where the new version loses ground. Per scenario, the record is close to a wash: 4.8 scored higher on 23% of shared tasks, tied on 61%, and scored lower on 17%, using a two point threshold. The interesting part is that the losses cluster.

4.8 regressed on web research and scraping skill families. Firecrawl tasks dropped 3.3 points on average across 72 scenarios. LangChain dropped 2.9 points across 48. Smaller families like Tavily and Apify fell further, 10.4 and 7.6 points, though on fewer tasks. Meanwhile 4.8 improved on infrastructure, auth, and code tooling: Cloudflare gained 4.5 points across 38 scenarios, Auth0 gained 4.3 across 18, and Mastra gained 10.1 across 10.

The aggregate hid this completely, because the gains and losses nearly cancel. Only a per domain breakdown surfaces it. That is the whole argument for paired skill evals over a single leaderboard number: the headline can be a tie while two coherent shifts run in opposite directions underneath it.

When to roll forward to 4.8

Roll forward to 4.8 if your agents run long, multi turn tasks where turn count, latency, and cost matter, which is most production agent work. You get the same accuracy ceiling, a higher floor before skills, a 21% turn reduction, a cheaper skill tax, and fewer integrity flags. If your workloads lean on infrastructure, auth, or general code tooling, 4.8 is flat to clearly better.

Test before you roll forward if your agents live in the scrape, crawl, and summarize world. The web research regression is small in absolute terms but consistent across the families we measured. Run your own A/B on your top scraping workflows first.

The takeaway: measure behavior, not the changelog

A skeptic has two reasonable objections. The first: a flat score is just no improvement, so why care? Two models can tie on accuracy while one spends 21% more turns and about 5% more budget to get there. The second: these are our eval harness costs. However, the relative differences in turns, tokens, and cost reflect model behavior which does generalize.

Make sure you’re measuring each release on behavior, on your own tasks, with skills installed and stripped out, and look at the per domain breakdown before you trust the average.

Want to see how your own stack behaves across a model upgrade? Browse the Tessl Registry to find the skills your agents depend on, then run the same paired evaluations we used here to measure what actually changed.

AI Native DevCon Day 2: From Agent Demos to Operating Models

Rohan Sharma — Wed, 03 Jun 2026 06:40:09 +0000

TL;DR

Day 2 of AI Native DevCon shifted from agent capability to operating discipline. The strongest sessions focused on how teams can run AI-native delivery with clearer context pipelines, measurable agent behavior, safer execution boundaries, and better organizational ownership.

The scale showed up in the numbers too. Across the two days, DevCon brought together 650+ in-person registrations, around 2,000 online registrations, and a packed mix of sessions, workshops, hallway conversations, and practical lessons.

Day 2 leaned into workshops. That shift mattered because the second day was less about proving agents can do useful work and more about showing how teams can make that work repeatable.

Hey there, welcome back. Rohan Sharma here again continuing the devcon series.

Day 1 gave us the framing, including Guy Podjarny’s core point that skills should be treated like real software assets. Day 2 picked up from there and moved into the operating details. Once agents are inside daily engineering work, platform and product teams need to decide what changes first, who owns those changes, and how the results are measured.

Talks that shaped Day 2

Harness engineering beyond code

Marc Sloan from Tessl focused on the next gap many teams are hitting. Code context is increasingly structured, but product and design context still lives in external systems such as Figma, Notion, and Linear. Pulling that context live can reduce staleness, but it introduces drift in evals, versioning, and reproducibility.

The practical lesson was to stop treating external product and design context as random reference material. Teams need a defined layer between the repository and those external systems, with clear versioning so evaluations can be replayed against known context snapshots.

Without that, agents can produce work that looks technically correct while missing the product constraint that actually mattered. That is a very expensive kind of almost-right.

From vibes to metrics

Simon Obstbaum and Rob Willoughby from Tessl delivered a session focused on a challenge many engineering leaders are currently facing. Their distinction between output evals and trajectory evals is operationally important. A good answer is not enough if the agent used risky tools, skipped required checks, or ignored policy steps.

The useful measurement model came down to activation, trajectory, and outcome. Did the right skill trigger? Did the agent follow the right steps? Was the final result actually useful and correct?

The good part was the emphasis on partial compliance. Pass or fail is too blunt for agent workflows. If a workflow degrades halfway through, teams need to know where it happened, not just that something felt off.

Benchmarking beyond the model

Amit Kushwaha highlighted why many current benchmarks miss real agent behavior. Agent systems run long traces with tool calls, context accumulation, and latency bottlenecks that one-shot benchmark numbers do not capture.

For teams choosing infrastructure, the warning was clear. Do not optimize only for model speed. Real agent workloads involve tools, memory, caches, retries, and long-running traces.

The better benchmark is closer to production reality, with multi-turn tasks, tool latency, tail latency, and cache behavior over time. Otherwise teams risk picking systems that look great in a chart and struggle in the actual workflow.

Safe execution boundaries for agents

Oleg Šelajev from Docker covered a problem every platform team eventually sees. An unconstrained agent can make high-impact changes in the wrong environment. Sandboxing is not optional once agents are allowed to execute.

The practical takeaway was to treat environment policy as part of the harness. Filesystem access, network access, secrets, and permissions all need clear boundaries before agents are given the ability to act.

This is how teams lower blast radius. Not by hoping the agent behaves nicely, but by designing the room it is allowed to move around in.

Do not write prompts, write software

Baruch Sadogursky and Macey Baker from Tessl reinforced an idea that keeps proving useful in production. Break behavior into modular skills instead of maintaining one giant prompt. This makes agent behavior easier to test, review, and reuse.

The message was not “write a better mega prompt.” It was to turn repeatable behavior into composable skills that match real workflow stages. That gives teams something they can review, test, improve, and share across repos.

If you try one thing from this workshop, use the materials and skill templates as a starting point. Prototype one small skill pipeline in your own environment before trying to scale the pattern across every repo.

What kept coming up across the day

1. Context quality is now a platform responsibility

Marc Sloan, Shaun Smith, and John Groetzinger approached this from different angles, but the operational message was consistent. Context delivery is becoming an engineering system, not documentation hygiene. Teams need predictable context pipelines for both humans and agents.

The next step is ownership. Teams need to know who maintains context sources, how often they refresh, and how changes are versioned. Context also needs observability so teams can trace which inputs shaped an agent decision.

2. Agent performance needs production-grade telemetry

The sessions from Simon Obstbaum and Rob Willoughby from Tessl, plus Amit Kushwaha from NVIDIA and Justin Cormack, former CTO at Docker, made this very concrete. Teams need to measure how agents worked, not only what they returned.

Trajectory metrics belong next to existing quality signals. If your dashboards already show test health, release health, or incident trends, agent workflow quality should sit in the same operational view.

The benchmark scenarios should also look like real work. Multi-turn, tool-heavy, slightly messy, and full of the same constraints your teams face every day. Justin’s observability point connected neatly here too. Teams need runtime signals that can reveal agent-induced drift before it becomes a bigger production problem.

3. Adoption is an organizational design problem, not a tooling checkbox

Talks from Tammuz Dubnov and Birgitta Böckeler from Thoughtworks showed that adoption succeeds when review structures, ownership boundaries, and team rituals evolve with the tooling.

That means setting explicit contribution boundaries for AI-assisted changes and updating review criteria. The diff still matters, but so does the path the agent took to produce it. Birgitta’s adoption data made this especially grounded by showing where hidden costs appear, including review load, technical debt, and maintainability when speed becomes the only metric.

4. Workshops made the ideas practical

Baruch Sadogursky and Macey Baker from Tessl, along with Alfonso Graziano from Nearform, helped turn the bigger Day 2 ideas into something teams could actually try. The workshop-heavy format made the day feel less like theory and more like practice.

Derek Ashmore’s packed workshop, “The AI Agent Testing Pyramid,” focused on the different levels of testing agent systems need. For those following from home, you can attempt it on your own by following this repo.

Aashrey Tiku from Anthropic worked through a hands-on session on shipping a managed agent. It was a useful bridge between agent concepts and the practical work of packaging, managing, and operating an agent with the right boundaries.

That mattered because AI-native development is still new enough that people need patterns they can test, not just concepts they can nod along to. Alfonso’s spec-driven angle fit well here because prompts become far more useful when they are turned into testable, production-ready specifications.

5. Agent enablement needs real ownership

Ian Thomas from Meta and Katie Roberts from Nearform made the enablement side feel practical. Rollouts work better when platform safeguards are paired with updated team rituals, clear ownership, and realistic guidance for brownfield systems.

Katie’s legacy advice was especially useful. AI should help teams modernize incrementally, not generate another fragile layer on top of systems that are already hard to maintain.

If you missed Day 1, start here

Day 2 was workshop-heavy. If you missed the Day 1 virtual stream, start with these talks before digging into the workshop themes.

Guy Podjarny, Tessl - Skills are the new Code
Dana Lawson, Netlify - Built for Humans. Now Agents Are Here.
James Moss, Tessl - Using skills to pay the bills
Liran Tal, Snyk - Your AI Agent Installed Malware Because a SKILL.md Told It To
Ryan Lopopolo, OpenAI - Harness Engineering
Patrick Debois, Tessl - The Rise of Agent Enablement
Shachar Azriel, Baz - Executable Specs
May Walter, Hud - Runtime Intelligence for Continuous Agentic Performance Optimization
Dave Farley - Vibe Coding: Is this really the best we can do?

That set gives the right foundation for Day 2 across skills, context, verification, security, harnesses, runtime feedback, and team enablement.

AI Native DevCon is not over yet!

We are already working on the next AI DevCon, and yes, we are very excited to say that AI DevCon NYC is officially on the way.

If Day 1 gave the frame and Day 2 showed the operating model, NYC is where the conversation gets even more practical. Expect more on skills, harnesses, agent safety, context systems, benchmarking, product workflows, and what it really takes to make AI-native delivery work inside teams.

Super-early-bird seats are available now. If you want to be in the room for the next round of conversations, this is the time to grab a spot.

In the meantime, register for the AI DevCon newsletter. We will release the content shared over the conference, including selected highlights, session clips, notes, slide decks, and workshop materials as they are published.