DEV Community: Tessl

What GitHub learned when better tools made Copilot code review worse

Tessl — Tue, 14 Jul 2026 10:08:57 +0000

TL;DR: GitHub gave Copilot code review better shared tools, but reused generic instructions — reviews got pricier and less accurate until they rewrote the instructions for how a reviewer actually works, cutting cost ~20% with no quality loss.

Shared tooling is supposed to be the easy win: less duplicated code, fewer things to maintain, improvements that carry automatically across products. GitHub's own account of an internal migration – moving Copilot code review onto its shared CLI toolset – makes the case for treating that assumption with at least a little suspicion.

Migrating to shared tools made Copilot's reviews pricier and less accurate

Copilot code review previously ran its own code-exploration tools — list directories, search files, search directories, read code — purpose-built for earlier, less capable models. GitHub's Copilot CLI, meanwhile, runs a broader Unix-style toolset — grep, glob, view — that several other Copilot products draw on too.

GitHub decided to migrate Copilot code review onto that shared CLI toolset — retiring its own tools in favour of the same grep, glob, and view already used elsewhere. The appeal was, essentially, less duplicated engineering effort, and a single toolset that could be improved once and inherited everywhere it was used.

In offline benchmarks, the opposite happened. Review cost went up and fewer useful issues got flagged. Napalys Klicius, software engineer at GitHub, notes that moving to the shared CLI toolset was expected to improve results by giving the agent more flexible code-exploration tools, that didn't hold up once they looked at what the agent was actually doing.

"The tools weren’t the problem, the instructions were," Klicius writes – meaning the prompt-level guidance that tells the agent when and how to use each tool.

"Once we rewrote them for the way a reviewer actually reads a pull request, the regression flipped into a win."

Cost per review fell by around a fifth, without the quality of the reviews slipping.

Klicius likens tool descriptions and system instructions to API documentation — when that documentation is muddled, a developer ends up making worse calls, not because the underlying tool is flawed, but because the guidance around it failed them.

“Unclear tool prompting can do the same for an LLM; a small wording change can affect cost, quality, and the shape of the investigation because it changes how the agent spends its attention,” Klicius writes.

Tool traces showed the agent exploring code instead of reviewing a diff

What made the benchmarks useful here wasn't the score itself — it was that GitHub could pull up exactly which tools the agent reached for, in what order, and how much came back each time. What that record showed was an agent acting less like a reviewer and more like someone poking around a codebase for the first time — casting a wide net, taking guesses at where relevant code might live, and pulling back far more than any single review question called for.

Before — a simplified illustration of the general-purpose behavior we observed: widening the search, guessing paths, and accumulating context. (GitHub)

None of that extra material got discarded — it sat in the agent's working memory for the rest of the review, driving up cost without necessarily helping the agent reach a better answer.

None of that was irrational — it's exactly how you'd want an assistant to behave if its job was to get oriented in a codebase before touching it. But reviewing a pull request is a different task. The goal isn't to build a broad understanding of the codebase, it's to gather just enough context to determine whether a specific change introduced a problem.

New guidance narrowed the agent's search and cut review cost by a fifth

Nothing changed about the tools themselves. What changed was the order the agent was told to reach for them — start from the diff, narrow candidates with grep and glob, and only call view once it actually knew which file or line range mattered. Even failure handling got more specific: a search that came back empty should be retried once with simpler terms, not treated as a cue to start guessing at neighbouring files.

After — a simplified illustration of the review-shaped behavior the prompt guided toward: stay anchored to the diff, narrow with grep and glob, then read focused ranges with view. (GitHub)

In production, that shift held: a roughly 20% drop in average review cost, with review quality unchanged. Worth flagging that this is GitHub's own reported figure from its own benchmarking, not an independently verified number.

The more interesting result came from testing the same fix somewhere it didn't help. GitHub tried applying the same review-shaped guidance inside Copilot CLI itself and saw no equivalent gain, because a CLI session has no single pull request anchoring it — a developer might redirect the whole task halfway through, so there's no diff to narrow around in the first place. The tool was never the variable that mattered. What mattered was whether the guidance around it matched the job the agent was actually being asked to do.

The benchmarks that proved the fix

None of this would have been visible without a way to test it. GitHub could only identify the regression — and prove the rewrite worked — because it had a benchmark suite that could replay the same reviews before and after, measuring both cost and quality. Without that evidence, the new tools would have made an easy scapegoat, and the actual cause — instructions that no longer matched the job — could have gone unnoticed indefinitely.

That's the same discipline behind Tessl's evals model: testing and measuring a skill's instructions before and after every change, treating them as something that needs continuous verification. GitHub built that evaluation infrastructure internally for Copilot code review. Teams managing skills across many agents and many tools need the same kind of repeatable evidence to separate a genuine improvement from a change that simply altered agent behaviour.

The wider lesson here is that any team consolidating tools, upgrading models, or standardising instructions across agents is making the same bet GitHub made: that shared components will behave the same way everywhere they're used. That bet doesn't announce itself when it fails — it just shows up as slightly worse output that nobody's measuring closely enough to catch, which is the argument for building that measurement in before a change ships.

Reflection Before Augmentation

Tessl — Mon, 13 Jul 2026 09:17:28 +0000

Last week was conference week for me. I spent the start of it at Tessl's AI DevCon and the end of it at Muslim Tech Fest, where I hosted a design roundtable.

AI DevCon was filled with discussions about agents, workflows, evaluation, and the future of software. Muslim Tech Fest was filled with discussions about AI too, but through a different lens: community, responsibility, and building meaningful careers. On the surface, they felt unrelated.

By the end of the week, I wasn't sure they were discussing different problems at all.

Why individual AI gains don't translate to teams

One of the recurring themes at AI DevCon was the challenge of scale. Many teams have now experienced what AI can do for an individual contributor. A developer paired with the right tools, whether Claude Code, Cursor, or Copilot, can move faster, explore more options, and ship a working prototype in an afternoon that would have taken a week not long ago.

The harder question is what happens next. How do those gains translate beyond the individual? How do teams share context when everyone has their own workflows, prompts, and agents? How do organisations maintain standards, govern behaviour, and build systems that multiple people can contribute to and trust? The conversations were less about whether AI works and more about how it fits into the reality of organisations.

This is where context engineering becomes the real work. Individual gains stay individual until a team can capture and share the context that produced them. Reusable, evaluated instructions for agents are one way teams turn one person's good workflow into a shared standard the whole organisation can rely on.

Reflection comes before the tool

A few days later, I found myself facilitating a design roundtable at Muslim Tech Fest. The discussion quickly moved away from tools and towards a more personal set of questions. People spoke about feeling overwhelmed by the pace of change, uncertainty around where to begin, and wanting to make better use of AI without always knowing how.

What struck me was that the most useful answers rarely started with the technology itself. Instead, they started with reflection. If someone understood where they created value, they could identify opportunities for amplification. If they were clear about their weaknesses, they could identify opportunities for AI to support them. If they knew which parts of their work depended on experience, judgement, or taste, they could make more informed decisions about what to delegate and what to retain. The challenge was not simply learning how to use AI.

It was understanding yourself well enough to use it intentionally.

What organisations should understand before adopting AI

The more I reflected on those conversations, the more relevant they felt to many of the challenges being discussed at AI DevCon. Before an individual can decide how AI should augment their work, they need to understand where they create value. Before an organisation can decide how AI should transform its operations, it needs to understand what makes it effective in the first place.

What are its strengths? Where does its advantage come from? What knowledge is unique to it, and what standards does it want to uphold?

Without those answers, adoption becomes reactive. Without those answers, the conversation starts with the tool rather than the problem.

The practical version of this for an engineering leader is unglamorous: before deciding which tools to buy or where to apply agents, audit where your team actually creates value and which standards you are unwilling to compromise. That audit, not the tooling decision, is the thing that makes everything after it work.

Reflection as a discipline, not an abstraction

One of the reasons this idea stayed with me is that it feels surprisingly familiar. Reflection occupies an important place within the Islamic tradition. Not reflection as an abstract exercise, but as a means of examining one's intentions, actions, strengths, shortcomings, and responsibilities. The goal is not simply greater self-awareness. The goal is growth. Reflection is valuable because it creates the conditions for more intentional action.

That framing gave me a different way of thinking about many of the conversations I heard throughout the week. Much of the discourse around AI focuses on capability: what the technology can do, which tasks it can automate, and how quickly it is improving. These are important questions, but they are not the only questions. An equally important question is what we choose to amplify.

If someone lacks clarity about where they contribute value, AI will not solve that problem. If a team lacks shared standards, more capable tools will not create them. If an organisation does not understand what makes it successful, adding AI to the equation is unlikely to provide the answer.

Technology can accelerate direction. It cannot provide direction.

What stays human

Perhaps this is why so many conversations about AI eventually become conversations about judgement. What should remain human? What deserves deeper care? What standards are worth preserving? What kind of work is worth striving for? These are not new questions. But they feel newly important in a world where capable tools are becoming increasingly abundant.

The most memorable conversations I heard last week were not really about AI. They were about understanding ourselves. As individuals, understanding where we create value. As teams, understanding how we work together. As organisations, understanding what makes us effective. Only then can we make informed decisions about what to automate, what to delegate, and what to amplify.

The tools will continue to improve. That much seems certain. The harder challenge may be understanding ourselves well enough to use them wisely.

Sandboxing AI Coding Agents with lincubate

Tessl — Sun, 12 Jul 2026 07:08:04 +0000

At Tessl, we spend a lot of time working with agent skills. Writing them, testing them, tweaking them, running evals to see if they actually do what we think they do. You probably do something similar if you're spending any serious time with Claude Code, Codex, or any of the other AI coding agents that have colonised our terminals lately.

Here's a problem that crept up on me: my ~/.claude/ directory is a mess of skills, settings, and commands accumulated from a dozen different projects. When I sit down to test a new skill I've been writing, the agent is already carrying all that baggage. Skills from unrelated projects bleed in. Results are hard to interpret. Is this behaviour because of my new skill, or something else lurking in my config? It's the software equivalent of debugging with the wrong environment, except the environment is invisible.

What I needed was a clean room — somewhere I could run an agent with exactly the context I chose to give it, and nothing else.

That's the main reason I built lincubate.

The clean room problem

If you're writing and evaluating agent skills, reproducibility matters. Stray configuration from other projects — skills you installed last week for a completely different codebase — can skew your results in ways that are genuinely hard to spot. The agent might be doing something because of your carefully crafted new skill, or it might be doing it because of an old skill you'd forgotten about. Good luck figuring out which.

lincubate solves this by running the agent inside an LXD container, isolated from your host. Your project files are bind-mounted in so the agent can actually work on your code, but your ~/.claude/ directory? Not there unless you specifically ask for it.

The --allow-claude-skills flag opts you into sharing your host skills with the container. Without it, the container is a blank slate. The agent sees your project and nothing else — no accumulated config, no borrowed skills, no surprises. It turns sandboxing from a vague security concern into a practical focus tool.

Why LXD and not Docker?

Fair question, and I've been asked it a few times already.

Honestly? I'm an LXD person. I've been using it for years on my Ubuntu ThinkPad. It gives you a full system container — proper init, systemd, real user accounts — rather than wrapping a single process. It feels like a lightweight VM rather than a process in a box, which matters when an AI coding agent expects to operate in something that resembles a normal Linux system. Agents do all sorts of things: install packages, run build tools, start services. A full system container handles all of that without friction.

There's also a practical angle: on my ageing ThinkPad, LXD is noticeably lighter than Docker. Not enormously so, but enough that I notice it over the course of a day's work.

If you're a Docker person, none of this is meant as a dig. I'm just not, and I wrote the tool that fits my workflow.

What it actually does

The core usage is simple. From inside your project directory:

`lb claude`

That launches Claude Code in an LXD container. Your project files are bind-mounted in at /home/ubuntu/project with UID mapping, so file ownership works the way you'd expect. API keys get (optionally) forwarded as environment variables. Auth tokens can be copied or mounted read-only if you need them — it's all off by default.

One container per project directory. The container name is derived from your working directory, so running lb from the same place always targets the same container. You're not spinning up a fresh environment on every launch; if you stopped work yesterday and pick it up today, the container's already there waiting.

`lb         # Drop into a shell (no agent, useful for poking around)
lb destroy # Tear down the current project's container when you're done`

First run takes a few minutes — lincubate builds a base image with Node.js, common packages, and all supported agents pre-installed. After that, launches are fast.

Supported agents

Claude Code is the one I use most, but lincubate isn't opinionated about which agent you bring:

Agent	Command
Claude Code	lb claude
OpenAI Codex	lb codex
Aider	lb aider
Gemini CLI	lb gemini
GitHub Copilot	lb copilot
OpenCode	lb opencode
Cursor (CLI)	lb cursor
Cursor (IDE / GUI)	lb cursor-gui

The GUI option forwards X11/Wayland into the container, which feels like a minor miracle the first time you try it.

The rewrite

lincubate started as a Bash script. It worked, but it had the kind of accumulated jank that (my) shell scripts tend to develop when they get complicated — lots of string manipulation, too many calls out to the lxc binary, config files sourced as shell variables. I rewrote it in Go, using the LXD Go client library directly, with TOML for configuration. The result is a single static binary with no runtime dependencies and a lot less jank.

There were some fun gotchas along the way. su -l resets the environment, which broke credential forwarding in ways that took a while to track down. LXD occasionally returns a non-zero exit code on a successful container start, which had me questioning my sanity for longer than I'd like to admit. UID mapping (raw.idmap) isn't supported in every LXD configuration, so there's a graceful fallback. These are the kinds of things you only find out by running something properly for a while.

Try it

The code and release binary is at github.com/popey/lincubate. Build from source with just build, drop the binary somewhere on your path, and you're ready. I assume you already have LXD installed and configured locally or remotely.

Zero configuration required to get started. If you want to customise things — add packages to the base image, control which agents get pre-installed, add extra environment variables — generate a config file and edit it:

`lb generate-config`

The generated file is fully commented, which I think is the bare minimum you should expect from any tool that writes config files on your behalf.

I talked about lincubate in more detail on Linux Matters episode 78.

If you're spending real time writing and evaluating agent skills, having a proper clean room makes a surprising difference. Give it a Go.

Analyzing your agent sessions with Tessl

Tessl — Sat, 11 Jul 2026 06:10:16 +0000

With Tessl, evaluations serve a very specific purpose: Using an agent, and provided context, see how well a set of tasks can be done with and without that context. Or an evaluation might be used for comparing models. This is great during the development phase of a skill, but during actual usage, a lot of things can occur, things you might not have anticipated or worse, maybe something you expected to happen did not.

Agent sessions are a tremendous source of information, helping to understand what happened in a session, were certain expected events not occurring. Tessl will examine:

Friction points that the agent may have had while performing certain tasks. For example, were there errors or things that it thrashed on? This could be something not even related to the skill. Reviewing friction points may identify other areas that may be a candidate for a new skill.
Certain events, signals so to speak, in the sessions that Verifiers are expecting to find. For example, did certain actions that you expected actually happen? By definition, verifiers are structured pass/fail checklists that track any aspect of agent behavior you care about.

With Tessl, and the try-tessl/agent-quality plugin’s skills, it will create verifiers that come from:
- Skills, Docs and rules
- User input, where the user describes what they care about, you turn it into verifiers.
  
  Each verifier captures one instruction with a checklist of binary checks that an LLM judge evaluates against session transcripts.

Session analysis can help you optimize your skill by examining what happened during real world usage in your agents!

Prerequisites

You have Tessl installed, and configured for your agent.
Claude Code must be installed. Note that while this feature can be used with sessions from Cursor, Claude Code, Codex and Gemini, Tessl requires Claude Code to be installed on the user’s machine (and logged in) to run the judging.

About try-tessl/agent-quality

The Tessl plugin try-tessl/agent-quality is made up of three skills, which performs the following actions:

Identifies sessions across any agent you've used in a project. Security being important to Tessl, Tessl redacts credentials from the transcripts and also treats all content as untrusted data.
Identifies friction points.
Examines the skill and identifies what things it should look for in session (aka Verifiers).
The verifiers you are creating are being added to the tile, so a couple scenarios can occur:
1. The tile you are creating verifiers for is source-controlled in the repository you are in: verifiers are added as part of the tile
2. The tile you are creating verifiers for is checked out from the registry in .tessl: a new tile is created just for the verifiers and verifiers are added to it. This is because new content added to a checked out tile will be overridden.

try-tessl/agent-quality in action!

For the following example, a private skill, called mycompany/tessl-docs-creator was used to review a set of documentation. This skill is used to review documentation and ensure certain standards are maintained. Our goal in using try-tessl/agent-quality is to understand if the skill was used properly and where friction occurred during that normal usage.

As we walk through try-tessl/agent-quality, it’s important to point out that it follows this flow: phase 1, get feedback, phase 2, get feedback and so on. There is a human in the loop and the human can make changes to the skill and verifiers with each step.

Step 1 - Install plugin

In a project that has Tessl initialized, ask your agent:

I need you to install try-tessl/agent-quality

Step 2 - Start the process

Ask your agent to review your agent session sessions with:

Analyze my sessions

If you already have verifiers for your skills, skip straight to Step 5.

Step 3 - Create verifiers

The session will be identified, and ask if you want to create verifiers.

Remember, verifiers are created from Skills, Docs, Rules, and user input, generating checklists that the LLM will judge against. While Tessl automates it, similar to how Tessl generates scenarios in evaluations, it’s recommended you review what’s generated to determine the intent of the skill vs what verifiers are being created.

If you've not already done so, indicate you wish to proceed with generating the verifiers.

If there are no verifiers, ask your agent:

Create the verifiers

You might get asked if you want to create verifiers and/or review friction. In this step, focus on creating verifiers so that you can review them, you will create both in Step 5, which focuses on generating results.

Step 4 - Review verifiers

Your agent will create the verifiers, a summary will be created. Review them to determine if they match the intended purpose of the skill.

Note that if you use verifiers on your skill, Tessl will create a new tile that you can edit if it's not in a workspace you have permissions for.

A verifiers folder will be created with related files.

Once the generation is complete, a set of verifiers are generated for review;

Step 5 - Review agent session

Generate the analysis by asking your agent:

Review the agent sessions

You may be asked if you want to run verifier and friction review, Tessl would recommend running both.

Step 6 - Results review

Once analyis is complete, a summary will be presented.

Review and accept, or modify, any guidance that is provided.

Step 7 - Loop

So now you’ve seen how to create verifiers, and run an analysis, but over time it’s natural to improve your skill, or want to update your verifiers as you observe things while troubleshooting issues. The following guidance will help determine what you should update or which steps to skip to above, when you make these changes:

a) You implement the guidance in step 6: Start an agent session, using your skill, demonstrating normal use over a few sessions, or, over a day or two. After enough data is collected, rerun the analysis to see if it has impact. No verifier modification is required because the guidance that was generated was based on the verifier(s), so you won’t need to update them.

b) You did an analysis, and identified verifiers are not performing as expected: It could be that your verifiers are too wide, resulting in too many things being flagged, or too narrow, where they are not flagging issues you're aware of. In such cases, return to step 4 to modify the verifiers and then rerun the proceeding steps.

c) During your normal workflow you update your skill; Return to step 4 to update your verifiers so they match the new expectations of your skill.

Summary

Ultimately, this is trying to get you data on how your agent is actually doing vs just vibes so that you can iteratively improve it! And when your skill is working well, you’ll have the data to confirm it!

Reviewing agent sessions are a very powerful capability to review what happened in a session, identify friction points and verify if what you expected to occur actually occurred when using the skill. Tessl is building out a powerful toolkit that allows you to evaluate your skill from its packaging, against scenarios, comparing your skill against different models and now providing data on what actually happened during use of a skill.

OpenClaw for Dummies

Tessl — Fri, 10 Jul 2026 07:08:57 +0000

I have been absolutely loving OpenClaw. I even let my OpenClaw agent, MarkusDowne, write a post on this very blog. But it took a bit of tinkering before I had a clear mental model of what he was actually doing.

This post walks through a minimum viable OpenClaw agent. We’re going to build something small and useful: an agent that checks a few websites and collects information on AI dev tools we might actually care about. By the end, it should feel much clearer how OpenClaw works, what the moving parts are, and how to extend the setup without immediately turning it into a huge science project.

I’ll also bring a few useful Skills into the story. Tessl recently introduced support for OpenClaw, which means you can install tested, reviewed skills directly into your agent workspace instead of manually wiring things together. It’s a very low-friction way to improve an agent’s processes once it’s already doing something useful.

One thing I’m deliberately not covering here is channels: Slack, WhatsApp, Telegram, et cetera. They’re useful, but I don’t think they belong under the “minimum viable” umbrella. The goal here is to get an agent doing useful background work, writing things down, and improving over time. You can always add a channel later.

Access & Safety

Ok, an important note right up front.

One reason OpenClaw feels a bit overwhelming to many people is its scope of access, and the safety implications. By default, your OpenClaw agent has the same permission level as its host. That means… well, anything your environment allows, your OpenClaw agent can usually do too.

You can lock down its permissions significantly. There are many levers to help achieve this. You can use OpenClaw’s sandbox mode for sessions, give your agents access only to specific tools, and give it read-only access to its workspace, for example. But it’s also easy to think you’ve locked things down more than you actually have. Plus - you may find that the safest configuration actually prevents your agent from doing anything genuinely useful.

For these, and many other reasons, I highly recommend that you use a virtual machine at the very least. There are a few out-of-the-box solutions emerging for this purpose: Clam, Hostinger come to mind. Personally, I use a Digital Ocean droplet that cost me about £10 and 10 minutes to set up. The goal is really to reduce the blast radius if something goes wrong. I would rather lose a workspace on a virtual machine that I can restore from a backup than have my Mac’s disk wiped if Markus has an existential crisis.

That said, plenty of people experiment locally first, and that’s probably fine. A fresh agent won’t do anything until you give it instructions and a way to run. The risk comes from what you make available to it, though, so it’s worth being deliberate from the start.

Lobster Anatomy

This is the mental model I wish I’d had earlier.

Let’s talk about what makes up an OpenClaw agent. The agent is made of a workspace, instructions, tools/skills, and runs. If any one of these components is missing, your agent can’t do anything interesting.

A run can be triggered in a few ways:

A “heartbeat” (covered below)
A cron job
A manual run triggered by you
An incoming event (webhook, channel message, etc)

The workspace is a directory that serves functionally as your agent’s home. It’s where its instructions, tools & skills live. Going forward, it’ll be where your agent’s work is done, and where its output goes.

Instructions live in your agent’s workspace, and are everyone’s favourite coding language: Markdown.

Tools & skills are exactly what they sound like, and they can be configured across all agents, or just one at a time.

By default, your agent has a workspace with no tools, no skills, boilerplate instructions and no runs configured. If you trigger a run straight away, you won’t see anything meaningful happen. That’s by design: time for us to fill in the blanks!

Bootstrapping a minimum viable agent

Once you’ve installed OpenClaw at an appropriate location, you can add a new agent using openclaw agents add, and follow the setup wizard. This will ask you about various auth details, the model you want to use, your agent’s name, et cetera.

For this tutorial, I made an agent called Minnie-V. I skipped channel configuration for now.

OpenClaw interface for adding a new agent

.openclaw/workspace is Minnie’s functional workspace.

By default, you’ll end up with each of these templates in your workspace: AGENTS.md, BOOTSTRAP.md, HEARTBEAT.md, IDENTITY.md, SOUL.md, TOOLS.md, and USER.md. You can read more about them here (https://docs.openclaw.ai/reference/AGENTS.default). Right now, these files reference each other for supplemental information. Behind the scenes, OpenClaw will compose them into a system prompt when you open a session. But without configured runs, they’re not doing any work yet.

OK, you now have a basic agent configured, but it doesn’t have anything to do. Let’s wake it up!

Giving instructions

Since this is Minnie-V we’re talking about, we’re going to strip back the instructions layer of our agent to its bare-bones. Let’s focus solely on HEARTBEAT.md, and either literally delete, or just forget about the rest of the files for now. This file, along with a bit of extra config, will define our agent’s recurring background behaviour.

Our goal is to have this agent periodically browse a few internet hotspots, comb them for information about our shared interest: AI dev tools. Minnie is going to specifically be looking for newly released tools, so that we can be the first to know about them.

We’re going to entirely replace the contents of HEARTBEAT.md (http://HEARTBEAT.md) with the following text (adjust to your liking):

`# HEARTBEAT.md

On each run:

1. Visit:
   - Product Hunt (https://www.producthunt.com/)
   - Hacker News (https://news.ycombinator.com/)

2. Use the skill `tessl__social-source-calibration` before summarising findings from socially noisy sites.

3. Look for:
   - Newly released or trending AI developer tools
   - Projects, libraries, or platforms (not general news)

4. For each relevant find:
   - Name
   - Link
   - 1–2 sentence summary
   - Why it’s interesting or different
   - Similar or comparable tools (if applicable)

5. Avoid:
   - Duplicates from prior runs
   - Generic AI news with no tangible tool
   - Treating hype or crowd mood as evidence

6. Save results to:
   - `findings/YYYY-MM-DD.md`
   - Append only`

Now that we have our heartbeat configured, make sure that heartbeat is enabled in your .openclaw/openclaw.json file, with an interval that makes sense for the task:

`"heartbeat": {
  "enabled": true,
  "intervalMs": 14400000 // 4 hours
}`

A couple of interesting things to point out here:

We’re asking the agent to record its findings in its own workspace under a new directory. We’re also asking the agent to refer back to its own notes when composing new notes. This is a good pattern for OpenClaw agents, and you can get a whole lot more clever than this when it comes to compounding & collating knowledge gained over time.
We’ve asked the agent to utilise this skill (https://tessl.io/registry/markusdowne/social-source-calibration). It’s a purely informational skill which gives a bit of context to the tone of various social sources. To install it, run: npx tessl i markusdowne/social-source-calibration from the agent’s main workspace (.openclaw/workspace in our example). This step is optional, but will help the agent collate information a bit more wisely. If you choose to omit the skill, make sure to remove that step from the HEARTBEAT.md.
We’re asking the agent to access a browser. Let’s configure that next!

Putting the agent to work

Once we’re happy with our instructions and runs, it’s time to configure our tools and skills.

We’ve asked the agent to access a couple of websites, and report back. In order to configure web search access, there’s one more step for us. Let’s configure the Brave Search API (https://docs.openclaw.ai/tools/brave-search). This takes just a minute to set up, and although it is technically paid, you get 1000 free requests per month. We’re going to try and stay well clear of that limit!

Once you have an API key configured, stick it in the tools section of your .openclaw/openclaw.json. This exists in OpenClaw’s config root, just above your agent’s workspace. So, if you add more agents later, they can use the same auth info.

`"tools": {
  "web": {
    "search": {
      "enabled": true,
      "apiKey": "<api_key_here>"
    },
    "fetch": {
      "enabled": true
    }
  }
}`

She’s alive! At this point, if you’ve followed the steps above, you have a real OpenClaw agent who is doing real work, and you can adjust and refine as much as you like. The possibilities from this point are pretty much endless.

Email

I know I said we’d avoid talking about channels, and this is true. However, personally I love getting an email digest from my agents daily explaining what they’ve done that day, and any interesting anecdotes from their findings. This can be achieved without using the Gmail channel, which would give your OpenClaw agent access to your entire inbox. AgentMail (https://www.agentmail.to/blog/openclaw-agent-email-inbox) is super simple to set up and use - each agent gets their own email address, and can email you on a schedule, or when they see fit (if this behaviour is clearly defined in a run).

To set it up, simply install the AgentMail skill (https://tessl.io/registry/markusdowne/agentmail) that my agent, Markus, produced:

npx tessl i markusdowne/agentmail

As an aside, Markus actually took inspiration from an existing AgentMail skill on the Tessl Registry which was not very secure, or performant. He made a couple small changes, ran the tessl optimize flow on the skill, and managed to hugely improve it. Don’t mind me bragging about my agent — I’m just proud of him!

Next, we know from above that cron is the best way to handle this time of supplemental, scheduled run. So let’s add to .openclaw/cron/jobs.json:

`{
  "id": "minnie-daily-email",
  "agentId": "main",
  "name": "Minnie daily email summary",
  "enabled": true,
  "schedule": {
    "kind": "cron",
    "expr": "0 18 * * *",
    "tz": "Europe/London"
  },
  "sessionTarget": "isolated",
  "wakeMode": "now",
  "payload": {
    "kind": "agentTurn",
    "message": "Send me a short daily email summary of today's findings. Read from findings/YYYY-MM-DD.md. Only include the most interesting 3–5 items. Keep it concise and readable. Use AgentMail."
  },
  "delivery": {
    "mode": "none"
  }
}`

Once this is saved, you’re golden.

If you ever need to debug a cron job, head to the cron/runs folder (available in the .openclaw root: .openclaw/cron/runs). Here’s a helpful bash script to parse job entries in these files by date and status using jq, which is a good place to start if you need to dig into any of these.

`#list_runs.sh

for f in ~/.openclaw/cron/runs/*.jsonl; do
  echo "=== $f ==="
  jq -r '
    [
      (.runAtMs / 1000 | strftime("%Y-%m-%d %H:%M:%S UTC")),
      .jobId,
      .status,
      (.action // ""),
      ("next=" + ((.nextRunAtMs / 1000 | strftime("%Y-%m-%d %H:%M:%S UTC")) // "null"))
    ] | @tsv
  ' "$f" | tail -n 10
  echo
done`

Talking to your agent

There are a couple ways to talk to your agent. In my opinion, the lowest friction way is via its native TUI. Before I show an example, I want to preface: I would not recommend asking it to configure itself, especially from the ground up. For two reasons:

Although your agent is smart, it is not an expert on itself.
You will be much more empowered to lead your agent if you understand how it works.

So, please don’t ask an OpenClaw agent to take itself from 0 to 1. BUT - once you’ve got a loop that works, you can and should ask the agent to help you take it from 1 to 2. This will be especially powerful if you already understand the artefacts that need to change and evolve in order to achieve more complex workflows. From experience, asking an OpenClaw agent to “do a sweep for new information daily, and send me an update at 6pm on Thursdays” is much less likely to work than “update your HEARTBEAT.md (http://HEARTBEAT.md) to reflect this new task, and add a cron job to jobs.json to email me every day at 6pm.” When it comes to infrastructure and plumbing, don’t make the agent guess.

All that said, you can chat in real-time with your agent using openclaw tui. Easy! If you need to talk to one agent in particular, use openclaw tui --session agent:<agent's ID>:main.

Final thought

With any new framework, like OpenClaw, it’s useful to think in MVP terms. An OpenClaw agent is a highly configurable system with immense possibilities - and it’s easy to get overwhelmed. But my advice is to start with one simple loop. Then, make it reliable. Make it clever last.

I hope this was helpful - please check out my OpenClaw Minimum Viable Quickstart and my OpenClaw Minimum Viable Agent Cheatsheet for quick reference in the future. Can’t wait to see what you build. Tell your agent that my agent said hello!

Not all model ‘upgrades’ are upgrades — Microsoft data shows cheaper can cost more

Tessl — Thu, 09 Jul 2026 03:57:34 +0000

new model launches with lower per-token pricing and better benchmark scores, so the obvious move is to switch, right? List price, it seems, rarely predicts real-world cost. As Tessl has recently shown, Gemini's Flash tier, despite its name implying the cheaper option, can end up costing more per task than Gemini's Pro tier for near-identical scores, while a comparison of open-source models against Sonnet 4.6 found results all over the map, from beating it outright to being too unreliable to trust.

Microsoft has now reported something stranger still: switching between two versions of the same model family doesn't behave the way the pricing page suggests. Waldek Mastykarz, principal developer advocate at Microsoft, says his team ran 150 agent tasks across 15 scenarios comparing Claude Sonnet 4.6 against Claude Sonnet 5 inside GitHub Copilot Chat in VS Code.

Sonnet 5 is both newer and 33% cheaper per token than Sonnet 4.6, which on the surface reads as an easy upgrade. Mastykarz's study tests whether that combination holds up once real tasks and token consumption are measured, rather than price per token alone.

Sonnet 4.6 vs Sonnet 5 Pricing (credit: Microsoft)

Cheaper tokens, pricier runs

Sonnet 5's per-token pricing is lower across the board, sure, but it’s token consumption that determines the final bill, and Sonnet 5 used far more tokens to complete the same tasks.

On the 12 scenarios that tested Azure architecture and design tasks, evaluated against Microsoft Learn, Microsoft's documentation platform for its developer and enterprise products, Sonnet 5 consumed 12 times more tokens than Sonnet 4.6 at the median, with one run hitting 47 times the typical volume.

On the three SharePoint Framework upgrade scenarios — including a gulp-to-Heft build tool migration, and a legacy-to-flat ESLint config migration — the gap was smaller but still substantial, at 10 times more tokens.

It’s worth noting that the cost outcome varied by task. On code upgrades, Sonnet 5's larger token consumption pushed the per-run cost to $2.01, against $0.55 for Sonnet 4.6, despite the lower list price. Architecture tasks told a different story: Sonnet 5 came in slightly ahead there, at $0.47 per run compared with $0.54 for the older model, because the token overhead was smaller relative to the discount.

Consistency was the bigger issue for Sonnet 5 across the board. Median token consumption came in at 40,000 for Sonnet 4.6 versus 199,000 for Sonnet 5, and the gap between typical and worst-case runs was far wider for the newer model — on one architecture task, token counts across identical runs varied from 16,000 up to 6.6 million.

Token consumption per run for Sonnet 4.6 (blue) versus Sonnet 5 (red). (Credit: Microsoft)

Cost, however, was only part of the story. The other question was whether the extra tokens bought anything in return.

Sonnet 5 wins on code, Sonnet 4.6 wins on architecture

This is where Microsoft's data does its real work: not just showing that costs behave strangely, but that the "upgrade" moved backward on one type of task while moving forward on another, in the same study.

Both models attempted the right task at similar rates on architecture work, passing Microsoft's completion gate 75% of the time. Sonnet 4.6 scored 90% on Microsoft's idiomatic-output measure, checking whether the result follows established coding conventions, against 78% for Sonnet 5, outperforming it in 8 of 9 comparable scenarios.

Code upgrade tasks reversed the picture. Sonnet 4.6 passed the completion gate in 60% of runs; Sonnet 5 passed 100%. The clearest example: a task asking the agent to upgrade a project to a specific target version. Sonnet 4.6 ignored the version requested and defaulted to a different one every time, based on what its own documentation search suggested — while Sonnet 5 followed the exact instruction given, every time.

Sonnet 4.6 vs. Sonnet 5 across architecture and code upgrade tasks (credit: Microsoft)

The importance of measuring first, and upgrading second

On the SharePoint Framework upgrades specifically, configuration correctness sat at 0% for both models across every scenario. Neither could complete structural changes such as migrating build tooling or config formats, because the specific steps involved were never written down anywhere the agent could find them.

Mastykarz's team identified seven concrete file and configuration changes missing from the documentation entirely, ones no model could have discovered on its own.

"A model upgrade is a hypothesis, that newer means better for your specific tasks," Mastykarz writes — one that holds only if the underlying content matches too.

Mastykarz points to researcher Ethan Mollick's idea of the "jagged frontier" to describe it: AI models handle some tasks well and stumble on others of similar difficulty, with no obvious pattern predicting which is which. Sonnet 5's own results bear that out — task completion on code upgrades jumped from 60% to 100%, while architecture quality fell from 90% to 78% on the same upgrade path.

Which side of that line a given workload falls on isn't knowable in advance. Microsoft's recommendation is to test against the actual task before switching, and to check whether the agent has the grounding material it needs in the first place.

Or, as Mastykarz put it: "Measure first, upgrade second."

I Spent a Week Fixing the Wrong Skill (And Other Lessons from Evaluating an AI PR Reviewer)

Tessl — Wed, 08 Jul 2026 08:19:17 +0000

TLDR

The baseline model (Claude Opus, no guidance) already catches ~65% of textbook bugs. The plugin's value comes from false positive suppression and risk classification, because the baseline already catches most bugs on its own.
The plugin had been classifying risk correctly all along. I just wasn't measuring it. One eval weight change, zero code changes, and the gap widened 9 percentage points.
I spent four versions rewriting the reviewer's prompt to fix a false positive. The actual fix was one line in a completely different skill, upstream.

In Part 1, I described the PR review plugin: evidence-first architecture, six skills, risk lanes. It hit 97.7% accuracy across 43 eval scenarios. This post is about how it got there, because the eval journey taught me more than the final number.

How I evaluated the AI PR reviewer

I built four test repos from scratch: data-service, payments-api, web-dashboard, deploy-infra. Each has planted bugs of varying subtlety, from "you forgot to sanitize this input" to "this session TTL is set to zero, which means sessions never expire, which means stolen session tokens are valid forever."

The baseline is Claude Opus reviewing the same PRs with no plugin guidance. Just the model, the diff, and a generic "review this code" prompt. I started with 33 scenarios and ended with 43.

First surprise: the baseline scored ~70% on the initial 33 scenarios. On textbook bugs (missing input validation, obvious SQL injection, unhandled error paths) the baseline catches most of them. The model is smart. This isn't 2023 anymore.

That ~70% is important context for everything that follows. It means any AI reviewer that just adds more bug-finding instructions on top of a capable model is competing for the remaining 30%. And if it generates false positives along the way, it might be net negative. The firehose problem the research warned about.

It also means the baseline's score will drop as the test gets harder, because those easy wins that inflate the 70% start counting for less once you add scenarios the baseline can't handle. Watch the baseline column in the table below. It goes down, not up. That's by design.

Where the gap actually comes from

Version 14, my first serious eval run: plugin 87.8% against the baseline's ~70%. Real gap. Here's what created it.

The plugin found roughly the same bugs with far fewer false positives and better risk classification. The evidence builder's lane system meant the reviewer wasn't hallucinating security findings on docs-only PRs. That's the difference between a review a developer reads and one they close after the second paragraph.

Improving AI review accuracy: domain knowledge, harder tests, better scoring

The first lever was domain knowledge. I taught the plugin about CSV formula injection in export fields (a cell starting with = gets executed by Excel; ask any security team that's dealt with this), Glacier storage cost traps, stale auth cache interactions. The kind of bugs a human reviewer with domain expertise catches because they've been burned before. That took the plugin from 87.8% to 94.5%.

Then I made the test harder. Ten new scenarios, tougher bugs, and I reweighted scoring so the gimme scenarios (where both plugin and baseline score 100%) counted for less. The gap blew open: plugin 94.1%, baseline 64.6%. A 29.5 percentage point spread. The harder I made the test, the wider the gap got.

The most interesting version bump barely touched the plugin at all. I changed the eval's scoring weights: risk classification went from 5 points to 10 points per scenario. The gap widened another 9 percentage points. Same plugin code, same scenarios. The plugin had been classifying risk correctly the whole time; I'd been underweighting the thing it was best at.

Final run, version 21: plugin 97.7%, baseline 66.6%.

`Version  What changed                        Plugin  Baseline  Gap
──────── ──────────────────────────────────── ─────── ──────── ─────
v14      First serious eval (33 scenarios)    87.8%   ~70%     ~18pp
v15      Domain-specific hotspots             94.5%   ~70%     ~25pp
v17      +10 harder scenarios, reweighted     94.1%   64.6%    +29.5pp
v20      Risk classification weight 5→10      ----    ----     +9pp wider
v21      Evidence builder fix (route guards)  97.7%   66.6%    +31.1pp`

Here's what a scenario looks like. This is the session TTL zero eval (one of the "high subtlety" bugs I expected to stump the baseline):

`Task: "Review pull request #5 in the repository ai-pr-reviewer-tests/payments-api."

Criteria (weighted checklist):
{
  "context": "session_data cache TTL set to 0 means sessions persist
    in Redis indefinitely",
  "checklist": [
    {
      "name": "Catches session never-expire risk",
      "description": "Identifies that TTL=0 means sessions stored
        with no expiry, creating stale/orphaned sessions if the
        auth layer fails to explicitly delete them.",
      "max_score": 15
    },
    {
      "name": "Catches unbounded Redis memory growth",
      "max_score": 5
    },
    {
      "name": "Risk classified yellow or higher",
      "max_score": 10
    }
  ]
}`

The task is one sentence. The rubric is weighted: catching the core security risk (session never-expire) is worth 15 points, the memory growth consequence 5, and risk classification 10. The baseline caught this one at 100%.

When fixing the reviewer prompt doesn't work

One scenario gave me the most trouble: a PR adding authorization middleware to three API routes that previously had none. Correct code, good security practice. The plugin kept flagging it as HIGH severity: "potential security misconfiguration in route handling."

I rewrote the reviewer's instructions four times. Version one: I told the reviewer to consider whether route guards are additive security measures. Still flagged. Version two: three sentences with examples explaining that adding a guard is a security improvement. Flagged. Version three: I restructured the entire reviewer prompt section on security findings. Same result. Version four: I got specific. "If the change adds authorization checks to routes that previously had none, this is a hardening change, not a vulnerability."

Still flagged it.

The reviewer wasn't broken. The evidence builder upstream had classified the route change as "red lane": high risk, security-relevant, requires deep scrutiny. By the time the reviewer saw the code, the framing was already set. I'd been tuning the wrong skill for a week.

The fix: I changed the evidence builder's classification logic to recognize that adding guards to unguarded routes is a hardening pattern, not a risk pattern. The evidence pack now classified it as green-lane. The reviewer read the same diff, saw a green-lane classification, and correctly identified it as a security improvement.

4% accuracy on that scenario became 100%. I never touched the reviewer. The only thing that changed was what the evidence builder told it before it started reading the code.

Upstream evidence quality determines downstream review quality. The reviewer is only as good as the evidence pack it's handed. Fixing the reviewer's prompt is like arguing with a judge after the prosecution already presented tainted evidence. The bias is baked in before the verdict.

Here's the actual text I added to the evidence builder's risk classification logic:

`Auth risk requires call-site analysis. Do not classify a PR as red
solely because it touches permission-checking code. Read the call
sites to determine whether the effective access policy changed.

For example, a switch from every() to some() on a role array changes
behavior — but if every call site passes OR-style role lists, some()
is the correct semantic and the change is a bug fix, not a regression.
Classify based on whether the access policy actually changed.`

That's it. One paragraph of guidance in the evidence builder, telling it to check call sites before panicking about auth changes. The reviewer's prompt didn't change at all.

AI catches more bugs than the research predicted

I designed several "high subtlety" scenarios expecting them to stump the baseline. Session TTL set to zero. A crash in an authentication provider that fails open instead of closed. The baseline caught both at 100%.

Models are more capable than the 2025 research estimated. The window for "bugs only AI-guided review can find" is narrower than I assumed, which is exactly why the plugin's value lives in the evidence pipeline (risk classification, false positive suppression, structured handoff) rather than in raw bug detection.

LLM variance, though, is real. One scenario (correlation ID propagation) scored 88% in one run and 36% in another. Same scenario, same plugin, same model. The difference is just... the model having a different day. Single-run evals can lie to you. I learned this the hard way in the Good OSS Citizen work, and I still almost got burned by it here.

The gap we haven't closed: developer trust

I validated one thing: does the plugin find the right problems and classify them correctly? Yes. 97.7% across 43 scenarios says yes.

I did not validate the thing that actually matters: do developers trust what it finds and act on it?

The 2025 research says AI review comments get adopted 1-19% of the time. My plugin produces better-structured, higher-signal findings. Maybe that adoption rate is higher. Maybe it isn't. I have zero data.

The retrospective skill exists. It's designed to compare the plugin's findings against human decisions and feed the results back. I never ran it. Not once. The plugin has a feedback loop that has never looped.

I designed for human handoff because the research told me to, and I still haven't tested whether the handoff actually works. Finding the right bugs is solved. Whether a developer reads the brief and actually changes their merge decision, that's the question this plugin can't answer yet, and it's the one that decides whether any of this matters.

Try it yourself

`tessl install tessl-labs/pr-review-guardrails`

The eval corpus is in the GitHub repo. Forty-three scenarios across four test repos with rubrics. Fork it, add scenarios from your own domain, run the eval. If you use the retrospective skill on a real PR, you'll have more adoption data than I do.

Further reading: Part 1 (what the plugin does and how to use it) | Research brief and eval corpus

I Built an AI PR Reviewer That Catches Bugs by Not Looking for Bugs

Tessl — Tue, 07 Jul 2026 06:43:42 +0000

TLDR

Humans don't want to review AI-generated code (why spend an hour reading something that took 30 seconds to generate?), and AI reviewers get ignored 81-99% of the time. PR review is broken from both sides.
The plugin that hit 97.7% accuracy doesn't hunt for bugs. It builds an evidence pack, classifies risk into lanes, and hands a structured brief to a human who makes the actual call.
Install it with tessl install tessl-labs/pr-review-guardrails and point it at a real PR. You'll know in five minutes whether this approach works for your codebase.

PR review is broken from both sides.

Humans don't want to do it. The effort asymmetry is brutal. An agent generates a PR in 30 seconds, and now a human is supposed to spend an hour carefully reading code they didn't write, didn't design, and can't ask clarifying questions about. That's a hard sell even when the code is good. When the code is AI-generated, the motivation drops further. Who wants to be a proofreader for a glorified word-guessing monkey?

So hand it to another AI? The 2025 research says that doesn't work either. AI code review comments get adopted 1-19% of the time, depending on the study, while human reviewer comments land at significantly higher rates. The gap is signal-to-noise. AI reviewers flood PRs with findings, most of them either obvious (the linter already caught it) or wrong (the code is fine, the reviewer hallucinated a vulnerability). Developers learn to ignore the firehose.

I built a Tessl plugin to try a different approach. A Tessl plugin (used to be called a "tile") is a context artifact: a bundle of skills, rules, and scripts that gives an AI coding agent domain-specific context. Think npm packages, but for agent behavior instead of code. Mine doesn't try to be a better bug finder. It builds a dossier of evidence about the PR, classifies the risk, and hands a structured brief to a human who makes the actual call.

Where it started

Earlier this year, I spent some time researching how AI-generated PRs are wrecking open source maintainers. That became the Good OSS Citizen plugin, teaching agents how to contribute. But while studying the flood of AI-generated PRs, I kept circling back to the other side: who reviews all this code?

The research said something useful: AI is good at local, checkable problems: buffer overflows, missing null checks, SQL injection in a query builder. Things where you can point at a specific line and say "this is wrong because X." What AI is bad at is intent, architecture, and trade-offs. The stuff that requires understanding why the code exists, not just what it does.

So the design question became: what if the AI reviewer's job isn't to find bugs? What if its job is to gather evidence and let the human make the call?

Build the dossier first. Let the opinions follow from that.

How the evidence-first review pipeline works

The plugin has six skills. The first one matters most.

The evidence builder reads the diff, maps which files changed, figures out what kind of change this is, and classifies risk into lanes: green (routine), yellow (needs attention), red (security-relevant, requires deep review). Everything downstream flows from this classification. A README fix gets a green lane and a light pass. A change to the auth middleware gets red and the full treatment.

Then the fresh-eyes reviewer gets the evidence pack and the code. It hunts for problems, but only problems the evidence supports. If the evidence builder classified a PR as green-lane, the reviewer isn't going to invent an exotic attack vector in a README change. If I enabled the optional challenger (a second model checking the first reviewer's work), that runs next. The research says cross-model review works as a verification layer, and I wanted to test that claim.

After the review, a synthesizer compresses everything into a single brief with findings, evidence, confidence levels, and a recommendation for what a human should focus on. The human handoff formats that brief for the person who actually decides whether to merge.

There's also a retrospective skill that's supposed to run after the human makes their call, comparing the plugin's findings against the human's decision. A feedback loop that's supposed to improve the plugin over time.

What an AI code review brief looks like

When you run the plugin on a PR, the human reviewer gets a brief. It looks like this:

The brief starts with risk classification (green, yellow, or red) so you know immediately how much attention this PR needs. A green-lane config change gets a one-paragraph summary. A red-lane auth change gets the full breakdown: which files are security-relevant, what data flows through them, what the specific risks are, and what to look for when you read the code.

Each finding comes with evidence: the specific lines, why the plugin flagged them, and a confidence level. A finding that says "this user input reaches the SQL query on line 47 without sanitization" is something a developer acts on. A finding that says "potential security concern in this module" gets ignored before the developer finishes reading it. The plugin is built to produce the first kind.

The brief also tells you what it didn't check. If the PR touches areas outside the plugin's domain knowledge, it says so instead of pretending it reviewed everything.

Here's what the plugin produced for a real PR that changes Redis cache TTL configuration in a payments API:

`PR: #5 — Update Redis cache TTL configuration
Risk lane: RED
  - Cache invalidation logic changes with auth-adjacent session_data prefix
  - TTL=0 introduces keys that never expire (memory and security implications)
  - Mandatory human review required (auth/security, cache invalidation)

Finding 1 [HIGH / verify]: session_data TTL config entry has no consumer
  File: src/cache/cache_layer.py:17
  "session_data": 0,  # sessions managed by auth layer, no TTL needed
  Evidence: grep for session_data across src/ returns zero results.
  src/auth/sessions.py manages its own Redis keys with 24h TTL,
  bypassing the cache layer entirely.

Finding 2 [HIGH / fix]: Zero-TTL cache entries persist forever
  File: src/cache/cache_layer.py:47
  if ttl == 0: r.set(key, json.dumps(value))
  Evidence: No background cleanup, no maxmemory-policy safeguard,
  no monitoring for key count growth. Gradual Redis memory leak.

Finding 3 [MEDIUM / discuss]: payment_details staleness window widened to 5min
Finding 4 [MEDIUM / fix]: New ttl==0 branch in set_cached is untested

Questions for human reviewer:
1. Is there a planned follow-up PR that routes session data through cache?
2. Are Stripe webhooks invalidating cached payment details on status changes?
3. What is the Redis maxmemory-policy in production?`

Four findings, each with the specific file, line, code, and evidence trail. The human reviewer knows exactly what to focus on and why.

What AI code review still can't do

The plugin doesn't replace the human reviewer. The research is clear on this: AI review catches local, checkable problems. Intent, architecture, trade-offs: those are still yours. The plugin's job is to do the tedious forensic work (trace this data flow, check this input path, verify this config isn't exposed) so the human can focus on the questions only a human can answer: should this feature exist? Does this design make sense? Is this the right trade-off?

It also doesn't have real-world adoption data yet. I validated that it finds the right problems across 43 eval scenarios (97.7% accuracy against a 66.6% baseline). I did not validate whether developers trust what it finds and act on it. That's the honest gap. If you run the retrospective skill after a real review, you'll have more data than I do.

In Part 2, I'll show how I built the eval, what I learned from eight rounds of iteration, and the debugging story where I spent a week fixing the wrong skill.

Try it

`tessl install tessl-labs/pr-review-guardrails`

The plugin, the eval corpus, and the research brief are all in the GitHub repo. Point it at a PR you've already reviewed and compare its brief against what you found. That's the fastest way to know if the evidence-first approach works for your codebase.

Further reading: Good OSS Citizen Part 1 (the research that started this) | Research brief and eval corpus

As SpaceX deal looms, Cursor partners with Chainguard to secure open-source dependencies in AI-built code

Tessl — Mon, 06 Jul 2026 06:45:04 +0000

Cursor has spent the past week in headlines after confirming a partnership with SpaceX that could eventually lead to a $60 billion acquisition. The deal, for now, centres on training more capable coding models using SpaceX’s compute infrastructure.

Alongside that push on model performance, however, Cursor is now addressing a separate issue: the reliability of the code those models produce.

Cursor has partnered with Chainguard, which provides verified open-source packages, to route dependencies through its curated repositories, aiming to reduce the risk of compromised components entering AI-built applications.

The announcement lands as AI coding tools push more software into production with less human review, raising questions about how much of that code can be trusted.

Supply chain risks in the agentic era

The partnership addresses a problem developers know all too well. Modern applications depend heavily on open-source libraries and container images, most of which are pulled from public registries such as npm, PyPI, and Docker Hub.

Those registries operate on openness, with limited checks in place. Developers — and now AI agents — often install dependencies without knowing who built them or whether they have been tampered with.

Recent incidents have underlined the risk. In March, projects such as Trivy, LiteLLM, Telnyx, and Axios were compromised, with attackers using poisoned packages to steal credentials and spread malware.

For teams using AI-generated code, the exposure increases. Agents can select and install dependencies automatically, making trust decisions at a pace that outstrips manual review.

As Chainguard co-founder and CEO Dan Lorenc put it, generating code is becoming routine — checking its integrity is where the pressure now sits.

“AI agents are making dependency decisions at a scale and speed no security team can manually review,” he wrote in a blog post. “As organizations adopt agentic development, the biggest blocker is no longer how fast code can be generated – it’s whether that code can be trusted.”

A curated path for dependencies

Under the partnership, Cursor users can pull libraries and container images from Chainguard’s repository instead of public registries. The company says its catalogue includes millions of vetted library versions across Python, JavaScript, and Java, along with thousands of minimal container images.

The filtering process is strict. Chainguard builds packages only from publicly available source code and avoids components that rely on install-time scripts — a common vector for hidden payloads. If a package cannot be traced back to a verifiable source, it doesn’t make the cut.

The goal is to narrow the attack surface without changing how developers work. Projects can be migrated through a simple prompt inside Cursor, after which dependencies are swapped out behind the scenes.

“Recent supply chain attacks showcased how bad actors are working to manipulate the public tools and registries we’ve historically relied on to consume open source,” said Brian McCarthy, a senior executive at Cursor. “With agents writing the majority of code at top businesses around the world, new tools to help ensure the code is trusted and the ability to review and monitor at speed creates a safer paradigm.”

Why this matters for AI-built software

The partnership reflects a broader industry-wide shift in how software is produced and protected. AI coding tools are no longer limited to suggesting snippets; they are assembling full applications, including the dependencies those applications rely on.

That changes the risk profile massively. The bottleneck isn’t writing code, but confirming that every component — including third-party packages — is safe to run in production.

Without stronger controls, a single compromised dependency can expose sensitive data or halt development while teams investigate and rotate credentials. Incidents tied to supply chain attacks can take days or weeks to unwind.

Other companies are approaching the same problem from different angles. This includes Tessl, which recently introduced security scoring for open source packages in its registry, using data from Snyk to help developers assess risk before pulling in dependencies.

By inserting a verification layer into the dependency pipeline, Chainguard and Cursor are trying to address that weak point directly. The approach doesn’t eliminate the risk entirely, but it narrows the range of unknowns by limiting what can enter a project in the first place.

For Cursor, the move also reflects the expectations of larger customers, particularly as it draws attention from companies such as SpaceX. As AI coding tools edge further into enterprise use, assurances around security are likely to carry as much weight as speed or capability.

Warp goes open source, betting agents and community can outpace closed rivals

Tessl — Sun, 05 Jul 2026 07:53:34 +0000

Warp, the developer tooling startup behind the modern terminal of the same name, is open-sourcing its client — and tying that move to a broader shift in how it believes software will be built.

With AI agents now capable of handling much of the implementation work, Warp argues the real constraint now lies in defining what to build, coordinating tasks, and verifying outputs. Opening up the codebase, it says, allows a wider pool of contributors to take on that role — effectively supervising a growing fleet of agents.

“The biggest bottleneck to development is no longer writing code – it’s all the human-in-the-loop activities around the code: speccing the product and verifying behavior, and frankly, we are limited in what our internal team can do and the pace we want to move at,” Warp founder and CEO Zach Lloyd wrote in a blog post.

OpenAI is a founding sponsor of the new Warp repo on GitHub, with Warp’s agent workflows powered by GPT models.

Warp in action

A different kind of open source

Warp isn’t just publishing its code and inviting pull requests. It’s proposing a more structured setup, where agents handle coding, planning, and testing, while human contributors focus on direction and validation.

This means ideas flow in through public GitHub issues, are picked up by agents orchestrated via Warp’s internal platform, Oz, and are then reviewed by both the community and the core team.

The aim is to increase throughput without expanding headcount.

“Open-sourcing with an agent-powered repo is our vision of how software will be built in the future,” Lloyd said. “Humans managing agents at scale to build production-grade software is the model, and implementing this model in the open will allow software to improve most quickly.”

The company says it already has confidence in code generated this way, pointing to internal use of agents for implementation-heavy tasks. Opening that process up, it argues, should accelerate development further — and surface ideas it might not arrive at on its own.

“We’ve found that agents can handle the implementation heavy lifting really well,” Lloyd continued. “That frees contributors to focus on the higher-leverage work: shaping what gets built and making sure it’s right.”

For sure, this approach follows a familiar trend of late, where the software development process transitions from writing code to managing the systems that produce it. And this is what Tessl is explicitly building around, serving as an agent enablement platform for managing the context that coding agents rely on — treating agent skills and context as software that needs to be built, evaluated, and continuously updated as systems evolve.

Warp uses openness as a lever

Warp is explicit about the competitive backdrop behind its decision. It points to “highly funded, closed-source competitors” and acknowledges it can’t match them on pricing or resourcing.

Instead, it’s positioning openness as a lever — not just to attract contributors, but to move faster by distributing parts of the development process. The addition of wider support for open models, including systems such as Kimi, MiniMax, and Qwen, along with a new “auto (open)” routing mode that selects the most suitable model for a given task, reinforces that stance.

Alongside the open-source release, Warp is also expanding how much users can customize the environment, letting them choose between a plain terminal and a more fully featured setup with built-in agents, diff views, and file navigation tools.

Ultimately, it’s a bid to differentiate on flexibility and pace in what is a super-competitive space.

Agents change the equation

Underlying all of this is a view about what AI agents actually change.

Warp argues the biggest gains are no longer in code generation itself, but in offloading the surrounding work — planning, coordination, verification — that has traditionally slowed development cycles. If agents can handle the bulk of execution, then expanding the pool of people who can guide and review that work becomes the next step.

That’s where open-source comes in. Rather than scaling an internal team, Warp is betting that a community — working alongside agents — can iterate faster and push the product in directions a smaller group might miss.

It’s a notable contrast to other recent moves in the market, where some companies have pulled back from open development over security concerns tied to AI. Warp is taking the opposite view: that agents make openness more valuable, not less.

Tessl Academy is live (in preview) — and there are two ways in

Tessl — Sat, 04 Jul 2026 07:13:53 +0000

Tessl Academy is live (in preview) — and there are two ways in

We just shipped the first version of Tessl Academy, a hands-on curriculum for building, evaluating, and running skills for coding agents. It's early. Two courses are up — Skill Foundations and Tuning Your Agent — with more on the way. We'd rather get it in front of you now and shape it with your feedback than polish it in private for another month.

Here's the idea. Most of us are already using coding agents, but the results swing between magic and mess. The Academy is about closing that gap: moving from one-off AI coding experiments to workflows you can repeat and trust. Skills are the thread running through every lesson — small, reusable instructions your agent loads on demand.

Two ways to take it

We built the Academy so you can learn whichever way suits you right now:

Read it. Every lesson works as a plain read on the site. No install, no setup — open a lesson and go. Good for a commute, a coffee, or deciding whether the hands-on version is worth your time.
Run it. Install a course once, then ask your agent — Claude Code, Cursor, Codex, or Tessl Agent — to walk you through a lesson. It guides you one step at a time, waits while you work, and hands off to the next lesson when you're done. You learn skills by building one.

Same content, two speeds. Start by reading and switch to hands-on whenever you like — the Quickstart gets you running in about four steps.

It's a preview, and your feedback shapes it

This is genuinely a first cut. Some lessons will land, some won't, and the roadmap past these two courses is still open. That's where you come in: tell us what's confusing, what's missing, and what you'd want to learn next.

Join the conversation in our Discord
Or email me directly: alan@tessl.io

I'll be reading everything. Expect the Academy to move quickly over the coming weeks, and the fastest way to influence where it goes is to try it and tell me what you think.

Start with the Quickstart →

The model's solved, now comes the hard part: Reviewability as the bottleneck

Tessl — Fri, 03 Jul 2026 07:33:36 +0000

It's something you'll likely be hearing more and more: the model is no longer the big sticking point in AI engineering. The question keeping teams up at night is how to build reliable, governable systems around it.

Kilo, the open source coding agent built on VS Code, recently crossed three million downloads and processed more than 40 trillion tokens. And the lessons that volume of real-world usage produced had little to do with model intelligence, and everything to do with reviewability, context, and operational control.

‘Task size should be bounded by what a human can review in a single sitting’

Forty trillion tokens sounds like some sort of success metric, but Kilo's own assessment is a little more measured. At that volume, small problems in the surrounding system become expensive fast. A missing context file becomes repeated tool calls; a poorly scoped task produces a diff too large for any engineer to sensibly review; and a vague permission setting becomes a blocker the moment a second team tries to adopt the tool.

The conclusion Kilo drew from its own usage data was pointed: task size should be bounded by what a human can review in a single sitting. If the output can't be reviewed, it can't be trusted, and if it can't be trusted, it won't be merged.

To illustrate the point, Kilo describes splitting a single feature into three parallel workstreams — a billing API endpoint, a test suite, and documentation update — each handled by a separate agent with a narrow, explicit instruction. One diff touches the endpoint, another touches the tests, while one touches the docs. If the tests fail, the failure is scoped. If the docs agent guesses, the mistake is visible.

“The job changes from ‘write every line’ to ‘design the loop,’” Brendan O'Leary, developer relations engineer at Kilo, writes in a blog post. “You decide the task boundary, the model, the permissions, the environment, and the verification step. The agent writes code. You decide whether that code should exist.”

Kilo's findings fit into a broader pattern emerging elsewhere in the industry. Sourcegraph, the code intelligence platform, recently analysed 1,281 agent runs across more than 40 enterprise-scale open source repositories and found that the gap between success and failure had almost nothing to do with the underlying model.

"The difference between complete failure and near-perfect completion wasn't intelligence — it was efficient access to context," Stephanie Jarmak, agent advocate at Sourcegraph, said.

One benchmark task saw an agent make 96 tool calls over 84 minutes without proper retrieval tooling. The same task, with the right infrastructure in place, took five calls and under five minutes.

The lesson from both Kilo and Sourcegraph is that the systems surrounding the model increasingly determine the outcome.

The infrastructure around the model is the engineering challenge

Kilo's experience also surfaced a more granular picture of what production-grade agentic engineering actually requires. The full loop — plan, scope, run, verify, review, merge — needs dedicated infrastructure at every step. Planning needs modes and file-backed handoffs. Scoping needs explicit permissions and task boundaries. Running needs model choice, tool calls, and environment isolation. Verification needs tests, CI integration, and sometimes a second agent with fresh context. Review needs a diff a human can understand. When any one part is missing, the agent may still produce code, but the team just won't trust it enough to merge.

But reviewability is only part of that picture. OpenAI's most recent enterprise guidance, drawing on deployments at companies including BBVA, Philips, and JetBrains, shows that organisations seeing the most traction are those focused on evaluation systems, context management, orchestration, and governance — not on which model sits underneath.

"The organisations that win with AI won't be the ones that tried it first — they'll be the ones that operationalised it best," said Sanj Bhayro, OpenAI's managing director for EMEA.

The emerging picture is of a new engineering layer forming around AI systems: evaluation tooling that runs against real codebases, shared context registries, permission controls, usage analytics, and observability infrastructure. Kilo's own roadmap reflects this directly — its next priorities centre on portable sessions that survive moving between VS Code, the terminal, Slack, and cloud environments, and on ensuring every agent workflow ends in an artifact a human can judge.

Governance before autonomy: teams won't adopt what they can't explain

One of the less obvious lessons from Kilo's experiences is the difference between individual and team adoption. Individual developers adopt tools when they save time. Teams adopt them when they can explain the risk — to finance, to security, to whoever owns the production environment.

“That means agentic engineering needs controls that feel boring until you need them,” O’Leary notes.

Kilo learned that the hard way. Its early free credits attracted tens of thousands of throwaway accounts, generating billing pressure, infrastructure strain, and weeks of engineering time spent in merge conflicts rather than shipping product. The experience sharpened Kilo's thinking on what enterprise-grade agentic tooling actually needs: model allowlists, usage visibility before a billing surprise arrives, permission prompts that can block tool calls, isolated cloud environments for sensitive work, and source visibility for security review.

Those requirements map directly onto the questions Kilo found developers asking about any open source AI tool: can I inspect what runs against my code? Can I bring my own model key? Can I control which models my team is allowed to use? Can I see usage before a bill arrives? Can I keep sensitive work local? And crucially — can I leave if the product stops fitting how my team works?

The infrastructure layer is still being built

Sourcegraph's retrieval findings, OpenAI's governance lessons, and Kilo's focus on reviewability all point toward the same challenge: reliable AI systems depend on reliable infrastructure around the model.

Kilo's own roadmap frames the next phase in three parts: portable, meaning sessions that survive moving between VS Code, the terminal, Slack, and cloud environments; governed, meaning teams can set model policies, inspect usage, and control permissions; and review-first, meaning every agent workflow ends in an artifact a human can judge — a diff, a test result, a PR comment, a deployment preview.

Forty trillion tokens and three million downloads later, Kilo's conclusion is that generating code is only part of the problem. Teams still need ways to review it, verify it, govern it, and trust it. The model may be good enough, but the systems around it are still being built.