oh-my-agent v2: Nine New Skills, First-Class Cursor, and an 80/100 Benchmark

#ai #productivity #tutorial #webdev

If you have watched an AI coding agent install a package version that does not exist in your lockfile, or ship a function that fails your own lint config on the first commit, you already understand the gap oh-my-agent v2 is built to close. The framework's second major release adds nine new skills, promotes Cursor to a first-class vendor, and ships a benchmark that scores the toolkit 80 out of 100.

Here is what v2 changes, and how to decide whether the additions target real failure modes or just expand the surface area.

What oh-my-agent does

oh-my-agent is a skill layer that sits between you and whatever AI coding agent you run. The name borrows from oh-my-zsh, and the analogy holds: instead of configuring shell behavior, you configure agent behavior with reusable, composable instruction modules the project calls skills.

The problem it targets is consistency. A raw coding agent keeps no durable memory of your project's conventions. Ask it to add a dependency and it may guess a version that is not in your lockfile. Ask it to write a component and it may ignore the lint config sitting in your repo root. These are not edge cases — they are the default behavior of an agent that treats every request as a fresh context.

A skill in oh-my-agent is a packaged set of instructions and checks the agent loads when a task matches. One skill might force the agent to read your package.json and lockfile before proposing a version. Another might surface your linter rules before any code is written. The pitch is that you stop re-explaining the same constraints in every prompt.

The nine new skills in v2

The v2 release adds nine skills. Three are worth calling out, because they map to problems most teams hit within a week of adopting an agent.

deepsec handles security review. Instead of trusting the agent to remember secure patterns, the skill runs a structured pass over generated code, checking for the injection, secret-handling, and trust-boundary mistakes agents introduce when they optimize for making something work.

observability pushes the agent to add logging, metrics, and tracing as it writes code, rather than leaving instrumentation as a follow-up task that never happens.

docs drift detection is the one most teams underrate. When an agent changes a function signature or a config option, the matching documentation usually goes stale without anyone noticing. This skill flags the gap so docs and code stay in sync.

If you adopt only one skill from v2, start with docs drift detection. Stale documentation is the failure you notice last and pay for longest: every new teammate and every future agent run inherits the wrong mental model from it.

The remaining six skills round out areas like testing and project conventions. The pattern across all nine is the same: take a step a developer is supposed to do, and make it a non-optional part of the agent's workflow instead of a hope.

Cursor becomes a first-class vendor

Earlier oh-my-agent releases were built around one agent and treated the rest as second-class. v2 changes the model. A vendor is the underlying agent that executes skills, and Cursor is now a first-class vendor, which means skills are tested against it and ship with Cursor-specific wiring rather than a generic fallback.

In practice, you can keep oh-my-agent's skill definitions in one place and run them through Cursor's agent without rewriting instructions per tool. For teams that have standardized on Cursor as their editor, that removes the main reason to maintain a separate, hand-rolled set of project rules.

First-class status is a maintenance commitment, not a one-time feature. The thing to watch over the next few releases is whether Cursor support keeps pace with the primary vendor or quietly drifts behind it — the usual failure pattern for multi-vendor tools.

What the 80/100 benchmark does and doesn't tell you

v2 ships with a benchmark that scores the toolkit 80 out of 100. A published, repeatable number is useful on its own: it gives you a baseline to compare future releases against, and it signals the project is willing to measure itself instead of leaning on adjectives.

Treat the number as a starting point, not a verdict. A benchmark reflects the tasks its authors chose. An 80 on the project's own suite tells you the skills behave as designed on that suite. It does not tell you how they perform on your codebase, your stack, or your conventions.

Do not adopt oh-my-agent on the strength of the 80/100 score alone. Run the skills against a real branch in your own repo and measure something you care about — failed lint checks, wrong dependency versions, broken builds — before and after. A framework's self-reported benchmark is a sales sheet until you reproduce it.

The honest read on v2: the release aims squarely at the most common, least glamorous agent failures — wrong versions, ignored configs, stale docs — rather than chasing a flashier capability. That is the right target. The open question is operational. Nine new skills is a lot of surface to keep working across two first-class vendors, and the real proof will be whether release three holds the line.

Originally published at pickuma.com. Subscribe to the RSS or follow @pickuma.bsky.social for new reviews.