DEV Community

gracefullight
gracefullight

Posted on

oh-my-agent: skills now measure and optimize their own utility

Most skill libraries grow by accretion. You add a SKILL.md, it sounds useful, and it lives forever because nobody can prove it helps or hurts. This week oh-my-agent closed that gap: oma skills eval measures whether loading a skill actually improves held-out task outcomes, and oma skills opt rewrites the skill to push that number up. 194 commits landed, CLI is at 8.41.0, but the eval-to-opt loop is the part worth your attention.

What's new

  • oma skills eval: measures utilityLift (treatment vs baseline) on held-out tasks. --mock replays recorded rollouts deterministically, --live spawns two read-only agentic arms per task, --record captures the rollouts. Default checker is judge (an LLM grades output against a rubric); assert and regex are opt-in deterministic checks.
  • oma skills opt: an optimizer LLM proposes bounded add/delete/replace edits to a SKILL.md, re-scores each candidate through eval, and accepts only when held-out validation lift strictly improves with no negative-transfer regression (SkillOpt, arXiv:2605.23904). --dry-run is the default; --apply writes through atomic temp+rename with a .bak backup.
  • Negative-transfer sampling: --neg-transfer checks whether loading one skill regresses unrelated same-domain tasks from other skills' eval sets.
  • Scaling-law audit checks: oma skills audit now flags black-hole skills (overly generic routing hijackers) and warns past a calibrated library-size routing-decay threshold (Chen et al., arXiv:2605.16508).
  • oma-video skill and /video workflow: key-optional 3-tier generation (9:16 shorts, 16:9 explainer, demo capture of any URL) composing narration, visuals, captions, and a vendored Remotion compositor. Every provider degrades to a deterministic fallback, so a run completes with zero API keys.
  • Swift native iOS in oma-mobile: a swift-ios variant (SwiftUI + @Observable, Apple swift-openapi-generator, App/Core/Features/Shared layout). /stack-set now detects Swift, Flutter, and React Native and routes to the resolved skill; oma verify mobile runs swift build / swift test by stack manifest.
  • oma intel: a local-first product intelligence pipeline that collects GitHub README, releases, and issues, runs an adversarial multi-lens review gate, and splits output into a PRD and a gap report.
  • Three new runtimes: Kiro CLI, Pi (Earendil, via in-process .pi/extensions), and full Antigravity (agy) hook integration through .agents/hooks.json.

What's fixed

  • runAction clobbered positional operands: it overwrote args[0] with the merged options object, so oma state:emit decision.made '{...}' recorded the kind as {category:"main",...} and state:verify always reported the decision missing. Options are now replaced by position so operands survive.
  • --yes never reached handlers: the wrapper passed command.opts() (which drops globally-parsed flags), so oma skills eval --live --yes still blocked at the cost-preview prompt. Switched to optsWithGlobals(), making live skill-eval runnable in CI.
  • AgentMemory leaked into project dirs: the iii engine wrote a cwd-relative ./data/ store into whatever project launched it. The daemon cwd is now pinned to ~/.agentmemory, and daemon stop invokes agentmemory stop so no orphaned engine keeps port 3111.
  • Keyless market sources silently 403'd: anonymous reddit search.json and bluesky's public search endpoint both returned 403, dropping two of the default sources. Reddit now routes through pullpush.io, bluesky through api.bsky.app, taking keyless default coverage from 2/4 to 4/4.
  • agy headless stdout was empty: Antigravity emits nothing on stdout under --print against a non-TTY, so spawned subagent capture was blank. Subagents now run under a PTY (script(1)) so their output is captured.

What's better

  • Workflows are symlinked directly: each workflow file carries its own name + disable-model-invocation frontmatter and is exposed by symlinking straight at .agents/workflows/<wf>.md. This removed 18 committed wrapper skills and fixed a pdf/oma-pdf audit false positive; the real skill count is now 30.
  • harvest.ts split: the 1.4k-line market harvest file became endpoints / normalizers / sources modules, and a ~400-line fetchSource conditional became a source-handler registry. The public facade is unchanged.
  • Print stylesheet stopped fighting the cascade: the slide PDF export dropped its avoidable !important overrides by fixing the source of the conflict (scoped #slide-NN resets emitted after author styles) instead of forcing the win. The only remaining !important is the prefers-reduced-motion a11y reset.
  • Centralized paths and hashing: install, state, and recap now share .agents path constants and agree on full SHA-256 for manifest checksums.
  • Default effort lowered xhigh to high for install-time Claude settings and the Anthropic auto-default; existing higher settings are preserved.

Installation

# macOS / Linux
curl -fsSL https://raw.githubusercontent.com/first-fluke/oh-my-agent/main/cli/install.sh | bash
Enter fullscreen mode Exit fullscreen mode
# Windows (PowerShell)
irm https://raw.githubusercontent.com/first-fluke/oh-my-agent/main/cli/install.ps1 | iex
Enter fullscreen mode Exit fullscreen mode

oh-my-agent is built for teams who treat a skill library as a measured asset, not a junk drawer. Next up: feeding oma skills opt accepted edits back through the eval fixtures so the library self-tunes on every release.

https://github.com/first-fluke/oh-my-agent

Top comments (0)