DEV Community: Anton Abyzov

Building the npm + Snyk + Jest for AI Agent Skills in 7 Days (Opus 4.7 Hackathon, Day 0)

Anton Abyzov — Tue, 21 Apr 2026 07:00:58 +0000

The email

1:39 AM, April 21. Cerebral Valley:

Your status for "Built with Opus 4.7: a Claude Code hackathon" has been updated to APPROVED.

500 builders. $100K pool. Seven days with Opus 4.7 and the Claude Code team watching live.

The product I'm pushing further this week

I've been shipping Verified Skill for months. This week, with Opus 4.7, I compress the next quarter of roadmap into seven days.

Verified Skill sits at the intersection of three gaps every AI agent ecosystem shares:

Security: skills execute code, access tools, read files. No industry-wide scanning.
Quality: no eval framework, no benchmarks, no way to grade skills.
Distribution: no semver, no signed releases, no audit trail.

Think npm + Snyk + Jest — but for AI agent skills, across every agent platform.

Why model-agnostic and agent-agnostic matter

SKILL.md isn't a Claude-only format. It's the de facto standard across 49 agent platforms:

CLI: Claude Code, Cursor, Copilot, Windsurf, Codex, Gemini CLI, Amp, Cline, Roo Code, Goose, Aider, Kilo, Devin, OpenHands, Qwen Code, Trae
IDE: VS Code, JetBrains, Zed, Neovim, Emacs, Sublime, Xcode
Cloud: Replit, Bolt, v0, GPT Pilot, Plandex, Sweep

One skill, 49 places it can run. But only if the infrastructure exists.

What already ships

vskill — the universal CLI

vskill install security-scanner
vskill audit
vskill update --all
vskill scan ./my-skill
vskill blocklist

49 agent platforms. Model-agnostic. Deduplication (install once, works across every agent you have). vskill.lock for reproducibility.

Skill Studio — `npx vskill studio`

The part nobody else is building. A 100% local eval framework. No cloud. No telemetry.

npx vskill init
npx vskill eval init my-skill      # auto-generate tests from SKILL.md
npx vskill eval serve              # visual dashboard
npx vskill eval run my-skill       # run benchmarks
npx vskill eval sweep              # cross-model testing

Three eval modes:

Benchmark: tests WITH the skill, grades assertions
A/B comparison: blind side-by-side, skill vs. baseline, semantic grading
Activation test: does the skill trigger when it should?

Multi-model by default: Claude, GPT, Gemini, Llama, Ollama. Bring your own adapter. MCP-referencing skills get simulated tool responses automatically — no live API calls needed.

verified-skill.com — the registry

3-tier trust: Scanned → Verified → Certified. 52 security scan patterns. Discovery, eval result display, audit trail. Publishing gated on security scan.

What I'm pushing this week with Opus 4.7

Agent-aware generation: same skill source, tailored output per target agent. Strip Claude-specific fields for Cursor. Add Codex-specific guidance for Codex. One author, every platform.
Smart routing: feature filtering based on target-agent capabilities.
Deeper eval loops: regression detection hooked into CI.

Stack

CLI: Node.js 20 + ESM
Registry API: Cloudflare Workers + D1 + Prisma
Dashboard: Next.js 15 (App Router)
Build loop: Opus 4.7, orchestrated through SpecWeave — my open-source spec-driven dev framework for Claude Code. Every feature starts as a spec, generates a plan and tasks, and closes through automated quality gates (code review, simplify, grill, judge-LLM). It's what makes 7-day ships possible.

The stakes

$100K prize pool. Judges: Boris Cherny, Cat Wu, Thariq Shihipar, Lydia Hallie, Ado Kukic, Jason Bigman from the Claude Code team.

Seven days. The hackathon runs on Claude. The product runs everywhere.

Daily build logs here. Follow along.

Anthropic Just Validated Agent Teams, Why Specs Matter More Than Prompts

Anton Abyzov — Wed, 08 Apr 2026 01:06:15 +0000

Anthropic Just Validated Agent Teams, Why Specs Matter More Than Prompts

Today Anthropic showed the slide that matters most for the next phase of agentic software: Agent Teams Revisited.

Not because it is flashy, but because it makes the shift explicit:

subagents
Claude Code communicating with them
coordination as a first class capability

That is the real story.

Most people still think in terms of one super prompt and one super assistant.
I think the future is much closer to an organization:

one lead agent
multiple specialists
a shared spec
verification before execution

Prompts give output, specs give alignment

A prompt is good for producing a result.
A spec is better for aligning a system.

Once you want multiple agents to work together, alignment matters more than cleverness.
Without structure, multi agent workflows become multi chaos.
With structure, they become leverage.

That is why I am bullish on spec first orchestration.

The workflow I want is simple:

spec -> /sw:team-lead -> specialists -> verification

The lead agent should not improvise from vague intent.
It should have enough structure to:

understand the target outcome
break work into tasks
delegate to specialists
review what came back
return a clean execution path

Why Anthropic's slide matters

When a frontier lab shows subagents and Claude Code communicating with each other, it validates a broader direction:

better models will matter
but orchestration will matter just as much
the next moat is not only intelligence, it is coordination

A lot of the value will move into the workflow layer around the model.
That includes how work is specified, delegated, verified, and merged.

What I am building toward

This is the direction I have been pushing with SpecWeave and the Verified Skill layer.

Both are FREE and OPEN SOURCE:

My view is simple:

prompts are not enough
teams need specs
agent teams need a lead
execution needs verification

Anthropic showed the destination.
Now the real race is building the best execution layer around it.

If you are building in this space, I would pay very close attention to that shift.

Project Glasswing changes the AI security conversation

Anton Abyzov — Tue, 07 Apr 2026 19:04:33 +0000

Project Glasswing changes the AI security conversation

Anthropic’s Project Glasswing is one of the clearest signals yet that frontier AI has crossed from “helpful coding assistant” into something much more consequential: autonomous vulnerability discovery.

According to Anthropic, Claude Mythos Preview found thousands of high-severity vulnerabilities, including in every major operating system and web browser. More importantly, the company says many of these findings, and some related exploit paths, were discovered autonomously.

If those claims hold up, the conversation around AI and software security has changed.

For the past couple of years, most discussions about AI in software have focused on productivity:

faster coding
better debugging
easier refactoring
more capable agentic workflows

Project Glasswing points to the next phase.

The question is no longer just whether AI can help engineers write software faster. It is whether frontier models can become first-class actors in finding and fixing vulnerabilities across critical infrastructure before attackers use similar capabilities offensively.

Why this announcement matters

Three things make this announcement different.

1. Anthropic is explicitly restricting the model

Anthropic is not broadly releasing Claude Mythos Preview. That alone says a lot. Companies do not usually frame their own model as too dangerous for wide deployment unless they believe the capability jump is material.

2. The partner list is unusually serious

AWS, Apple, Google, Microsoft, Cisco, CrowdStrike, the Linux Foundation, NVIDIA, Palo Alto Networks, and JPMorganChase are not participating for PR theater. That coalition signals the industry believes AI-driven vulnerability discovery is becoming strategically important.

3. Open source is central to the story

Anthropic paired the model-access announcement with usage credits and donations for open-source security organizations. That matters because critical infrastructure increasingly depends on open-source components, while maintainers are often stretched thin.

My bigger takeaway: the AI skills supply chain now matters

The most interesting second-order effect of Project Glasswing is not just about model safety. It is about trust in the systems that surround these models.

If AI agents are increasingly writing code, reviewing code, testing software, and securing infrastructure, then we need much better provenance and verification across the entire execution stack:

which skills the agent can use
who authored them
what they are allowed to do
how they are versioned
how they are audited
how teams can trust them in production

That is exactly why I think the AI skills supply chain is about to become a major category.

It is also why I care about verified-skill.com, a FREE and OPEN SOURCE registry for verified AI skills. If we want agentic systems to operate safely in real environments, we need trusted building blocks around them, not just more powerful frontier models.

The real race

Project Glasswing also makes the central strategic question painfully clear:
Who gets these capabilities first at scale, defenders or attackers?

Anthropic’s answer is to give defenders a head start. That is rational. But it also suggests a deeper truth: once AI reaches this level of cyber capability, trust, governance, disclosure, patching workflows, and skill-level controls become just as important as raw model intelligence.

The next era of software security will not be defined only by smarter models.
It will be defined by whether we can build trustworthy systems around them.

Closing thought

Project Glasswing may be remembered as the moment the industry stopped thinking about AI security as a side topic and started treating it as foundational infrastructure.

Smarter agents are coming whether we are ready or not.
The real work now is making them trustworthy.

Claude Code UltraPlan: why the workflow matters more than the hype

Anton Abyzov — Tue, 07 Apr 2026 05:30:18 +0000

Claude Code’s new UltraPlan is getting a lot of “smarter planning” attention.

I think that framing misses the real product shift.

UltraPlan looks more important as a workflow upgrade than as a pure intelligence upgrade.

What UltraPlan officially changes

From the official docs, the basic loop is:

start planning from the terminal with /ultraplan
Claude drafts the plan in the cloud
you review it in the browser
you can leave inline comments and reactions
then you either execute in the cloud or teleport the plan back to your terminal

That sounds simple, but it changes where planning lives.

The real value: terminal → cloud → review → execution

Most people focus on whether the plan itself is better.

But in practice, planning is often limited by:

how easy it is to review
how easy it is to revise
how much it blocks your local workflow
how cleanly it hands off into execution

UltraPlan improves all four.

Your terminal stays free.
You get a better review surface.
You can comment on specific parts of the plan.
And you can choose whether execution stays remote or comes back local.

That is a meaningful improvement in engineering workflow.

Where UltraPlan looks stronger

From the transcript I reviewed, a few things stood out:

it looked roughly ~2x faster than local planning across repeated runs
in some migration-style tasks, it seemed better at auditing blast radius and risk
it looked better suited for multitasking because you can fire off plans and review them asynchronously

That is real value, especially for people working across multiple code changes.

Where the hype breaks down

The same transcript also showed something important:

UltraPlan did not look consistently smarter than local planning.

In some tasks it looked stronger.
In others it looked very similar to local planning, just with a much nicer review experience.

That nuance matters.

Why I think this is bigger than one feature

My current read is that UltraPlan may matter more as planning infrastructure than as one fixed planner.

If Anthropic is using this cloud review loop to test and refine planning strategies over time, then the deeper story is not just a new slash command.

It is a new control surface for planning quality.

The other side of the problem: execution discipline

There is also a separate question here:

What happens after the plan?

If your goal is deterministic, spec-first execution, that is where tools like SpecWeave are still important.

SpecWeave is about:

spec
plan
tasks
tracked execution

It is completely free and open source.

That is a different layer of the workflow, but an important one.

Final takeaway

My takeaway is simple:

UltraPlan is not mainly interesting because it might generate a better plan.

It is interesting because it turns planning into a cloud workflow with better review, better handoff, and better iteration speed.

That may end up mattering more than people think.

I Cancelled My $26,280/Year Cloud GPU Subscription - Here's Why

Anton Abyzov — Thu, 02 Apr 2026 00:15:45 +0000

Last week I ran nvidia-smi on my MacBook Pro M4 Max.

128GB unified memory. 7,168 CUDA cores. CUDA 12.8, running natively on Apple Silicon.

Then I loaded a 70B parameter LLM. Full QLoRA finetune. On a laptop. From my couch.

The Part Nobody Talks About

The H100 has 80GB of HBM3. The M4 Max has 128GB unified. The model that literally doesn't fit on a $40,000 datacenter GPU fits on a MacBook.

The Math Nobody Does

Setup	Cost
H100 cloud	730 hrs x $3/hr = $2,190/month = $26,280/year
M4 Max MacBook Pro	$4,000 one-time

Break-even: month 2. After that: pure savings.

Inference Performance

The M4 Max's memory bandwidth (546 GB/s) gives me about 15 tok/s on a 70B model. Production-usable for most use cases.

The Real Shift

Three years ago, finetuning a 70B model required a cluster. Now it requires a laptop and an afternoon.

What's your current setup for ML work? Cloud or local?

Your AI Skills Deserve More Than a GitHub Repo Nobody Finds

Anton Abyzov — Sun, 22 Mar 2026 00:01:54 +0000

2.4 million. That's how many AI skills have been submitted to verified-skill.com.

It's a free, open source marketplace for AI agent skills across 39 platforms. Every submission gets AI intent analysis and a three-tier trust score (current average: 99.0). Over 107,000 skills are verified and discoverable right now.

If you've built a skill for Claude, GPT, Gemini, or any other agent platform, submit it. Two minutes. Free. Your skill stops being invisible.

Check it out → https://verified-skill.com

5 Months of Daily Shipping AI Developer Tools — Here's What Happened

Anton Abyzov — Sat, 14 Mar 2026 20:26:04 +0000

5 months ago I made a commitment: build AI developer tools every single day.

Not blog posts about AI. Not Twitter threads. Actual tools that developers download and use.

Here's what 5 months of daily shipping looks like — and what I learned.

The Numbers

vskill: 6,273 weekly npm downloads
specweave: 88 GitHub stars (just started promoting)
0 days off the keyboard

What I Built

vskill — Verified AI Skills Registry

The skill/plugin layer for AI coding agents (Claude, Cursor, Codex, etc). Think loadable behaviors:

"Run TDD cycle"
"Generate E2E tests"
"Review PR"

6,000+ developers download it every week. Completely free and open source.

→ https://verified-skill.com

specweave — AI-First Spec-Driven Development

The bigger vision: Write spec → AI generates tasks with BDD test plans → AI implements → tests gate every closure.

No vibe coding. No drift. Every AI-generated change is verified by design.

→ https://spec-weave.com

The Lesson

The developers winning with AI aren't prompting harder. They're building systems where AI correctness is guaranteed by design.

Specs + tests > better prompts.

Both tools are completely free and open source. Both are growing.

What are you building with AI? I'm curious what approaches others are taking to make AI coding reliable.

Claude Code's /voice heard 'clot coat' when I said 'Claude Code' — Voice Tools for Developers Compared

Anton Abyzov — Fri, 13 Mar 2026 16:28:01 +0000

Claude Code just shipped /voice — voice input directly in the terminal.

I tested it the moment it landed. Said "Claude Code." It transcribed "clot coat."

Not great. But let's be fair about what /voice actually is today.

What /voice does

Voice input only — you speak, it transcribes to text, Claude responds in text
Terminal CLI only — no VSCode support yet
No voice output — Claude doesn't speak back
No vocabulary learning — it will keep getting "Claude Code" wrong

That last point is the real issue for daily use.

The comparison

Feature	/voice	ElevenLabs	Wispr Flow
Voice input	✅	❌	✅
Voice output (TTS)	❌	✅	❌
Works in terminal	✅	✅	✅
Works in VSCode	❌	N/A	✅
Works everywhere on Mac	❌	N/A	✅
Vocabulary learning	❌	N/A	✅

Why vocabulary learning matters

Wispr Flow remembers every correction. Fix "Claude Code" once, it's correct forever. Same for your project names, framework abbreviations, and technical jargon.

It works in every input field on Mac — Slack, browser, terminal, editors, everything. And it gets smarter on YOUR vocabulary with every use.

The verdict

/voice is a promising v1. If you live entirely in the terminal and don't mind re-correcting the same words, it works.

But for daily developer workflows on Mac, Wispr Flow is still in a different category. The vocabulary learning alone makes it irreplaceable.

Has anyone found a workflow where /voice actually outperforms dedicated tools? I'm genuinely curious.

Anthropic's Paid Code Reviews vs Free Multi-Agent Reviews with SpecWeave

Anton Abyzov — Wed, 11 Mar 2026 04:39:11 +0000

Anthropic just announced paid code reviews in Claude Code. $15-$25 per review. Can't use your Pro plan. Can't use Max.

But here's what most developers don't realize: Claude Code's CLI already supports code review locally. For free.

The Single-Repo Problem

The bigger issue with their paid review? It only analyzes one repository at a time.

If you're running microservices, a change in your API gateway could break your payment service. Their review will never see that. It only looks at one repo.

Multi-Agent Code Reviews with SpecWeave

SpecWeave (completely free and open source, available on verified-skill.com) takes a different approach. One command:

/sw:team-lead [PR-URL] "thoroughly review this PR"

This spins up 3 parallel AI agents:

Security reviewer — vulnerability analysis, auth issues, injection risks
Logic reviewer — business logic errors, edge cases, race conditions
Architecture reviewer — design patterns, coupling, scalability concerns

All three review across your entire codebase — not just one repo. All your microservices. All your shared libraries. The full picture.

The Results

Each agent produces independent findings that get coordinated by the team-lead agent into a unified review with severity ratings (critical, high, medium, low).

The whole thing runs locally using Claude Code under the hood. No API fees. No per-review charges. Just your existing subscription.

Try It

spec-weave.com — the SpecWeave project
verified-skill.com — free, open source skill registry

Install it in 2 minutes. One command. Three reviewers. Full codebase visibility.

Claude Opus 4.6 Found 22 Firefox Vulnerabilities in 2 Weeks — AI Security Just Got Real

Anton Abyzov — Mon, 09 Mar 2026 13:26:34 +0000

Anthropic's Claude Opus 4.6 just discovered 22 new security vulnerabilities in Firefox — 14 of them high-severity — in just two weeks of automated scanning.

One use-after-free bug was found in 20 minutes of exploration. These weren't theoretical — they were real bugs patched in Firefox 148.

The Numbers

22 new vulnerabilities discovered
14 high-severity
6,000 C++ files scanned
20 minutes to find one critical use-after-free bug
2 successful exploits out of hundreds of attempts

The Dual-Use Problem

Here's what makes this both exciting and concerning: the same AI capability that finds bugs defensively can be weaponized offensively.

Right now, AI appears to be a better defender than attacker — Claude could find bugs but only successfully wrote 2 exploits out of several hundred attempts. But that capability gap won't last forever.

What This Means

If you're in security, this changes your threat model. AI-assisted vulnerability discovery at scale means:

Defenders get superpowers — codebases can be audited at unprecedented speed
Attackers get the same tools — zero-day discovery becomes faster and cheaper
Verification becomes critical — we need to verify AI skills, agents, and tools before they touch production systems

This is exactly the problem I'm working on at verified-skill.com — building verification infrastructure for AI agent skills before they can execute on your system.

The AI security arms race isn't coming. It's here.

Three AI Stories Dropped in 24 Hours. Almost No One Is Connecting Them.

Anton Abyzov — Fri, 06 Mar 2026 21:54:01 +0000

Yesterday was arguably the most important day in AI this year. Not because of any single announcement — but because of three that landed simultaneously.

1. OpenAI dropped GPT-5.4

Native computer use. 1 million token context window. 33% fewer hallucinations vs GPT-5.2. Three models at once: GPT-5.3 Instant, GPT-5.4 Thinking, GPT-5.4 Pro.

This is their most capable release ever. The message is clear: raw, unrestricted capability, shipped as fast as possible.

Source: OpenAI announcement

2. Pentagon officially labeled Anthropic a supply chain risk

Effective immediately. Anthropic is now the first American company ever to receive this designation, which has traditionally been reserved for foreign adversaries like Huawei or Kaspersky.

The reason? Anthropic refused to let Claude be used for mass surveillance of American citizens or autonomous weapons systems. Defense Secretary Hegseth announced it publicly.

Read that again: a company built ethical guardrails into its AI. The U.S. Department of Defense called them a national security risk for it.

Source: TechCrunch

3. Claude Code brought back "ultrathink"

After Anthropic deprecated the ultrathink keyword in January, users noticed quality degradation in complex coding tasks. A GitHub issue was filed. Community pressure mounted. The feature was restored in the latest update.

This is a small story, but it matters: users still have power when they speak up.

Source: GitHub issue #19098

Why these three stories matter together

On the same day, we saw:

Pure capability being shipped at maximum speed (GPT-5.4)
A company getting punished by the government for setting ethical guardrails (Anthropic)
Users successfully demanding quality from their tools (ultrathink)

The AI industry just hit a genuine fork in the road:

Build everything, ask questions later.
Or build responsibly, even when it costs you.

My take

I build developer tools on Claude Code every day. These models power real production work for me and thousands of others.

This week forced me to think harder about the stack I depend on. Not just which model is fastest or cheapest — but which company's values align with how I want AI to be built.

Both paths lead somewhere. The question is where.

What do you think — should AI companies have the right to set ethical guardrails on military use of their products? I'd genuinely love to hear your perspective in the comments.

Hackers Jailbroke Claude to Steal 195M Mexican Taxpayer Records — Why AI Security Needs Layers

Anton Abyzov — Fri, 06 Mar 2026 02:45:20 +0000

Hackers just jailbroke Claude with 1,000+ prompts and stole 195 million Mexican taxpayer records. The AI initially refused. They kept pushing until it didn't.

This is exactly why we built OpenClaw with strict guardrails and audit trails. AI agents that touch real systems need real security. Not just "please don't hack things" in the system prompt.

The cost of sophistication just dropped to near zero. If your AI tools don't have layered defenses, you're already behind.

Key takeaways:

A cybercrime group used 1,000+ jailbreak prompts to bypass Claude's safety guardrails
They compromised 9 Mexican government systems stealing 150GB of data
195 million identities exposed including tax records, vehicle registrations, birth certificates
Anthropic banned the accounts but the damage was done

Source: LA Times

DEV Community: Anton Abyzov

Building the npm + Snyk + Jest for AI Agent Skills in 7 Days (Opus 4.7 Hackathon, Day 0)

The email

The product I'm pushing further this week

Why model-agnostic and agent-agnostic matter

What already ships

vskill — the universal CLI

Skill Studio — npx vskill studio

verified-skill.com — the registry

What I'm pushing this week with Opus 4.7

Stack

The stakes

Anthropic Just Validated Agent Teams, Why Specs Matter More Than Prompts

Anthropic Just Validated Agent Teams, Why Specs Matter More Than Prompts

Prompts give output, specs give alignment

Why Anthropic's slide matters

What I am building toward

Project Glasswing changes the AI security conversation

Project Glasswing changes the AI security conversation

Why this announcement matters

1. Anthropic is explicitly restricting the model

2. The partner list is unusually serious

3. Open source is central to the story

My bigger takeaway: the AI skills supply chain now matters

The real race

Closing thought

Claude Code UltraPlan: why the workflow matters more than the hype

What UltraPlan officially changes

The real value: terminal → cloud → review → execution

Where UltraPlan looks stronger

Where the hype breaks down

Why I think this is bigger than one feature

The other side of the problem: execution discipline

Final takeaway

I Cancelled My $26,280/Year Cloud GPU Subscription - Here's Why

The Part Nobody Talks About

The Math Nobody Does

Inference Performance

The Real Shift

Your AI Skills Deserve More Than a GitHub Repo Nobody Finds

5 Months of Daily Shipping AI Developer Tools — Here's What Happened

The Numbers

What I Built

vskill — Verified AI Skills Registry

specweave — AI-First Spec-Driven Development

The Lesson

Claude Code's /voice heard 'clot coat' when I said 'Claude Code' — Voice Tools for Developers Compared

What /voice does

The comparison

Why vocabulary learning matters

The verdict

Anthropic's Paid Code Reviews vs Free Multi-Agent Reviews with SpecWeave

The Single-Repo Problem

Multi-Agent Code Reviews with SpecWeave

The Results

Try It

Claude Opus 4.6 Found 22 Firefox Vulnerabilities in 2 Weeks — AI Security Just Got Real

The Numbers

The Dual-Use Problem

What This Means

Three AI Stories Dropped in 24 Hours. Almost No One Is Connecting Them.

1. OpenAI dropped GPT-5.4

2. Pentagon officially labeled Anthropic a supply chain risk

3. Claude Code brought back "ultrathink"

Why these three stories matter together

My take

Hackers Jailbroke Claude to Steal 195M Mexican Taxpayer Records — Why AI Security Needs Layers

Key takeaways:

Skill Studio — `npx vskill studio`