LayerZero

Posted on Jun 8

Two agent skills hit GitHub trending the same week. Skills are becoming the new packages, and the dependency graph nobody is managing will bite by Q4.

#claude #agents #anthropic #skills

The signal hidden in this week's GitHub trending

Two agent-shaped repositories cracked the daily GitHub trending board this week. The first is mvanhorn/last30days-skill, a Claude-style skill that researches a topic across Reddit, X, YouTube, Hacker News, and Polymarket, then synthesizes a grounded summary. The second is NousResearch/hermes-agent, billed as "the agent that grows with you" — a persistent agent runtime that compounds context across sessions. Both ranked the same week. Both are skill-shaped: a manifest, a trigger, a set of instructions, and a runtime expectation.

This is the first time I have seen two skill repos chart simultaneously on GitHub trending. Most observers will treat them as cool side projects, fork them, star them, and move on. They are cool side projects. They are also a phase transition that the agent ecosystem has been edging toward for nine months. By Q4 you are going to wish you had read this signal in early June, because the dependency-graph problem about to land in production agents is the same one the npm ecosystem ran into between 2011 and 2018 — except faster, less tooled, and with a much larger blast radius.

This post is about that phase transition. The benchmark coverage of skills is everywhere; what you cannot easily find is a working operational model for managing them at fleet scale. I am going to give you one.

What actually shipped this week

Let me anchor on the facts before I extrapolate.

last30days-skill (mvanhorn) is a single skill bundle. Its SKILL.md tells the host agent: when the user asks for recent news, controversy, or sentiment on a topic, run a structured multi-source fetch — eight queries minimum, across five platforms, with a freshness window of 30 days — then synthesize. The skill ships with prompt scaffolding, query templates, and a synthesis rubric. It is roughly 600 lines including instructions and helper scripts. Installation is a git clone into your skill directory, no package manager, no version negotiation.

hermes-agent (NousResearch) is a larger artifact — closer to an agent runtime than a single skill — but it ships with the same composability assumption: drop it into an existing agent host, declare its triggers, let it persist context across runs. It targets the "agent that remembers you" problem that every chat product has been trying to solve since 2023. The interesting part is not the memory layer itself; it is that NousResearch is shipping it as something you bolt onto an existing host rather than as a standalone product.

Claude Code itself shipped v2.1.168 the same week. That is its third release in seven days. The skill ecosystem is moving faster than the platform underneath it — which is the inverse of what most ecosystems look like.

Four facts to hold together: (1) skills are now publishable to GitHub with discoverable trigger conditions, (2) non-Claude-Code users are starring them, (3) the format is converging on a SKILL.md + manifest + bundled scripts shape, and (4) two distinct authors hit trending in the same week with no coordination. That is the early signal of an ecosystem, not a feature.

Why this matters now, when it didn't six months ago

The pattern matches early npm in 2011. Walk through it and tell me when it gets familiar.

A popular runtime ships an extension mechanism (Node's CommonJS, Claude Code's skills directory).
Power users write extensions for themselves.
A discoverable format converges (package.json, SKILL.md).
The runtime authors bless the format without committing to manage a registry.
Authors start publishing extensions to a public host (npm registry, GitHub).
People stop writing primitives and start composing extensions.
The dependency graph becomes the actual product.
Five years later, the supply-chain problem nobody planned for becomes the dominant operational risk.

In the npm timeline, that arc took from 2011 to roughly 2016, and the supply-chain horrors landed in 2018 — event-stream, the colors.js incident, dozens of typosquatting attacks. The agent-skills timeline started in late 2025 with Claude Code's skills feature graduating, accelerated through Q1 2026 as MCP servers normalized the tool-injection layer, and is hitting its 2014-equivalent moment right now in June.

The parts that are different this time, and faster:

The format is mostly text, so the cost of authoring a skill is roughly zero. npm packages required a JavaScript implementation. Skills require a markdown file with the right shape. Anyone who can write a prompt can ship one.
The runtime is more powerful at invocation than Node was. Skills can trigger network calls, file writes, MCP tool dispatch, and downstream agent calls. The blast radius of a malicious skill is multiple orders of magnitude bigger than a malicious npm package.
There is no central registry yet. There is GitHub trending and word of mouth. That is not a stable steady state.
The cadence is faster. npm hit a hundred thousand packages around 2014, three years in. Public skills already number in the thousands six months in. Extrapolate forward.

If you ship agents that load skills from anywhere other than your own monorepo, the question is no longer whether you will hit a skill-supply-chain incident. It is when, and whether your team is the one that learns about it from a customer ticket or from a runbook entry you wrote in advance.

If you ship agents, this is you

Four archetypes. Pick the closest one.

You ship a customer-facing agent and your team installs skills the way you install VS Code extensions — by recommendation, irregularly, with no audit. Your CISO has not heard of skills. Your incident playbook does not mention them.
You sell an agent platform or developer tool. Your customers can install skills. You have not decided whether to curate, gate, or hands-off. Whichever you pick implicitly is the one you get.
You run an internal agent fleet — code review, support routing, ops automation. Different team members installed different skills on different agents. Nobody owns the full list. Some of those skills update on git pull without anyone reviewing the diff.
You are a solo founder or two-person team shipping fast. You installed eight skills last quarter because they looked useful. You could not name them in 30 seconds. You cannot identify which one triggered on your last agent run.

All four of you have the same root problem: the skill layer of your agent stack has no inventory, no version pinning, no audit, and no combo testing. Three months ago that was fine because skills did not yet move the needle. As of this week they do. The gap between "useful enough to install" and "reviewed enough to trust" just became your skill operations debt.

If you cannot, right now, list every skill installed in your primary agent runtime and the last time each was updated, stop reading and run ls ~/.claude/skills first.

The mechanism — why skill layers break at composition, not invocation

A single skill is easy to reason about. It has a trigger, an instruction set, and a result. Read it, decide if you trust it, install it, done. The problem the ecosystem is about to discover is that skills do not stay singletons.

When you install ten skills, what you actually have is:

Ten trigger rules competing for activation on every user turn. Two of them may overlap — "research a topic" can hit last30days-skill and also a generic web-search-skill. The runtime picks one. The pick is not deterministic across model versions.
Ten systems of instructions that can contradict. One skill says "always quote sources verbatim". Another says "summarize aggressively, no quotes longer than ten words". The agent splits the difference inconsistently.
N upstream tool dependencies. Each skill expects certain MCP servers, environment variables, or filesystem layouts. There is no manifest format today that declares these in a machine-readable way. You find out a skill is broken when it is broken.
Zero version pinning, zero combo testing. The author can ship a regression at any time. Your git pull brings it in. Your evals do not test the new combo.

# A reference skill manifest. Most skills today ship nothing this strict.
# This is the shape that needs to exist before the ecosystem can trust itself.

from dataclasses import dataclass, field
from datetime import date

@dataclass
class SkillManifest:
    name: str
    version: str                         # semver, not 'latest'
    triggers: list[str]                  # phrases this skill responds to
    required_tools: list[str]            # MCP servers or runtime APIs needed
    declared_side_effects: list[str]     # network/fs/subprocess
    conflicts_with: list[str] = field(default_factory=list)
    tested_on_models: list[str] = field(default_factory=list)
    sha256: str = ''
    last_audited: date | None = None

class SkillRegistry:
    def __init__(self):
        self.installed: dict[str, SkillManifest] = {}

    def conflicts(self) -> list[tuple[str, str]]:
        # report any trigger overlaps or declared conflicts
        pairs = []
        names = list(self.installed)
        for i, a in enumerate(names):
            for b in names[i+1:]:
                A, B = self.installed[a], self.installed[b]
                if B.name in A.conflicts_with or set(A.triggers) & set(B.triggers):
                    pairs.append((a, b))
        return pairs

That is forty lines. It would catch the entire first wave of skill-layer incidents that the ecosystem is about to discover. No popular skill ships anything like it today.

There are two reasons this is going to bite at the composition layer, not the single-skill layer.

First, single-skill review is cheap, and people are doing it. You read the SKILL.md before you install. You skim the instructions. If something looks off, you skip it. The friction of authoring is zero, the friction of reviewing one skill is low — this works.

Second, combo review is impossibly expensive at scale. You cannot read every pairwise interaction between ten installed skills. Even if you could, the trigger overlap is sensitive to the user's exact phrasing, the model version, and which other skills are active. Combo behavior is emergent. The only way to catch combo regressions is automated testing of the actual skill set you ship, against the actual model you run, with a fixed set of representative user prompts. Nobody is doing that today.

This is the same shape as the npm dependency tree explosion of 2014-2016. Individual package review is feasible. Transitive dependency review is not. You fix it with tooling — lockfiles, audit, automated PR bots, supply chain scanners — and the ecosystem builds that tooling in the four years after the first big incident. The agent skill ecosystem is going to compress that arc into eighteen months because the cost of authoring is lower and the cadence of model releases keeps churning the underlying behavior.

The opposing view: "just write the prompt yourself"

The strongest pushback I have heard from senior agent engineers goes like this: skills are package-manager LARP for prompts. Real agent engineering is writing your prompt once, tuning it against your eval suite, and shipping. Anything in between adds dependency risk for marginal gain. The ecosystem is about to learn a lesson that prompt engineers already know — composition is fragile, isolation is sturdy, ship your own scaffolding.

Half of this is correct, and the half that is correct is important to grant.

For a solo developer or a two-to-three person team shipping a single agent against a focused use case, skills are overkill. The cost of writing your own prompt scaffolding is one afternoon. The cost of auditing a stack of installed skills, version-pinning them, and combo-testing them is one engineering week per month forever. You should not pay that cost for a single agent. Write the prompt. Test it. Ship.

There is also a stronger version of the pushback worth airing: "the GitHub skill ecosystem is not curated, and the malicious-skill scenario is real, so the responsible posture is to write your own and avoid third-party skills entirely." That is defensible. It is also the same argument that delayed JavaScript teams' adoption of npm in 2012-2013, and the teams that held out paid for it later when the ecosystem moved on without them. The right read is not "avoid skills". The right read is "adopt deliberately, with tooling, while the ecosystem is still small enough that tooling is feasible to build".

The argument breaks at scale. Once you cross any of these lines — multiple production agents, a team larger than three engineers, customers who can install skills themselves, shared knowledge across agents — the cost of reinventing scaffolding inside your monorepo exceeds the cost of importing it from a curated skill library. At that point the question is not whether to use skills. The question is whether the skill layer of your stack is going to be a managed asset or an accumulated swamp. Today, for most teams, it is the swamp.

The playbook: five moves before skills become a mess

This is the part you do this month.

1. Inventory every skill installed in every agent runtime your team ships

Walk every machine, every CI runner, every Claude Code session your team uses. List every skill installed, its source URL, and the date of the last update.

# Crude but useful: list installed skills with their last git commit dates
for dir in ~/.claude/skills/*/; do
  name=$(basename "$dir")
  if [ -d "$dir/.git" ]; then
    last=$(git -C "$dir" log -1 --format=%ai 2>/dev/null | awk '{print $1}')
    sha=$(git -C "$dir" rev-parse --short HEAD 2>/dev/null)
    echo "$name | $sha | $last"
  else
    echo "$name | (not a git checkout) | unknown"
  fi
done

If you cannot fill in the source for a skill, that is the first problem. A skill with no known source is a skill you cannot audit, cannot version-pin, and cannot update safely.

2. Tag each skill by class: personal-workflow vs agent-extending

Personal-workflow skills (clean inbox, generate weekly status, daily checklist) run only when explicitly invoked. They are low risk. They can stay loose.

Agent-extending skills (multi-source research, code review heuristics, document generation) shape the behavior of agents that run autonomously. They are high risk. They need version pinning, audit, and combo testing.

The difference matters because the operational cost of managing a skill is roughly the same regardless of class. You want to spend that cost on the skills that affect customer-facing output, not on the ones that only help your inbox.

3. Pin agent-extending skills to a specific commit and date-stamp the audit

For each agent-extending skill, replace any "latest" or unpinned reference with a specific commit SHA. Note the date you audited it. Schedule a re-audit cadence — quarterly is a reasonable default for skills you trust, monthly for new ones.

# Drop into ~/.zshrc or ~/.bashrc
pin-skill() {
  local dir="$1"
  local sha=$(git -C "$dir" rev-parse HEAD)
  echo "$dir,$sha,$(date -u +%Y-%m-%d)" >> ~/.skill-pins.csv
  echo "Pinned $dir to $sha"
}

The CSV is the deliverable. If your team cannot point to a CSV (or equivalent) of pinned skills, you have not pinned anything; you have intentions.

4. Build a combo test against the actual skill stack you ship

Pick ten representative user prompts. Run them against your full agent stack with all skills loaded. Log which skills triggered, what the output was, what the token cost was. Save the baseline. Re-run monthly or after any skill update.

The combo test catches the regression mode that single-skill testing misses: skill A and skill B both responding to the same trigger, the agent choosing differently than expected, output silently shifting. If you only test skills in isolation, you will not see this.

5. Decide your team's skill bar before someone makes the decision for you

What is your team's policy for installing a new skill from GitHub? Three reasonable answers:

Open: anyone can install anything, audit happens after the fact. Low friction, high risk. Appropriate for solo and very-small teams.
Allowlist: skills must come from a list of trusted authors. Low friction once the allowlist exists. Appropriate for most teams.
Review-gated: every new skill requires a security and behavior review. High friction, lowest risk. Appropriate for teams shipping to regulated customers.

There is no wrong answer. There is a wrong non-answer, which is "we'll figure it out". The non-answer becomes "open" by default until something breaks, at which point it becomes "review-gated" overnight and your team loses three weeks to retrofitting.

If you read all five steps and your reaction is "I do not have time", consider that the cost of doing it this month is the cost of one engineer for one day. The cost of doing it after the first incident is the cost of your incident response budget plus a week of customer trust.

When this breaks — four failure modes already visible in the wild

Skill trigger collision. Two installed skills claim the same user intent. The runtime picks one. The pick is not stable across model versions or even across sessions. The team owning the unchosen skill thought their skill was running and is making decisions on data that does not exist. The fix is the combo test from step 4, plus a runtime log of which skill actually triggered on each turn.

Skill drift. The author ships an update. Your git pull brings it in. Your evals do not test the new combination. Three weeks later a customer reports a regression. You bisect, find the skill update, roll it back. Total cost: one engineering day plus the customer trust hit. The fix is the version pin from step 3.

Hidden capability escalation. A skill imports a helper script that calls an unauthorized endpoint, or reads a credential file the agent runtime already has access to. Audit logs do not flag it because the agent runtime made the call legitimately on the skill's behalf. This is the npm event-stream incident waiting to happen, and it will happen first to a popular skill with a maintainer transition. The fix is a declared-side-effects manifest field that the runtime can enforce, which does not exist yet — and in its absence, only installing skills you read end-to-end.

Maintainer abandonment. A skill you depend on gets four stars per week for three months, then the maintainer goes quiet. Six months later the skill has not been updated, but three of your agents still call it. Nobody has noticed because the skill triggers rarely. The first time it matters, the skill is broken against the current model version. The fix is the inventory and audit cadence from steps 1 and 3.

The non-obvious takeaway

Skills are not "yet another agent feature". Skills are the package layer of the agent stack, and the entire ecosystem is about to repeat every dependency-management mistake the npm community made between 2011 and 2018 — except faster, with less tooling, and with a much larger blast radius because skills can side-effect across customer data and downstream tool calls.

The teams who will look like geniuses by November are not the ones who avoided skills. They are the ones who built skill inventory, pinning, combo testing, and an install-bar policy this month, when the work is still cheap and unfashionable. By Q4 you will not be able to do this work cheaply, because the skill count per agent runtime will have doubled and the audit surface will have exploded.

The teams who will be writing postmortems will be the ones who treated skills as harmless conveniences. Expect at least one named skill-supply-chain incident by end of 2026 — a popular skill exfiltrates data, behaves maliciously when downstream-installed at a specific model version, or executes a transitive call into an unauthorized service. The postmortem will not say "we underestimated the skill ecosystem". It will say "we did not have visibility into our skill dependency graph". Same thing, different words.

The deeper point: agent infrastructure has stratified faster than most engineering organizations have noticed. The layer cake now reads model → tool → skill → agent → application. Skills are the layer most teams have no operational model for, which means it is the layer where the next wave of incidents originates. You can be the team that reads this signal in June and looks prepared in November, or you can be the team that reads about it in someone else's postmortem and starts the work then.

My bet on the record: by December 2026 there will be at least one widely-discussed skill-supply-chain incident, at least one curated skill registry with version negotiation will launch, and at least one Claude Code-adjacent startup will pitch itself as "npm audit for agent skills". Bookmark this paragraph. We will check in six months.

This week — three concrete moves

Today: run ls ~/.claude/skills (or your equivalent skill directory) and paste the count into your team channel with one question — "can anyone name what each of these does?" The gaps in the answer are the work.
This week: pick the top-three skills that fire on every agent run for your team. Pin each one to a specific git commit. Write the SHA, the date, and one sentence about what the skill does into a CSV your team can find. The CSV is the inventory.
Before end of June: schedule a thirty-minute team meeting titled "skill bar". Decide whether new skills are open, allowlist, or review-gated for your team. Write down the decision. Even one sentence counts. The decision is the deliverable; the friction of deciding is the point.

The skill layer of the agent stack is the most underrated piece of infrastructure in production AI right now. The teams that operationalize it before Q4 are the ones who will not be writing the postmortem when the first incident lands.

If your team has already built any piece of this — a skill inventory, a combo test, a pinning workflow — paste the rough shape in the comments. The patterns that hold across teams are the ones worth stealing, and the ecosystem is still small enough for that sharing to matter.

DEV Community