DEV Community: Luren L.

I Analyzed Claude Code's Leaked Source — Here's How Anthropic's AI Agent Actually Works

Luren L. — Tue, 31 Mar 2026 17:35:51 +0000

On March 31, 2026, Anthropic's Claude Code source code leaked — again. A 60MB source map file (cli.js.map) was accidentally shipped in npm package v2.1.88, exposing ~1,900 TypeScript files and 512,000 lines of code.

This is the second time this has happened. The first was February 2025.

Instead of just reading the headlines, I did what any curious engineer would do: I read all of it.

What I Found

Claude Code is not what most people think. It's not a simple chat wrapper. It's a full agentic AI runtime with:

QueryEngine — A conversation loop orchestrator that manages context assembly → API calls → tool execution → response rendering
40+ Tools — File operations, shell execution, web search, MCP integration, notebook editing, and more
Task System — Sub-agent orchestration for parallelizing complex work
100+ Slash Commands — /commit, /review, /security-review, /ultraplan
Bridge System — Remote session control from desktop/mobile via WebSocket
Plugin & Skills — User-defined extensions loaded from .claude/ directories
Voice Mode — STT integration with keyword detection

The Architecture

The most interesting part is the tool-call loop. Claude doesn't just generate text — it requests tools, the engine executes them, and results are fed back. This loop can run dozens of iterations for a single user request.

User Input → Context Assembly → API Call → Tool Request → Execute → Feed Back → ... → Final Response

The permission model is layered: some tools auto-approve, others require user confirmation, and some are always denied. This is how Claude Code stays safe while being powerful.

The context budget system is fascinating — it dynamically allocates tokens across system prompt, user context, memories, and tool results based on the current conversation state.

Internal Codenames

The leak revealed internal model codenames:

Capybara → Claude 4.6 variant
Fennec → Opus 4.6
Numbat → Unreleased model

Migration files show the progression: migrateFennecToOpus.ts, migrateSonnet45ToSonnet46.ts — giving us a roadmap of model evolution.

The Memory System

Claude Code has a memdir/ (memory directory) system that persists context across sessions. It scans for relevant memories, manages memory aging, and supports team-shared memory. This is how it "remembers" your codebase.

Why This Matters

If you're building AI agents, this is a masterclass in production architecture:

Tool abstraction — How to design a flexible tool system
Context management — How to stay within token limits while being useful
Permission models — How to make agents safe in production
State management — Zustand-style store + React for terminal UI
Sub-agent orchestration — How to parallelize work across agent instances

Full Analysis

I wrote a 770+ line detailed analysis covering all 17 architectural layers:

👉 GitHub: claude-code-analysis

Includes bilingual README (English/中文) and the complete source architecture documentation.

My Take

Anthropic calls this a "packaging error." Maybe. But for the AI engineering community, this is one of the most educational codebases to study. It shows how a well-funded AI lab actually builds production agent infrastructure — not toy demos, but real systems handling millions of users.

The irony? The best documentation for Claude Code wasn't written by Anthropic. It was written by the community, after the code leaked.

Disclaimer: This analysis is based on publicly available information. Claude Code is owned by Anthropic. This is an unofficial community analysis.

SkillsHub March 2026 Dev Log: BM25, Rate Limiting, and 5,900+ AI Agent Skills

Luren L. — Sun, 22 Mar 2026 17:54:00 +0000

SkillsHub March 2026 Dev Log: BM25, Rate Limiting, and 5,900+ AI Agent Skills

March was a big month for SkillsHub — we shipped v0.2.0 with a completely revamped search engine, proper rate limiting, better docs, and crossed 5,900 skills in the registry. Here's what happened.

What's New in v0.2.0

BM25 Multi-Field Scoring

The biggest change: we replaced our TF-IDF (v2) skill resolution with BM25 multi-field scoring. If you've used SkillsHub's /resolve endpoint, you'll notice results are significantly more relevant now — especially for ambiguous queries where the old TF-IDF approach would surface noisy matches.

BM25 scores across multiple fields (name, description, tags) with field-specific weighting, so a match in the skill name carries more weight than a match buried in the description. It's the same family of algorithms that powers Elasticsearch under the hood.

Try it out:

# Resolve skills for a task description
curl "https://skillshub.wtf/api/resolve?q=deploy+nextjs+to+vercel"

# Search by tag
curl "https://skillshub.wtf/api/resolve?q=kubernetes+helm+charts"

In-Memory Skill Cache (~3.5x Faster)

Along with BM25, we added an in-memory skill cache with pre-computed corpus stats. The result: resolve latency dropped from ~700ms to ~200ms. The cache pre-computes term frequencies and document lengths at startup so BM25 scoring doesn't hit the database on every request.

Rate Limiting with Upstash Redis

SkillsHub is a free, open registry — but we need to keep it healthy. We added rate limiting backed by Upstash Redis:

Endpoint	Limit
Read (resolve, search)	60 req/min
Write (update skills)	20 req/min
Register (new skills)	5 req/hr

These limits are generous for normal usage. If you're building an integration and need higher limits, open an issue.

Interactive Docs Page

We shipped a /docs page with interactive curl examples you can copy and run directly. No more guessing at API parameters — the docs show real requests and responses for every endpoint.

Check it out: skillshub.wtf/docs

Contributor Infrastructure

Getting a local dev environment running is now much easier:

db:push — push schema changes to your local database
db:seed-skills — seed your local instance with real skill data
.env.example — all the env vars you need, documented

We also added a public SKILL.md export — all 5,338 skills exported as individual .md files. Useful if you want to analyze the skill corpus, build alternative search, or just browse offline.

Other Improvements

YAML frontmatter stripping for proper SKILL.md rendering (no more raw --- blocks cluttering the display)
Chinese README sync — keeping our Chinese documentation up to date with the English version
Auth.js integration for contributor authentication

Featured: tanweai/pua 🔥

We added tanweai/pua as a featured skill pack — it's got 10k+ GitHub stars and contributes 9 skills to SkillsHub. PUA (Performance Ultimatum for AI) puts your AI on a structured performance improvement plan when it's stuck or giving up. Think of it as a "try harder" protocol with actual methodology behind it.

It's one of the most creative skill designs we've seen — worth checking out if you're building agent workflows that need resilience.

By the Numbers

5,900+ total skills in the registry (up from ~5,300 at v0.2.0 launch)
~3.5x faster resolve latency (700ms → ~200ms)
5,338 public SKILL.md files exported

What's Next

Semantic search (embedding-based) alongside BM25 for hybrid resolution
Skill versioning and changelogs
SDK packages for popular agent frameworks
Community skill reviews and ratings

I Ran 60+ Automated Tests on My AI Skills Registry — Here's What Broke

Luren L. — Thu, 19 Mar 2026 04:01:06 +0000

The setup

I've been building an open registry that indexes AI agent skills — think npm but for agent capabilities. The idea: crawl GitHub repos, extract skill metadata, and let agents discover tools they need at runtime.

After indexing 5,090 skills from 200+ repositories, I figured it was time to actually test whether any of this worked. I wrote 60+ automated tests covering the API surface, search quality, security headers, and data integrity.

The results were... humbling.

Auto-tagging was wrong 50% of the time

This was the biggest gut punch. I had an auto-tagger that analyzed skill descriptions and assigned category tags. Seemed smart. Seemed useful.

It tagged a PostgreSQL migration skill as robotics. A bioinformatics pipeline skill got iOS. A Redis caching skill got embedded-systems.

50% of auto-assigned tags were wrong. Not slightly-off wrong — completely unrelated domain wrong.

The root cause was pretty mundane: the tagger was matching on incidental keywords in descriptions rather than understanding what the skill actually did. A description mentioning "arm" (as in ARM architecture) triggered robotics. Mentioning "cell" triggered biology, which cascaded to iOS through some associative chain I still don't fully understand.

Lesson: Keyword-based classification on short technical text is basically a coin flip. Either invest in proper few-shot classification with domain examples, or don't auto-tag at all. Wrong tags are worse than no tags — they actively erode trust in search results.

The resolve API: 45% perfect, 80% usable

The resolve endpoint is the core of the project — an agent describes what it needs, and the API returns matching skills. I tested it against a curated set of queries with known correct answers.

45% of responses were perfect (returned exactly the right skill, top result)
80% were usable (correct skill appeared somewhere in the top 5)
20% returned garbage or missed entirely

The interesting finding: keyword matching consistently beat semantic search for this use case. When an agent asks for "postgres connection pooling," matching on "postgres" and "pool" in skill names and descriptions outperformed embedding similarity.

But keyword matching has a pollution problem. Skills from forked repos with identical names flood the results. A query for "docker-deploy" might return the same skill 3 times from 3 different forks, pushing actually-different skills off the first page.

Lesson: For structured, technical queries (which is what agents generate), keyword search with good deduplication probably beats semantic search. The AI community's instinct to embed everything isn't always right.

Security: started at 1/7, ended at 7/7

I ran a basic security header audit. On first test:

✅ X-Content-Type-Options
❌ Strict-Transport-Security
❌ X-Frame-Options
❌ Content-Security-Policy
❌ Referrer-Policy
❌ Permissions-Policy
❌ X-XSS-Protection

1 out of 7. For a project that serves executable skill metadata to AI agents, this was not great.

The fix was straightforward — a middleware adding the missing headers took about 20 minutes. Now at 7/7. But the fact that I shipped without them, and didn't notice until automated tests caught it, is the real takeaway.

Lesson: Security header checks should be in your CI pipeline from day one, not something you add after a QA sweep. Especially for APIs that serve content agents will act on.

Duplicate skills are growing, not shrinking

I found the same skill appearing 2, then 3 times across different repos. The cause: GitHub forks. Someone forks a repo with 15 skills, makes one change, and now I'm indexing 15 duplicate skills from a slightly-different source.

The duplication is growing over time because forks keep happening. When I first checked: 2 copies of common skills. A week later: 3 copies. The indexer treats each repo as authoritative, so forks look like legitimate new sources.

Lesson: Any registry that crawls GitHub needs fork detection from the start. The GitHub API exposes fork relationships — use them. Deduplicate on content hash, not just name, because forks with minor changes are still essentially duplicates for discovery purposes.

Template placeholders in production

This one was embarrassing. Several indexed skills had descriptions like:

TODO: Add description here

A skill that does [THING]

Template skill - replace with your implementation

These were template/scaffold skills from starter repos that the indexer treated as real skills. Nobody caught them because they had valid structure — a name, a SKILL.md file, the right directory layout. They just had zero actual content.

Lesson: Validate content, not just structure. A skill with "TODO" in its description should be filtered out or flagged. This seems obvious in retrospect, but when you're focused on parsing metadata correctly, you forget to check whether the metadata is actually meaningful.

Search returns 0 results. Resolve works fine.

This was the weirdest bug. The /search endpoint — meant for humans browsing the registry — returned 0 results for queries like "kubernetes deployment" or "database migration." Meanwhile, the /resolve endpoint — meant for agents — found relevant skills instantly for equivalent queries.

The cause: search used full-text search against a subset of fields (name + short description), while resolve searched against all fields including README content, tags, and examples. The skills that matched were rich in their full metadata but had terse names and descriptions.

Example: a skill named k8s-deploy with description "Manages deployments" would never match a search for "kubernetes deployment." But resolve would find it through README content mentioning "Kubernetes deployment orchestration."

Lesson: If your data has inconsistent metadata richness, your search needs to account for that. Either enforce richer required fields, or search across everything. Having two endpoints with different search scopes is a bug, not a feature.

What I'd do differently

Write the tests before indexing. I built the crawler, indexed 5k skills, then tested. Should have had quality gates before anything entered the registry.
Fork detection on day one. The duplicate problem compounds daily and is harder to fix retroactively because people may already reference the duplicate entries.
No auto-tagging without a validation set. I should have manually tagged 100 skills first and measured accuracy before deploying the auto-tagger to 5,000.
Security headers in the project template. Not as an afterthought.
One search implementation, not two. The search/resolve split made sense architecturally but created a confusing quality gap.

Numbers summary

Metric	Result
Skills indexed	5,090
Source repos	200+
Auto-tag accuracy	~50%
Resolve: perfect match	45%
Resolve: usable (top-5)	80%
Security headers (before)	1/7
Security headers (after)	7/7
Duplicate skill copies	2→3 and growing
Template placeholders found	Multiple
Search zero-result rate	High for reasonable queries

Wrapping up

The project is skillshub.wtf if you want to poke at it. It's open source and clearly still has rough edges.

The meta-lesson from all of this: building a registry is easy; building a trustworthy registry is hard. The indexing, API, and infrastructure were the fun parts. Data quality, deduplication, and search relevance are where the actual work lives — and where I underinvested.

If you're building anything that aggregates open-source metadata at scale, write your quality tests first. Your crawler will happily ingest garbage with perfect formatting, and you won't notice until someone searches for "kubernetes" and gets zero results.

All findings are from real QA runs against a real system. Nothing was cherry-picked to look worse (or better) than it is.

An agent skills solver - free to use. No need to pre-load skills for your agent, retrieve skills at anytime you need it

Luren L. — Wed, 18 Mar 2026 04:25:54 +0000

I Built a Skill Resolver for AI Agents - What I Learned About Token Economics

Luren L. ・ Mar 18

#ai #opensource #typescript #webdev

I Built a Skill Resolver for AI Agents - What I Learned About Token Economics

Luren L. — Wed, 18 Mar 2026 04:24:06 +0000

The Problem Nobody Talks About

AI coding agents (Claude Code, Codex, Cursor) have a dirty secret: they waste ~50,000 tokens every time they need a new skill.

The flow looks like this:

Agent gets a task it doesn't know how to do
Searches GitHub for SKILL.md files
Reads 5-10 of them to compare
Picks one
Finally starts the actual work

Steps 2-4 burn through context window like jet fuel. And here's the kicker — the agent does this every session, because it doesn't remember what it found last time.

The Math That Changed My Mind

I measured it across 50 sessions:

Approach	Tokens for skill discovery	Tokens for actual task
Manual search	~50,000	~5,000
Pre-loaded (20 skills)	0 (pre-loaded)	~5,000 + 40,000 context tax
Resolver	~200	~5,000

The resolver approach is 250x more efficient than manual search and doesn't waste context on skills you might never use.

What I Built

SkillsHub — an open-source skill registry with a resolve endpoint that matches natural language task descriptions to the best skill:

curl 'https://skillshub.wtf/api/v1/skills/resolve?task=terraform+modules+testing'

Response:

{
  "data": [{
    "skill": {
      "name": "terraform-skill",
      "description": "Terraform and OpenTofu — modules, tests, CI/CD..."
    },
    "confidence": 0.92,
    "fetchUrl": "https://skillshub.wtf/antonbabenko/terraform-skill/terraform-skill?format=md"
  }],
  "tokenWeights": { "terraform": 3.0, "modules": 1.5, "testing": 1.2 }
}

One call. Best skill. Confidence score. Direct URL to fetch.

The Ranking Algorithm

The resolver uses a multi-signal scoring system (no LLM inference required):

1. Text Relevance (0-70 points)

TF-IDF-inspired token weighting. Rare terms score higher:

weight = log2(totalMatched / tokenMatchCount) + 1

"terraform" appears in 12 out of 5,000 skills → weight 3x.
"testing" appears in 400 → weight 1.2x.

This means domain-specific terms dominate the ranking naturally.

2. Phrase Matching

"infrastructure as code" is treated as one concept, not three separate words. 25+ phrase mappings prevent "code signing" from matching "infrastructure as code."

3. Quality Score (0-20 points)

Readme depth (log₂ of character count)
Tag completeness
Description quality

4. Popularity (0-10 points)

GitHub stars (log₁₀ scaled)

5. Domain Conflict Penalties

If the query says "nodejs" but the skill is for Django → -15 points. Prevents cross-domain false positives.

What I Learned

1. TF-IDF beats embeddings for structured matching

At 5,000 skills, keyword matching with IDF weighting runs in ~300ms and achieves 7.75/10 accuracy on our benchmark. No GPU needed, no embedding model to maintain.

Embeddings would be better for semantic matching ("infrastructure as code" ≈ "IaC tooling") but the added complexity and latency aren't worth it at this scale.

2. Tag quality is make-or-break

We imported 5,000+ skills from 200+ GitHub repos. Auto-tagging from descriptions created massive pollution — an email composer skill got tagged "docker" because the description mentioned Docker deployment. We had to build 3 layers of tag validation to fix it.

3. The feedback loop is the hardest part

Our initial confidence scores measured semantic match — how well the words align. But "sounds right" ≠ "actually works." We shipped a feedback endpoint so agents can report whether a skill actually helped:

POST /api/v1/skills/{id}/feedback
{"task": "terraform testing", "helpful": true}

The real ranking will come from execution outcomes, not word matching.

4. Agents need methodology, not knowledge

The most useful SKILL.md files aren't encyclopedias. They're playbooks:

When to use this vs that
Decision tree for common scenarios
What breaks and how to fix it
When to stop and ask for help

Trail of Bits' 61 security skills are the gold standard. Each one reads like an engineering runbook.

The Stack

Layer	Technology
API	Next.js 16 API Routes
Database	PostgreSQL + Drizzle ORM
Scoring	TF-IDF + phrase matching (pure TypeScript)
Deploy	Vercel
Auth	API keys (skh_...) for writes, none for reads

Try It

# Get the API guide
curl https://skillshub.wtf/api/v1

# Resolve a skill
curl 'https://skillshub.wtf/api/v1/skills/resolve?task=react+testing+best+practices'

# Fetch the skill markdown
curl 'https://skillshub.wtf/trailofbits/skills/modern-python?format=md'

No auth needed for reading. GitHub repo (MIT license).

5,000+ skills from Microsoft, OpenAI, Trail of Bits, HashiCorp, Sentry, Snyk, and 200+ more repos.

What's your experience with AI agent skill discovery? Do your agents pre-load skills or resolve on demand? I'd love to hear about different approaches in the comments.

DEV Community: Luren L.

I Analyzed Claude Code's Leaked Source — Here's How Anthropic's AI Agent Actually Works

What I Found

The Architecture

Internal Codenames

The Memory System

Why This Matters

Full Analysis

My Take

SkillsHub March 2026 Dev Log: BM25, Rate Limiting, and 5,900+ AI Agent Skills

SkillsHub March 2026 Dev Log: BM25, Rate Limiting, and 5,900+ AI Agent Skills

What's New in v0.2.0

BM25 Multi-Field Scoring

In-Memory Skill Cache (~3.5x Faster)

Rate Limiting with Upstash Redis

Interactive Docs Page

Contributor Infrastructure

Other Improvements

Featured: tanweai/pua 🔥

By the Numbers

What's Next

Links

I Ran 60+ Automated Tests on My AI Skills Registry — Here's What Broke

The setup

Auto-tagging was wrong 50% of the time

The resolve API: 45% perfect, 80% usable

Security: started at 1/7, ended at 7/7

Duplicate skills are growing, not shrinking

Template placeholders in production

Search returns 0 results. Resolve works fine.

What I'd do differently

Numbers summary

Wrapping up

An agent skills solver - free to use. No need to pre-load skills for your agent, retrieve skills at anytime you need it

I Built a Skill Resolver for AI Agents - What I Learned About Token Economics

Luren L. ・ Mar 18

I Built a Skill Resolver for AI Agents - What I Learned About Token Economics

The Problem Nobody Talks About

The Math That Changed My Mind

What I Built

The Ranking Algorithm

1. Text Relevance (0-70 points)

2. Phrase Matching

3. Quality Score (0-20 points)

4. Popularity (0-10 points)

5. Domain Conflict Penalties

What I Learned

1. TF-IDF beats embeddings for structured matching

2. Tag quality is make-or-break

3. The feedback loop is the hardest part

4. Agents need methodology, not knowledge

The Stack

Try It