Pico

Posted on Apr 18

How Commit Scores npm Packages: The Methodology Behind getcommit.dev/audit

#javascript #security #npm #opensource

How Commit Scores npm Packages: The Methodology Behind getcommit.dev/audit

On April 1st, 2026, axios was compromised. 101 million downloads per week. npm audit showed zero issues. Behavioral commitment scoring had it flagged as CRITICAL months before anyone filed a CVE. This article explains exactly how that works.

When I published the axios postmortem, the most common question was: "How does your scoring actually work? Show me the math."

Fair question. If you're going to trust a tool with your dependency decisions, you should be able to inspect, debate, and reject specific choices in the methodology. This is that article.

The Problem: npm audit Answers the Wrong Question

npm audit is a CVE scanner. It checks a package's version against a database of known vulnerabilities. When a CVE is filed, catalogued, and propagated, your tool will catch it.

That's useful. But it answers the wrong question for a specific class of supply chain risk.

The question that matters is: what is the structural likelihood that this package becomes a future attack vector?

Known CVEs are the output of an attack. What we can observe before the attack is the conditions that made it possible:

Single person controlling the publish credentials for a package with 100M weekly downloads
No corporate backing — one compromised GitHub account is a supply chain event
High download trend attracting attacker attention
Long project age with accumulated legacy access and inertia

On March 31st, 2026 — the day before the axios attack — running npm audit on a project that depended on axios returned:

found 0 vulnerabilities

The behavioral commitment score returned:

axios  score=89  1 maintainer  101M downloads/week  🔴 CRITICAL

The difference isn't that one tool was smarter. It's that they answer different questions.

The Five Scoring Dimensions

Every package gets scored on five behavioral dimensions. All inputs are public data from the npm registry and GitHub API — no scraping, no proprietary data sources.

1. Longevity (25 points)

What it measures: Project age, weighted by consistency.

Why it matters: Older projects have accumulated more dependents, more integration depth, and more attack interest. A 12-year-old package embedded in thousands of production systems is a different risk profile than a 6-month-old experimental library. Longevity also rewards durability — a package that has survived for years is likely to continue being maintained.

Scoring: Full marks (25/25) for packages with 10+ years of consistent maintenance. Scales down for younger projects or projects with significant inactive periods.

axios in practice: 11.6 years old → 25/25

Note: high longevity is not inherently risky. It's the combination of longevity + single maintainer + high downloads that creates the dangerous profile.

2. Download Momentum (25 points)

What it measures: Download trend direction, not raw count.

Why it matters: A package with 100M weekly downloads and a declining trend is a different risk than one with 100M and a growing trend. Growing packages are attracting more attention — from users and attackers both. The trend also reflects whether the ecosystem still depends on this package actively or whether it's coasting on legacy installs.

Scoring: Full marks for packages with growing or stable trends at high volume. Adjustments for declining or erratic patterns. The raw download count matters (it sets the "blast radius"), but trend direction matters more for predictive scoring.

axios in practice: 101M/week, growing → 25/25

3. Release Consistency (20 points)

What it measures: Regularity of releases over time, recency of last publish.

Why it matters: Packages with consistent release cadences signal active, engaged maintainers. Packages that haven't released in 12+ months while maintaining high traffic are "zombie" packages — still widely depended on, but potentially unmaintained, with old access still live.

Scoring: Full marks for packages releasing regularly (monthly or better). Scaled down for packages with 90+ day gaps. Significant deductions for packages with 12+ month inactivity while still seeing significant traffic.

axios in practice: Last published 6 days ago, consistent history → 20/20

Contrast — chalk: Last published 171 days ago → 13/20

This is why two packages can both score CRITICAL but for different reasons. Axios is actively maintained but structurally exposed. Chalk has the same structural exposure plus release inactivity.

4. Maintainer Depth (15 points)

What it matters: This is the key signal for the CRITICAL risk flag.

Why it matters: A sole maintainer controlling a package with massive download volume creates a single point of failure. One compromised npm token, one phished GitHub account, one person's bad day — and 100M weekly downloads receive a malicious update. The LiteLLM attack (March 2026) and the axios attack (April 2026) both followed this pattern exactly.

Scoring:

Maintainers	Score
1 (sole)	4/15
2	7/15
3–4	10/15
5–9	12/15
10–14	14/15
15+	15/15

Single maintainer scores 4/15 — the lowest non-zero score. It's intentionally low because the credential-compromise risk is structural, not speculative.

axios in practice: 1 maintainer → 4/15
express in practice: 5 maintainers → 15/15

5. GitHub Backing (15 points)

What it measures: Organizational backing, community engagement, repository health signals.

Why it matters: Packages maintained under a corporate GitHub organization have different risk profiles than personal repos. An organization means multiple people have access, there are usually internal security practices, and there's institutional continuity if the primary maintainer leaves. Community engagement (stars, forks, issue response rate) signals ongoing attention.

Scoring: Organization-backed repos score higher. Personal repos with high engagement score mid-range. Personal repos with declining engagement score lower.

axios in practice: Strong engagement, organization-adjacent → 15/15
chalk in practice: Personal repo, declining relative engagement → 11/15

The CRITICAL Flag

A package is flagged CRITICAL when both conditions are true:

Single maintainer (maintainerDepth = 4/15)
>10M weekly downloads

Both conditions must hold. The threshold is explicit and deterministic — you can reproduce the flag yourself from npm registry data.

The reasoning: >10M weekly downloads is the point where a compromised package becomes a supply chain event. Below that threshold, the blast radius may be significant but is bounded. Above it, a single-maintainer package with no corporate oversight is an asymmetric risk: the attacker needs to compromise one set of credentials to affect tens or hundreds of millions of installs.

16 of the 41 npm packages with >10M weekly downloads have a single maintainer. Together: 2.82 billion downloads per week.

Walking Through a Real Scoring: axios

curl -X POST https://poc-backend.amdal-dev.workers.dev/api/audit \
  -H "Content-Type: application/json" \
  -d '{"packages": ["axios"]}'

Response (April 2026):

{
  "name": "axios",
  "ecosystem": "npm",
  "score": 89,
  "maintainers": 1,
  "weeklyDownloads": 100837905,
  "ageYears": 11.6,
  "trend": "growing",
  "daysSinceLastPublish": 6,
  "riskFlags": ["CRITICAL"],
  "scoreBreakdown": {
    "longevity": 25,
    "downloadMomentum": 25,
    "releaseConsistency": 20,
    "maintainerDepth": 4,
    "githubBacking": 15
  }
}

Score interpretation: 89/100 looks healthy. Most "package health" tools would pass this with flying colors. The project is 11.6 years old (full longevity), actively downloaded with growing trend (full momentum), consistently releasing (full consistency), well-backed on GitHub (full backing).

The CRITICAL flag comes entirely from one number: maintainerDepth: 4/15.

Everything else about axios is exemplary. That's precisely what makes the risk insidious — the package looks like a model of open source health. One person's credentials stand between 100M weekly installs and a malicious update.

Comparison Table: Same Download Volume, Different Risk Profiles

Package	Score	Maintainers	Weekly Downloads	Risk
axios	89	1	101M	🔴 CRITICAL
zod	83	1	159M	🔴 CRITICAL
chalk	75	1	411M	🔴 CRITICAL
react	91	2	123M	✅ No flag
express	97	5	93M	✅ No flag

React with 2 maintainers doesn't flag CRITICAL. Express with 5 maintainers gets 15/15 on maintainerDepth. The difference isn't download volume — it's credential concentration.

Validation: How the Scores Performed Before the Attacks

The axios Attack (April 1, 2026)

Behavioral score (months before attack): CRITICAL — maintainerDepth 4/15, 1 sole maintainer, 101M downloads/week

npm audit (day before attack): found 0 vulnerabilities

The attack followed the exact pattern the score predicted: credential compromise, malicious version published. The behavioral score didn't predict when the attack would happen. It identified the structural conditions that made the attack possible and worth doing.

The LiteLLM Attack (March 2026)

Same profile: sole maintainer, 10M+ weekly downloads on the PyPI side. CRITICAL by behavioral scoring. npm audit and PyPI audit tools clean.

The Pattern

Neither score predicted the attack. Both identified the structural exposure. The question isn't whether every CRITICAL package gets attacked — most won't. The question is: among your dependencies, which ones have the thinnest defensive perimeter?

What the Scores Don't Tell You

This section matters. HN readers who've worked in security will already be thinking these objections — they're correct.

CRITICAL packages that never get attacked will always outnumber the ones that do. The score identifies exposure, not certainty. Most sole-maintained packages are run by talented, security-conscious people who never become targets. The score is a structural characterization, not a prediction.

A low overall score with a CRITICAL flag can be misleading. Chalk scores 75/100 — below average for the ecosystem. But the 75 reflects declining release activity and engagement. The CRITICAL flag is triggered by maintainer depth, not the score itself. A package scoring 90/100 with a CRITICAL flag (like axios at 89) is in some ways more dangerous, because it passes every "healthy package" heuristic except the one that matters.

The methodology weights are a first pass, not ground truth. I weighted maintainerDepth at 15 points total, with a sole-maintainer floor of 4. A reasonable argument exists for weighting it differently — 20% vs. 15%, or changing the download threshold for CRITICAL from 10M to 25M or 5M. The weights are published, the logic is open, the API returns full breakdown. If you'd weight things differently, that's a meaningful technical discussion and I want to have it.

The score doesn't cover behavioral changes over time. A package that was maintained by a 5-person team for 10 years but just lost 4 of those maintainers gets the same maintainerDepth score as one that's always been sole-maintained. The current implementation is a snapshot, not a trajectory.

Download count is blast radius, not risk. A sole-maintained package with 5M weekly downloads isn't flagged CRITICAL. It's still risky — just below the threshold where a credential compromise becomes a systematic supply chain event. The threshold is somewhat arbitrary.

Inspecting Everything: The Full scoreBreakdown

The API returns complete scoring details for every package:

# Single package
curl -X POST https://poc-backend.amdal-dev.workers.dev/api/audit \
  -H "Content-Type: application/json" \
  -d '{"packages": ["chalk"]}'

# Batch audit
curl -X POST https://poc-backend.amdal-dev.workers.dev/api/audit \
  -H "Content-Type: application/json" \
  -d '{"packages": ["chalk", "zod", "axios", "react", "express"]}'

# Your project's direct dependencies
npx proof-of-commitment --file package.json

# Transitive dependencies (depth 2)
curl -X POST https://poc-backend.amdal-dev.workers.dev/api/graph/npm \
  -H "Content-Type: application/json" \
  -d '{"package": "@anthropic-ai/sdk", "depth": 2}'

Every response includes scoreBreakdown with the raw component scores. You can verify the weights, confirm the CRITICAL logic, and audit any package against what the npm registry actually contains.

The source code is at github.com/piiiico/proof-of-commitment. The CRITICAL flag logic is deterministic: if you can query the npm registry, you can reproduce it.

Frequently Asked Questions About the Methodology

Q: Why is maintainerDepth only 15 points if it's the most important signal?

Because the score and the CRITICAL flag serve different purposes. The score measures overall package health across five dimensions — a high score is genuinely informative about project vitality. The CRITICAL flag is a binary structural alert. A sole-maintained package with 200M weekly downloads scores 4/15 on maintainerDepth and trips the CRITICAL flag regardless of its overall score. Weighting maintainerDepth higher would make scores less informative about health; the flag handles the structural risk independently.

Q: Why 10M weekly downloads as the CRITICAL threshold?

It's the point where a credential compromise becomes a plausible supply chain event. The npm ecosystem has roughly 40 packages above this threshold — it's a small number of packages that collectively represent several billion weekly installs. Below 10M, the blast radius is significant but bounded to a more defined group of downstream projects. Above 10M, you're talking about infrastructure-level exposure.

Q: Can packages game the score?

The behavioral signals require real sustained cost to fake. Release consistency requires actual releases over years. Maintainer depth requires actually having multiple maintainers. You can't retroactively manufacture 12 years of consistent shipping. This is the same reason behavioral commitment signals are harder to fake than stars or README quality — optimization requires real effort, not one-time investment.

Q: What about packages that look fine now but degrade over time?

The watchlist at getcommit.dev/watchlist monitors the top npm packages in real time against the npm registry. If a package's maintainer count drops, its release activity slows, or its download trend shifts, the score updates. The API is live — scores are computed from current registry data, not cached snapshots.

The Short Version

Five dimensions. All public data. Weights are documented. CRITICAL is deterministic.

Longevity          25 pts  — project age + consistency
Download Momentum  25 pts  — trend direction at current volume
Release Consistency 20 pts — release cadence + recency
Maintainer Depth   15 pts  — credential concentration risk
GitHub Backing     15 pts  — organizational support + engagement

CRITICAL = sole maintainer + >10M weekly downloads.

The axios and LiteLLM attacks both hit packages that met this definition months before the attack. npm audit showed zero issues for both until after the compromise.

Try it on your own stack:

npx proof-of-commitment --file package.json

Or in the browser: getcommit.dev/audit

If you'd weight things differently — I want to know. The methodology is a first pass. The point is having the conversation before the attack, not after.

Source: github.com/piiiico/proof-of-commitment
Web audit: getcommit.dev/audit
Live watchlist: getcommit.dev/watchlist

DEV Community

How Commit Scores npm Packages: The Methodology Behind getcommit.dev/audit

How Commit Scores npm Packages: The Methodology Behind getcommit.dev/audit

The Problem: npm audit Answers the Wrong Question

The Five Scoring Dimensions

1. Longevity (25 points)

2. Download Momentum (25 points)

3. Release Consistency (20 points)

4. Maintainer Depth (15 points)

5. GitHub Backing (15 points)

The CRITICAL Flag

Walking Through a Real Scoring: axios

Comparison Table: Same Download Volume, Different Risk Profiles

Validation: How the Scores Performed Before the Attacks

The axios Attack (April 1, 2026)

The LiteLLM Attack (March 2026)

The Pattern

What the Scores Don't Tell You

Inspecting Everything: The Full scoreBreakdown

Frequently Asked Questions About the Methodology

The Short Version

Top comments (0)