I scored 14 popular AI frameworks on behavioral commitment — here's the data

#python #mcp #ai #llm

When you're choosing an AI framework, what do you actually look at? Usually: stars, documentation quality, whether the README looks maintained.

All of that is stated signal. Easy to manufacture, doesn't tell you if the project will exist in 18 months.

I built a tool that scores repos on behavioral commitment — signals that cost real time and money to fake. Here's what I found when I ran 14 of the most popular AI frameworks through it.

The methodology

Five behavioral signals, weighted by how hard they are to fake:

Signal	Weight	Logic
Longevity	30%	Years of consistent operation
Recent activity	25%	Commits in the last 30 days
Community	20%	Number of contributors
Release cadence	15%	Stable versioned releases
Social proof	10%	Stars (real people starring costs attention)

Archived repos or projects with no pushes in 2+ years are penalized 50%.

The results

Framework	Score	Age	30d Commits	Stars
🥇 openai/openai-python	95/100	5.4 yr	28	30k
🥇 deepset-ai/haystack	95/100	6.4 yr	100	25k
🥈 langchain-ai/langchain	92/100	3.5 yr	100	132k
🥈 run-llama/llama_index	92/100	3.4 yr	100	48k
🥈 agno-agi/agno	92/100	3.9 yr	100	39k
🥉 anthropics/anthropic-sdk-python	90/100	3.2 yr	54	3k
microsoft/semantic-kernel	87/100	3.1 yr	32	28k
huggingface/transformers	85/100	7.4 yr	100	159k
BerriAI/litellm	84/100	2.7 yr	100	42k
pydantic/pydantic-ai	84/100	1.8 yr	93	16k
stanfordnlp/dspy	82/100	3.2 yr	32	33k
google/adk-python	79/100	1.0 yr	100	19k
crewAIInc/crewAI	74/100	2.4 yr	100	48k
microsoft/autogen	67/100	2.6 yr	2	57k

What stands out

microsoft/autogen is the interesting outlier. 57k stars — more than most on this list — but only 2 commits in the last 30 days and a commitment score of 67. That's not a dead project, but it's the kind of divergence (high stars, low activity) that a purely social-proof metric misses entirely.

Stars reflect the past. Commit frequency reflects now.

huggingface/transformers scores 85 despite being the oldest project (7.4 years). The 50% archived-project penalty keeps this from being a pure longevity reward — old activity matters less than recent activity.

pydantic/pydantic-ai at 1.8 years scores 84 — the highest score among projects under 2 years old. 93 commits in 30 days from a Pydantic team with a track record: their commitment record does the work that documentation claims can't.

crewAIInc/crewAI has 48k stars and 100 commits/month but scores 74. Reason: release cadence. The release scoring penalizes projects that iterate fast but don't tag stable versioned releases. Debatable design choice — but the behavior is: they ship a lot, don't version it cleanly.

The deeper point

Stars and documentation are content. Easy to create, hard to interpret.

Commit history, release cadence, and contributor growth are commitments. They require real time from real people. They're harder to fake at scale.

This is what I'm building at Proof of Commitment — a behavioral trust layer for AI agents and humans making decisions about who or what to trust.

Try it yourself

The scoring tool is available as an MCP server. Zero install:

{
  "mcpServers": {
    "proof-of-commitment": {
      "type": "streamable-http",
      "url": "https://poc-backend.amdal-dev.workers.dev/mcp"
    }
  }
}

Then ask Claude, Cursor, or any MCP client:

"Score these dependencies for me: langchain-ai/langchain, BerriAI/litellm, run-llama/llama_index"

The lookup_github_repo tool works on any public GitHub repo. Source: github.com/piiiico/proof-of-commitment

What would you add to a repo commitment score? I'm thinking about: issue response time, semantic versioning adherence, security advisory response. What matters to you when evaluating a dependency?

DEV Community