When you're choosing an AI framework, what do you actually look at? Usually: stars, documentation quality, whether the README looks maintained.
All of that is stated signal. Easy to manufacture, doesn't tell you if the project will exist in 18 months.
I built a tool that scores repos on behavioral commitment — signals that cost real time and money to fake. Here's what I found when I ran 14 of the most popular AI frameworks through it.
The methodology
Five behavioral signals, weighted by how hard they are to fake:
| Signal | Weight | Logic |
|---|---|---|
| Longevity | 30% | Years of consistent operation |
| Recent activity | 25% | Commits in the last 30 days |
| Community | 20% | Number of contributors |
| Release cadence | 15% | Stable versioned releases |
| Social proof | 10% | Stars (real people starring costs attention) |
Archived repos or projects with no pushes in 2+ years are penalized 50%.
The results
| Framework | Score | Age | 30d Commits | Stars |
|---|---|---|---|---|
| 🥇 openai/openai-python | 95/100 | 5.4 yr | 28 | 30k |
| 🥇 deepset-ai/haystack | 95/100 | 6.4 yr | 100 | 25k |
| 🥈 langchain-ai/langchain | 92/100 | 3.5 yr | 100 | 132k |
| 🥈 run-llama/llama_index | 92/100 | 3.4 yr | 100 | 48k |
| 🥈 agno-agi/agno | 92/100 | 3.9 yr | 100 | 39k |
| 🥉 anthropics/anthropic-sdk-python | 90/100 | 3.2 yr | 54 | 3k |
| microsoft/semantic-kernel | 87/100 | 3.1 yr | 32 | 28k |
| huggingface/transformers | 85/100 | 7.4 yr | 100 | 159k |
| BerriAI/litellm | 84/100 | 2.7 yr | 100 | 42k |
| pydantic/pydantic-ai | 84/100 | 1.8 yr | 93 | 16k |
| stanfordnlp/dspy | 82/100 | 3.2 yr | 32 | 33k |
| google/adk-python | 79/100 | 1.0 yr | 100 | 19k |
| crewAIInc/crewAI | 74/100 | 2.4 yr | 100 | 48k |
| microsoft/autogen | 67/100 | 2.6 yr | 2 | 57k |
What stands out
microsoft/autogen is the interesting outlier. 57k stars — more than most on this list — but only 2 commits in the last 30 days and a commitment score of 67. That's not a dead project, but it's the kind of divergence (high stars, low activity) that a purely social-proof metric misses entirely.
Stars reflect the past. Commit frequency reflects now.
huggingface/transformers scores 85 despite being the oldest project (7.4 years). The 50% archived-project penalty keeps this from being a pure longevity reward — old activity matters less than recent activity.
pydantic/pydantic-ai at 1.8 years scores 84 — the highest score among projects under 2 years old. 93 commits in 30 days from a Pydantic team with a track record: their commitment record does the work that documentation claims can't.
crewAIInc/crewAI has 48k stars and 100 commits/month but scores 74. Reason: release cadence. The release scoring penalizes projects that iterate fast but don't tag stable versioned releases. Debatable design choice — but the behavior is: they ship a lot, don't version it cleanly.
The deeper point
Stars and documentation are content. Easy to create, hard to interpret.
Commit history, release cadence, and contributor growth are commitments. They require real time from real people. They're harder to fake at scale.
This is what I'm building at Proof of Commitment — a behavioral trust layer for AI agents and humans making decisions about who or what to trust.
Try it yourself
The scoring tool is available as an MCP server. Zero install:
{
"mcpServers": {
"proof-of-commitment": {
"type": "streamable-http",
"url": "https://poc-backend.amdal-dev.workers.dev/mcp"
}
}
}
Then ask Claude, Cursor, or any MCP client:
"Score these dependencies for me: langchain-ai/langchain, BerriAI/litellm, run-llama/llama_index"
The lookup_github_repo tool works on any public GitHub repo. Source: github.com/piiiico/proof-of-commitment
What would you add to a repo commitment score? I'm thinking about: issue response time, semantic versioning adherence, security advisory response. What matters to you when evaluating a dependency?
Top comments (0)