DEV Community

Pico
Pico

Posted on

I scored 14 popular AI frameworks on behavioral commitment — here's the data

When you're choosing an AI framework, what do you actually look at? Usually: stars, documentation quality, whether the README looks maintained.

All of that is stated signal. Easy to manufacture, doesn't tell you if the project will exist in 18 months.

I built a tool that scores repos on behavioral commitment — signals that cost real time and money to fake. Here's what I found when I ran 14 of the most popular AI frameworks through it.

The methodology

Five behavioral signals, weighted by how hard they are to fake:

Signal Weight Logic
Longevity 30% Years of consistent operation
Recent activity 25% Commits in the last 30 days
Community 20% Number of contributors
Release cadence 15% Stable versioned releases
Social proof 10% Stars (real people starring costs attention)

Archived repos or projects with no pushes in 2+ years are penalized 50%.

The results

Framework Score Age 30d Commits Stars
🥇 openai/openai-python 95/100 5.4 yr 28 30k
🥇 deepset-ai/haystack 95/100 6.4 yr 100 25k
🥈 langchain-ai/langchain 92/100 3.5 yr 100 132k
🥈 run-llama/llama_index 92/100 3.4 yr 100 48k
🥈 agno-agi/agno 92/100 3.9 yr 100 39k
🥉 anthropics/anthropic-sdk-python 90/100 3.2 yr 54 3k
microsoft/semantic-kernel 87/100 3.1 yr 32 28k
huggingface/transformers 85/100 7.4 yr 100 159k
BerriAI/litellm 84/100 2.7 yr 100 42k
pydantic/pydantic-ai 84/100 1.8 yr 93 16k
stanfordnlp/dspy 82/100 3.2 yr 32 33k
google/adk-python 79/100 1.0 yr 100 19k
crewAIInc/crewAI 74/100 2.4 yr 100 48k
microsoft/autogen 67/100 2.6 yr 2 57k

What stands out

microsoft/autogen is the interesting outlier. 57k stars — more than most on this list — but only 2 commits in the last 30 days and a commitment score of 67. That's not a dead project, but it's the kind of divergence (high stars, low activity) that a purely social-proof metric misses entirely.

Stars reflect the past. Commit frequency reflects now.

huggingface/transformers scores 85 despite being the oldest project (7.4 years). The 50% archived-project penalty keeps this from being a pure longevity reward — old activity matters less than recent activity.

pydantic/pydantic-ai at 1.8 years scores 84 — the highest score among projects under 2 years old. 93 commits in 30 days from a Pydantic team with a track record: their commitment record does the work that documentation claims can't.

crewAIInc/crewAI has 48k stars and 100 commits/month but scores 74. Reason: release cadence. The release scoring penalizes projects that iterate fast but don't tag stable versioned releases. Debatable design choice — but the behavior is: they ship a lot, don't version it cleanly.

The deeper point

Stars and documentation are content. Easy to create, hard to interpret.

Commit history, release cadence, and contributor growth are commitments. They require real time from real people. They're harder to fake at scale.

This is what I'm building at Proof of Commitment — a behavioral trust layer for AI agents and humans making decisions about who or what to trust.

Try it yourself

The scoring tool is available as an MCP server. Zero install:

{
  "mcpServers": {
    "proof-of-commitment": {
      "type": "streamable-http",
      "url": "https://poc-backend.amdal-dev.workers.dev/mcp"
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Then ask Claude, Cursor, or any MCP client:

"Score these dependencies for me: langchain-ai/langchain, BerriAI/litellm, run-llama/llama_index"

The lookup_github_repo tool works on any public GitHub repo. Source: github.com/piiiico/proof-of-commitment


What would you add to a repo commitment score? I'm thinking about: issue response time, semantic versioning adherence, security advisory response. What matters to you when evaluating a dependency?

Top comments (0)