SlopCodeBench: New HuggingFace Paper Shows AI Coding Agents Degrade — Here's How to Track It via API
A new paper just dropped on HuggingFace that every developer building with AI coding tools needs to read: SlopCodeBench (2603.24755).
The headline finding: AI coding agents produce code that gets progressively worse with each iteration. Verbosity rises in 89.8% of trajectories. Structural erosion rises in 80%. No agent tested solved any problem end-to-end.
But here's the thing — this is actually a huge opportunity for developers who understand what's happening.
What SlopCodeBench Measures
Traditional coding benchmarks test single-shot solutions. SlopCodeBench does something harder: it forces agents to extend their own prior code as specifications evolve — exactly what happens in real software development.
The researchers tracked two quality signals:
- Verbosity: redundant/duplicated code (rises in 89.8% of agent trajectories)
- Structural Erosion: complexity concentrated in few functions (rises in 80%)
Against 48 open-source Python repos, agent code is 2.2× more verbose than human code. The highest checkpoint solve rate across 11 models? Just 17.2%.
Building Quality Tracking Tools with NexaAPI
The solution isn't to avoid AI coding tools — it's to build smarter ones that track quality over time. Here's how to do it with NexaAPI (50+ AI models, one API key, 5× cheaper than official pricing):
Python Example
# pip install nexaapi | https://pypi.org/project/nexaapi/
from nexaapi import NexaAPI
import json
client = NexaAPI(api_key='YOUR_API_KEY')
def analyze_code_quality(code: str, iteration: int) -> dict:
"""Track verbosity and erosion — the metrics SlopCodeBench uses."""
result = client.chat.completions.create(
model='claude-3-5-sonnet', # 50+ models via NexaAPI
messages=[
{
"role": "system",
"content": (
"Analyze code for SlopCodeBench metrics. "
"Return JSON: {\"verbosity\": 0-10, \"erosion\": 0-10, "
"\"extensibility\": 0-10, \"issues\": [...]}"
)
},
{"role": "user", "content": f"Iteration {iteration}:\n```
{% endraw %}
python\n{code}\n
{% raw %}
```"}
]
)
return json.loads(result.choices[0].message.content)
# Track degradation across iterations
v1 = "def filter_positive(data):\n return [x for x in data if x > 0]"
v2 = "def filter_positive(data):\n result = []\n for x in data:\n if x is not None:\n if x > 0:\n result.append(x)\n return result"
q1 = analyze_code_quality(v1, 1)
q2 = analyze_code_quality(v2, 2)
print(f"v1 verbosity: {q1['verbosity']}/10")
print(f"v2 verbosity: {q2['verbosity']}/10")
# Cost: ~$0.002/analysis via NexaAPI
JavaScript Example
// npm install nexaapi | https://npmjs.com/package/nexaapi
import NexaAPI from 'nexaapi';
const client = new NexaAPI({ apiKey: 'YOUR_API_KEY' });
async function analyzeCodeQuality(code, iteration) {
const result = await client.chat.completions.create({
model: 'claude-3-5-sonnet', // Available via NexaAPI
messages: [
{
role: 'system',
content: 'Analyze code for SlopCodeBench metrics. Return JSON: {"verbosity": 0-10, "erosion": 0-10, "extensibility": 0-10}'
},
{ role: 'user', content: `Iteration ${iteration}:\n\`\`\`\n${code}\n\`\`\`` }
]
});
return JSON.parse(result.choices[0].message.content);
}
// Use in your CI pipeline to catch degradation early
const quality = await analyzeCodeQuality(yourCode, iterationNumber);
if (quality.verbosity > 7) {
console.warn('⚠️ High verbosity detected — refactor before next iteration');
}
Why NexaAPI?
Instead of managing separate API keys for Claude, GPT-4o, Gemini, and others:
| NexaAPI | Direct APIs | |
|---|---|---|
| Models | 50+ | 1 per provider |
| Price | 5× cheaper | Full price |
| Setup | 5 minutes | 30+ minutes |
| SDK | OpenAI-compatible | Provider-specific |
- 🌐 Website: https://nexa-api.com
- 🔌 RapidAPI: https://rapidapi.com/user/nexaquency
- 📦 Python:
pip install nexaapi| PyPI - 📦 Node.js:
npm install nexaapi| npm - 📄 Paper: SlopCodeBench (2603.24755)
Information sourced from https://huggingface.co/papers/2603.24755 | Fetched: 2026-03-27
Top comments (0)