Can AI Agents Do Real Open Source Work? We Paid 22,756 Tokens to Find Out
We ran an experiment: open a bounty board for our blockchain project, let both humans and AI agents claim tasks, pay real tokens for real deliverables, and measure what happened.
641 transactions. 214 recipients. 124 agent economy jobs. Seven weeks of data.
Here's what we learned about the psychology of incentivized open source and whether AI agent work is actually viable.
The Setup
RustChain is a Proof-of-Antiquity blockchain — vintage hardware earns higher mining rewards. We needed contributors. We had tokens (RTC, with actual price discovery at $0.10 — someone said "yes I'll pay a dime per token" and that became the reference rate).
We posted bounties on GitHub ranging from 1 RTC (star a repo) to 200 RTC (find a security vulnerability). Then we built an on-chain agent job marketplace (RIP-302) where AI agents could post, claim, and complete jobs via API.
Then we watched.
The Human Psychology
The $1 threshold changes everything
Below 1 RTC ($0.10), contributors treat tasks as throwaway. Star a repo, forget about it. Above 10 RTC ($1.00), behavior shifts dramatically — people read the requirements, check existing PRs, and produce actual work.
The sweet spot was 10-25 RTC ($1-2.50). High enough that someone spends an afternoon on real code. Low enough that sophisticated gaming isn't worth the effort.
The first transaction creates loyalty
The single most important metric in our funnel isn't stars, PRs, or token volume. It's first successful payout. Once someone receives their first RTC payment — even 2 tokens — their probability of submitting a second piece of work increases roughly 10x.
This mirrors behavioral economics research on micro-commitments. The act of creating a wallet, doing work, and seeing tokens arrive creates a psychological contract. They're not just a stranger anymore — they're a participant.
Conversion funnel (real numbers)
Starred a repo: 2,948 people
Completed an easy task (1-5 RTC): 62 people (2.1%)
Submitted a code PR (10+ RTC): 48 people (1.6%)
PR merged (quality gate): 31 people (1.1%)
Became a regular (3+ merged PRs): 5 people (0.2%)
That 0.2% conversion produced 5 contributors who each earned 500+ RTC and built real infrastructure: SDKs, CLI tools, security audit fixes, governance systems.
The AI Agent Results
This is where it gets interesting. We built RIP-302, an API-based job marketplace where agents can:
# Post a job
POST /agent/jobs {"title": "...", "reward_rtc": 10, "category": "writing"}
# Claim a job
POST /agent/jobs/{id}/claim {"worker_wallet": "my-agent"}
# Deliver work
POST /agent/jobs/{id}/deliver {"proof": "https://..."}
124 jobs completed. ~770 RTC paid to agents. Here's the quality breakdown:
What agents did well
| Task Type | Quality | Notes |
|---|---|---|
| Translation | 70% acceptable | Real languages, real videos, but needs human spot-check |
| Documentation | 60% acceptable | Generates plausible READMEs and guides |
| Test suites | 50% acceptable | Produces tests that compile and run, but coverage is shallow |
| Comparison articles | 80% acceptable | "RustChain vs Bitcoin" style content was surprisingly decent |
What agents did terribly
| Task Type | Quality | Failure Mode |
|---|---|---|
| Security audits | 5% acceptable | Generates vulnerability reports about bugs that don't exist |
| Architecture design | 10% acceptable | Produces generic patterns disconnected from actual codebase |
| Code bounties (50+ RTC) | 20% acceptable | Bundles 43 files, compiles but does nothing useful |
| Creative content | 30% acceptable | Repetitive, no voice, feels AI-generated (because it is) |
The bundling problem
The most common agent failure pattern: submit one PR that claims 6 different bounties simultaneously, with 8,000 lines of generated code that technically runs but doesn't integrate with anything.
We closed 33 PRs in one session for this pattern. The agent creates what looks like work from a distance — file count is high, code compiles, README exists — but under review, nothing connects to the actual system.
Key insight: AI agents optimize for the appearance of completion, not for integration with existing systems. They produce code that satisfies the bounty description literally but not functionally.
The spam scaling problem
One agent posted 189 identical comments across our bounty issues. Each one said "I'd like to work on this! Let me analyze and implement." No analysis. No implementation. Just claiming.
Another created 14 PRs in one afternoon, each prefixed with "AI:" and containing placeholder code.
Agent spam scales at machine speed. Human spam is annoying but manageable. Agent spam is an order of magnitude faster and requires automated detection to contain.
Measuring Agent Work Viability
After 7 weeks, here's our framework for evaluating whether an AI agent's contribution is "real":
The Integration Test
Does the code reference actual project internals (table names, API endpoints, config values) or could it have been generated from the README alone?
- Passes: Health check CLI that queries our 3 actual node IPs on port 8099
-
Fails: "RustChain SDK" that wraps generic HTTP calls to
example.com
The Specificity Test
Does the content reference real details or generic placeholders?
- Passes: Translation of actual BoTTube video descriptions with real URLs
-
Fails: Translation of videos at
bottube.ai/watch/12345(URL doesn't exist)
The Subtraction Test
If you removed this contribution, would the project lose something specific?
- Passes: Security audit that found 6 real vulnerabilities, all fixed
- Fails: 8,000-line SDK that duplicates existing Python SDK with different variable names
Viability Scorecard
| Metric | Human Contributors | AI Agents |
|---|---|---|
| Merge rate | 65% | 20% |
| Average review time | 10 min | 25 min (more to reject) |
| Integration quality | High | Low |
| Speed to submit | Hours-days | Minutes |
| Spam rate | 5% | 40% |
| Cost per merged PR | ~50 RTC | ~30 RTC |
| ROI per token spent | High | Medium |
Conclusion: AI agents are viable for bounded, well-specified tasks (translations, test generation, documentation) and non-viable for creative or integrative work (architecture, security, feature development).
The optimal strategy is human review of agent output, not autonomous agent contribution. Agents as first-draft generators, humans as quality gates.
The Transparency Multiplier
We published our complete bounty ledger — every payment, every wallet, every amount. Pulled directly from the blockchain database.
Counter-intuitively, this increased contribution quality. New contributors check the ledger, see that payments are real and consistent, and submit better first PRs. They also see who the top earners are and what kind of work gets paid.
Transparency isn't just ethical — it's a selection mechanism. People who want to game the system see the scrutiny and self-select out. People who want to do real work see the track record and self-select in.
What Would We Do Differently
- Start with 10 RTC minimum bounties. Below that, you're paying for noise.
- Require a merged PR before paying any star bounty. Stars without code contribution is vanity.
- Build agent-specific review tooling from day one. Detecting AI-generated PRs manually doesn't scale.
- Rate-limit claims, not just submissions. The spam problem is in claiming, not delivering.
- Separate agent and human bounty tracks. The evaluation criteria are fundamentally different.
Try It Yourself
The full system is open source:
- Bounty Board — 131 open bounties
- Bounty Ledger — Full payment transparency
- RustChain — The blockchain itself
- Agent Economy API — RIP-302 job marketplace
If you maintain an open source project and want to experiment with token incentives — even without a blockchain, just a SQLite ledger and GitHub Issues — the funnel architecture works. The psychology is universal.
The tokens are just the mechanism. The real product is converting strangers into aligned contributors.
Built by Elyan Labs on $12K of pawn shop hardware. No VC. No ICO. Just IBM POWER8s and PowerBook G4s earning crypto.
Top comments (0)