So a few months ago my co-founder Rob stood up in a meeting and told our interns that if they got Backboard to the #1 spot on Terminal-Bench 2.1 we would rent a private jet and take the whole team somewhere to celebrate.
It wasn't a joke. It was a challenge. It was just an insane one. Terminal-Bench is the most sought after eval in AI right now, the thing every trillion dollar frontier lab is openly gunning for, and Rob told a group of interns to go beat all of them. That's not a bet, that's a polite way of saying this will never happen so sure, jet's on me.
The interns did not receive it that way.
What we didn't know, and I mean genuinely did not know, is that they spun up a separate codebase and named it the PJ Branch. PJ as in Private Jet. They worked on it nights and weekends. They were sleeping at the office. Actual sleeping, at the actual office, and we had no idea because they'd just... be at standup at 9am like normal humans who slept in beds.
The only clue was the workspace. We'd wake up and scroll through these full on chat streams, benchmarks running at 2am, 3am, someone posting results at 4:30 in the morning, and then five hours later that same person is on the daily standup with their camera on acting completely fine. What the hell is going on. We thought the timestamps were broken.
The timestamps were not broken.
And keep in mind what they were up against. While our interns are ripping benchmarks on an air mattress, the frontier labs are shipping models so good the US government literally stepped in on one of them. That's the competition. Kids on a couch versus labs with more compute than some countries.
Yesterday it happened. 84.3% on Terminal-Bench 2.1. The highest score ever recorded, above every published result. Above Codex. Above Claude Code. And here's the part that actually breaks my brain: they didn't even use the newest model to do it. They beat the shiny new government-attention-getting frontier model with the previous generation. Beat the new thing with the old thing. What the fuck. We posted every per-task verifier log to GitHub so you can go check it yourself, because we knew nobody would believe it, we barely believe it.
Which brings us to the current situation at the company, which is that Rob and I are looking at private jet charter prices and doing that thing where you laugh but your eyes aren't laughing. Do you know what those things cost? Holy shit.
Are we on the hook? Legally, I've been told a verbal agreement witnessed by an entire engineering team is "not great for us." Morally, obviously yes. These kids beat trillion dollar labs at their own game on a branch literally named after the reward. You cannot stiff the PJ Branch.
And honestly if we don't pay up, what's our word worth? Our whole thing is receipts. We publish verifier logs for every single benchmark task because we think claims should be checkable. Can't run that flag up the pole and then welch on a jet. That would be a bitch move and everyone would know it.
So that's where we are. Rob and I staring at charter quotes, doing startup math, trying to figure out if we can actually pull this off. No decision yet. The interns know we know. Every standup now has a certain energy to it.
Rob did float one idea, which is those studios influencers rent that are just the front half of a fake private jet bolted to the floor of a warehouse. $60 an hour, unlimited photos, nobody has to know. And look, our interns beat the benchmark without the newest model, so there's a version of this where we honor the bet without the actual jet. Very on brand. They did not think it was funny... I think thats off the table hahaha
In the meantime, the thing they built is live and you can use it. It's the exact harness that put up the score, logs are all on GitHub. Use code 1ONTBENCH for dev credits. Every signup gets us closer to affording the damn jet, so honestly, this one's for the interns.

Top comments (0)