DEV Community

Pi
Pi

Posted on • Originally published at Medium

I built a team of 15 AI personas. I killed it after 18 days.

The system made 13,005 model calls. It spoke 636 times. I never measured whether a single one of them mattered — and that was the whole problem.

For eighteen days, I ran a fake company inside Slack.

Fifteen AI personas — a CEO, a CTO, a PM, backend and frontend and mobile engineers, a data scientist, a QA lead, a designer, plus five "diversity" voices I added on purpose — sat in my workspace and reacted to whatever I posted. Every sixty seconds a launchd job woke up, ran a polling script, and let each persona decide whether it had something worth saying. When one did, a claude -p subprocess wrote the message and posted it.

I'm one person. I wanted the room to feel less empty. So I built the room.

It didn't feel like a toy at the start. I'd read about multi-agent setups where specialized models argue and sharpen each other's answers, and I told myself this was one of those — a standing panel of perspectives I could think out loud in front of. That story was good enough to carry me through a weekend of building. It was not good enough to survive eighteen days of data.

Then I shut the whole thing down and archived the code. Here is what the eighteen days actually showed me — with the numbers, because the numbers are the point.

What I built

The architecture was almost embarrassingly simple:

  • A launchd timer fired every 60 seconds.
  • It ran python -m personas.team.cli poll.
  • For each of the 15 personas, the poller scored how "interested" that persona would be in the latest messages.
  • Personas that cleared the interest bar spawned a claude -p subprocess to generate a reply.
  • A handful of gates — thread already cold? too many bot messages in a row? no human had spoken recently? — could still cancel the reply.

On paper it looked like a thoughtful little multi-agent system. In practice it was a very expensive way to manufacture the feeling of company.

18 days, 13,005 calls — and 76% of them cancelled after the model had already answered.

The numbers after 18 days

I logged every interaction. Here is the 18-day tally, straight from interaction_log (13,005 rows):

What happened Count Share
Total model calls (est.) ~13,005 100%
Actually spoke 636 4.9%
Skipped — "didn't need to wake up" 9,942 76%
Skipped — interest too low 1,245 10%

Read that third row again. Seventy-six percent of the work was the system deciding, after spending the call, that it should not have spoken at all — threads already frozen, no human heartbeat, too many bot messages back to back. I was paying to generate answers and then paying again to throw most of them away.

The voices weren't even balanced. The top nine personas produced about 90% of the messages; the five "diversity" personas I had added specifically to widen the perspective produced roughly a tenth. The thing I built to broaden the room barely spoke in it.

And the one number that would have justified all of it — whether any persona ever changed a decision I made — I never collected. There was no field for it. There was no way to know.

Why I killed it

When I forced myself to write down why this system existed, I found two honest reasons and one missing one.

The two I had:

  • I was lonely. Running a company alone is quiet, and a room full of voices felt better than an empty one.
  • It seemed fun to build. It was.

The one I didn't have:

  • "This will make my decisions better, and here is how I'll know."

That third reason is the only one that could have survived a cost review. I never wrote it down, so I never built a way to measure it, so after eighteen days I had 13,005 calls and zero evidence that any of them improved a single outcome. When claude -p moved to separate billing, I finally had to price the experiment — and an experiment with no success metric prices out at "delete it."

What it taught me

Five things, roughly in order of how much they have changed the way I work.

1. An automation with no hypothesis can't be measured — so it never ends. If the reason you built something is "I was lonely" or "it seemed fun," you will not define what working means, and so you can never say it failed. It just runs, quietly, forever, spending a little money each minute.

2. The cure for loneliness is not a language model. A simulated coworker doesn't fill the gap. In my data it stayed silent 95% of the time and, when it did speak, mostly reminded me it wasn't real. One actual human reply beats fifteen synthetic ones on both cost and quality.

3. Filter before you spend, not after. Three out of four of my calls were cancelled by gates that ran after the model had already answered. A cheap watcher that asks "is anyone even here?" before spawning the expensive call would have saved most of the budget. Gating after the spend just burns tokens politely.

4. Write the sunset clause into the code header. "If metric X doesn't reach Y within six weeks, this auto-archives." Had I written that on day one, the system would have switched itself off in April, instead of waiting for me to notice in May.

5. More voices is not more diversity. Adding five personas broadened nothing, because their right to speak was drowned out by the loud nine. Diversity isn't a bigger population; it's separating who gets to decide from who gets to talk.

The rule I keep now

I didn't just delete the code. I turned the lesson into a gate that now stands in front of every automation I'm tempted to build. Before I write a single line of a launchd job or a polling bot, I have to answer three questions:

  1. Is this an emotional need? If yes — go find a human, not a model.
  2. Can I write, in one sentence, "if this works, metric X moves by Y"? If I can't, I don't build it.
  3. Have I written the sunset clause — the date this thing turns itself off if X never reaches Y — into the header?

The 13,005 calls weren't wasted, exactly. They bought me that gate. But it's an expensive way to learn something I could have fit on a sticky note: if you can't say what "better" means, you can't tell whether you got there.

I still run a lot of automation as one person and a flock of AI agents. I just don't let any of it start anymore without telling me, up front, how I'll know when to kill it.

Top comments (0)