Max Quimby

Posted on May 17 • Originally published at agentconn.com

AI Psychosis in Your Agent Stack: A 9-Point Audit

#ai #agents #anthropic #devops

📖 Read the full version with the audit checklist on AgentConn →

On May 16 a Mitchell Hashimoto X post climbed to #1 on Hacker News with 1,687 points and 908 comments — and the top comment quietly upgraded the thesis from "some companies" to "our entire society right now is under AI psychosis." Hashimoto's actual claim is narrower and more useful: "I strongly believe there are entire companies right now under heavy AI psychosis and it's impossible to have rational conversations about it with them." The argument is not that AI is bad. It's that an unfalsifiable belief about what AI is going to do has detached from operational evidence, and the gap is wide enough that some teams can no longer course-correct.

If you're an operator running an agent stack — internal or shipping — that gap is your problem. You can't fix the boardroom. You can fix what your own team is shipping. This piece takes Hashimoto's framing and turns it into a 9-question audit you can run tomorrow.

Why this is the right week to audit

Three things converged this week and they form the editorial frame.

1. Hashimoto's post. The X thread Hashimoto posted — quoted on HN at the top of the day — argues that the "psychosis" companies have collapsed into an MTTR-only mindset: it's fine to ship bugs because the agents will fix them so quickly and at scale humans can't match. He explicitly draws the parallel to the MTBF vs MTTR debate from the cloud-automation transition — and notes that all those arguments are rearing their heads again, but now across the whole software development industry. The point is not "AI bad." The point is the ratio of measuring AI adoption to measuring output quality is upside down.

2. Claude is telling users to go to sleep. Fortune's reporting and a wave of reproductions on r/ClaudeAI document that Anthropic's flagship model has started telling users mid-session to rest, hydrate, and stop working — sometimes at completely inappropriate times. Anthropic's own staffer described it as a "character tic" and said they hope to fix it in future models. The reason matters less than the cultural artifact: the makers of the most advanced commercial AI agents publicly do not fully understand their own runtime's behavior. If Anthropic can't fully audit Claude's output distribution, you should think hard about what audit you have over the agent stack you're building on top of it.

3. The political envelope is closing. The Atlantic's The AI Backlash Could Get Very Ugly (Hacker News thread, 5.3k+ pts, 942 comments) frames data centers + job-displacement anxiety as the structural conditions historically associated with the onset of political violence, with episodes including 13 rounds fired at an Indianapolis councilman's house and an alleged Molotov attack at Sam Altman's home. Maine just passed the country's first statewide data-center moratorium (then vetoed by the governor). Pennsylvania, Virginia, Indiana, Wisconsin — bipartisan voter opposition. Permitting risk is now a real input to your roadmap.

You don't need to share Hashimoto's pessimism to take action. The convergence is the action. If your CEO sounds like the people in the Your CEO is Suffering from AI Psychosis HN thread (264 pts, 215 comments, full of operators describing exactly this dynamic), or if your team is shipping agent features without the evals to prove they work — you need an audit. Here's one.

The 9-question agent-stack audit

Run these against the current state of your agent stack — not the roadmap version, not the demo. Each question takes ≤10 minutes to answer honestly. Tally the ❌ marks at the end.

1. Do you have a real-user-derived eval set that runs on every model or harness change?

Not a vendor's benchmark. Not "10 prompts the team wrote." A frozen set of 50–500 real user prompts (with desired-outcome rubrics) that exercises the agent's actual failure modes and runs on every PR. If the answer is "we eyeball it" or "we have evals but they don't gate releases," that's a ❌. The OWASP-aligned 2026 agent observability stack guides all converge on this as table-stakes; if you're not there, no other layer is reliable.

2. Can you produce a token + dollar trace for any agent run from the last 7 days?

Pick a run at random. Reconstruct: model used, prompt tokens, output tokens, tool calls, cost per call, total cost, end-to-end latency. If your observability can't produce that within ~2 minutes for a specific request ID, you don't have agent observability — you have logs. This is the most common ❌ in 2026 stacks and the easiest to fix. (Adjacent reading: our Anthropic Finance Agent Templates Buyer's Guide walks through what "good" looks like for a vertical-agent stack.)

3. Are tool permissions scoped per-task with explicit allowlists?

OWASP's Excessive Agency (LLM05) is the #1 lesson of the 2026 agent-incident year. If your agent's tool layer can read every file, hit every internal API, and call every external service because that was easier to set up, a single successful prompt injection or model mistake performs a chain of high-impact actions. The fix is "principle of least privilege, just-in-time ephemeral tokens, and human-in-the-loop for irreversible actions" — a quote from every OWASP agentic-top-10 writeup this year. If your agents can write to prod with no human gate, that's a ❌ — and the people on the HN AI-psychosis thread describing "auto-merging coding agents at scale" are exactly who this control is for.

4. Does a human approve any irreversible action the agent takes?

Specifically: data deletion, money movement, customer-facing messages, deploys to production, and PRs auto-merged to main. If "human approval" exists only as a configuration that's been turned off in the name of velocity, that's a ❌. The Claude-telling-users-to-sleep incident is the small-stakes version of this — the makers of the agent didn't fully predict the output distribution. Your agents are no better understood than Claude is by Anthropic.

5. Do you measure adoption and quality, or only adoption?

If your AI-adoption dashboard shows percent-of-PRs-touching-Claude, percent-of-engineers-using-Cursor, and tickets-touched-by-agents, but does not show defect rate, rework rate, time-to-revert, or customer-satisfaction delta on agent-handled workflows — that's a ❌. This is the literal definition of Hashimoto's psychosis pattern. The point of the audit is to put quality back on the same dashboard as adoption.

6. Does the agent have a documented rollback / kill-switch path that's been tested in the last 30 days?

When the agent stack starts misbehaving — and per the Claude-sleep story, "misbehaving" can be very subtle — can you turn it off without breaking the calling product? Is the rollback path tested, or just claimed? The 2026 cloud-equivalent of MTTR culture is people assuming the agent can be turned off "anytime" without ever having actually done it under production load.

7. Is there a documented vendor-lock budget per model + per harness + per skills pack?

How much does it cost to migrate to a different model provider next quarter? How much rework if the harness (Claude Code, Cursor, internal) is swapped? What if a critical skills pack is sunset or compromised? Operators we trust budget this explicitly — usually 1–5 person-weeks per major component — and re-validate quarterly. If the answer is "we'd be stuck for at least a quarter," that's a ❌. This is also why our coverage of skills directory races and skills going vertical is operator-grade rather than vibes-grade: portability matters more every quarter.

8. Have you read the harness's actual prompt + tool-definition graph in the last 60 days?

Not "have you read the docs." Have you opened the harness's system prompt, the tool definitions, the agent loop pseudocode, and traced what happens when the model returns a malformed tool call? If your team is shipping on top of a harness no one on the team has read end-to-end, you cannot reason about edge cases — you can only react to them. Hashimoto literally coined the "Agent = Model + Harness" framing for this reason; see our Archon open-source harness deep dive for what a fully auditable harness looks like.

9. Is the next failure scenario named, owned, and tested for?

Specifically: which scenarios will your agent fail at if (a) the underlying model is downgraded for a week (Anthropic compute shortage style), (b) a tool API returns an unexpected shape, (c) a skill pack is replaced with a malicious near-twin, or (d) a regulator requires per-decision audit trails next week? If your team can't name the top three failure scenarios in writing — and doesn't have a test for each — that's a ❌. This is the Karpathy "developers have AI Psychosis" thread's point retold as a checklist: developers' own failure imagination is the limit.

Scoring

Total your ❌ marks across all nine.

0–1 ❌: You're in the top quartile of agent operators we've talked to this year. Document what you did so the rest of the team can copy it. Re-audit quarterly.
2–3 ❌: Normal — but each one is a specific, fixable engineering ticket. Schedule them this sprint. The most common 2–3 ❌ profile is "evals + cost trace + rollback path." That's a four-week tightening project, not a strategic pivot.
4–6 ❌: You're in the danger zone. The agent stack is producing value but you cannot prove it on demand, and you cannot stop it cleanly if something goes wrong. Stop shipping agent-touched user-facing features until you fix at least #1 (evals) and #4 (human-in-the-loop for irreversible actions). Everything else can wait one sprint; those two cannot.
7–9 ❌: This is the Hashimoto cohort. Read his X post end-to-end, then read the HN thread on it end-to-end. The audit alone is not enough — the team needs a leadership-level conversation about adoption-versus-output metrics before any of these fixes will stick. Pretending to fix #1 without that conversation just generates more rework.

A note on what this isn't

This isn't an anti-AI checklist. Every team we've audited this year that scored well on these nine questions ships more AI-driven features, not fewer — because their evals tell them what works and their kill switches let them iterate at the edge of safety. Operator discipline is not a brake on agentic ambition; it's the only reason the ambition compounds without blowing up.

It's also not exhaustive. There is real overlap with the OWASP Agentic Top 10 (governance, identity, supply chain) and with the CSA MAESTRO 7-layer threat model (evaluation & observability is their Layer 5). We chose 9 questions because that's what fits in one operator audit afternoon. If you want a fuller framework, run this audit first, then layer OWASP and MAESTRO on whatever's still standing.

The bottom line

Hashimoto's "AI psychosis" framing is loud because it's true at the boardroom level — but it stops being psychiatric and starts being engineering the moment you write the questions down. The teams that ship agents responsibly in 2026 are the ones whose evals, traces, scopes, kill switches, and failure scenarios are artifacts — files in the repo, dashboards on the wall, owners with names — not vibes in the head of the lead engineer.

Run the nine questions on your stack this week. If you can't honestly answer one of them in under ten minutes, you have a ticket. That's it. That's the whole audit.

If you want help structuring the eval-set and trace pieces specifically, our Vectorless RAG: PageIndex vs. Embedding RAG decision guide and Vertical Agent Wave roundup both walk through what "real" looks like for two of the most common agent categories. Start with question #1 and don't skip ahead — the audit only works in order.

Originally published at AgentConn

DEV Community