The "75.6% on SWE-bench means 1 in 4 tasks fails" reframe is the line every team adopting Claude Code should internalize. Most planning docs I see still treat the benchmark as a ceiling instead of a baseline expected failure rate.
A few patterns that have held up in our own production setup, in case useful:
CLAUDE.md should be small and load-bearing, not a manual. Ours got to ~600 lines and the model started ignoring the back half. We cut it to ~150 lines of rules and pointers and moved the long-form context into skills that get loaded on demand. Quality jumped immediately.
Subagents are best when they have narrower context than the parent. Counterintuitively, a subagent given the full parent context tends to drift in the same direction the parent already drifted. Giving it a minimal, fresh prompt is usually a feature.
One MCP server per concern, even if it means more servers. Bundling "everything GitHub-related" into one server made tool selection harder for the model, not easier. Splitting issues, PRs, and code search into three made tool calls noticeably more accurate.
The Playwright + Supabase + GitHub combo you describe is also the one that finally made end-to-end "write code, verify it ran, file the edge case" work for us. The "without leaving the session" framing is the actual unlock — context-switching between tools is where most of the latency lived before.
For further actions, you may consider blocking this person and/or reporting abuse
We're a place where coders share, stay up-to-date and grow their careers.
The "75.6% on SWE-bench means 1 in 4 tasks fails" reframe is the line every team adopting Claude Code should internalize. Most planning docs I see still treat the benchmark as a ceiling instead of a baseline expected failure rate.
A few patterns that have held up in our own production setup, in case useful:
The Playwright + Supabase + GitHub combo you describe is also the one that finally made end-to-end "write code, verify it ran, file the edge case" work for us. The "without leaving the session" framing is the actual unlock — context-switching between tools is where most of the latency lived before.