Getting AI agents to write code is not new anymore. The real problem is not how smart the model is. The real problem is that agents do not have good environments to work in for a long time.
Harness Engineering is the field that works on this problem. In November 2025, Anthropic (https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents) published a blog post about it. In February 2026, OpenAI (https://openai.com/index/harness-engineering/) did the same. OpenAI said a team of 7 people made about 1 million lines of code in 1,500 PRs over 5 months. They wrote zero lines by hand (self-reported).
On X, the post "the 10x skill of 2026 is Evaluation Engineering" went viral. The engineer's job is changing. From "writing code" to "building environments where agents write good code."
Two Parts of Harness Engineering
Agent Harness is the execution side. The setup that lets agents work well over long sessions. It automates environment setup, passes progress between sessions using progress files and Git, builds one feature at a time, and runs E2E tests automatically.
Evaluation Harness is the quality side. How you score AI output with numbers, not feelings. EleutherAI has 60+ benchmarks. Inspect AI has 100+ pre-built evaluations. LLM-as-a-judge lets AI grade AI. These connect to CI/CD gates and safety testing (MLCommons AILuminate has 59,624 test prompts).
Anthropic's Approach: Session Handoff
Anthropic uses a two-step system. First, a setup agent makes init.sh and a feature list (JSON). Then a coding agent builds one feature at a time: code, test, commit, repeat. Between sessions, claude-progress.txt and Git history carry the work forward.
OpenAI's Approach: Repo-Wide Environment
AGENTS.md (about 100 lines) sets the rules for the whole repo. Custom linters and CI enforce those rules automatically. Instead of asking the AI nicely in a prompt, they make the tools force the rules.
The methods are different, but both companies reached the same conclusion. Put knowledge in the repo, enforce rules with tools, break work into small steps and leave a trail. Both approaches have limits, though. Anthropic's method is optimized for full-stack web development and has not been tested on scientific research or financial modeling. OpenAI's environment is highly customized for one repo and cannot be copied directly to other projects.
Models will keep getting smarter. But even the smartest model cannot sustain long-running development without a well-designed environment. The difference is not which model you pick. It is how you build the harness.
I covers AI agent designs, skills, and context engineering from the perspective of bringing AI into real teams and workflows. Analysis grounded in primary sources. Follow for more.
https://x.com/n_asuy



Top comments (0)