Long-horizon agents: OpenCode + GPT-5.2 Codex Experiment

#ai #programming #productivity #agents

Sequoia Capital has recently published a blog post arguing that AGI has been achieved because "Long-horizon agents are functionally AGI". About the same time Cursor team has published their experiments with long-running agents that coded a web browser from scratch.

And my recent reflections of the past year made me realize what a huge stride has AI coding made over the course of just one year.

Along the lines of agentic coding and long-horizon execution, here's my recent experiment using OpenCode and GPT-5.2 Codex (predominantly at high reasoning level, sometimes switching to medium and xhigh)...

Approach: the main dialogue (or session in terms of OpenCOde) is the an orchestrator agent; you explicitly ask it to delegate individual tasks to sub-agents (OpenCode uses task built in tool for that ), verify them, and integrate the results. Why? Cause we don't want to hit the context window limit of the model. Though it could be an interesting experiment, relying on one single long thread with compaction happening from time to time.

Task: rewrite a previously vibe-coded provider for litellm which implements a cascade of requests to several LLMs (implementing strategies, such Mixture-of-Agents or LLM Council strategies) before returning a final response.

Results:

About 4 hours of pure agent work time
Orchestrator session — $4.13, 157k tokens of dialogue length by the end of the task
16 sub-agent sessions — $9.73
Total spent $13.86, about 2M tokens
26 files changed in Git
Only 5 tests written (some Kiro+Sonnet/Opus would probably have gone wild and generated a hundred test doing no real work) — all green
The app works — the provider executes multiple llm queries aggregating the final respond, the Streamlit dashboard shows the recorded traces.

While doing he work agents did plenty of tools calls, scrawl the code-base, made file edits and most importantly tested the changes being made (often the changes didn't work and the agents had to fix what was broken):

For these ~4 hours of agent time, it took about half an hour of human effort and ~10 user messages. 6 major human-in-the-loop touchpoints:

Discuss the scope, formulate a requirements .MD
Kick off the work by explicitly asking to delegate to sub-agents and make sure the tests are green
Ask to run a real case with actual LLM interaction
At xhigh resoning level, ask to analyze real LLM interaction test case failure and give a fix plan
Run the fix loop with a real LLM interactions
Finishing touches asking to fix the failing tests and tidy up the docs

The orchestrator/subagents approach has effectively allowed to fit in 2 million tokens worth of work into 157K token long main thread with the orchestrator - there's still room given that GPT-5.2 Codex has a 400K context window.

P.S> I liked OpenCode a lot, more that I liked Codex.

DEV Community

Long-horizon agents: OpenCode + GPT-5.2 Codex Experiment

Top comments (0)