Florian Liao

Posted on May 4 • Originally published at github.com

I got tired of telling Codex “continue”, so I built Long Long Run

#automation #openai #showdev #tooling

I built a Codex skill called Long Long Run (LLR):

https://github.com/huahuadeliaoliao/long-long-run

The short version: I got tired of Codex stopping in the middle of long tasks when it clearly already knew the next step.

This is not meant as a complaint about Codex. It is more of a dev-log about a pattern I kept running into, and the small harness I built around Codex skills and hooks to make long-running work feel less fragile.

The problem

A pattern I kept seeing:

I spend time clarifying a project with Codex.
We agree on a detailed plan and acceptance criteria.
Codex starts implementing.
It finishes one useful chunk.
It summarizes the next step itself.
Then it still asks me whether to continue.

At some point, typing "continue" over and over started to feel absurd.

The strange part is that Codex usually knows what to do next. It can describe the next step. It can explain what is still unfinished. But it still stops after a local update.

For short tasks, this is fine. For long tasks, it breaks the rhythm.

I wanted Codex to keep moving when:

the objective is not complete
the work is not blocked
the user did not ask it to stop
the next action is clear
the next action is still covered by the agreed task

So I built Long Long Run.

What Long Long Run is

LLR is a plug-and-play Codex skill for long-running agent work.

It is not a project management system.

It is not a memory archive.

It is not trying to make Codex run forever.

It is a small agent-first harness built around two modes:

INC
ACTIVE

INC: Intent Noise Cancellation

INC means Intent Noise Cancellation.

The name comes from ANC, like noise-canceling headphones.

The idea is simple: many bad agent results do not come from the model being weak. They come from the user prompt being noisy.

Sometimes I do not understand the domain well enough.

Sometimes I cannot describe what I actually want.

Sometimes I forget hidden constraints.

Sometimes I do not know what a good acceptance criterion should look like.

If Codex starts implementing immediately, it may complete the task I literally described, while missing the thing I actually needed.

INC asks Codex to slow down first.

In INC mode, Codex can:

inspect the repo
explore the domain
identify hidden requirements
surface assumptions
identify risks
discover expert framing
propose hard acceptance criteria
build an evidence-backed contract before implementation

The goal is to reduce intent noise before execution.

This has become one of the most useful parts of LLR for me. I now spend more effort in the INC phase, because the better the contract is, the less surprising ACTIVE becomes.

ACTIVE: authorized mainline continuation

ACTIVE means the user has explicitly approved the contract and authorized Codex to pursue it as the mainline.

This is where LLR uses hooks.

When Codex tries to stop during ACTIVE, the stop guard reminds it to check:

Is the objective complete?
Is the work blocked?
Did the user ask to stop?
Did new evidence change the contract?
Is there still a clear next action covered by the current contract?

If the next step is clear, Codex continues.

This is not about endless automation. It is about avoiding premature stopping.

The most common failure mode I wanted to fix was:

Codex states the next step itself, but still stops and asks me whether to continue.

LLR turns that into:

If the next step is clear and authorized, continue.

Evidence chain instead of only checkpoints

Long-running agent tasks produce a lot of artifacts:

code
tests
scripts
docs
temporary probes
reports
intermediate experiments
explanations that sounded reasonable at the time

Later, review becomes difficult. Which artifacts matter? Which assumptions were overturned? Which facts still support the current plan?

This is why LLR emphasizes an evidence chain.

A checkpoint says what happened.

An evidence chain says what still matters.

If old evidence is overturned, it should be removed or replaced in the current evidence chain. The history can remain in a checkpoint, but stale evidence should not keep steering the task.

That makes long-running Codex work easier to review and easier to resume.

Side-thread tolerance

During long tasks, I often remember something late.

Maybe I forgot to mention a constraint.

Maybe I need Codex to answer an urgent side question.

Maybe a premise changes while ACTIVE is running.

Normal conversation can pull the model away from the mainline. LLR cannot magically solve the underlying attention problem, but it can help Codex tolerate it.

In ACTIVE mode, Codex is reminded to:

answer the user's latest message first
decide whether it changes the contract
if not, resume the authorized mainline

It feels less like forcing Codex down a rigid path, and more like giving it a compass.

It can walk onto a side path, handle the interruption, and return to the road.

Exploration before validation

Another thing I noticed: Codex search often behaves like answer validation.

If you ask for the latest instance segmentation models, it may immediately search for a keyword it already knows, like SAM2.

That is not necessarily wrong. But it is not how a person explores an unfamiliar field.

A person usually starts from task-level keywords, discovers the vocabulary of the field, and only then validates candidate methods.

LLR's INC guidance nudges Codex to derive discovery keywords from:

the user's wording
repo vocabulary
file names
data labels
metrics
failure symptoms
quality bar
tools and ecosystem terms

This helps Codex explore before it validates.

Current state

LLR is still early. I released v0.1.0 recently.

Repo:

https://github.com/huahuadeliaoliao/long-long-run

Origin story:

https://github.com/huahuadeliaoliao/long-long-run/blob/main/docs/why-long-long-run.md

The skill has already changed how I use Codex:

I spend more time clarifying goals in INC.
I spend less time babysitting execution.
I can ask side questions without losing the mainline.
I can review long tasks through evidence instead of only history.
Codex stops less often when it already knows what to do next.

Final thought

A good agent harness should not think for the agent.

It should help the agent stay clear-headed.

That is what Long Long Run tries to do: give Codex a clearer contract, a current map, an evidence chain, and a guard against stopping too early.

Curious if others are running into the same long-running Codex problems, and whether you solve them through prompts, hooks, /goal, custom skills, or something else.

DEV Community