DEV Community: KingGyu

I open-sourced Codex Spark: traceable UI delegation for Codex

KingGyu — Wed, 06 May 2026 16:49:29 +0000

I open-sourced Codex Spark, a Codex plugin for delegating concrete Computer Use and Browser Use tasks to GPT-5.3 Codex Spark subagents.

Repo: https://github.com/KingGyuSuh/awesome-codex-spark

The Problem

As Codex sessions get better at long-horizon reasoning, the bottleneck is not always "can the model click the button?"

Often the better question is:

Should the most reasoning-heavy model spend its context and tokens on mechanical UI work?

If the parent session is doing architecture, code review, release verification, or product reasoning, I want it focused there. A visible-world task like opening a page, reading UI state, pasting approved content, or filling one approved form should be delegated as bounded execution.

The Pattern

Codex Spark uses one skill: $codex-spark-delegate.

The parent session remains responsible for:

understanding the user request;
choosing exactly one surface: Computer Use or Browser Use;
confirming exact side effects;
setting the target, content, limits, and verification criteria;
reading the returned trace and deciding recovery.

The Spark child is not a planner. It is an executor.

The Trace Is The Interface

The child must return:

status;
trace id;
tool surface;
target;
model config;
steps;
observations;
verification;
artifacts;
blockers;
next step.

That trace matters because UI work fails in partial ways. A form might submit but not visibly persist. A rich-text editor might accept pasted text but corrupt non-ASCII characters. A browser tool might be unavailable. The parent needs evidence, not a vague "done."

What It Does Not Do

Codex Spark is intentionally narrow.

It does not ship X, Reddit, Gmail, or other domain-specific executors. Those belong in separate plugins. It also does not silently replace Browser Use with HTTP scraping or another automation surface. If the requested surface is unavailable, the child reports blocked.

Why I Built It

The useful split is:

reasoning-heavy parent model, for example GPT-5.5 xhigh, handles judgment;
Codex Spark handles bounded visible-world execution;
the trace is the join point.

That lets the strongest reasoning stay focused on design, code, and verification while Spark handles the mechanical UI/browser work.

Repo: https://github.com/KingGyuSuh/awesome-codex-spark

Autarch: AI Strategy Evolution, Deterministic Trade Execution

KingGyu — Sat, 02 May 2026 18:48:12 +0000

Disclaimer: this is research and architecture software, not financial advice. The bundled strategies are not a profitability claim. Cryptocurrency trading involves significant risk.

TL;DR

Autarch is an open-source Bybit USDT perpetual trading workbench built around one boundary:

"LLM trading" does not have to mean an LLM presses Buy or Sell. It can mean an LLM evolves future strategy while deterministic code owns live execution.

Claude/Codex agents generate, review, backtest, and rank strategy candidates. A Python asyncio runner executes selected strategy files with no LLM calls in the live loop. The handoff is visible through strategy manifests, signal code, leaderboards, active/next pointers, cached data, and append-only evidence logs.

GitHub:

https://github.com/KingGyuSuh/autarch

Why I Built It

Most "AI trading bot" framing collapses two very different jobs into one box.

In older discretionary trading, a trader might personally decide and execute each order. In modern quant trading, that is usually not the shape of the work. Humans design policies, define constraints, validate assumptions, monitor behavior, and let execution systems place trades under those rules.

That distinction matters for LLMs.

If an LLM participates in a trading system, the interesting question is not "can the model press the Buy button?" The more useful question is: can the model help evolve future policy without sitting inside the live execution path?

Research work benefits from generative systems. They can inspect evidence, form hypotheses, critique candidate strategies, compare backtests, and revise code.

Live execution has a different job. It should be explicit, bounded, inspectable, and boring.

Autarch is my attempt to preserve both truths in one architecture.

The project started from a simple design rule:

Let AI improve the strategy, but do not let generative uncertainty directly own irreversible execution.

That rule shaped the whole repository. Autarch is not "an LLM that trades for you." It is a workbench where AI agents can propose, review, and backtest strategies, while live execution stays deterministic, inspectable, and bounded by explicit risk controls.

System paper:

https://github.com/KingGyuSuh/autarch/blob/main/docs/AUTARCH.md

The Problem

Generative models are useful in research loops. They can search through hypotheses, explain tradeoffs, inspect logs, write candidate code, compare results, and critique their own work.

Live execution has different needs. It benefits from narrow responsibility, explicit state, deterministic behavior, and clear authority boundaries.

Those two qualities should not be forced into the same runtime path.

In a trading system, that distinction matters. A model can be useful for strategy evolution without being allowed to improvise in the live order path.

The Autarch Split

Autarch is organized into two planes with an evidence boundary between them.

Evolution Plane

The Evolution Plane is where Claude/Codex harness agents work.

In the current implementation, the harness runs producer/reviewer pairs:

trade-strategy creates or revises strategy candidates.
backtest evaluates the strategy pool against cached market data.
strategy-run compares the top leaderboard candidate against the currently active strategy and writes a proposed next strategy pointer when appropriate.

This side is allowed to be creative and iterative because it does not place live orders. It produces artifacts that can be inspected.

Evidence Boundary

The handoff is deliberately plain:

strategy/pool/<id>/manifest.toml
strategy/pool/<id>/signal.py
strategy/leaderboard.toml
strategy-script/active/<pair>.toml
strategy-script/next/<pair>.toml
config/trade.toml
data/*.jsonl
raw-data/*.csv

These files answer the questions that matter:

What strategy exists?
Which strategy is active?
Which strategy is proposed next?
Why was it ranked highly?
What evidence has the runner recorded?
What risk posture is currently configured?

The boundary could become a database, a queue, a signed manifest, or a dashboard later. The important part is not the medium. The important part is that the handoff is visible and accountable.

Execution Plane

The Execution Plane is deterministic Python.

strategy-script/runner.py runs one asyncio coroutine per configured pair. Each coroutine:

Checks current positions.
Waits for native TP/SL closure if a position is already open.
Records closure evidence.
Applies a pending next/<pair>.toml strategy pointer only at the boundary.
Loads the active strategy manifest and signal.py.
Fetches Bybit klines.
Evaluates entry_signal(candles, params, context).
Routes any entry through bybit-script/place_order.py.

The runner does not call an LLM.

After entry, the position is managed by Bybit native TP/SL. The runner polls, records evidence, and continues.

Strategy Format

Each strategy has a manifest:

id = "ema_cross_v1"
description = "..."
pairs = ["BTCUSDT", "ETHUSDT"]
leverage = 5
tp_pct = 0.012
sl_pct = 0.008
timeframe = "5"
kline_limit = 120

[params]
# strategy-specific parameters

And a deterministic signal function:

def entry_signal(candles, params, context=None):
    # Return None for no entry.
    # Return {"side": "Buy" or "Sell", "rationale": "..."} for an entry.
    ...

The signal code is constrained. It should be deterministic for identical inputs. It should not perform network calls, external IO, or time-dependent behavior. The live runner should evaluate strategy logic, not host a hidden research session.

Risk Gates

The project keeps safety posture in config/trade.toml.

The default configuration includes:

armed = false, so live order placement is rejected until explicitly enabled
mandatory TP/SL
leverage caps
minimum TP and SL distance floors
minimum reward/risk ratio
fixed margin fraction
global maximum concurrent positions
active pair list

The harness never calls place_order.py. Only the execution runner places entries, and only through the configured order gate.

This does not make trading safe. It makes the authority boundary explicit.

Why This Architecture Matters

The point of Autarch is not that one strategy, exchange, or scoring formula is correct.

The point is the shape of the system:

Let the creative layer mutate future policy.
Make the policy handoff inspectable.
Keep the action layer narrow and deterministic.
Record evidence so the next evolution cycle can learn from what happened.

That pattern applies beyond trading. Any agentic system that separates "thinking about future behavior" from "taking irreversible action" can benefit from a similar boundary.

What I Want Feedback On

I am especially interested in critique around:

whether the Evolution Plane / Execution Plane split is clear enough
whether file-based handoffs are a good first boundary
whether strategy adoption should require stronger review or signatures
how to score a changing strategy pool without overfitting recent data
where human approval should sit in the loop
what should be made more formally verifiable

Links

GitHub:

https://github.com/KingGyuSuh/autarch

Architecture note:

https://github.com/KingGyuSuh/autarch/blob/main/docs/AUTARCH.md

Autarch is research software. It is not financial advice, not a profitability claim, and not something anyone should run blindly. The useful idea is the boundary: evolve freely, hand off explicitly, execute accountably.

Bridging Codex’s image_gen tool into Claude Code as /codex-image:* skills

KingGyu — Sat, 02 May 2026 18:15:36 +0000

Claude Code has no first-party image generation. Codex CLI does — it ships a headless image_gen tool (gpt-image-2) that runs against whatever auth you already have: ChatGPT subscription (Free tier included), or your existing OpenAI API key. So no extra OPENAI_API_KEY to manage.

I built a thin Claude Code plugin that bridges the two. Three slash commands:

/codex-image:generate "5 logo variations of a brass compass on white, save under images/logos/"
/codex-image:edit input.png "Replace background with a clean white studio backdrop"
/codex-image:status

The full slash-command argument is passed verbatim to Codex's imagegen skill. Output paths, sizes, quality, transparency, multi-image count — all expressed in natural language inside the prompt. No --out / --size / --quality flags to memorize; imagegen handles them.

Architecture: each SKILL.md is a 1-line node script.mjs <subcmd> "$ARGUMENTS" invocation. The Node wrapper (~375 lines) does only argument splitting and codex exec spawning with a ~6-line minimal instruction prefix. Image-generation intelligence lives entirely in Codex's bundled imagegen skill — this plugin is a pure dispatcher. One non-obvious finding documented along the way: SKILL.md bash isn't always executed verbatim by the model (it pre-evaluates $(...) substitutions in its head), so all parsing must live in the Node script. Details in docs/ARCHITECTURE.md if you're building plugins yourself.

Trade-off worth knowing: agent tokens count against your Codex usage limit. A typical single-image low-quality turn is around 30k agent tokens on top of the image-gen cost itself.

Repo: https://github.com/KingGyuSuh/codex-image-in-cc

Install:

claude plugin marketplace add KingGyuSuh/codex-image-in-cc
claude plugin install codex-image@codex-image-in-cc

Apache-2.0. Orthogonal to and complementary with openai/codex-plugin-cc (code review / task delegation under the /codex: namespace) — install both.

Happy to take feedback or contributions. The architecture decisions are documented openly so you can disagree concretely.

Open-sourcing my personal AI Agent Harness for Production (harness-loom)

KingGyu — Mon, 20 Apr 2026 13:11:03 +0000

I’ve been poking at a bunch of AI agent frameworks and coding tools this past year. For personal projects, I often just use Hermes Agent or something similar because it's fast and saves tokens.

But honestly? When I actually have to ship something for production, I can't just use those raw agent setups. Between security compliance, instability, and the sheer complexity of real-world codebases, it’s just too risky.

For production, I keep going back to CLI tools like Claude Code, Codex, or Gemini CLI.

Why? Because in production:

Perfect > Fast: I'd rather it take longer but be absolutely correct and secure.
Traceability & Long Plans: I need to track the exact progress of long-running plans without having to baby-sit it or intervene constantly.
Consistent Quality: No matter which team member kicks off the task, the output quality and adherence to our repo's standards need to be exactly the same.

And I realized the way to achieve this isn't by finding a magical new model. It's by tuning the harness.

These CLIs (Claude, Codex, Gemini) already give you a pretty solid baseline harness for free (planners, hooks, auto mode, skills). But that baseline has no idea what my specific repo cares about. It doesn't know my team's review rules, what "Done" looks like for us, or what artifacts we need to persist.

So, I started focusing on Harness Fine-Tuning—writing my team's specific review rules, producer/reviewer pairs, and task shapes into actual version-controlled files, rather than trying to re-explain them in a prompt every single session.

I've finally open-sourced my personal harness setup: harness-loom.

It’s not another agent framework. It sits on top of whatever harness your CLI already ships and lets you shape it to fit your production repo. You define your rules in one canonical place (.harness/loom/), and it derives the specific configs for Claude, Codex, or Gemini.

I’m still in the process of porting over all the specific features from my private setup into the open-source repo, but the core factory is there and ready to use. I'll be updating it quickly!

If you are trying to use AI assistants for serious production work and want them to act more like a predictable system rather than a one-off chat, I'd love for you to poke at it.

🔗 GitHub Repo: harness-loom

Has anyone else felt the need to shift from "prompt engineering" to "harness engineering" for production work?