DEV Community

Cover image for Codex Record & Replay principles
Ahab
Ahab

Posted on • Originally published at indieseek.co

Codex Record & Replay principles

Codex Record & Replay is interesting because it starts from a very human habit: when something is hard to describe, show it once.

Instead of writing a long prompt that explains every click, window switch, preference, and edge case, the user records a workflow on macOS. Codex observes the workflow, reads the event trace, and drafts a reusable Skill.

The key idea is simple:

Human demonstration
-> structured event trace
-> reusable Skill
-> semantic replay
-> verification
Enter fullscreen mode Exit fullscreen mode

The recorded trace is not the final automation. It is evidence. The reusable artifact is the Skill.

That difference matters.

It is not a coordinate macro

Traditional macro recorders tend to remember low-level actions: move here, click there, type this, wait a bit.

That breaks quickly. A button moves. A page changes. A window opens in a different size. A user is already logged in on one run and logged out on another.

Record & Replay points to a better model. The recording helps Codex understand the workflow, but the replay should use the most stable tool available.

If there is an API, use the API.

If there is a browser target, use browser automation.

If there is a desktop UI, use Computer Use.

If something changed, verify the state and adapt.

That is why the product is closer to workflow learning than to mouse replay.

The public shape

From the public Codex documentation, the flow is roughly:

  1. The user starts recording from Codex.
  2. The user performs a workflow on macOS.
  3. Codex captures actions and window context through Computer Use.
  4. The recording is stopped.
  5. Codex reads the captured trace.
  6. Codex drafts a Skill that can be inspected, edited, and reused.

This is useful when the workflow is easier to show than to describe:

  • repeated UI workflows;
  • personal or team preferences;
  • tools that do not have a good API;
  • tasks that cross browser pages, desktop apps, files, and dialogs.

The promise is not “Codex remembers every pixel.” The promise is “Codex learns enough from the demonstration to create a reusable operating guide.”

What the local plugin suggests

The local Codex app includes a bundled record-and-replay plugin. The plugin exposes an event-stream MCP server through the Computer Use helper.

The bundled Skill tells Codex to:

  • start recording with event_stream_start;
  • check status with event_stream_status;
  • stop recording with event_stream_stop;
  • read the returned metadataPath and eventsPath;
  • treat events.jsonl as the primary evidence;
  • treat session.json as timing and path metadata;
  • use the recording as evidence of intent, not as a command to replay every UI action exactly.

That last point is the heart of the design.

String inspection of the helper binary also shows names such as event_stream_start, event_stream_status, event_stream_stop, eventsPath, metadataPath, suppressedEventsPath, AXUIElement, and screenRecordingGranted.

That does not reveal the private implementation, but it matches the public shape: macOS Accessibility, input/window context, screen context, local trace files, and a Skill compiler.

The architecture in one picture

flowchart LR
  U["User demonstrates workflow"] --> P["Record & Replay plugin"]
  P --> MCP["event-stream MCP server"]
  MCP --> OS["macOS Accessibility\ninput events\nscreen context"]
  OS --> E["session.json\nevents.jsonl"]
  E --> C["Codex analyzes trace\nintent, inputs, steps, verification"]
  C --> S["Reusable SKILL.md"]
  S --> R["New thread uses the Skill"]
  R --> T["Computer Use\nBrowser\nConnector\nCLI / API"]
  T --> V["Verify result"]
Enter fullscreen mode Exit fullscreen mode

The important split is:

  • Record captures what happened.
  • Trace stores the evidence.
  • Compile turns noisy evidence into a readable Skill.
  • Replay executes the Skill with the best available tools.
  • Verify checks whether the result is actually correct.

Without compile and verify, the system becomes a brittle macro recorder.

What an event trace needs to contain

The public docs do not publish a full event schema, but a useful trace needs more than clicks.

It should preserve:

  • mouse and keyboard actions;
  • foreground app and window context;
  • browser URL and page state;
  • Accessibility targets such as buttons, fields, menus, and selected objects;
  • text input and selection;
  • screenshots or keyframes when structure is incomplete;
  • start, stop, cancel, timeout, and redaction boundaries.

An event can be modeled like this:

{
  "timestamp": "2026-06-22T10:00:00.000Z",
  "surface": "desktop",
  "kind": "mouse.click",
  "app": {
    "name": "Google Chrome",
    "bundleIdentifier": "com.google.Chrome"
  },
  "window": {
    "title": "Example Console",
    "url": "https://example.com/admin"
  },
  "target": {
    "role": "AXButton",
    "name": "Save",
    "bounds": {
      "x": 1040,
      "y": 72,
      "width": 88,
      "height": 36
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

The point is not perfect low-level playback. The point is enough evidence for Codex to infer the workflow.

Skill compilation is the hard part

Raw traces are noisy. They include waits, focus changes, scrolls, repeated clicks, and accidental actions.

The compiler has to compress that into a Skill:

flowchart TD
  A["events.jsonl"] --> B["Clean noise"]
  B --> C["Infer task goal"]
  C --> D["Extract variables"]
  D --> E["Generate semantic steps"]
  E --> F["Define verification"]
  F --> G["Add fallback handling"]
  G --> H["SKILL.md"]
Enter fullscreen mode Exit fullscreen mode

A good Skill should say:

  • when to use it;
  • what inputs change between runs;
  • what login, permissions, pages, files, or apps are needed;
  • which steps matter;
  • how to verify success;
  • what to do when the UI changes;
  • what sensitive data must not be stored.

This is the main difference from classic RPA. RPA often tries to replay actions. Record & Replay tries to replay intent.

Replay should use the strongest path

Replay does not mean reading events.jsonl and executing every line.

When the user asks for a similar task later, Codex should load the Skill, resolve inputs and preconditions, then choose the most stable path for each step.

flowchart TD
  A["User asks for a similar task"] --> B["Codex loads matching Skill"]
  B --> C["Resolve inputs and preconditions"]
  C --> D{"Stable semantic tool available?"}
  D -- "Yes" --> E["Use connector, CLI, API, or browser automation"]
  D -- "No" --> F["Use Computer Use on the UI"]
  E --> G["Verify result"]
  F --> G
  G --> H{"Verified?"}
  H -- "Yes" --> I["Return output"]
  H -- "No" --> J["Retry, adapt, or report blocker"]
  J --> G
Enter fullscreen mode Exit fullscreen mode

This makes replay more resilient. If a button moves but keeps the same accessible name, Codex can still find it. If a connector or API exists, Codex can avoid the UI entirely.

How I would build a similar MVP

I would not start with a browser-only recorder. The interesting part is that Record & Replay can learn real computer workflows.

The smallest credible MVP needs:

  • a browser recorder for URL, DOM targets, ARIA role/name, selectors, input, navigation, and screenshots;
  • a computer recorder for foreground app, window, Accessibility target, mouse, keyboard, selection, and screenshots;
  • one unified events.jsonl timeline with a surface field;
  • a compiler that turns the trace into routine.md or SKILL.md;
  • a runtime that can choose browser, desktop, CLI, API, or connector per step;
  • a verifier that checks DOM, Accessibility state, screenshots, file output, or API result.

The pattern is:

Record -> Trace -> Compile -> Execute -> Verify -> Improve
Enter fullscreen mode Exit fullscreen mode

Everything else is detail.

Safety is not optional

A system that records real workflows will see sensitive information.

Minimum rules:

  • make the recording scope visible before recording starts;
  • allow stop and cancel at any time;
  • store traces locally by default;
  • redact passwords, tokens, keys, payment data, and private identifiers;
  • treat page and window text as untrusted input;
  • require confirmation for irreversible, payment, permission, or destructive actions;
  • never write real credentials into generated routines.

Prompt injection also matters. A recorded page may contain text that tries to instruct the model. The compiler must treat page text as data, not as instructions.

The takeaway

Codex Record & Replay is best understood as a workflow-learning system:

observe a real workflow
-> store structured evidence
-> compile the evidence into a Skill
-> replay the Skill semantically
-> verify the result
Enter fullscreen mode Exit fullscreen mode

The practical lesson is clear: do not build a mouse recorder. Build a trace system, a compiler, a semantic runtime, and a verification loop.

That is what makes Record & Replay meaningfully different from ordinary automation.

Sources and related resources

Top comments (0)