Ahab

Posted on Jul 5 • Originally published at indieseek.co

Building a shippable Record & Replay app for macOS

#ai #automation #webdev #tooling

Codex Record & Replay points to a product idea I find very interesting: sometimes users do not want to explain a workflow from scratch. They already know how to do the work. They just want to show it once and turn that demonstration into something reusable.

That is a strong idea. But a standalone product cannot be a simple browser recorder or a mouse-coordinate macro. If people are going to install it, trust it, and use it repeatedly, it has to record the real browser and the real desktop, protect sensitive data, and replay the intent of the workflow instead of blindly replaying pixels.

The product I would build is a local-first Mac app:

record a browser + desktop workflow
-&gt; review what was captured
-&gt; compile it into an editable routine
-&gt; replay it with verification

The advantage over Codex is not “a better general agent.” The advantage is product ownership: local storage, bring-your-own-model settings, a routine library, run history, scheduling, logs, and exportable workflow files.

The core product bet

A useful Record & Replay app should turn one demonstration into a routine that can run again.

Demonstrate
-&gt; capture evidence
-&gt; compile a routine
-&gt; run the routine
-&gt; verify the result

The important word is “compile.” Raw events are evidence. They are not the final automation.

If the user clicks a button, the product should not remember only x=1040, y=72. It should remember the app, the window, the accessible target, the surrounding text, the browser DOM, a screenshot if needed, and the reason that step mattered.

That is the difference between a useful workflow system and a fragile macro recorder.

The first version must cover browser and desktop

Browser-only automation is useful, but it is not enough for this product.

Real workflows often cross boundaries:

open a page in Chrome;
download a file;
find it in Finder;
upload it somewhere else;
copy a value from Notes or Slack;
confirm a macOS dialog;
return to the browser and submit.

If the product can only see the browser, it misses the part that makes Record & Replay different.

So the first credible version needs two surfaces:

flowchart LR
  User["User demonstrates workflow"] --&gt; App["Mac app"]
  App --&gt; Browser["Browser surface\nextension + native messaging"]
  App --&gt; Desktop["Desktop surface\nAccessibility + input events + screenshots"]
  Browser --&gt; Trace["Local trace\nsession.json\nevents.jsonl\nkeyframes"]
  Desktop --&gt; Trace
  Trace --&gt; Review["Review and redaction"]
  Review --&gt; Compile["Routine compiler"]
  Compile --&gt; Routine["workflow.json\nroutine.md\nassets"]
  Routine --&gt; Runtime["Replay runtime"]
  Runtime --&gt; Verify["Verification"]

The browser surface should capture URL, title, DOM target, ARIA role/name, selector candidates, input changes, navigation, and key screenshots.

The desktop surface should capture foreground app, window title, Accessibility role/name/value, focus, selection, input events, and screenshots.

Both should write into one timeline. The routine compiler should see the workflow as one story, not as two disconnected logs.

Trace is evidence, not the product

The trace format should be boring and explicit.

Each event needs to say:

what surface it came from;
what app and window were active;
what the user did;
what target was involved;
what context was captured;
which fields are redacted;
which screenshot, DOM, or Accessibility snapshot supports the event.

A simple event might look like this:

{
  "eventId": "evt_001",
  "surface": "browser",
  "app": "Chrome",
  "windowTitle": "Dashboard",
  "type": "input",
  "target": {
    "role": "textbox",
    "label": "Search",
    "selector": "[aria-label='Search']"
  },
  "value": {
    "kind": "text",
    "redacted": false,
    "preview": "invoice 2026"
  },
  "contextRefs": ["dom_001", "frame_001"]
}

The product should not store everything as plain text forever. Raw content, summaries, screenshots, and model-bound context should be separated. Before anything leaves the device, the user should be able to see it.

The compiler is the real product

The compiler turns noisy evidence into an editable routine.

flowchart TD
  Raw["Raw events"] --&gt; Clean["Clean noise"]
  Clean --&gt; Segment["Segment steps"]
  Segment --&gt; Anchor["Build stable anchors"]
  Anchor --&gt; Params["Detect variables"]
  Params --&gt; Verify["Write verification"]
  Verify --&gt; Routine["routine.md + workflow.json"]

Good compilation means:

merge typing into one input step;
remove accidental clicks and idle time;
detect which values change between runs;
prefer semantic targets over coordinates;
write verification for important steps;
mark risks and sensitive fields;
ask the user only when intent is unclear.

A good routine should read like this:

Open the report page.
Choose the date range.
Upload the selected file.
Submit the draft.
Verify that the success message appears.

Not like this:

Move mouse to 1040,72.
Click.
Wait 500ms.
Press Tab.

Replay needs a verification loop

Replay should be a state machine, not a script that blindly runs line by line.

stateDiagram-v2
  [*] --&gt; LoadRoutine
  LoadRoutine --&gt; CollectInputs
  CollectInputs --&gt; Preflight
  Preflight --&gt; ExecuteStep
  ExecuteStep --&gt; VerifyStep
  VerifyStep --&gt; ExecuteStep: Pass and more steps remain
  VerifyStep --&gt; Recover: Fail
  Recover --&gt; ExecuteStep: Recovered
  Recover --&gt; HumanTakeover: Needs help
  HumanTakeover --&gt; ExecuteStep: User resumes
  VerifyStep --&gt; Done: All steps pass
  Done --&gt; [*]

The runtime should try tools in this order:

Use the most stable semantic path first: API, MCP, Apple Events, browser DOM action, or Accessibility action.
Use structure next: selector, ARIA label, visible text, Accessibility role, or window hierarchy.
Use visual matching when structure is incomplete.
Use coordinates only as a last resort, tied to a known window and screenshot.

Dangerous actions need explicit confirmation. Deleting data, submitting payments, changing passwords, uploading personal files, installing software, or sending sensitive information should not run unattended.

BYOK is a real reason to exist

Bring-your-own-model support is not just a settings feature. It is a product advantage.

It lets users choose cost, privacy, and model quality. The app should support at least:

OpenAI-compatible base URL + API key + model;
Anthropic;
Gemini;
OpenRouter;
local models through Ollama or a compatible server.

Keys should live in Keychain. The settings screen should include connection tests, model capability hints, context preview, and rough cost estimates.

Cheap models can clean text and summarize traces. Stronger multimodal models can handle ambiguous screens or visual recovery.

Distribution is part of the plan

This kind of app should start outside the Mac App Store.

It needs Accessibility, Screen Recording, Input Monitoring, Apple Events in some cases, helper processes, and browser extension setup. Developer ID signing and notarization are the more realistic first route.

The user experience also has to explain permissions clearly. A workflow recorder sees a lot. If the app cannot explain what it records, where it stores data, and what it sends to models, it will not deserve trust.

A realistic v1

The first sellable version should be narrow, but complete.

It should support:

Chrome or Brave workflow recording;
macOS app/window tracking and Accessibility targets;
one local trace timeline;
a trace review screen with deletion and redaction;
routine compilation into routine.md and workflow.json;
browser + desktop replay with verification;
human takeover when the app is unsure;
local-first storage and context preview;
signed and notarized distribution.

Good first workflows are boring on purpose: back-office browser forms, browser + Finder upload/download flows, and simple native app steps in Finder, Notes, Mail, Calendar, or Slack.

I would avoid banking, payments, government, medical, games, complex design tools, and fully unattended sensitive actions in v1.

My final take

This is a viable product direction if the positioning is honest.

It is not “automate every app.”

It is not “record pixels and replay them.”

It is not “browser automation with a nicer UI.”

The stronger promise is:

> Record a browser + Mac workflow once, turn it into an editable AI routine, and replay it with your own model.

That gives the product a clear reason to exist next to Codex: local-first control, model choice, portable routines, and a UI built for repeated real-world runs.