Ahab

Posted on Jul 5 • Originally published at indieseek.co

How to locate the input position on macOS, and what it took to get it right

#programming #debugging #productivity #tools

Paste Switch has a small overlay that appears when I press the shortcut to cycle through recent clipboard items. The overlay should appear near the place where text is being inserted, not near the mouse pointer, not at the center of the screen, and not on the wrong display.

That sounds like a small UI detail. It was not.

The hard part is that macOS does not give every app the same accessibility shape. A native text field, a browser text area, and an Electron app can all look like “the current input,” but they may expose different Accessibility attributes. Some give an exact caret rectangle. Some only expose the frame of the focused text element. Some return stale or incomplete values depending on focus timing.

The final solution is not a single magic API. It is a small decision tree, backed by logs, with a clear boundary between “exact caret” and “good enough input box anchor.”

The behavior I wanted

The product behavior is simple:

Press Command+Shift+V
  -> paste the latest clipboard item
  -> show a compact overlay near the active typing position
Keep Command+Shift held
  -> press V again to cycle to the next item
Release Command or Shift
  -> end the cycle and hide the overlay

The overlay is feedback, not a picker. It should help me see which clipboard item is active while I stay inside the app where I am typing.

That means the anchor matters. If the overlay appears next to the mouse pointer, it feels random. If it appears on another monitor, it feels broken. If it appears over a system permission dialog, it becomes hostile. For this kind of utility, positioning is part of the core interaction.

The first principle: ask Accessibility for the focused input

On macOS, the right starting point is the Accessibility API.

The flow is:

AXUIElementCreateSystemWide()
  -> AXFocusedUIElement
  -> AXSelectedTextRange
  -> AXBoundsForRange
  -> caret rectangle in screen coordinates

When it works, this is the best answer. AXSelectedTextRange tells you the current selection or insertion point. AXBoundsForRange turns that range into a rectangle. If the range length is zero, the rectangle is the caret. If the API cannot return a zero-length caret rectangle, you can infer it from the adjacent character bounds.

In Paste Switch, this is the first source attempted. The current code asks the system-wide focused element first, then falls back through the focused application and focused window. The logs record which source returned a rectangle, so I can tell whether a position came from direct_caret, text_rect, a descendant, or nothing.

That logging turned out to be more important than I expected.

The second principle: not every app gives a caret

The first implementation assumed: if there is a focused text input, there should be a caret rectangle.

That assumption was wrong.

In native macOS fields, AXSelectedTextRange plus AXBoundsForRange can work very well. In some browser and Electron surfaces, the focused element may report that it is a text input, but the parameterized bounds call returns nothing useful. Codex was the clearest example during testing: the app had a real focused input, but the exact caret bounds path was not reliable.

If the code treats “no caret bounds” as “no input position,” the overlay falls back to the wrong place. In one broken version, it drifted toward the previous input location or the current screen center. That made the whole feature feel worse than before.

The fix was to keep a second source:

If exact caret bounds exist:
  use the caret rectangle
Else if the focused element is a text input:
  use AXFrame
  or AXPosition + AXSize
Else:
  keep searching descendants / windows

This is not pretending the input box frame is the caret. It is a named fallback with a different meaning. For a compact overlay, anchoring near the focused text field is still much better than anchoring near the mouse or the last known position.

That distinction mattered. The bug was not “Codex has no input position.” The real issue was that Codex did not expose the exact caret through the same path as native fields, and the implementation had become too strict.

The third principle: do not mix coordinate systems casually

Another bad turn came from trying to position the overlay with an AppKit window path.

Accessibility returns screen-like coordinates. Tauri also has its own window positioning API, monitor work areas, physical pixels, logical pixels, and scale factors. AppKit uses a different coordinate origin convention. On a single screen this can look almost right. On multiple monitors, especially with different scale factors, it can become visibly wrong.

The version that tried to push the overlay through an AppKit setFrameTopLeftPoint style path made Codex look better for a moment, then broke other apps. The overlay was no longer reliably near the input position in normal browser and native app scenarios.

The stable version went back to a simpler rule:

Use Accessibility only to get the anchor rectangle.
Use the Tauri window API to position the overlay.
Convert through the target monitor's work area and scale factor.
Clamp the overlay inside that monitor.

That separation made the code easier to reason about. Accessibility answers “where is the input?” Tauri answers “where can this webview window be placed on this monitor?”

The current decision tree

The current implementation is roughly:

On the first trigger of a new cycle:
  clear the locked overlay position
  read the focused input anchor on a short background lookup
  log every Accessibility source that was tried
  position the overlay once
  lock that overlay position for the active cycle

While Command+Shift remains held:
  pressing V cycles items
  the overlay content updates
  the overlay position does not jump

When Command or Shift is released:
  end the cycle
  hide the overlay

The anchor lookup has a timeout and an in-flight guard. That is important because Accessibility calls are crossing process boundaries. A clipboard utility should never let a slow or stuck target app block the whole machine.

The actual anchor priority is:

1. focused element direct caret
2. focused text element frame
3. focused descendants
4. focused app window paths
5. controlled fallback only when no usable anchor exists

The most important part is that each source is visible in logs. When a real app behaves differently, I want to see the reason instead of guessing.

The bugs that made the root cause clearer

Several wrong fixes were useful because they exposed the real shape of the problem.

One wrong fix removed or de-emphasized the focused text element frame fallback. That made Electron-like apps worse. The code was demanding exact caret geometry from apps that did not provide it.

Another wrong fix moved the overlay positioning into an AppKit path. That created a broader regression: other apps that used to position correctly now became wrong. The issue was not Accessibility anymore. It was coordinate conversion.

A third mistake was treating “use the input field frame” as an ugly fallback to avoid. In practice, it is a valid source with a different precision level. The product needs to know which one it got, but it should not reject a usable input box just because the exact caret is unavailable.

The final stable behavior came from restoring that layered model:

exact caret when available
focused text field frame when exact caret is unavailable
Tauri monitor positioning for the overlay window
logs that say which path was used

What working with an AI agent changed

This feature was implemented with an AI coding agent in the loop. That helped a lot, but it also created a few sharp edges.

The helpful part was speed. The agent could inspect the Rust and TypeScript paths, add logging, change the overlay positioning code, rebuild the app, and wire the UI quickly. For a macOS utility with a lot of small edge cases, that speed is valuable.

The dangerous part was confidence without enough evidence.

At one point, the agent concluded a fix was correct because the code path looked plausible. I tested it in Codex, and it was still wrong. Then another fix improved Codex but broke other apps. That is the exact failure mode you have to watch for when using an AI agent on UI and system integration work: it can optimize for the latest failing example and accidentally destroy a previously working path.

The process got better after I forced a stricter loop:

Do not guess from the code shape alone.
Add logs at the boundary where the system returns data.
Name every fallback source.
Compare native apps, browser apps, and Electron apps separately.
Do not claim success until the behavior is manually verified.
When a fix broadens the blast radius, stop and restore the known-good part.

The most useful logs were not verbose debug dumps. They were small structured traces:

focused app pid / role / window state
focused element role / subrole
AXSelectedTextRange availability
AXBoundsForRange result
AXFrame result
AXPosition + AXSize result
chosen source
final anchor rectangle

Those logs changed the conversation with the agent. Instead of arguing about theories, we could ask: did Codex return a caret? Did it return a text frame? Did the overlay anchor get converted to the correct monitor?

That is where the AI agent became useful again. Once the unknowns were observable, the agent could modify the right layer instead of patching around symptoms.

My practical rule for AI-assisted system work

AI agents are strongest when the task has a tight feedback loop and clear evidence. They are weakest when a bug is visible only in a real desktop interaction and the agent has to infer too much from code.

For this kind of work, I now use a stricter rule:

First make the hidden system state observable.
Then change the smallest layer that explains the observed state.

For Paste Switch, that meant:

log Accessibility results before changing positioning;
separate caret lookup from overlay window placement;
keep fallback behavior explicit instead of hiding it behind generic retry logic;
verify in at least one native app, one browser surface, one Electron surface, and a multi-monitor setup;
treat “it works here” as a local observation, not a global conclusion.

The final implementation is not fancy. It is careful.

It asks macOS for the most precise caret position it can get. When macOS or the target app cannot provide that, it uses the focused input box as the anchor. It keeps coordinate conversion in one place. It logs the source. It locks the position for one shortcut cycle so the overlay does not jump while I keep Command+Shift held.

That is the difference between an overlay that feels random and one that feels attached to where I am actually typing.

The takeaway

Precise input positioning on macOS is not one API call. It is a layered strategy:

Accessibility caret bounds
  -> focused text element frame
  -> descendant/window search
  -> monitor-aware overlay placement
  -> logs for every real app that disagrees with your assumptions

The AI agent helped move quickly, but the key was not letting it keep guessing. The real breakthrough came from making the OS boundary visible and forcing the implementation to respect what the target app actually exposed.

That is probably the broader lesson too. AI can write code fast. For system-level product details, correctness still comes from patient observation, tight feedback, and a human who refuses to accept a plausible explanation until the product actually behaves correctly.

DEV Community