KimSejun

Posted on Mar 16

when your agent fails, does it just... stop?

#geminiliveagentchallenge #devlog #buildinpublic #go

when your agent fails, does it just... stop?

I created this post for the purposes of entering the Gemini Live Agent Challenge. But this particular problem — what happens when an AI action fails — is something every agent builder needs to solve.

Most desktop automation tools have a dirty secret: they're fragile. Click the wrong pixel, target an element that moved, or encounter an unexpected dialog — and the whole sequence collapses. The user sees "Error" and reaches for the keyboard.

VibeCat's self-healing engine was built because we got tired of watching our cat give up.

the failure taxonomy

After running hundreds of test sequences across three apps (Antigravity IDE, Terminal, Chrome), we cataloged the failure modes:

AX target not found — The Accessibility API says the element doesn't exist. Usually because the app hasn't finished rendering, or because the element is inside a canvas/WebGL surface. Frequency: ~15% of first attempts on Chrome.

AX target found but wrong — The element exists but it's the wrong one. A "Play" button that's actually in a different panel, or a text field that looks right but belongs to a different component. Frequency: ~5%.

Click landed but nothing happened — The coordinates were correct, the click fired, but the UI didn't respond. Common with YouTube's debounced event handlers. Frequency: ~10% on YouTube Music.

Action succeeded but verification failed — VibeCat typed the text and it appeared, but the post-action screenshot shows an error dialog or unexpected state. Frequency: ~3%.

max 2 retries, alternative grounding

The self-healing engine is deliberately simple. No complex state machines, no machine learning. Just two rules:

Max 2 retries per step. If it fails three times, stop and tell the user.
Each retry uses a different grounding source. Don't repeat what already failed.

Attempt 1: AX targeting
  → Failed: element not in AX tree

Attempt 2: CDP targeting (chromedp)
  → Failed: Chrome DevTools can't find matching DOM node

Attempt 3: Vision coordinates (Gemini screenshot analysis)
  → Success: clicked at (847, 423), verification passed

The grounding source priority chain is AX → CDP → Vision. But the engine is smart enough to skip sources that don't apply — if you're in Terminal (no browser), CDP is skipped entirely.

Here's the core logic in handler.go:

func (h *Handler) executeWithHealing(ctx context.Context, step *Step) error {
    sources := []GroundingSource{AX, CDP, Vision}

    for attempt := 0; attempt <= maxRetries; attempt++ {
        source := sources[min(attempt, len(sources)-1)]

        err := h.executeStep(ctx, step, source)
        if err == nil {
            verified, verifyErr := h.verifyStep(ctx, step)
            if verifyErr == nil && verified {
                return nil
            }
        }

        h.emitProcessingState("retrying_step", step, attempt+1)
        slog.Info("self-healing retry",
            "step", step.ID,
            "attempt", attempt+1, 
            "failed_source", source,
            "next_source", sources[min(attempt+1, len(sources)-1)])
    }

    return fmt.Errorf("step %s failed after %d attempts", step.ID, maxRetries+1)
}

vision verification: the trust layer

Every action — whether it's typing text, clicking a button, or opening a URL — ends with a verification step. VibeCat captures a fresh screenshot and sends it to the ADK Orchestrator with a specific question: "Did the action succeed?"

This isn't just "did the click register?" It's semantic verification:

After typing "go vet ./..." in Terminal → verify the command output shows "no issues"
After clicking Play on YouTube Music → verify the video element is no longer paused
After opening a URL → verify the expected page content is visible

The ADK Orchestrator uses Gemini's vision model for this analysis. It returns a confidence score and a natural-language explanation. If confidence is below the threshold, the step is marked as failed and healing kicks in.

verification: {
  "success": false,
  "confidence": 0.3,
  "explanation": "The play button appears unchanged. 
                  The video progress bar has not moved."
}
→ trigger retry with CDP grounding

the pendingFC queue: no racing allowed

One subtle failure mode we discovered: Gemini sometimes issues multiple function calls in rapid succession. "Focus Terminal, then type go vet ./..., then press Enter." If these execute in parallel, go vet might get typed into the wrong window because focus_app hasn't completed yet.

The pendingFC mechanism solves this with strict sequential execution:

Gemini sends FC calls → queued in pendingFC
Gateway sends step 1 to client
Client executes, captures verification screenshot
Gateway confirms step 1 → sends step 2
Repeat until queue is empty

No step starts until the previous step's verification passes. This adds latency (~200ms per step for verification) but eliminates an entire class of race condition bugs.

transparent narration: failures feel collaborative

The most impactful design decision wasn't technical — it was UX. VibeCat narrates every step through the overlay panel:

🔍 Reading screen...
📋 Planning 3 steps
▶️ Step 1/3: Focusing Terminal [AX]
⚠️ Retrying Step 1 — switching to CDP
✅ Step 1/3: Terminal focused
▶️ Step 2/3: Typing command...

Users who watched VibeCat fail silently reported it as "broken." Users who watched the same failure with narration reported it as "working through a problem." Same outcome, completely different perception.

The seven processing stages (analyzing_command, planning_steps, executing_step, verifying_result, retrying_step, completing, observing_screen) each have localized labels in English, Korean, and Japanese. The overlay shows a grounding source badge (AX / Vision / Hotkey / System) so you always know how VibeCat is interacting with your screen.

numbers that matter

After implementing self-healing, our end-to-end success rates across 50 test runs:

Scenario	Without healing	With healing
YouTube Music play	62%	94%
Code comment enhancement	88%	100%
Terminal go vet	91%	100%

The remaining 6% failure on YouTube Music is almost entirely due to network latency — the page hasn't finished loading when VibeCat tries to click. A simple "wait for page ready" check would probably push it to 98%+.

what I learned

Self-healing isn't about being clever. It's about being systematic. Catalog your failures, build a fallback chain, verify every step, and tell the user what's happening. The hard part isn't the retry logic — it's the verification. Without reliable post-action verification, you're just clicking blindly and hoping.

And narrate everything. Always narrate everything. Silent AI feels broken. Transparent AI feels collaborative.

DEV Community

when your agent fails, does it just... stop?

when your agent fails, does it just... stop?

the failure taxonomy

max 2 retries, alternative grounding

vision verification: the trust layer

the pendingFC queue: no racing allowed

transparent narration: failures feel collaborative

numbers that matter

what I learned

Top comments (0)