web4browser

Posted on Jun 20

When a Browser Profile Should Be Quarantined After Automation Failure

#playwright #automation #webdev #debugging

A browser profile lease solves a concurrency problem.

It answers:

Which worker is allowed to use this profile right now?

But after an automation failure, another question appears:

Is this profile still safe to reuse?

Those are not the same question.

A worker can release a lease correctly while still leaving the browser profile in an unknown state. The profile may contain changed cookies, refreshed tokens, modified local storage, a pending verification screen, an unfinished form, or a task result that nobody recorded.

If the scheduler immediately gives that profile to the next worker, the next run may not start from a clean account environment.

It may inherit damage from the previous failure.

That is why some failed profiles should not go directly back into the available pool.

They should be quarantined.

A lease is not a trust decision

A profile lease is about ownership.

It prevents two workers from opening the same profile at the same time. That matters, but it does not prove the profile is healthy after a failed run.

Quarantine is different. It is a post-failure state.

It means:

This profile was touched by a failed run, and we do not yet trust its account, session, environment, or task state.

In browser automation, a profile is not just a folder on disk. It can carry cookies, local storage, login state, permissions, extensions, proxy assumptions, language settings, and recent task history.

That is why debugging automation failures often requires looking beyond the script. You need to understand the account context in browser automation, not just the exception message.

What profile quarantine means

Quarantine does not mean deleting the profile.

It means temporarily removing the profile from normal scheduling until a review or recovery process decides what should happen next.

A quarantined profile should not be picked by:

a normal retry
another worker
a background status check
a headless task
an AI agent run
a teammate looking for any available profile

Instead, the profile should carry a reason, evidence, and a required next action.

For example:

{
  "profile_id": "profile_042",
  "status": "quarantined",
  "reason": "worker_crash_after_authenticated_action",
  "failed_run_id": "run_2026_06_20_018",
  "last_step": "submit_account_update",
  "last_known_url": "https://example.com/dashboard",
  "session_state": "unknown",
  "evidence": {
    "screenshot": true,
    "console_log": true,
    "network_summary": true,
    "worker_exit_code": 137
  },
  "review_required": true
}

The schema can vary.

The important part is that your system has a state between available and broken.

That state is quarantined.

When a failure does not need quarantine

Not every failure should quarantine a profile.

Quarantine is usually unnecessary if:

the browser never launched
the profile was never opened
the worker failed before acquiring the lease
the run only touched public pages
the failure was clearly outside the browser profile
the browser closed cleanly and evidence is complete

The key question is not:

Did the task fail?

The better question is:

Did the failed run leave the profile in a state that the next run cannot safely explain?

If the profile was never touched, it usually does not need review.

If account state, session state, or profile storage may have changed, it should not be reused blindly.

Quarantine after authenticated state is reached

Quarantine becomes more important after login.

Before login, many failures are just infrastructure failures: DNS errors, launch errors, dependency problems, proxy connection failures, or selector misses on public pages.

After login, the browser is interacting with account state.

Quarantine is usually appropriate when failure happens after:

the account dashboard loaded
a session refresh occurred
a form was filled
a file was uploaded
an account setting changed
a message was drafted
a checkout or submission flow started
an agent interpreted an account-specific page
a headless task continued without visual review

The risk is not only that the task failed.

The risk is that the next run may continue from a state the scheduler does not understand.

Quarantine when session state may have changed

Some failures are dangerous because they make session state uncertain.

Examples:

the page asked for login again
a verification prompt appeared
the account was redirected to a warning page
the session expired halfway through the task
a token refresh failed
local storage changed during the run
cookies changed right before the crash
the automation cleared or modified browser data
the task appeared to switch accounts unexpectedly

A profile can still open successfully while being unsafe for the next task.

That is the trap.

“Profile opens” does not mean “profile is trusted.”

Quarantine when the environment changed mid-run

Browser automation often assumes that the profile, proxy, timezone, language, and region are consistent.

If those assumptions changed during the failed run, the profile should not return directly to the pool.

Quarantine is appropriate when:

the proxy failed mid-run
the worker retried through a different proxy
the exit region did not match the expected profile region
timezone or language settings changed
a mismatch was detected but the workflow continued
the task ran with an environment different from the assigned account context

A scheduler that only checks locked = false will miss this.

A better scheduler asks whether the profile environment is still coherent.

Quarantine when the worker crashed while the browser was open

A worker crash is one of the clearest quarantine signals.

A crash means the automation may not have executed its cleanup path.

The worker may have failed to:

close the page correctly
save final evidence
mark the task result
record the last visible URL
detect a stop screen
restore temporary settings
write the final session state

A normal retry assumes the previous run ended in a known state.

A worker crash means the final state is unknown.

Unknown does not always mean corrupted. But unknown is enough to quarantine until someone or something checks it.

Quarantine near non-idempotent actions

Some browser actions are safe to repeat.

Opening a page is usually safe. Reading a dashboard is usually safe. Taking a screenshot is usually safe.

Other actions are not safe to repeat without context.

Examples:

submitting a form
sending a message
changing account settings
uploading a file
deleting or archiving something
approving a workflow
confirming a payment-related step
accepting a prompt that changes account state

If automation fails near a non-idempotent action, do not immediately retry with the same profile.

Ask first:

Did the action already happen?
Did the website accept the request before the crash?
Did the UI fail to show confirmation even though the backend accepted it?
Would a retry duplicate the action?
Would the next worker understand the current page?

If the answer is unknown, quarantine the profile.

Quarantine when evidence is incomplete

Sometimes the failure is not the biggest problem.

The missing evidence is.

A failed run should leave enough information for the next operator to understand what happened.

At minimum, capture:

profile ID
run ID
worker ID
lease start and end time
last step name
last URL
screenshot
visible account state
proxy ID and region
relevant console errors
network failure summary
retry count
stop reason
browser close status

If the evidence bundle is incomplete, quarantine is safer than reuse.

Without evidence, the next run is not debugging.

It is guessing.

A Playwright example

With Playwright, this problem often appears when using a persistent profile directory.

For example:

import { chromium, type BrowserContext } from "playwright";

type ProfileRunResult = {
  openedProfile: boolean;
  reachedAuthenticatedArea: boolean;
  workerCrashed: boolean;
  browserClosedCleanly: boolean;
  evidenceComplete: boolean;
  sessionStateChanged: boolean;
  environmentChanged: boolean;
  nearNonIdempotentAction: boolean;
};

function shouldQuarantine(result: ProfileRunResult): boolean {
  if (!result.openedProfile) return false;

  if (result.workerCrashed) return true;
  if (!result.browserClosedCleanly) return true;
  if (!result.evidenceComplete) return true;

  if (!result.reachedAuthenticatedArea) {
    return false;
  }

  return (
    result.sessionStateChanged ||
    result.environmentChanged ||
    result.nearNonIdempotentAction
  );
}

async function runWithPersistentProfile(profilePath: string) {
  let context: BrowserContext | undefined;
  let browserClosedCleanly = false;

  const result: ProfileRunResult = {
    openedProfile: false,
    reachedAuthenticatedArea: false,
    workerCrashed: false,
    browserClosedCleanly: false,
    evidenceComplete: false,
    sessionStateChanged: false,
    environmentChanged: false,
    nearNonIdempotentAction: false
  };

  try {
    context = await chromium.launchPersistentContext(profilePath, {
      headless: false
    });

    result.openedProfile = true;

    const page = await context.newPage();

    await page.goto("https://example.com/dashboard");
    result.reachedAuthenticatedArea = true;

    // After this point, the run is touching account state.
    await page.getByRole("button", { name: "Update settings" }).click();
    result.nearNonIdempotentAction = true;

    await page.screenshot({ path: "evidence/final.png", fullPage: true });
    result.evidenceComplete = true;
  } catch (error) {
    await saveFailureEvidence(error);
    result.evidenceComplete = true;
    throw error;
  } finally {
    if (context) {
      try {
        await context.close();
        browserClosedCleanly = true;
      } catch (closeError) {
        await saveFailureEvidence(closeError);
      }
    }

    result.browserClosedCleanly = browserClosedCleanly;

    if (shouldQuarantine(result)) {
      await markProfileAsQuarantined(profilePath, result);
    } else {
      await markProfileAsAvailable(profilePath);
    }
  }
}

async function saveFailureEvidence(error: unknown) {
  console.error("Saving failure evidence", error);
}

async function markProfileAsQuarantined(
  profilePath: string,
  result: ProfileRunResult
) {
  console.log("Quarantine profile", profilePath, result);
}

async function markProfileAsAvailable(profilePath: string) {
  console.log("Release profile", profilePath);
}

The exact implementation will depend on your scheduler, queue, and profile store.

The main point is the placement.

Do not only release the profile in finally.

First decide whether the failed run changed the trust level of that profile.

In a real worker pool, workerCrashed is usually set by the supervisor process, not by the page task itself. A page task can report what it observed, but the supervisor is usually the component that knows whether the process crashed, timed out, or was killed.

Use profile states, not just locks

A simple scheduler may treat profiles as either locked or unlocked.

That model is too small for multi-account automation.

A more useful model looks like this:

available
leased
cooldown
quarantined
needs_review
repairing
retired

Each state has a different meaning.

available means a worker can use the profile.

leased means a worker is currently using it.

cooldown means the profile is probably fine, but should not be reused immediately.

quarantined means a failure created uncertainty.

needs_review means a human or recovery workflow must inspect the evidence.

repairing means the system is actively restoring or validating the profile.

retired means the profile should no longer be used.

The goal is to avoid a common mistake:

treating every released lock as a reusable profile.

What review should check

A quarantined profile needs a review path.

That review can be manual, automated, or hybrid.

A practical review checklist:

Confirm the profile identity.
Confirm which account the profile belongs to.
Check the failed run ID and last step.
Review the final screenshot and last known URL.
Verify whether the account is logged in, logged out, challenged, or partially loaded.
Confirm proxy, region, timezone, and language.
Check whether the last action may have completed.
Decide whether to resume, repair, reset, or retire.
Write the decision back to the profile record.
Only then return the profile to the available pool.

Quarantine should produce a decision.

It should not become a forgotten holding area.

The retry mistake

Many failures look temporary.

Timeouts look temporary. Proxy errors look temporary. Selector misses look temporary. Headless crashes look temporary.

Some of them are temporary.

But the exception class is not enough.

A timeout before login is different from a timeout after submitting an authenticated form.

A crash before browser launch is different from a crash after session refresh.

A proxy error before opening the profile is different from a proxy error after the account dashboard loads.

The quarantine decision should depend on what the profile may have experienced, not just the final error message.

The cleanup mistake

Another common mistake is cleaning the profile too quickly.

Clearing cookies or local storage might be the right recovery action, but doing it immediately can destroy useful evidence.

Before cleaning anything, capture the state.

Ask:

What was the last visible account state?
Which storage values changed recently?
Did the site still recognize the account?
Was the proxy consistent with the profile?
Did the failure repeat from the same step?
What evidence would be lost if the profile was reset now?

Cleaning may repair the profile.

Cleaning too early can erase the reason it failed.

Why AI browser agents make this stricter

AI browser agents make quarantine more important, not less.

A rigid script may stop when the page changes. An agent may adapt, continue, and find another path.

That flexibility can be useful. It also means the agent may move through states the original workflow did not explicitly model.

After a failed agent run, the team needs to know:

what the agent saw
what it changed
what it skipped
what it assumed
why it stopped
whether the profile is still safe

A profile used by an agent should be quarantined when the action trace is incomplete, the final page state is unclear, or the agent made a judgment call near an account boundary.

The question is not only whether the agent failed.

The question is whether the profile remains trustworthy after the agent failed.

The practical rule

Here is the rule I use:

A failed browser profile can be reused only when the team can explain the last account state, the last environment state, the last task state, and the last evidence state.

If one of those is unknown, quarantine the profile.

That may feel conservative in a small automation project.

In a multi-worker system, it becomes operational hygiene.

A browser profile is not just a launch target. It is the account environment that the next worker will inherit.

Treat it that way.

Final checklist

Quarantine the profile if:

the worker crashed after opening the profile
the run reached an authenticated area
session state may have changed
cookies or local storage may have changed
proxy or region changed during the run
the failure happened near a non-idempotent action
the account showed login, verification, warning, or redirect screens
evidence is incomplete
the browser did not close cleanly
the next operator cannot explain what happened

Do not quarantine automatically if:

the profile was never opened
the failure happened before browser launch
the run only touched public pages
no account state was loaded
the browser closed cleanly
evidence is complete
the failure is clearly outside the profile

The goal is not to quarantine everything.

The goal is to stop treating failed account environments as clean inputs.

A lease prevents two workers from colliding.

A quarantine state prevents one failed run from becoming the starting point for the next failure.

DEV Community