web4browser

Posted on Jun 18

Recovering Stale Browser Profile Locks Without Corrupting Account State

#testing #automation #webdev #javascript

A browser profile lock is supposed to prevent one dangerous situation: two workers opening the same account environment at the same time.

The previous post in this series covered prevention: do not let two workers acquire the same browser profile.

This one covers recovery.

What happens when the first worker crashes, times out, or loses network access after it has already opened a logged-in browser profile?

That is where many automation systems make the wrong decision. They see an old lock, delete it, and let another worker continue.

That may make the queue move again, but it can quietly corrupt account state.

A stale lock should not be treated as a file to remove. It should be treated as an unfinished account operation that needs a controlled recovery path.

For a broader team workflow model, this connects to the idea of a profile ownership and handoff layer.

TL;DR

A stale browser profile lock should move through recovery states, not disappear.

Use this model:

held
  -> suspected_stale
  -> quarantined
  -> inspected
  -> available | resume_pending | manual_review

The lock is only metadata. The real thing you are protecting is the browser profile behind it: cookies, local storage, proxy mapping, extension state, task history, and the last known page state.

The lock is not the account state

A lock only tells your system that a worker claimed the right to use a profile.

It does not tell you whether the browser process is still alive.

It does not tell you whether the profile closed cleanly.

It does not tell you whether the last task stopped on a safe page.

It does not tell you whether cookies, local storage, IndexedDB, extension storage, or profile metadata were being modified when the failure happened.

That is why this rule matters:

Never recover a browser profile lock by deleting the lock alone.

The lock is metadata. The account state lives somewhere else.

A safe recovery model has to protect both.

What makes a profile lock stale?

A profile lock becomes stale when the system can no longer trust that the owner worker is still actively using the profile.

Common causes include:

the worker process crashed
the host restarted
the job exceeded its maximum runtime
heartbeat updates stopped
the queue visibility timeout expired
the browser process is still alive but detached from the worker
the worker completed the task but failed before releasing the lock
a network partition made the worker unreachable

These cases do not all mean the same thing.

A crashed worker may be gone.

A network-partitioned worker may still be running.

A browser process may still hold the profile directory even if the job owner disappeared.

A task may have failed after submitting a form, changing settings, or landing on a verification page.

So lock age > timeout is useful, but it is not enough to safely reassign the profile.

Store locks as leases, not permanent ownership

A browser profile lock should behave like a lease.

A lease has an owner, an expiry time, a heartbeat, and a version.

The version is important because it prevents an old worker from writing after a new worker has recovered the profile.

A simple lock record can look like this:

{
  "profile_id": "profile_us_018",
  "owner_worker_id": "worker-7",
  "task_id": "task_20260618_0932",
  "lease_token": 42,
  "acquired_at": "2026-06-18T09:32:10Z",
  "heartbeat_at": "2026-06-18T09:36:40Z",
  "expires_at": "2026-06-18T09:38:40Z",
  "state": "held"
}

The lease_token is a fencing token.

Every write related to this profile should include the latest token. That includes task logs, status updates, screenshots, profile metadata updates, and release attempts.

If worker-7 comes back after the lock has already been recovered by worker-12, its old token should be rejected.

That one rule prevents many silent corruption cases.

Do not jump from stale to available

This is the common mistake:

if (lock.expires_at < now) {
  await deleteLock(profileId)
  await enqueue(profileId)
}

It is simple, but too dangerous.

The lock may be stale, but the profile may not be safe.

A safer system uses intermediate states:

held
  -> suspected_stale
  -> quarantined
  -> inspected
  -> available | resume_pending | manual_review

The point of quarantined is to prevent a second worker from immediately opening the profile while the system is still deciding what happened.

The recovery worker is not taking over the account yet.

It is taking over the investigation.

Step 1: mark the lock as suspected stale

When a lease expires, update the lock only if the old value still matches what you observed.

This is a compare-and-swap operation.

async function markSuspectedStale(profileId, observedToken) {
  const lock = await lockStore.get(profileId)

  if (!lock) {
    return { ok: false, reason: "missing_lock" }
  }

  if (lock.lease_token !== observedToken) {
    return { ok: false, reason: "lock_changed" }
  }

  if (new Date(lock.expires_at) > new Date()) {
    return { ok: false, reason: "lease_still_active" }
  }

  return lockStore.compareAndSet(profileId, observedToken, {
    ...lock,
    state: "suspected_stale",
    suspected_at: new Date().toISOString()
  })
}

This protects you from racing with another worker that already renewed or recovered the lock.

Step 2: quarantine the profile

Once a lock is suspected stale, the profile should not go back into the normal worker pool.

Move it to quarantine.

A quarantined profile is temporarily blocked from normal automation until the system checks:

whether an old browser process is still running
whether the old worker can still send heartbeats
whether the profile directory is still held by the OS
whether the last task stopped at a safe checkpoint
whether the account session requires human review
whether the task can be resumed or must restart

This step is not about being slow.

It is about avoiding two workers writing to the same browser state.

Step 3: capture recovery evidence before cleanup

Before killing processes, deleting temporary files, or opening the profile again, capture evidence.

At minimum, store something like this:

{
  "profile_id": "profile_us_018",
  "old_worker_id": "worker-7",
  "old_task_id": "task_20260618_0932",
  "lease_token": 42,
  "last_heartbeat_at": "2026-06-18T09:36:40Z",
  "detected_stale_at": "2026-06-18T09:39:05Z",
  "last_known_step": "submit_budget_change",
  "last_known_url": "https://example.com/account/campaigns",
  "last_screenshot": "s3://evidence/task_20260618_0932/last.png",
  "proxy_id": "proxy-us-dallas-03",
  "profile_dir": "/profiles/profile_us_018",
  "recovery_reason": "heartbeat_timeout"
}

This evidence helps you answer the question later:

Did the profile fail because the worker died, the website changed, the proxy shifted, the browser crashed, or the task should have stopped for human review?

Without recovery evidence, stale lock recovery becomes guesswork.

Step 4: reject late writes

The dangerous case is not only “the worker crashed.”

The dangerous case is this:

A worker looked dead, another worker took over, and then the first worker came back.

This can happen with network partitions, overloaded hosts, slow hosts, or delayed cleanup.

That is why every write should check the current token.

async function writeTaskEvent(profileId, leaseToken, event) {
  const lock = await lockStore.get(profileId)

  if (!lock) {
    throw new Error("profile lock missing")
  }

  if (lock.lease_token !== leaseToken) {
    throw new Error("stale lease token rejected")
  }

  await taskLog.append({
    profile_id: profileId,
    lease_token: leaseToken,
    ...event
  })
}

If an old worker tries to mark the task as complete after recovery started, the system rejects it.

In browser automation, this protects account state. The old worker may have a stale view of the page, proxy, session, or task plan.

Step 5: inspect before reuse

After quarantine, the recovery worker should run a health check.

This should be separate from the original business task.

Do not immediately continue clicking buttons on the target website. First inspect whether the browser environment is safe to reuse.

A simple inspection output can look like this:

Profile directory exists: yes
Profile directory held by active browser process: no
Last task ended at safe checkpoint: unknown
Last screenshot available: yes
Last URL belongs to expected domain: yes
Proxy mapping unchanged: yes
Session state present: yes
Manual review flag: no
Lease token advanced: yes

If the last task stopped during a sensitive operation, do not auto-resume.

Examples include:

after clicking submit
during checkout
while changing account settings
on a verification page
after a failed login attempt
during password, MFA, or recovery flows
while editing campaign, billing, wallet, or payout settings

For those states, the recovery result should be manual_review, not available.

Step 6: choose release, resume, retry, or review

A stale lock recovery should end in one of four outcomes.

Release

Use this when the task never started or failed before touching account state.

suspected_stale -> quarantined -> inspected -> available

This is the cleanest path.

The profile can return to the worker pool after a short cooldown.

Resume

Use this only when your task has explicit checkpoints and idempotent steps.

suspected_stale -> quarantined -> inspected -> resume_pending

A resumable task should know the last safe step:

{
  "task_id": "task_20260618_0932",
  "resume_from": "open_campaign_list",
  "last_safe_checkpoint": "campaign_list_loaded"
}

If you cannot name the last safe checkpoint, do not resume.

Retry

Use this when the profile is safe, but the task should restart from the first step.

Even then, be careful.

Restarting a browser task is not the same as retrying an API request. The website may already have seen part of the previous action.

Manual review

Use this when the account state may have changed or the last page requires judgment.

This should be the default for ambiguous states.

A manual review flag is not a failure of automation. It is a boundary that prevents automation from making a bad recovery decision.

A compact recovery function

The exact storage layer can be Redis, Postgres, DynamoDB, or something else.

The important part is the state machine.

async function recoverStaleProfileLock(profileId) {
  const lock = await lockStore.get(profileId)

  if (!lock) {
    return { action: "none", reason: "no_lock" }
  }

  if (new Date(lock.expires_at) > new Date()) {
    return { action: "none", reason: "lease_active" }
  }

  const marked = await markSuspectedStale(
    profileId,
    lock.lease_token
  )

  if (!marked.ok) {
    return { action: "none", reason: marked.reason }
  }

  const recoveryToken = lock.lease_token + 1

  await lockStore.update(profileId, {
    ...lock,
    state: "quarantined",
    lease_token: recoveryToken,
    recovery_started_at: new Date().toISOString()
  })

  const evidence = await collectRecoveryEvidence(profileId, lock)
  await recoveryLog.write(evidence)

  const inspection = await inspectProfileHealth(profileId)

  if (inspection.requires_manual_review) {
    await lockStore.update(profileId, {
      state: "manual_review",
      lease_token: recoveryToken,
      reason: inspection.reason
    })

    return {
      action: "manual_review",
      reason: inspection.reason
    }
  }

  if (inspection.can_resume) {
    await lockStore.update(profileId, {
      state: "resume_pending",
      lease_token: recoveryToken,
      resume_from: inspection.resume_from
    })

    return {
      action: "resume",
      from: inspection.resume_from
    }
  }

  await lockStore.update(profileId, {
    state: "available",
    lease_token: recoveryToken,
    released_at: new Date().toISOString()
  })

  return { action: "released" }
}

The function is intentionally boring.

That is a good sign.

Recovery logic should not be clever. It should be explicit, auditable, and hard to race.

Add a short cooldown before reuse

After a stale lock is released, avoid immediate reuse if the failure mode is unclear.

A short cooldown can prevent accidental overlap with delayed cleanup.

available_after = recovered_at + cooldown_seconds

This is especially useful when workers run on remote machines, containers, or headless environments where process cleanup can lag behind queue state.

The cooldown should not replace fencing tokens.

It is only an additional safety margin.

Common failure patterns

Here are the stale-lock mistakes that show up repeatedly in browser automation systems.

Blind lock deletion

Deleting the lock makes the profile available, but it does not prove the profile is safe.

No fencing token

Without a token, an old worker can come back and write stale task results after recovery has already started.

No quarantine state

If the only states are locked and available, the system has no place to inspect ambiguous failures.

No recovery evidence

Without the last URL, step name, screenshot, proxy mapping, and task state, the recovery decision becomes a guess.

Automatic retry after sensitive steps

A retry after a submit, checkout, settings update, or login challenge can duplicate actions or move the account into a worse state.

Changing proxy during recovery

A stale lock is already an unstable state. Changing the proxy at the same time makes debugging harder and can change account context.

What should never happen automatically

These actions should not be part of automatic stale lock recovery:

deleting the whole profile directory
clearing cookies to “fix” the session
changing the proxy during recovery
opening the same profile in two browsers to compare state
marking the original task as successful without evidence
retrying a form submission without a checkpoint
assigning the profile to a different account
accepting late writes without checking the lease token

These shortcuts may make the queue green.

They can also damage the account context your automation depends on.

A practical stale lock checklist

Before returning a browser profile to the worker pool, answer these questions:

Lock:
- Was the lease expired?
- Was the lock changed with compare-and-swap?
- Was the lease token advanced?

Worker:
- Is the old worker still alive?
- Did the old worker send a late heartbeat?
- Can old writes be rejected by token?

Browser:
- Is the browser process closed?
- Is the profile directory still locked?
- Was the browser closed cleanly?

Task:
- What was the last known step?
- What was the last known URL?
- Is there a screenshot or trace?
- Was the task in a sensitive operation?

Account state:
- Did cookies or local storage change?
- Is the proxy mapping unchanged?
- Is the session still valid?
- Does the page require human review?

Decision:
- release, resume, retry, or manual review?

The goal is not to make recovery complicated.

The goal is to make it explicit.

The right abstraction is profile recovery

A stale browser profile lock is not only a concurrency problem.

It is an account-state recovery problem.

That is why the recovery model should sit near profile ownership, task logs, proxy mapping, screenshots, and human review rules.

If those systems are separate, stale lock recovery will always depend on guessing.

For long-running browser automation, the safer pattern is:

one profile
one owner lease
one active task
one evidence trail
one recovery decision

A browser workflow layer should not only run tasks. It should record enough context to recover them.

That is the real lesson of stale locks:

The lock is only the symptom.

The system has to protect the account environment behind it.

Editorial note: This article was drafted with AI assistance and reviewed by the author before publication.

DEV Community