DEV Community

Cover image for Multi-Process Browser Automation Framework
Teske Systemtechnik
Teske Systemtechnik

Posted on • Originally published at teske-systemtechnik.de

Multi-Process Browser Automation Framework

17k LOC Python framework for parallel, fault-tolerant browser workflows, race-safe worker coordination, cross-process crash bridge, per-phase timeouts, and full operator UX through Streamlit.

The challenge

A private client needed a permanently operational backend for browser-based automation workflows. The requirements were engineering-first from day one, not feature-first:

  • Multiple parallel browser sessions, cleanly isolated from each other.
  • Subprocess architecture, a crash in one session must not take others down with it, and a hung workflow must not block the entire process tree.
  • Full observability, every phase transition logs its status; every crash carries a unique phase marker.
  • Operator UX through a dashboard rather than the CLI, the end user is the client, not the developer.
  • 100 % type hinting, clear layer separation, full pytest setup with GitHub Actions CI from day one.

Trivial that's not on Windows: parallel Chrome instances are a minefield of race conditions (port collisions on dynamic CDP allocation, profile locks in the user-data-dir, zombie processes on Streamlit restart). And a subprocess that dies before its own crash handler can even run means, without a protection mechanism, a silently lost error report, exactly the class of bug that stays undetected in production for six weeks.

The approach

The result is a 17,461 LOC Python codebase across 25 cleanly modularized files, fully type-annotated, organized into three clearly decoupled layers:

Presentation → Streamlit UI Control → Scheduler + CLI orchestrator Execution → Browser workers (as subprocesses)
Enter fullscreen mode Exit fullscreen mode

Cross-layer communication runs exclusively through SQLite and atomically written JSON files; no worker ever imports another.

  • Race-safe multi-worker coordination. With N parallel asyncio tasks, workers share an asyncio.Lock-based round-robin: only one worker at a time runs the expensive discovery step, the others wait at the lock and pick up the result from a shared dict. Halves outgoing output without losing speed and avoids all workers duplicating the same operation in parallel.
  • Best-result aggregation with coordinated cancellation. As soon as one worker hits the target result, an asyncio.Event fires and a watchdog task calls task.cancel() on all sibling tasks. Clean CancelledError propagation instead of polling. Additionally, a class-global _completed_jobs set suppresses late reports from the cancelled tasks, no notification spam, even when 10 siblings simultaneously walk their cleanup paths.
  • Subprocess isolation at Windows level. Each worker gets a fully isolated Chrome instance: race-safe port allocation via socket bind (_PortLock holds the port reserved until Chrome takes it over, no TOCTOU race), its own user-data-dir (chrome_<uuid>), its own crash dump path, its own CDP session. No shared resources, no lock conflicts between parallel sessions, no leaking browser state.
  • CDP-based auth configuration. Instead of a classic Manifest V2 browser extension, auth configuration runs directly through Chrome DevTools Protocol via Fetch.authRequired. Auth events propagate automatically onto popup pages via a context.on("page", …) handler. A lean class replaces the traditional extension workaround with significantly less surface area.
  • Cross-process crash file bridge. Workers run as subprocesses, spawned by the scheduler via subprocess.Popen. On a crash, the subprocess writes a structured JSON file to data/crashes/job_<id>_<ts>.json AND attempts a direct Telegram notification in parallel. The scheduler reads the file back after subprocess exit, deduplicates against the already-sent notification, fills in missing reports, or quietly cleans up the file when everything was already reported. A global sys.excepthook as last line of defence guarantees: no crash gets lost, even if the subprocess dies so early that its own crash handler never runs.
  • Per-phase timeouts with live phase tracking. Every workflow phase runs inside a dedicated asyncio.timeout() block; every phase updates a central context object with its current sub-step. On a crash, the Telegram report says exactly which phase of which worker failed, not "somewhere in main()" but "4/6 add_step: concrete UI element X". Debugging time drops from "first scan the logs" to "jump straight to the function".
  • Typed error hierarchy + swarm deduplication. A dedicated exception class per failure mode (ProxyError, NavigateError, SessionExpiredError, …), each with its own recovery policy (close the browser vs. leave it open, retry with a different proxy, hard fail). When 10 workers crash in parallel with the same root cause, report_grouped_errors groups the messages by (exception type, first stack frame) and sends a single aggregated Telegram message with worker IDs and affected phases, no 10 redundant pings.
  • SQLite with WAL + BEGIN IMMEDIATE for race safety. Counters and state in the tables are updated in read-modify-write transactions. With N parallel workers incrementing a counter simultaneously, naive UPDATE counter logic would increment by 1 instead of N, BEGIN IMMEDIATE serializes the updates correctly and prevents the race at the SQLite level before it ever reaches Python. Plus WAL mode + 64 MB cache + 256 MB mmap for read performance under parallel write pressure. Auto-migrations on first connect, with pytest tests verifying every migration individually.
  • Atomic file IPC. All inter-process state files (job status snapshots, live state, run reports) are written atomically, write to .tmp, then os.rename(). No worker ever reads a half-written JSON file, even under concurrent access from multiple subprocesses. POSIX semantics, works on Windows too since Path.replace().
  • Date-versioned logs. logs/<DD-MM-YYYY>/{chrome,traces,screenshots,…}/, every phase of every worker produces clearly attributed artefacts (Chrome stdout, Playwright trace, failure screenshot). On a production issue, ls logs/23-03-2026/screenshots/ finds the exact failure phase of every affected worker in five seconds, plus the full Playwright trace ready to replay in the browser trace viewer.
  • Streamlit operations UI with Windows hard cleanup. Multi-page dashboard with service lifecycle (start/stop/restart of all subsystems), DB CRUD, live logs, EN/IT localization across 300+ string pairs. Streamlit is finicky on Windows, process-tree cleanup is not guaranteed on shutdown, children become zombies. Solved via a Windows Job Object with JOB_OBJECT_LIMIT_KILL_ON_JOB_CLOSE: every subprocess gets assigned to the job handle on spawn, and on Streamlit exit the OS automatically terminates all children cascadingly. Works even on hard-kill via Task Manager, no orphaned browser processes left behind.
  • Centralized Telegram reporter. A single ErrorReporter class as the single entry point for all notifications. Fire-and-forget by contract: never throws an exception, never blocks longer than the HTTP roundtrip, fails silently on connection errors (Windows 10054 ConnectionReset, …) and retries with a fresh session. Direct connection without proxy (session.trust_env = False), so system proxy vars don't silently kill reports, plus global suppression logic against notification storms.

Engineering highlights & fail-safe architecture

Reliability was absolutely non-negotiable, the system runs unattended 24/7 and the end user is not a developer:

  • 9 pytest test suites with GitHub Actions CI. Database migrations (idempotent, runnable multiple times), error reporter (260+ tests including suppression logic and crash-file roundtrip), coordination patterns, proxy layer, shared helpers, config resolution, all validated automatically on every push against Ubuntu Python 3.13. Migration bugs get caught before deploy, not at runtime.
  • Strict layer separation without circular imports. Presentation layer imports only the control layer; control layer imports only the execution layer + utils. Every subprocess can be brought up standalone, without Streamlit even being installed, relevant for CI runs and debugging sessions without UI overhead.
  • Singleton path resolution. A PATHS singleton class with auto-root detection (walks up looking for marker files like .git, .env, requirements.txt) and automatic directory creation on property access. Code called from any working directory consistently finds the same absolute paths, not a single os.path.join(os.path.dirname(__file__), …) in the entire codebase.
  • Dataclass-first domain model. All workflow inputs and outputs are dataclasses with type hints, validation logic, and clean from_dict/to_dict roundtrips. Clean interfaces between layers, IDE autocomplete works, refactorings raise compile-time errors instead of runtime AttributeErrors.
  • Test-first for IPC-critical components. Crash file bridge, suppression gate, and aggregator logic are the riskiest spots, they run exactly when everything else is broken. The test suite is correspondingly dense: a custom _clear_* fixture pattern resets class-global state between tests, every edge case (subprocess crashed BEFORE crash-file write, crash file without Telegram flag, Telegram after crash file, both in parallel) has an explicit test case.
  • Dependency-injection layer for tests. Every external dependency (DB, Telegram, proxy provider, filesystem) sits behind a thin interface that can be swapped for an in-memory equivalent in test mode. The SQLite tests, however, run against a real SQLite DB in pytest's tmp_path, not a mock, migration bugs would systematically not be detected by mocks.

Multi-Process Browser-Automation Framework — codebase treemap

Volume breakdown of the entire codebase across Presentation, Control, Execution Core, and Utility layers, the Execution Core (8,228 LOC) dominates visually as the largest block, while the utility layer shows modularity across 11 small helper modules.

The result

  • 17,461 LOC of production Python across 25 cleanly modularized files, clean layer separation, no circular imports, every stage standalone runnable.
  • 260+ pytest tests with GitHub Actions CI on every push, migration bugs, IPC race conditions, and suppression logic all get caught before deploy.
  • Cross-process crash bridge, no crash gets lost, even when a subprocess dies before its own crash handler.
  • Race-safe coordination across N parallel browser workers on Windows, no port collisions, no profile locks, no zombie processes on shutdown.
  • Full operator UX through Streamlit, the end client toggles services with one click, sees live status and live logs, without ever touching the CLI.
  • Modularly extensible, new workflow types are a new module against the existing coordination and reporting infrastructure, without touching the core.

Top comments (0)