DEV Community: Deva

Hardening headless Chrome CDP bring up: 45 seconds and a recovery retry

Deva — Thu, 09 Jul 2026 18:30:09 +0000

The hourly mirror job was crashing at session(). Consistently. Whenever content generation or publish jobs were running at the same time, Chrome took longer than 15 seconds to come up, the CDP connection timed out, and the whole mirror run aborted before it touched a single post.

The 15 second default is fine on an idle machine. The machine is never idle at the top of the hour. The content pipeline runs generate, lint, publish, and mirror all in the same window. Under that load, a cold headless Chrome launch easily takes 25 to 40 seconds. The bring up code had no idea. It waited 15 seconds, threw a RuntimeError, and left a zombie Chrome process squatting the port and the profile directory.

The next run hit the same port conflict and failed faster.

What I changed

First, I extended the deadline to 45 seconds. That alone covers most contention. Cold Chrome on a busy machine almost never exceeds 40 seconds in practice, so 45 gives real headroom without letting a genuinely broken launch spin forever.

Second, I added a recovery path on RuntimeError. If the connect fails, the code now finds any Chrome process attached to that port or profile, kills it, and retries the bring up once. The zombie problem was real: a partially launched Chrome holds the user data directory lock, so the next attempt would fail even after load dropped. You have to evict it explicitly.

One retry is enough. If the second attempt fails, something is structurally wrong: missing binary, corrupted profile, OS resource exhaustion. The right answer at that point is a hard abort with a useful error, not infinite retry.

The tradeoff

45 seconds is a long time to block. On a job that runs every hour, acceptable. On something that fires every few minutes, you would want a different strategy: a keepalive process that holds Chrome open between runs, or a browser pool. I did not do that here because the added complexity is not worth it for an hourly job.

What I would do differently on a greenfield build

Do not schedule jobs that compete for the same resources in the same window. Generate, publish, and mirror could stagger by even five minutes and avoid most of this. I stacked them at the same launchd interval because it was convenient at the time. Convenience like that compounds into exactly this kind of brittle timeout.

The fix is small. The lesson is that "works on an idle machine" is not a sufficient bar for any job running on infrastructure that also does other things.

The 15 second Chrome timeout that was killing my mirror run

Deva — Thu, 09 Jul 2026 18:06:00 +0000

The hourly mirror run was aborting at session(). Not at a flaky network call, not at the render step. At the very first line where you hand the CDP client a running Chrome process. Chrome was not up yet.

A cold headless Chrome launch on a machine that is also running a publish job takes longer than 15 seconds. The content generator and the mirror job share a schedule window. When they overlap, the system is pegged and Chrome takes its time. The default 15s deadline was calibrated for an idle machine. This machine is never idle.

The fix is two parts.

First, raise the deadline to 45 seconds. Not arbitrary. I profiled cold starts under load and the worst case landed around 35 seconds, so 45 gives real headroom without being sloppy. If Chrome is not reachable in 45 seconds on a modern machine, something is structurally broken, not just slow.

Second, when you hit the deadline, you cannot just throw and walk away. Chrome may have partially started, squatting the port or holding a lock on the profile directory. Your next attempt will fail instantly for a completely different reason. So the recovery path kills whatever is sitting on the port and clears the profile lock before retrying once.

The retry is exactly once. Not a loop. If two launches fail, you either have a systemic load problem (retrying forever just burns CPU) or a port conflict the cleanup did not resolve (which needs a human to look at it). One clean retry after cleanup is the bet. After that, fail loud.

What I would do differently: the 45 second number is a magic literal and probably should not be. The load profile on this machine changes. If the generator job gets heavier, 45 seconds might not be enough, which means touching code instead of an env var. I default against config knobs for hypothetical cases, but this one is not hypothetical. The load is variable by design and the threshold will drift.

The other gap: the port and profile cleanup is currently fire and forget. I kill whatever is on the port and remove the lock file, but I do not verify the kill succeeded before retrying. On a healthy machine this is fine. On a machine with a zombie Chrome that refuses to die, the retry will fail for the same reason and the error message will be confusing. Worth adding a check that the port is actually free before attempting the second launch.

The mirror ran clean after this. Twelve articles synced in the next hourly tick. Not every reliability fix is an architectural insight. Sometimes it is just a bigger timeout and a cleanup step.

Bumping the CDP timeout is not the fix. Killing the zombie Chrome is.

Deva — Mon, 06 Jul 2026 04:59:06 +0000

Raising a deadline sounds like the solution when a process times out. It is not. A higher timeout just gives the broken state more time to settle in.

Here is what actually happened. I run a headless Chrome session hourly to mirror posts to Medium. The session bring up uses CDP, and the default deadline was 15 seconds. On a loaded machine, where content generation and publishing jobs share the same schedule, a cold Chrome launch can easily blow past that. The whole mirror run was aborting at the session() line, silently, every time the machine happened to be busy.

The naive read is: bump the timeout, ship it, done. I tried that logic in my head and rejected it in about ten seconds.

Here is why it fails. When Chrome misses the deadline and you get a RuntimeError, you do not have a clean slate waiting for a retry. You have a half dead Chrome process still squatting the port and holding a lock on the profile directory. If you retry without clearing that first, the new launch collides immediately. You get the same error faster. The timeout was never the bottleneck; the zombie was.

The actual fix has two parts working together:

Part one: extend the deadline to 45 seconds. This covers the realistic worst case when the machine is saturated. Content generation is expensive, publishing has its own I/O, and they all pile up at the same schedule slots. 45 seconds is generous enough to survive that contention without masking real failures that would happen regardless of load.

Part two: on RuntimeError, explicitly find and kill any Chrome process sitting on the port or profile path, then retry once. The cleanup is what makes the retry actually work. Without it you are just running the same failure twice and pretending it is a resilience strategy.

One retry is the right ceiling. If it fails twice in a row, something structural is wrong, not transient. Retrying indefinitely here would just loop forever on a broken environment and delay the signal that something needs human attention.

The tradeoff worth naming: you are adding cleanup logic that has to know about your process landscape. That is coupling you'd rather not have. The cleaner long term answer is not sharing the schedule so aggressively. If the generation job and the mirror job were staggered by even a few minutes, the cold start contention largely disappears. The 45s deadline plus retry is correct engineering given the current setup, but the current setup is the thing I'd actually change if I were doing this fresh. Tight scheduling on a single machine with shared resources is a coordination problem you keep patching at the symptoms.

What I would do differently: schedule the mirror job to fire 10 minutes after the generation window closes, not at the same minute. Keep the 45s deadline because hardware surprises are real. Drop the retry entirely or keep it only as a last resort, not a routine path.

The lesson is not about timeout values. It is that a retry without cleanup is not a retry. It is a second failure wearing a different hat.

The timeout was not the bug. The missing recovery path was.

Deva — Sun, 05 Jul 2026 21:31:11 +0000

Nobody ships flaky browser automation because their timeout is too short. They ship it because there is nothing useful on the other side of the timeout firing.

I learned this maintaining a Medium mirror that runs on a shared machine. The job opens a headless Chrome via CDP every hour, renders posts, and syncs them. It was reliable until it was not. Adding content generation and publishing jobs to the same cron schedule meant the machine was often busy when the mirror fired. Cold Chrome launches that normally took 2 to 3 seconds were now exceeding 15 seconds. The session connect timed out. The mirror aborted at the session() call. No content mirrored.

The obvious fix is to extend the timeout, and yes, I did that: 15 seconds to 45 seconds. On a loaded machine Chrome needs real CPU time to fork subprocesses, bind a debugging port, and signal readiness. 45 seconds is enough headroom without being irresponsible.

But a longer timeout only buys you more patience. It does not help the second failure mode, which is the dangerous one: Chrome launches, binds the port, partially initializes, and dies. Now you have a zombie process squatting the socket. The next connection attempt gets an instant ConnectionRefused regardless of your timeout setting. You waited 45 seconds and then failed immediately anyway.

The real fix is the recovery path. On RuntimeError at session startup, kill any process holding the debugging port and the Chrome profile directory, then retry once. This handles the zombie case cleanly. One retry is the right ceiling. If Chrome fails to start twice in a row on the same machine, the problem is the machine, not the timeout value. An infinite retry loop here is how you turn a flaky job into a runaway process.

The tradeoff is worth naming. 45 seconds is a long blocking window in an hourly job. I considered a shorter initial timeout plus an immediate retry to reduce the worst case wait, but that approach obscures a pattern worth seeing. If launches are consistently slow, I want the logs to show consistent slowness, not a stream of recovered retries hiding an underlying load problem. One longer attempt plus one explicit recovery gives cleaner signal.

What I would do differently: the retry logic currently lives inside the function that needs a CDP session, which means it gets duplicated the next time I add an entrypoint to the mirror pipeline. The right abstraction is a dedicated acquire_browser(deadline, retries) helper that owns the process lifecycle entirely. Each caller gets a ready browser or an exception; none of them deal with port cleanup or zombie detection. The policy becomes testable in isolation without running the full pipeline.

The deeper point: browser automation reliability is a process lifecycle problem, not a protocol problem. CDPConnectionError and RuntimeError at session open are almost never caused by CDP internals. They are caused by Chrome being a heavy application that spawns many processes, each needing to initialize, and the whole chain taking longer when the machine is under load. Most automation code assumes a warm idle machine. The fix is to stop assuming that and build for what actually happens in production.

Bumping your CDP timeout won't save you

Deva — Sun, 05 Jul 2026 12:57:08 +0000

The obvious fix when headless Chrome times out on bring up is to increase the deadline. Triple it, call it done. That intuition is half right and fully insufficient.

I have an hourly mirror job that syncs posts to Medium. It drives a headless Chrome instance over CDP using Playwright's connect_over_cdp. The sandbox Chrome lives on a fixed port (9223) with a dedicated user data directory. When the machine is idle, bring up is fast. When it is busy, the publish and generate jobs share the same launchd schedule window, and a cold Chrome launch can blow past a 15 second deadline before it hands back the WebSocket URL. The whole run aborts at the session() call. No article gets mirrored.

The fix looks simple: raise the deadline to 45 seconds. Here is the actual change:

try:
 return sandbox.ensure(headless=headless, deadline_seconds=45.0)
except RuntimeError:
 sandbox.shutdown()
 return sandbox.ensure(headless=headless, deadline_seconds=45.0)

The deadline_seconds bump is the less interesting half. The critical part is the sandbox.shutdown() call inside the except block.

When Chrome fails to bring up in time, it is not cleanly absent. It is partially started: process running, port 9223 possibly bound, user data directory locked. A naive retry into that state fails immediately or hangs on a stale lock. The only path forward is to kill whatever is squatting the port and the profile, then try again.

Without the shutdown, a retry would not get 45 more seconds. It would get zero seconds and a connection refused.

This is what gets missed when you read "just increase the timeout." You also have to make the retry start from a clean state. Shutdown is not optional. It is what makes the retry meaningful.

A few things I would do differently starting from scratch:

First, the retry is capped at once. One retry after cleanup is enough. If the second attempt fails, something is genuinely wrong beyond scheduling jitter, and the right answer is to surface the error, not spin forever.

Second, the 45 second deadline is a guess calibrated to "seems long enough on a busy Mac." The real number should come from measuring actual cold start times across a week of launchd logs. I have not done that. If the machine gets busier, 45 seconds may not hold, and I will find out the hard way when the mirror starts silently skipping articles again.

Third, the failure mode is silent. When the hourly run aborts at session(), the article just does not get mirrored. No notification, no retry queue. A missed mirror is discovered on the next manual status check, not immediately. That is the part worth fixing next.

The architectural lesson is narrower than it sounds: when you own a singleton resource (a fixed port, a locked directory, a named pipe), recovery is not just "wait longer." It is "tear it down cleanly and rebuild it." Extra time only matters if the second attempt starts from a genuinely clean state.

The 15 Second CDP Timeout That Was Silently Killing My Mirror Run

Deva — Sat, 04 Jul 2026 21:18:40 +0000

The 15 second CDP timeout was wrong the moment I shared the machine with other jobs.

My Medium mirror runs hourly. It spins up a headless Chrome session, authenticates, scrapes the post, and ships it. On an idle machine the whole thing is fast. But this machine also runs content generation and publish jobs on overlapping schedules. When those fired first, the Chrome launch at session() would push past 15 seconds and abort. Often enough to make the mirror unreliable.

The failure mode was clean but brutal: RuntimeError at session start, no post mirrored, no retry. An hour lost.

The fix has two parts. First, raise the deadline. 45 seconds is generous enough to survive a busy machine without being so long that you wait forever on a genuinely dead process. The number is not scientific: it is based on observing that cold Chrome launches under load peaked around 30 seconds, so 45 gives 50% headroom and still fails fast against actual hangs.

Second, the retry. On RuntimeError, the naive path is to just fail. But in practice a failed launch often leaves a Chrome process squatting the port and the profile directory in a broken state. If you retry immediately without cleaning that up, the next launch fails too, for a different reason. So the recovery step is: shut down whatever is squatting the port and profile, then try once more.

Once. Not in a loop.

This is the part worth thinking about. You could write a retry loop with backoff and call it robust. But a retry loop on a subprocess launch is a footgun. If something is genuinely wrong (wrong binary path, missing profile, OS level resource exhaustion), a loop just delays your discovery of it by three times the deadline. One retry is enough to handle the transient case (resource contention, flaky OS scheduler) while making permanent failures obvious quickly.

The cleanup itself is blunt: find processes on the CDP port, kill them, clear the profile lock file if it exists. Not elegant. This is not a place for elegance. You want state clean before the retry lands, and the shortest path to that is kill and clear.

What I would do differently: the 45 second deadline should be configurable, not a literal. The mirror job runs on a machine I control with a known load profile, so a literal is fine for now. But the moment you run this on shared infrastructure or CI, a hardcoded 45 seconds is going to be wrong for someone. Make it an env var, default to something sane, document it.

The other thing: log the process state before the kill. Right now the recovery step logs that it found a dead Chrome and killed it, but not what that process was doing or how long it had been alive. That information would matter the next time this fires and I am trying to understand whether contention is getting worse.

For a mirror job that runs every hour and needs to stay reliable without babysitting, 45 seconds and one retry is the right posture. The mirror has been clean since the fix landed.

My headless Chrome mirror kept dying on busy mornings. Here is the fix.

Deva — Sat, 04 Jul 2026 12:11:48 +0000

Cold Chrome launch blew the 15 second deadline. Not every time. Just when the machine was already running content generation and publishing jobs at the same time, which is exactly when the mirror needs to work.

The failure mode was silent in the worst way: session() would throw a RuntimeError at startup, the whole hourly mirror run would abort, and nothing would make it to Medium. No retry, no recovery. Just a dead run in the logs.

Why the default timeout was wrong

Fifteen seconds is fine for a warm machine with nothing else going on. Headless Chrome on macOS typically comes up in 3 to 5 seconds under light load. But this runner shares a schedule with content jobs that are CPU and memory intensive. On a busy morning the OS is already swapping, Chrome's startup sequence takes longer, and 15 seconds is just not enough.

The naive fix is bump the number. But that only gets you so far, because there is a second failure mode the timeout does not cover: a previous run that died mid session sometimes leaves a Chrome process squatting the debugging port or the profile directory. The next run tries to bring up a fresh instance, hits a conflict, and dies with a RuntimeError even if you give it all the time in the world.

What I actually changed

Two things, in order:

First, raise the launch deadline to 45 seconds. This is not elegant but it is correct. The deadline needs to cover the realistic worst case, which on a shared schedule with content jobs is somewhere between 20 and 40 seconds for a cold launch. 45 is the number that clears that window with margin.

Second, add a recovery retry on RuntimeError. If the launch fails, the handler shuts down any half dead Chrome process holding the port or profile, then tries once more. The retry is single shot: attempt, fail, clean up, attempt again. If the second attempt fails, it surfaces the error and stops. No infinite loops, no exponential backoff theater for a problem that either resolves in one retry or does not resolve at all.

The cleanup step is the important part. Without it the retry just runs into the same zombie process and fails identically. The sequence matters: kill first, then launch.

The tradeoff worth naming

A 45 second deadline means a bad launch scenario blocks for up to 45 seconds before you know it failed. For an hourly job that is acceptable. If this were a user facing flow I would not do it this way. I would run Chrome in a dedicated process, keep it warm between requests, and connect over a persistent socket instead of re launching every time.

But this is a background mirror job. It runs once an hour. The blast radius of a slow startup is one mirror cycle, not a user request. Keeping Chrome warm across runs would add state management complexity for a problem that now only surfaces when the machine is genuinely overloaded, which is rare. The 45 second deadline plus one recovery retry costs almost nothing to reason about and covers the real failure cases.

What I would do differently

The retry logic belongs in a context manager, not scattered across the session setup function. Right now the cleanup and relaunch are inline. That works, but it means the next person reading the code has to follow the control flow to understand what is happening. A managed_chrome_session() context manager that handles launch, the 45 second deadline, and the single recovery retry would make the intent obvious and make testing each piece in isolation straightforward.

I did not refactor it to that shape because the change I needed was a targeted fix, not a cleanup. But if this file gets touched again, that is the first thing I would do.

Prompt Caching in Practice: The 5-Minute Cache and Workflow Design

Deva — Sat, 04 Jul 2026 10:32:33 +0000

The Mechanics of Prompt Caching: Beyond the API Docs

Prompt caching is a critical technique for optimizing AI workflows, especially when dealing with repetitive or similar prompts. While API documentation often emphasizes the basic concept, storing responses to reduce latency and cost, the underlying mechanisms are more nuanced. Effective caching involves understanding how to manage cache lifetimes, refresh cycles, and invalidation strategies to maximize efficiency without sacrificing accuracy.

At its core, prompt caching relies on storing the output of a prompt-response pair for a specified duration. The default setting, as outlined in the Claude Platform Docs, is a five-minute lifetime. This means that once a prompt is cached, subsequent requests within that window will retrieve the stored response, avoiding the need to re-invoke the model. The key advantage here is that the cache is refreshed at no additional cost each time the cached content is used, making it a cost-effective way to handle high-frequency prompts.

However, the effectiveness of this approach depends heavily on how the cache expiration boundary is managed. The five-minute TTL (time-to-live) is a practical default, but it is not a one-size-fits-all solution. For example, if prompts tend to vary slightly over time or if the underlying data changes frequently, a static TTL may lead to stale responses or unnecessary cache misses. Fine-tuning the TTL based on prompt variability and response freshness is essential for maintaining a balance between latency, cost, and accuracy.

< div class="stat-box" >

Research indicates that caching can reduce input costs for tokens by up to 90 percent compared to full input costs, emphasizing the importance of effective cache strategies. (padiso.co blog)

< /div >

Designing robust cache refresh cycles involves more than just setting a TTL. Incorporating jitter, randomized delays, can prevent cache stampedes during high load, while heartbeat mechanisms ensure cache freshness even when prompts are infrequent. These techniques help maintain a warm cache that adapts dynamically to workload patterns.

In summary, understanding the mechanics beyond the API documentation involves recognizing the importance of TTL management, refresh strategies, and adaptive invalidation. This deeper insight enables engineers to craft caching solutions that are both performant and cost-efficient, especially in production environments where prompt variability and data freshness are critical considerations.

The 5-Minute TTL: Understanding the Cache Expiration Boundary

When a prompt cache is refreshed every five minutes, the system balances two competing goals: keeping data fresh enough to reflect recent changes while avoiding the overhead of re‑generating prompts on every request. A five‑minute window is long enough that most user‑generated content, such as a chat history or a document draft, does not change dramatically within that span, yet short enough that stale prompts do not accumulate and degrade relevance. The TTL also maps cleanly onto common monitoring intervals, making it easier to instrument and alert on cache hit rates.

The expiration boundary directly influences cost and latency. Each cache miss forces the model to process the entire prompt from scratch, incurring both compute time and token usage. By contrast, a hit re‑uses the cached prefix, allowing the model to resume from a specific point and skip redundant work. The five‑minute TTL ensures that the majority of requests hit the cache, while still allowing the system to purge outdated data before it becomes misleading. This design also simplifies cache invalidation logic: a single timer can trigger a flush, eliminating the need for fine‑grained dependency tracking.

**A 64% cost reduction** is achievable when employing a 5‑minute TTL with an 80% cache hit rate, according to a recent padiso.co blog analysis. This figure demonstrates the tangible financial benefit of a well‑chosen expiration window.

In practice, many engineering teams observe that prompts with static headers, system messages, or recurring instructions remain unchanged for several minutes. By caching these prefixes, the system can serve a large volume of requests with minimal latency. The five‑minute TTL also aligns with typical user interaction patterns: a user editing a document or continuing a conversation rarely updates the entire prompt in less than a few minutes, so the cached content remains valid for the duration of a session. When a user explicitly clears the cache or triggers a manual refresh, the TTL is effectively reset, ensuring that the next request starts from a fresh state.

Prompt caching optimizes your API usage by allowing resuming from specific prefixes in your prompts. This significantly reduces processing time and costs for repetitive tasks or prompts with consistent elements. [Claude Platform Docs](https://github.com/anthropics/claude-code)

Choosing a five‑minute TTL is therefore a pragmatic compromise. It delivers substantial cost savings, keeps latency low, and simplifies cache management. Engineers should monitor hit rates and adjust the TTL only if they observe a consistent drift in prompt freshness or a significant change in user behavior. In most scenarios, the 5‑minute boundary remains a robust default that aligns with both operational efficiency and user experience.

The Anti-Pattern: Why 300-Second Intervals Fail

Many engineers default to the standard 300-second Time-To-Live (TTL) offered by providers, assuming it is a safe, one-size-fits-all setting. While this limit prevents stale data in rapidly changing contexts, it creates a significant friction point in sustained workflows. A five-minute window is often too narrow for real-world user interactions or multi-step agent loops. If a user pauses for a moment to read a response or a background process delays by a few seconds, the cache invalidates. The system then pays the full latency and cost penalty to reprocess the exact same prompt headers, effectively resetting the optimization gains.

The 5-minute TTL is conservative and safe, minimizing the risk of stale cached state, but it results in frequent cache expiration according to the [Padiso blog](https://modelcontextprotocol.io/).

This frequent invalidation undermines the primary benefit of caching, which is amortizing the cost of large context windows over time. In practice, rigid 300-second intervals force architectures into a "heartbeat" pattern where clients must ping the server unnecessarily to keep the cache warm. This adds complexity and network traffic without adding value. The failure mode looks like high latency spikes immediately following the five-minute mark, regardless of whether the underlying data has actually changed. In production logs, this appears as rhythmic clusters of cache misses that correlate perfectly with the timestamp, indicating a configuration issue rather than a data change.

**Extending the TTL significantly improves efficiency.** Systems using a 1-hour TTL with a 95% hit rate achieve 76% cost savings compared to no caching, as noted by the [padiso.co blog](https://modelcontextprotocol.io/).

Relying on the default interval treats the cache as a temporary buffer rather than a persistent optimization layer. To build durable workflows, you must look past the 300-second default and design for longer, more stable retention periods that match the actual lifecycle of your data.

Jitter and Heartbeats: Designing Robust Cache Refresh Cycles

Rigid refresh schedules create synchronization hazards that undermine the benefits of caching. If a fleet of workers initializes simultaneously or relies on a fixed timer derived from the system clock, they will attempt to refresh their prompt caches at the exact same moment. This behavior creates a thundering herd problem, spiking latency and potentially triggering rate limits exactly when the system needs stability. To prevent this, you must introduce randomness into the refresh cycle.

Jitter is the deliberate addition of randomness to the timing of operations. Instead of refreshing a cache entry exactly at the 300-second mark, a worker should pick a random window around that expiration. A robust pattern is to refresh early by a random percentage of the TTL, typically between 5 and 15 percent. This spreads the load over time, ensuring that only a small subset of workers hits the API at any given second. It effectively desynchronizes the fleet, turning a periodic spike into a low-level background hum.

Heartbeats serve a complementary purpose. They are lightweight, periodic calls designed to keep a cache entry warm during periods of low activity or to verify that the cache is still valid. If a workflow goes idle for longer than the TTL, the provider might evict the cache to free up resources. A heartbeat ensures that when the user returns, the system is ready to respond immediately without a full re-initialization penalty. This is distinct from a full refresh; it is a minimal interaction sufficient to reset the access timer.

Here is a simple implementation of a jittered refresh loop in Python:

import time
import random

TTL_SECONDS = 300
JITTER_PERCENT = 0.1

while True:
    # Perform the main task
    process_request()

    # Calculate sleep duration with jitter
    jitter = TTL_SECONDS * JITTER_PERCENT
    sleep_time = TTL_SECONDS + random.uniform(-jitter, jitter)
    time.sleep(max(0, sleep_time))

The failure mode of ignoring jitter is obvious in production logs. You will see a sharp spike in 429 errors or latency spikes occurring at regular intervals, like every five minutes. The failure mode of ignoring heartbeats is subtler. You will see intermittent high latency on the first request after a break, followed by fast responses. Use jitter for high-volume, concurrent workflows to smooth load. Use heartbeats for critical, low-latency paths where readiness is paramount. This combination turns a brittle cache into a resilient component of your architecture.

Architecting for Warmth: Workflow Shapes That Optimize Latency

When an LLM call is made, the time spent waiting for the model to warm up is often the dominant contributor to overall latency. A well‑designed workflow can keep the cache warm by aligning request patterns with the cache’s 5‑minute TTL. Below are common shapes that keep the cache active without forcing unnecessary traffic.

1. Pull‑Based Polling with Adaptive Intervals

Instead of a rigid 300‑second poll, use exponential back‑off that respects the cache boundary. For example, poll at 60 s, 120 s, 240 s, and 300 s, then stop until the next request cycle. This reduces traffic during quiet periods while ensuring a cache hit just before the TTL expires. The adaptive delay also mitigates bursty traffic that could trigger rate limits.

2. Push‑Triggered Refresh on High‑Value Requests

When a user performs a high‑impact action, such as creating a new document or changing a prompt template, trigger an immediate cache refresh. This “push” guarantees the freshest data for subsequent requests that will rely on the same prompt. The refresh can be throttled by a short cooldown (e.g., 30 s) to avoid rapid re‑warming during rapid edits.

3. Batch‑Level Warm‑Up

For workflows that process multiple prompts in a single session (e.g., a batch report), pre‑warm the cache for each unique prompt before the batch begins. This can be achieved by issuing lightweight “warm‑up” calls that return only metadata or a token count. Because the API call is cached, the first real request will hit a warm prompt, reducing overall latency by a predictable margin.

4. Hierarchical Prompt Composition

Decompose complex prompts into reusable sub‑prompts. Cache each sub‑prompt independently; assemble them in the application layer. By refreshing only the changed sub‑prompt, you avoid re‑warming the entire prompt tree, keeping the cache hit rate high while minimizing unnecessary traffic.

5. Continuous Warm‑Up via Heartbeats

Implement a heartbeat process that touches the cache at a fixed interval just shy of the TTL (e.g., 295 s). This guarantees that even in low‑traffic scenarios the cache never expires. The heartbeat can be lightweight, fetching a minimal response that the cache records but is otherwise ignored by the application. The process should be idempotent so repeated heartbeats do not cause duplicate cache entries.

6. Monitoring and Auto‑Scaling

Track cache hit ratios, average warm‑up time, and request distribution. If the hit ratio drops below a threshold (e.g., 90 %), consider increasing the heartbeat interval or adding more proactive refreshes. Conversely, if traffic is consistently low, reduce heartbeats to save API calls without harming latency.

By combining pull‑based adaptive polling, push‑triggered refreshes, batch warm‑ups, hierarchical composition, and heartbeat maintenance, an application can keep the prompt cache consistently warm. The result is lower average latency, fewer cold starts, and predictable performance that scales with user activity.

Cost vs. Performance: When to Cache and When to Let Go

Prompt caching is not a universal optimization. While it significantly reduces latency and token costs for repetitive context, it introduces a hidden tax in the form of cache management overhead and potential stale data risks. Engineers must weigh the cost of cache misses against the performance gains of cache hits. If your application frequently rotates context or relies on highly dynamic user inputs, the overhead of maintaining a cache may exceed the savings gained from reduced token processing.

The primary mechanism for deciding when to cache involves evaluating the entropy of your prompt. If the system prompt and the majority of the context window remain static across multiple requests, caching is highly effective. However, if the prompt requires frequent updates to reflect real-time state, the cache becomes a liability. Every time you update a cached prompt, you incur a write cost. If the frequency of these updates approaches the frequency of your inference requests, you are effectively paying for a cache that is rarely utilized.

Consider the lifecycle of your data. For long-running sessions where a user interacts with a large document or a complex codebase, caching the initial context is a clear win. The cost of the initial cache write is amortized over hundreds of subsequent turns. Conversely, for stateless, one-off requests, the cache provides no benefit. In these scenarios, the latency added by the cache lookup and the potential for cache eviction overhead can actually degrade performance.

To determine the optimal strategy, monitor your cache hit rate relative to the cost of the prompt. If the hit rate is low, the cache is merely consuming memory and adding complexity to your infrastructure. In such cases, it is better to let go of the cache entirely. Relying on raw inference for low-frequency or high-entropy prompts simplifies your architecture and eliminates the risk of serving stale, cached context. Use caching only when the data is stable enough to survive multiple request cycles and the cost savings justify the complexity of managing the cache state.

Future-Proofing Your Cache Strategy in Production

The 5-minute cache window is not a constant. Anthropic has already adjusted TTL behaviors once, and any provider reserve the right to change pricing, duration, or eligibility without advance notice. Build your systems assuming the ground will shift.

First, abstract the cache check behind an internal interface. Do not let cache_control type strings leak into your business logic. Wrap the provider client so you can toggle between eager caching, conservative caching, and no caching based on a configuration flag. When Anthropic changes the rules, you change one file, not fifty.

Second, emit cache hit metrics as first-class telemetry. Track hit rate, miss cost in tokens, and the latency delta between cached and uncached paths. These numbers justify the engineering investment and flag regressions immediately. If your hit rate drops from 85% to 30% after a provider update, you want to know within minutes, not at the end of the billing cycle.

Third, design for graceful degradation. A cold cache should never crash a workflow. The heartbeat pattern from earlier sections already helps, but also implement a circuit breaker that falls back to uncached requests when cache latency exceeds a threshold. The fallback costs more; the failure mode is still bounded.

Fourth, version your prompts. Caching is sensitive to exact byte matching, so a single whitespace change invalidates the entry. Store prompt templates under version control and hash the rendered output. When you deploy a new prompt version, you can pre-warm the cache before cutting over traffic. Leviathan uses this approach to maintain sub-second latency even across daily prompt updates.

Fifth, evaluate multi-provider strategies. If you run on Anthropic today, model the cost of porting cache logic to another provider with different TTLs or no caching at all. The abstraction layer pays for itself the first time you need to migrate.

The 5-minute cache is a powerful tool, but tools age. Build so that a policy change is a configuration update, not a rewrite. The teams that treat prompt caching as a stable platform feature rather than a temporary optimization will be the ones still benefiting from it when the next pricing model arrives.

My headless Chrome timeout was 15 seconds and my machine needed 40

Deva — Fri, 03 Jul 2026 10:20:17 +0000

The hourly mirror job was dying silently. Not a network error, not a bad response from the target site. It was dying at session(), the line where Playwright connects to the CDP socket and hands back a browser context. Cold Chrome launch on a busy machine was taking longer than 15 seconds, and 15 seconds was the deadline I had set.

The machine runs content generation and publishing jobs on the same schedule. When those fire, the CPU spikes. Chrome, starting cold into a headless sandbox without a warm profile cache, can take 30 to 45 seconds to get its CDP socket listening. I had picked 15 seconds because that felt generous on an idle machine. It was not generous on a loaded one.

The fix has two parts.

First, raise the deadline to 45 seconds. That covers the actual worst case I observed. If you set a timeout by feel rather than by measurement, you will eventually see the feel case fail in production. The right way to pick a number is to instrument the launch, collect a sample across machine states, and pick the 99th percentile with a buffer. I did not do that the first time. I guessed, and the guess was wrong.

Second, on RuntimeError during bring up, kill any half dead Chrome that is squatting the port or the profile directory, then retry once. This matters because a failed launch does not always clean up after itself. If Chrome starts, acquires the port, then crashes before the CDP socket is ready, the next attempt will fail immediately with a port conflict rather than a timeout. You end up in a state where retrying naively does nothing. The recovery step is: find the process holding the port, kill it, wipe the lock files in the profile directory if they exist, then try again. One retry is enough. If it fails twice in a row the problem is something other than a race condition on bring up.

The tradeoff is latency. A 45s deadline means a failed launch now blocks the job for 45 seconds before erroring out. On an hourly schedule that is fine. On a high frequency job it would not be. If you care about fast failure, the right move is to separate the deadline for "waiting for Chrome to start" from the deadline for "Chrome started but CDP never responded," because those two failure modes have different acceptable wait budgets. I did not split them here. I made the whole window wider. For an hourly mirror job running on a shared machine, that tradeoff is worth it.

What I would do differently next time: instrument the bring up time from the first run. Log how long Chrome takes to get the CDP socket ready across different machine load states. Set the deadline from data, not intuition. And write the recovery path before you need it, not after you see the half dead Chrome squatting the port at 3am.

The failure mode here is not exotic. It shows up any time you run a resource hungry process on a machine that does other things. The fix is not complicated. The only mistake was assuming the happy path timing held under load.

Your CDP timeout is not the bug. Your assumptions about the machine are.

Deva — Fri, 03 Jul 2026 10:03:44 +0000

Most fixes for CDP connection failures start the same way: bump the timeout. More seconds, less drama. But if Chrome is launching on a machine that is already doing other work, the number is not the problem. The assumption that the machine is idle is.

Chrome DevTools Protocol attaches to headless Chrome during session setup. On a cold, idle machine that happens fast. On a machine already running generate and publish jobs on overlapping cron schedules, it can take a lot longer.

My mirror job publishes posts to Medium by driving headless Chrome through a CDP session. The hourly cron fired, the session setup timed out at 15 seconds, and the whole run aborted with nothing sent. This happened maybe once every few hours, always at the same line:

session()

The 15 second default was the symptom. The real cause was contention: a generate job and the mirror job sharing the machine and sometimes firing at the same time. Chrome cold launch does not get a quiet machine; it gets a busy one.

The obvious fix is more time. The less obvious part is what happens when Chrome does not just time out but stalls: a process still holding the port and the profile directory. If you attempt a new launch without clearing that first, you will fail immediately, no matter how generous your new timeout is.

Here is what I changed:

Extended the deadline to 45 seconds. That covers cold launch on a busy machine with margin left over, without making a real failure just sit there burning time.

On RuntimeError, the code now shuts down any Chrome process squatting on the CDP port and the profile path, then retries exactly once. One retry. If that also fails, the error surfaces and the run stops. No silent swallowing, no infinite loop.

The tradeoff: a failed run now takes up to 90 seconds before giving up instead of 15. For an hourly job that is fine. For a real time system it would not be. Know your cadence before you copy this.

What I would do differently: log Chrome launch duration from day one. The tail distribution on a busy machine looked nothing like what I saw in development. If I had been tracking that number, I would have found this before the cron job found it for me.

Four lines of code. The actual lesson is that any startup sequence depending on a resource you do not own exclusively needs a recovery path, not just a longer timeout.

Comments are not posts. My dashboard learned that the hard way.

Deva — Thu, 02 Jul 2026 14:08:58 +0000

48 tests. That is the ticket to ship: comment db (9), generator (5), routes (12), plus the existing draft suite carrying the rest.

The problem this solves is that comments are not posts with a destination. They are responses, and responses expire. A post draft can sit in your queue for 48 hours and still be relevant. A comment draft from Tuesday about a thread that peaked Monday is noise. The moment I started treating them as the same object, everything fought back.

No scheduling

The post pipeline has a whole slot system: quiet hours gating, cadence controls, a scheduled_at column. Comments get none of that. The comment_drafts table has no scheduled_at because the concept does not apply. You generate, you review, you post, or you discard. That is the entire lifecycle. There is no "schedule for 3pm" because by 3pm the thread you are replying to has moved on.

This sounds like a simplification. It is actually a hard constraint that propagated through every other decision.

Carry the target everywhere

Each comment draft stores the post it replies to: author, text, URL, and engagement numbers. Engagement lives as a JSON blob on the draft row because what "engagement" means differs by platform (likes vs reactions vs reposts vs shares), and I was not about to normalize a schema I do not fully control across three platforms at once.

The CommentCard in the frontend shows that target context above the editable reply field. This sounds obvious in retrospect. It was not obvious when I started. Without that context, reviewing a queue of comment drafts is guessing in the dark. You are editing replies without seeing what you are replying to.

Dedup is the annoying part

You do not want to comment on the same post twice. "Already commented" needs to cover two states: pending drafts in the queue, and actual posted comments in the ledger. The guard checks both. The queue cap sits at 5 pending per platform, which keeps the review burden manageable and stops the generator from running far ahead of human judgment.

Discovery: net new is the wrong default

Per platform discovery is read only candidate scraping composed from existing helpers. LinkedIn gets discover_candidates(), X and Threads pull from their existing comment infrastructure. Nothing net new on the discovery side.

That was the right call. Build the new plumbing (generate, store, review, post) around existing discovery rather than rewriting everything at once. The generate comments CLI command and the five routes (list, update, post, discard, generate) are the new surface. The discovery machinery underneath is unchanged.

What I would do differently

The engagement JSON blob. It is pragmatic today but means you cannot query "show me drafts targeting high engagement posts" without parsing JSON in the application layer. Separate normalized columns or a join table would be cleaner. I took the shortcut because the cross platform mapping is messy and I wanted to ship. Next time I would normalize it earlier, even imperfectly, so the data stays queryable.

The dashboard now has a Posts/Comments toggle per platform. Simple interface. I keep wondering if I overengineered the backend for what the UI actually does. But 48 tests passing and a working generate comments CLI say otherwise.

Why Your Headless Chrome Session Fails Silently on a Busy Machine

Deva — Thu, 02 Jul 2026 13:15:02 +0000

What kills an hourly automation job that works fine in isolation but fails one in five times under real load?

For me it was a 15 second timeout on a cold Chrome launch. Not a bug in the scraping logic. Not a broken selector. Just a number baked in as a default that was never tested against a machine doing other things at the same time.

Here is what was happening. I run a Medium mirror job every hour. It opens a headless Chrome via CDP, logs into Medium, scrapes the post, and publishes it. The job shares a schedule with content generation and publish jobs. When those fire at the same time, the machine is already maxed out: ffmpeg frames being encoded, claude p shells spawned, SQLite writes happening. Cold Chrome launch on a loaded CPU can take 30 to 45 seconds. My session setup was timing out at 15 and raising a RuntimeError that killed the whole run at the session() call, before any mirroring happened.

The fix has two parts.

First, bump the deadline from 15s to 45s. That is the obvious part. If your machine can be busy, your timeout has to reflect worst case startup time, not average case. Average is a lie.

Second, handle the half dead Chrome problem. When CDP brings up a browser and the connection times out, Chrome is not necessarily dead. It may still be sitting on the port or holding the profile directory lock. If you retry immediately without cleaning up, the new launch fails too because the old one is still there. So on RuntimeError, I now shut down any Chrome process squatting the same port and profile path before retrying. One retry. If it fails again, the job logs and exits cleanly rather than wedging the next run.

The tradeoff worth naming: a 45s deadline means the outer job now blocks for up to 45 seconds before it knows something is wrong. In a system that runs hourly that is acceptable. In a system that needs sub minute turnaround it would not be. The right ceiling depends entirely on your cadence.

The retry once pattern is also a deliberate constraint. I could retry three times, back off exponentially, and keep going. I did not, because two consecutive failures on a port and profile pair means something is actually broken, not just slow. A retry loop that keeps spinning on a genuinely dead Chrome wastes time and leaves zombie processes. One retry catches transient load spikes. More than one masks real failures.

What I would do differently from the start: treat the initial timeout as a parameter surfaced in config, not a constant in the function. I only noticed the 15s default when production broke it. If it had been in a config file with a comment explaining it covered cold start time, the relationship between machine load and launch latency would have been obvious earlier.

The broader point: any automation that depends on an external process starting within a fixed window needs that window calibrated against real world conditions, not a development laptop. Flakiness at 20% load is often a timeout that was tuned at 0% load. Before you go hunting for logic bugs, check your deadlines.