<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Alex Zhdankov</title>
    <description>The latest articles on DEV Community by Alex Zhdankov (@alex_zhdankov).</description>
    <link>https://dev.to/alex_zhdankov</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3929959%2Fb8e74cb6-a829-45e7-820e-5ae2678fb69a.png</url>
      <title>DEV Community: Alex Zhdankov</title>
      <link>https://dev.to/alex_zhdankov</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/alex_zhdankov"/>
    <language>en</language>
    <item>
      <title>Why we built a durable task runtime without a broker</title>
      <dc:creator>Alex Zhdankov</dc:creator>
      <pubDate>Tue, 09 Jun 2026 13:26:45 +0000</pubDate>
      <link>https://dev.to/alex_zhdankov/why-we-built-a-durable-task-runtime-without-a-broker-580o</link>
      <guid>https://dev.to/alex_zhdankov/why-we-built-a-durable-task-runtime-without-a-broker-580o</guid>
      <description>&lt;p&gt;Background tasks look trivial - until they must survive process restarts, support delayed execution, and run without external infrastructure.&lt;/p&gt;

&lt;p&gt;We needed scheduled maintenance inside a PostgreSQL management agent: &lt;code&gt;VACUUM&lt;/code&gt; jobs, health checks, heartbeat tasks, retention cleanup, delayed operations.&lt;/p&gt;

&lt;p&gt;Celery was the obvious answer. We didn't use it.&lt;/p&gt;

&lt;p&gt;Not because Celery is bad. Because the operational model was wrong for our constraints.&lt;/p&gt;

&lt;p&gt;The agent is installed directly on database hosts. Sometimes inside isolated customer environments. Without Kubernetes. Without managed infrastructure. Without external services we control.&lt;/p&gt;

&lt;p&gt;At that point, background jobs stopped being a helper library problem. They became &lt;strong&gt;persistence, supervision, and lifecycle management inside a single Python process&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The constraint that changed everything
&lt;/h2&gt;

&lt;p&gt;The agent is a single Python process running on the database host.&lt;/p&gt;

&lt;p&gt;We already used Redis elsewhere in the platform for the browser‑based psql terminal. Using it as a task broker sounded convenient.&lt;/p&gt;

&lt;p&gt;Operationally, it was the wrong fit.&lt;/p&gt;

&lt;p&gt;Terminal traffic optimises for low‑latency streaming. Task scheduling optimises for durable persistence. Mixing them would create contention.&lt;br&gt;
Worse: Redis persistence is optional. If Redis restarts, queued tasks disappear. That trade‑off is acceptable for transient terminal streams. It is much harder to justify for scheduled maintenance.&lt;/p&gt;

&lt;p&gt;RabbitMQ was even harder to justify - a dedicated distributed broker whose only purpose was &lt;em&gt;“run tasks on the same machine that already runs the agent”&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;The requirements became clear:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;runs fully in‑process&lt;/li&gt;
&lt;li&gt;survives restarts with task state intact&lt;/li&gt;
&lt;li&gt;supports scheduled future + periodic execution&lt;/li&gt;
&lt;li&gt;supports cancellation and recovery&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;zero external infrastructure&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  The incident that changed the design
&lt;/h2&gt;

&lt;p&gt;The original implementation used an in‑memory queue. It worked perfectly in development.&lt;/p&gt;

&lt;p&gt;Then one agent restarted during a maintenance window. The worker process disappeared. Queued follow‑up tasks disappeared with it. A &lt;code&gt;VACUUM&lt;/code&gt; scheduled for 3 AM simply never ran. No error. No log. Just silence.&lt;/p&gt;

&lt;p&gt;At that point: task execution and task state cannot live only in memory.&lt;/p&gt;

&lt;p&gt;Scheduled operations needed to survive agent restarts, package upgrades, partial failures, unexpected termination. The scheduler became &lt;strong&gt;persistence‑first&lt;/strong&gt;.&lt;/p&gt;
&lt;h2&gt;
  
  
  Why not cron? Why not systemd timers?
&lt;/h2&gt;

&lt;p&gt;Cron solves static scheduling. We needed runtime orchestration: tasks created dynamically via API, deduplication, cancellation, rescheduling, execution status tracking, reattachment after restart. Cron has no concept of task ownership. Coordinating &lt;em&gt;“run &lt;code&gt;VACUUM&lt;/code&gt; every 6 hours unless one is already running”&lt;/em&gt; becomes awkward fast.&lt;/p&gt;

&lt;p&gt;Systemd timers solve some lifecycle problems but split orchestration across two independent runtimes (the application and the OS). The scheduler needed direct access to database metadata, agent runtime state, worker registration, internal APIs. One ownership boundary: the agent owns the tasks.&lt;/p&gt;
&lt;h2&gt;
  
  
  The architecture
&lt;/h2&gt;

&lt;p&gt;Three independent processes. Two queues. One Unix socket.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;API caller
    │
    │ Unix socket
    ▼
Scheduler process
    │
    ├── task_queue ──► WorkerPool process
    │
    └── event_queue ◄─ WorkerPool process
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The important framing line is this:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The scheduler is really two independent systems:&lt;br&gt;
a durable state machine and a subprocess supervisor.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Task state and task execution are intentionally separated.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Scheduler&lt;/strong&gt; owns persistence, timing, deduplication, lifecycle state.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;WorkerPool&lt;/strong&gt; owns subprocess execution, signals, supervision, cancellation.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;They never share memory - only queues and socket messages.&lt;/p&gt;

&lt;p&gt;That separation turned out to matter operationally: if the WorkerPool crashes, the Scheduler continues running. If the Scheduler restarts, tasks survive on disk.&lt;/p&gt;

&lt;h2&gt;
  
  
  Task lifecycle
&lt;/h2&gt;

&lt;p&gt;The state machine is explicit:&lt;br&gt;
&lt;code&gt;DEFAULT → SCHEDULED → QUEUED → DOING → DONE&lt;/code&gt;, with branches to &lt;code&gt;FAILED, ABORTED, CANCELED&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Status is a bitmask. Recovery logic collapsed into a single SQLite predicate:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;DOING&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;QUEUED&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One integer column. No joins. No OR chains. That’s not clever implementation — that’s operational simplicity when you have to reconstruct state after a crash.&lt;/p&gt;

&lt;h2&gt;
  
  
  How the scheduler actually works
&lt;/h2&gt;

&lt;p&gt;The Scheduler runs a &lt;code&gt;select()&lt;/code&gt; loop over two file descriptors: the Unix socket (incoming requests) and the WorkerPool event queue (status updates). Every second, it also runs &lt;code&gt;schedule()&lt;/code&gt;: find runnable tasks, reschedule periodic ones, purge expired history.&lt;/p&gt;

&lt;p&gt;The key insight: &lt;strong&gt;the Scheduler never executes work&lt;/strong&gt;. It only orchestrates lifecycle transitions. Execution happens inside isolated subprocesses.&lt;/p&gt;

&lt;p&gt;Process isolation gave us a clean guarantee: if a worker misbehaves (hangs, leaks memory, deadlocks), the scheduler can terminate it unconditionally. Threads would have shared the failure domain. That guarantee mattered more than raw efficiency.&lt;/p&gt;

&lt;h2&gt;
  
  
  Persistence and recovery (the SQLite moment)
&lt;/h2&gt;

&lt;p&gt;Task state lives entirely in SQLite. On startup, recovery runs:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Any task that was &lt;code&gt;DOING&lt;/code&gt; or &lt;code&gt;QUEUED&lt;/code&gt; during shutdown becomes &lt;code&gt;ABORTED&lt;/code&gt;. Future scheduled tasks remain intact.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;A &lt;code&gt;VACUUM&lt;/code&gt; scheduled for 3 AM still runs after a midnight restart.&lt;/p&gt;

&lt;p&gt;SQLite here is not a database. It is a &lt;strong&gt;durable state machine log&lt;/strong&gt; - exactly the same pattern we used for our metrics buffer. That realisation unified our mental model across the entire agent.&lt;/p&gt;

&lt;h2&gt;
  
  
  Deterministic deduplication
&lt;/h2&gt;

&lt;p&gt;Duplicate submissions (retries, double‑clicks, race conditions) should not create duplicate execution.&lt;/p&gt;

&lt;p&gt;Task IDs are generated deterministically from the operation &lt;code&gt;signature: database:schema:table:hour:operation_type&lt;/code&gt;. The second submission resolves to the same identifier and is rejected immediately. One simple mechanism eliminated an entire class of coordination bugs.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this system intentionally does not have
&lt;/h2&gt;

&lt;p&gt;This is not Celery. No distributed routing, no broker federation, no result backend, no retry orchestration, no monitoring UI, no horizontal scaling.&lt;/p&gt;

&lt;p&gt;For a single‑host agent, what we needed instead was:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;durable local scheduling&lt;/li&gt;
&lt;li&gt;subprocess isolation&lt;/li&gt;
&lt;li&gt;restart recovery&lt;/li&gt;
&lt;li&gt;periodic execution&lt;/li&gt;
&lt;li&gt;deterministic deduplication&lt;/li&gt;
&lt;li&gt;bounded orchestration complexity&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The entire scheduler fits in roughly 400 lines of standard library Python.&lt;/p&gt;

&lt;h2&gt;
  
  
  The mental model
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Unix socket&lt;/strong&gt; = scheduler API boundary&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scheduler&lt;/strong&gt; = durable task state machine&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;WorkerPool&lt;/strong&gt; = subprocess supervisor&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SQLite&lt;/strong&gt; = local orchestration persistence&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You could call it a local control plane - a miniature of what the larger platform does, but inside a single host boundary.&lt;/p&gt;

&lt;p&gt;The interesting realisation was this:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Once tasks must survive crashes and restarts, background jobs stop being a threading problem. They become a persistence problem, a lifecycle problem, and a process supervision problem.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Celery solves distributed execution. We needed crash‑safe orchestration inside a single host boundary.&lt;/p&gt;

&lt;p&gt;Those are very different systems.&lt;/p&gt;

</description>
      <category>python</category>
      <category>postgres</category>
      <category>distributedsystems</category>
      <category>devops</category>
    </item>
    <item>
      <title>SQLite as an offline metrics buffer in a monitoring agent</title>
      <dc:creator>Alex Zhdankov</dc:creator>
      <pubDate>Thu, 28 May 2026 14:53:32 +0000</pubDate>
      <link>https://dev.to/alex_zhdankov/sqlite-as-an-offline-metrics-buffer-in-a-monitoring-agent-5c3k</link>
      <guid>https://dev.to/alex_zhdankov/sqlite-as-an-offline-metrics-buffer-in-a-monitoring-agent-5c3k</guid>
      <description>&lt;p&gt;&lt;em&gt;Reliable monitoring becomes surprisingly hard once you assume the network will eventually fail.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Monitoring systems quietly depend on one dangerous assumption:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;the monitoring server will always be reachable&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In production, that assumption breaks constantly.&lt;/p&gt;

&lt;p&gt;And once it breaks,&lt;br&gt;
you discover that metrics pipelines are really distributed systems.&lt;/p&gt;
&lt;h2&gt;
  
  
  The problem
&lt;/h2&gt;

&lt;p&gt;Our PostgreSQL monitoring agent periodically collects metrics and sends them to the Control Plane.&lt;/p&gt;

&lt;p&gt;Under normal conditions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Agent
   │
   └──► metrics ───► Control Plane ───► storage
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Simple enough.&lt;/p&gt;

&lt;p&gt;But then production networks happen:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;VPN reconnects&lt;/li&gt;
&lt;li&gt;firewall reloads&lt;/li&gt;
&lt;li&gt;DNS failures&lt;/li&gt;
&lt;li&gt;routing instability&lt;/li&gt;
&lt;li&gt;Control Plane maintenance windows&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Once connectivity drops,&lt;br&gt;
the entire delivery pipeline becomes unreliable.&lt;/p&gt;

&lt;p&gt;The obvious answer sounds simple:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“just retry later”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That idea turned out to be much harder than expected.&lt;/p&gt;

&lt;p&gt;Questions immediately appeared:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;how long should metrics stay in memory?&lt;/li&gt;
&lt;li&gt;what happens during process restart?&lt;/li&gt;
&lt;li&gt;how do we avoid unbounded memory growth?&lt;/li&gt;
&lt;li&gt;how do we recover after long outages?&lt;/li&gt;
&lt;li&gt;how do we know delivery is degraded?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At that point,&lt;br&gt;
this stopped being “metrics collection”.&lt;/p&gt;

&lt;p&gt;It became a buffering and reliability problem.&lt;/p&gt;
&lt;h2&gt;
  
  
  The incident that changed the design
&lt;/h2&gt;

&lt;p&gt;One outage lasted almost six hours after a network partition between monitoring agents and the Control Plane.&lt;/p&gt;

&lt;p&gt;The agents themselves stayed healthy.&lt;br&gt;
PostgreSQL stayed healthy.&lt;/p&gt;

&lt;p&gt;Only the network path disappeared.&lt;/p&gt;

&lt;p&gt;The original implementation buffered metrics in memory.&lt;/p&gt;

&lt;p&gt;At 4:12 AM, one agent restarted during a package update.&lt;/p&gt;

&lt;p&gt;Six hours of buffered monitoring data vanished instantly.&lt;/p&gt;

&lt;p&gt;The dashboards looked like this:&lt;br&gt;
&lt;code&gt;timeline ─────── gap ─────── timeline&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;No visibility.&lt;br&gt;
No recovery.&lt;br&gt;
No idea what happened during the outage.&lt;/p&gt;

&lt;p&gt;That incident permanently killed the idea of in-memory buffering.&lt;/p&gt;

&lt;p&gt;Metrics had to survive process restarts.&lt;/p&gt;
&lt;h2&gt;
  
  
  The architecture
&lt;/h2&gt;

&lt;p&gt;The final design became intentionally simple:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                PostgreSQL
                     │
                     ▼
              metrics collector
                     │
                     ▼
          +----------------------+
          |      SQLite WAL      |
          |----------------------|
          | durable local queue  |
          | bounded retention    |
          +----------------------+
                     │
                     ▼
             batch delivery loop
                     │
                     ▼
               Control Plane
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key architectural shift:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;the Control Plane is no longer in the critical path of collection&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Metrics are always written locally first.&lt;/p&gt;

&lt;p&gt;Network delivery became asynchronous.&lt;/p&gt;

&lt;p&gt;That single separation changed the reliability model completely.&lt;/p&gt;

&lt;p&gt;The network could fail for hours,&lt;br&gt;
and the agent would continue operating normally.&lt;/p&gt;
&lt;h2&gt;
  
  
  Why SQLite
&lt;/h2&gt;

&lt;p&gt;The interesting part here is not SQL.&lt;/p&gt;

&lt;p&gt;It's what SQLite fundamentally is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a single file&lt;/li&gt;
&lt;li&gt;atomic writes&lt;/li&gt;
&lt;li&gt;crash-safe persistence&lt;/li&gt;
&lt;li&gt;zero infrastructure&lt;/li&gt;
&lt;li&gt;included in Python itself&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We evaluated alternatives.&lt;/p&gt;
&lt;h3&gt;
  
  
  In-memory queue
&lt;/h3&gt;

&lt;p&gt;Failed because:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;metrics disappear on restart&lt;/li&gt;
&lt;li&gt;memory usage becomes unbounded during long outages&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Plain append-only files
&lt;/h3&gt;

&lt;p&gt;Failed because:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;manual locking&lt;/li&gt;
&lt;li&gt;manual serialization&lt;/li&gt;
&lt;li&gt;corruption handling&lt;/li&gt;
&lt;li&gt;cleanup complexity&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Redis
&lt;/h3&gt;

&lt;p&gt;We already use Redis elsewhere in the platform (for the psql terminal).&lt;br&gt;&lt;br&gt;
But for the metrics buffer, Redis was the wrong fit:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Redis is in-memory by default. Persistence (RDB/AOF) is optional and adds latency.&lt;/li&gt;
&lt;li&gt;If Redis restarts, the buffer disappears — even if the agent is healthy.&lt;/li&gt;
&lt;li&gt;Mixing low-latency terminal traffic with background batch writes would create contention.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;SQLite gives us durability on disk, survives agent restarts, and has zero dependencies outside the agent process.&lt;br&gt;&lt;br&gt;
Redis is great for real-time coordination. SQLite is better for durable local buffering.&lt;/p&gt;
&lt;h2&gt;
  
  
  The schema is intentionally tiny
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;IF&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;EXISTS&lt;/span&gt; &lt;span class="n"&gt;metrics_queue&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;id&lt;/span&gt;          &lt;span class="nb"&gt;INTEGER&lt;/span&gt; &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt; &lt;span class="n"&gt;AUTOINCREMENT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;instance_id&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt;    &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nb"&gt;timestamp&lt;/span&gt;   &lt;span class="nb"&gt;TEXT&lt;/span&gt;    &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;metric_type&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt;    &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;data&lt;/span&gt;        &lt;span class="nb"&gt;TEXT&lt;/span&gt;    &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;created_at&lt;/span&gt;  &lt;span class="nb"&gt;TEXT&lt;/span&gt;    &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'now'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;One row per metric type per collection cycle.&lt;/p&gt;

&lt;p&gt;Payloads are JSON-encoded.&lt;/p&gt;

&lt;p&gt;This is not relational modeling.&lt;/p&gt;

&lt;p&gt;This is durable buffering.&lt;/p&gt;
&lt;h2&gt;
  
  
  Concurrency: why SQLite didn't become a bottleneck
&lt;/h2&gt;

&lt;p&gt;One obvious question:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“doesn't SQLite lock the whole database?”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Yes.&lt;br&gt;
And that matters.&lt;/p&gt;

&lt;p&gt;The reason it worked here:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;exactly one writer exists&lt;/li&gt;
&lt;li&gt;the collector loop owns all inserts&lt;/li&gt;
&lt;li&gt;delivery workers only read/delete in batches&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We also enabled WAL mode:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;PRAGMA&lt;/span&gt; &lt;span class="n"&gt;journal_mode&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;WAL&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="n"&gt;PRAGMA&lt;/span&gt; &lt;span class="n"&gt;synchronous&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;NORMAL&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;WAL mode mattered for one specific reason:&lt;/p&gt;

&lt;p&gt;the collector continuously appends metrics,&lt;br&gt;
while the delivery loop simultaneously reads and deletes batches.&lt;/p&gt;

&lt;p&gt;Without WAL,&lt;br&gt;
those operations would frequently block each other.&lt;/p&gt;

&lt;p&gt;WAL allows readers to continue while writes are happening.&lt;/p&gt;

&lt;p&gt;We also use:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;PRAGMA&lt;/span&gt; &lt;span class="n"&gt;synchronous&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;NORMAL&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This reduces fsync frequency while still preserving crash consistency,&lt;br&gt;
which was a better latency/durability tradeoff for monitoring data.&lt;/p&gt;

&lt;p&gt;For our workload:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;collection every 10–30 seconds&lt;/li&gt;
&lt;li&gt;batches of 100–500 rows&lt;/li&gt;
&lt;li&gt;occasional degraded-network bursts&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;SQLite contention was negligible.&lt;/p&gt;

&lt;p&gt;If we needed hundreds of concurrent writers,&lt;br&gt;
we would likely move to PostgreSQL instead.&lt;/p&gt;

&lt;p&gt;For this system,&lt;br&gt;
operational simplicity mattered more than horizontal scalability.&lt;/p&gt;
&lt;h2&gt;
  
  
  The buffer state machine
&lt;/h2&gt;

&lt;p&gt;The buffer itself behaves like a bounded queue:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;COLLECT
   │
   ├── buffer healthy ──► INSERT
   │
   └── limits exceeded
              │
              ▼
         DROP + WARN
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two limits control growth:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;maximum file size&lt;/li&gt;
&lt;li&gt;maximum record age&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Those constraints turned out to be operationally critical.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why age-based cleanup matters
&lt;/h2&gt;

&lt;p&gt;Imagine a six-hour outage.&lt;/p&gt;

&lt;p&gt;Without retention limits,&lt;br&gt;
the moment connectivity returns,&lt;br&gt;
the agent floods the Control Plane with six hours of historical metrics.&lt;/p&gt;

&lt;p&gt;On dashboards, that looks like:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;timeline ─── gap ─── MASSIVE SPIKE ─── normal&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Operationally, that's worse than losing data.&lt;/p&gt;

&lt;p&gt;The spike is misleading.&lt;/p&gt;

&lt;p&gt;So we intentionally discard stale records.&lt;/p&gt;

&lt;p&gt;If buffered metrics exceed max_age_hours,&lt;br&gt;
they are silently removed during cleanup.&lt;/p&gt;

&lt;p&gt;The result becomes:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;timeline ─── gap ─── normal&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;The outage remains visible.&lt;br&gt;
But the monitoring data stays trustworthy.&lt;/p&gt;

&lt;p&gt;That tradeoff mattered more than perfect retention.&lt;/p&gt;
&lt;h2&gt;
  
  
  Batch delivery
&lt;/h2&gt;

&lt;p&gt;The delivery loop periodically reconstructs batches from SQLite:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;instance_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;metric_type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;data&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;metrics_queue&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;created_at&lt;/span&gt; &lt;span class="k"&gt;ASC&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="o"&gt;?&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Rows are grouped back into metric payloads and sent to the Control Plane.&lt;br&gt;
Importantly:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;rows are deleted only after successful delivery&lt;br&gt;
&lt;/p&gt;


&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;send_to_control_plane&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;cursor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
        DELETE FROM metrics_queue
        WHERE id IN (&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;?&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ids&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;)
        &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;ids&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;commit&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This creates a very simple delivery guarantee:&lt;br&gt;
&lt;code&gt;INSERT → SEND → DELETE&lt;/code&gt;&lt;br&gt;
If the process crashes before DELETE,&lt;br&gt;
the metrics remain in SQLite and are retried later.&lt;/p&gt;

&lt;p&gt;That makes delivery effectively at-least-once.&lt;/p&gt;

&lt;p&gt;This effectively turns SQLite into a durable local write-ahead queue.&lt;/p&gt;
&lt;h2&gt;
  
  
  The second production incident
&lt;/h2&gt;

&lt;p&gt;Several months later,&lt;br&gt;
another issue appeared.&lt;/p&gt;

&lt;p&gt;Monitoring dashboards showed healthy agents,&lt;br&gt;
but metrics were arriving hours late.&lt;/p&gt;

&lt;p&gt;The agents were buffering correctly.&lt;/p&gt;

&lt;p&gt;The problem was:&lt;br&gt;
the SQLite files had silently grown to hundreds of megabytes after a prolonged outage.&lt;/p&gt;

&lt;p&gt;Delivery throughput could not drain the backlog fast enough.&lt;/p&gt;

&lt;p&gt;At that point we realized:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;the buffer itself needed observability&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;
  
  
  Buffer observability
&lt;/h2&gt;

&lt;p&gt;Every heartbeat sent to the Control Plane now includes buffer state:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="err"&gt;'buffer_size_bytes':&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;size&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="err"&gt;'oldest_record_age':&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;age&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="err"&gt;'buffer_utilization':&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;utilization&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Operationally, the most useful metrics became:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Metric             | Meaning                  | Operational action
───────────────────+──────────────────────────+────────────────────────
buffer_size_bytes  | Current SQLite file size | Alert if &amp;gt;100 MB
oldest_record_age  | Delivery lag             | Alert if &amp;gt;1 hour
buffer_utilization | Current size / max size  | Scale retention policy
delivery_failures  | Consecutive failed sends | Investigate connectivity
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Without these metrics,&lt;br&gt;
network degradation looked identical to healthy operation.&lt;/p&gt;

&lt;p&gt;The buffer itself became another production subsystem&lt;br&gt;
that required monitoring.&lt;/p&gt;
&lt;h2&gt;
  
  
  Production numbers
&lt;/h2&gt;

&lt;p&gt;Typical production values:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;collection interval: 10–30 seconds&lt;/li&gt;
&lt;li&gt;average batch size: 100–500 rows&lt;/li&gt;
&lt;li&gt;SQLite buffer limit: 128 MB&lt;/li&gt;
&lt;li&gt;longest outage survived: ~6 hours&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Under normal degraded-network conditions,&lt;br&gt;
the SQLite file rarely exceeded 10–15 MB.&lt;/p&gt;

&lt;p&gt;CPU overhead was effectively negligible.&lt;/p&gt;
&lt;h2&gt;
  
  
  Failure modes we actually cared about
&lt;/h2&gt;
&lt;h3&gt;
  
  
  A. Agent restart during outage
&lt;/h3&gt;

&lt;p&gt;Metrics survive because the buffer lives on disk.&lt;br&gt;
Delivery resumes automatically after restart.&lt;/p&gt;
&lt;h3&gt;
  
  
  B. Control Plane recovery after long outage
&lt;/h3&gt;

&lt;p&gt;Fresh metrics resume immediately.&lt;br&gt;
Stale buffered records expire automatically.&lt;br&gt;
No misleading spikes.&lt;/p&gt;
&lt;h3&gt;
  
  
  C. Disk pressure
&lt;/h3&gt;

&lt;p&gt;Once the buffer exceeds the configured size limit:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;new writes stop&lt;/li&gt;
&lt;li&gt;warnings are emitted&lt;/li&gt;
&lt;li&gt;collection pauses temporarily&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Delivery resumes automatically once backlog drains.&lt;/p&gt;
&lt;h3&gt;
  
  
  D. SQLite corruption
&lt;/h3&gt;

&lt;p&gt;Rare, but possible.&lt;/p&gt;

&lt;p&gt;We intentionally kept handling simple:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;log the error&lt;/li&gt;
&lt;li&gt;recreate the buffer&lt;/li&gt;
&lt;li&gt;continue operation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For a monitoring system,&lt;br&gt;
temporary metric loss is acceptable.&lt;/p&gt;

&lt;p&gt;Operational complexity was not worth stronger guarantees.&lt;/p&gt;
&lt;h2&gt;
  
  
  Bounded loss over unbounded growth
&lt;/h2&gt;

&lt;p&gt;One important design decision:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;the system intentionally prefers bounded data loss over unbounded resource growth&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Old metrics eventually expire.&lt;/p&gt;

&lt;p&gt;The SQLite file has a hard size limit.&lt;/p&gt;

&lt;p&gt;That tradeoff was deliberate.&lt;/p&gt;

&lt;p&gt;For operational monitoring,&lt;br&gt;
partial visibility is better than:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a dead agent&lt;/li&gt;
&lt;li&gt;an exhausted disk&lt;/li&gt;
&lt;li&gt;or an overloaded recovery pipeline&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Perfect durability was never the goal.&lt;/p&gt;

&lt;p&gt;Operational survivability was.&lt;/p&gt;
&lt;h2&gt;
  
  
  The mental model
&lt;/h2&gt;

&lt;p&gt;If you understand only one thing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;collect
   ▼
SQLite
   ▼
deliver
   ▼
delete
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;SQLite here is not acting as a database.&lt;/p&gt;

&lt;p&gt;It's:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a durable queue&lt;/li&gt;
&lt;li&gt;with bounded retention&lt;/li&gt;
&lt;li&gt;bounded size&lt;/li&gt;
&lt;li&gt;atomic writes&lt;/li&gt;
&lt;li&gt;and zero infrastructure dependencies&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That's the entire design.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final thought
&lt;/h2&gt;

&lt;p&gt;The interesting realization was this:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;monitoring reliability problems are mostly buffering problems&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Once network delivery became unreliable,&lt;br&gt;
the architecture naturally evolved toward:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;local durability&lt;/li&gt;
&lt;li&gt;asynchronous delivery&lt;/li&gt;
&lt;li&gt;bounded retention&lt;/li&gt;
&lt;li&gt;explicit degradation handling&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Not because SQLite is magical.&lt;/p&gt;

&lt;p&gt;Because a single durable file solved the exact failure mode we actually had.&lt;/p&gt;

</description>
      <category>python</category>
      <category>postgres</category>
      <category>sqlite</category>
      <category>devops</category>
    </item>
    <item>
      <title>Why your SSH scripts will fail in production</title>
      <dc:creator>Alex Zhdankov</dc:creator>
      <pubDate>Mon, 18 May 2026 15:41:40 +0000</pubDate>
      <link>https://dev.to/alex_zhdankov/why-your-ssh-scripts-will-fail-in-production-4cb8</link>
      <guid>https://dev.to/alex_zhdankov/why-your-ssh-scripts-will-fail-in-production-4cb8</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Remote command execution looks trivial — until unstable networks, retries, long-running commands, and half-open connections turn it into a reliability problem.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;We use Paramiko with a thin supervision layer on top.&lt;br&gt;
The same operational problems apply to AsyncSSH, Fabric, or plain OpenSSH subprocesses.&lt;/p&gt;

&lt;p&gt;At first, the implementation looked completely straightforward:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;paramiko&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;SSHClient&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hostname&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;host&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;username&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;stdin&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;stdout&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;stderr&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exec_command&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;systemctl restart postgres&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;output&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;stdout&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In development, this worked perfectly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Then production happened&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Hundreds of hosts.&lt;/li&gt;
&lt;li&gt;Unstable networks.&lt;/li&gt;
&lt;li&gt;Long-running commands.&lt;/li&gt;
&lt;li&gt;Frozen sessions.&lt;/li&gt;
&lt;li&gt;Half-open connections.&lt;/li&gt;
&lt;li&gt;Retries.&lt;/li&gt;
&lt;li&gt;Partial execution.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At that point this stopped being &lt;em&gt;“SSH scripting”&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;It became a distributed systems problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  SSH is deceptively simple
&lt;/h2&gt;

&lt;p&gt;Most developers intuitively model SSH like this:&lt;br&gt;
&lt;code&gt;local subprocess, but remote&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;But production SSH execution is actually:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;network transport
+ stateful session
+ interactive channel
+ remote process lifecycle
+ unreliable infrastructure
+ partial execution visibility
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And failures can happen independently at every layer.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Application
    ↓
SSH Client
    ↓
TCP transport        ← packets can vanish
    ↓
SSH session          ← can hang without closing
    ↓
Remote shell         ← can ignore commands
    ↓
Process execution    ← may continue after disconnect
    ↓
stdout/stderr        ← can block forever
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This distinction changes everything.&lt;/p&gt;

&lt;h2&gt;
  
  
  Failure mode #1 — execution uncertainty
&lt;/h2&gt;

&lt;p&gt;This was the first major production lesson.&lt;/p&gt;

&lt;p&gt;If the SSH transport dies, you do not know whether the command:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;succeeded&lt;/li&gt;
&lt;li&gt;failed&lt;/li&gt;
&lt;li&gt;partially executed&lt;/li&gt;
&lt;li&gt;is still running remotely&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That uncertainty completely changes retry semantics.&lt;/p&gt;

&lt;p&gt;For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;systemctl restart postgres
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the connection drops immediately after sending the command:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;did restart begin?&lt;/li&gt;
&lt;li&gt;is postgres still restarting?&lt;/li&gt;
&lt;li&gt;did it already succeed?&lt;/li&gt;
&lt;li&gt;is the service now dead?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You no longer have execution certainty.&lt;/p&gt;

&lt;p&gt;This is not a &lt;em&gt;“Paramiko problem”&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;This is a distributed systems problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  Retry is dangerous
&lt;/h2&gt;

&lt;p&gt;Retries sound harmless until commands become stateful.&lt;/p&gt;

&lt;p&gt;Some operations are naturally &lt;em&gt;idempotent&lt;/em&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cat&lt;/span&gt; /proc/meminfo
&lt;span class="nb"&gt;ls&lt;/span&gt; &lt;span class="nt"&gt;-la&lt;/span&gt; /etc
systemctl status postgres
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Others are not:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;useradd deploy
&lt;span class="nb"&gt;rm&lt;/span&gt; &lt;span class="nt"&gt;-rf&lt;/span&gt; /some/path
systemctl restart postgres
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A failed transport does not imply failed execution.&lt;/p&gt;

&lt;p&gt;That means naive retry logic can create destructive side effects.&lt;/p&gt;

&lt;p&gt;This forced us to separate failures into two categories:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;transport uncertainty&lt;/li&gt;
&lt;li&gt;command failure&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Those are fundamentally different operational states.&lt;/p&gt;

&lt;h2&gt;
  
  
  Timeouts are not one thing
&lt;/h2&gt;

&lt;p&gt;One of the most common mistakes in SSH automation is treating timeout as a single concept.&lt;/p&gt;

&lt;p&gt;Production systems usually need several independent timeout layers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;TCP connect timeout&lt;/li&gt;
&lt;li&gt;SSH handshake timeout&lt;/li&gt;
&lt;li&gt;authentication timeout&lt;/li&gt;
&lt;li&gt;command execution timeout&lt;/li&gt;
&lt;li&gt;idle/read timeout&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each failure means something different operationally.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;hostname&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;host&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;username&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;username&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;banner_timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;auth_timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;15&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But even that is insufficient.&lt;/p&gt;

&lt;p&gt;A command may still hang forever while the socket technically remains alive.&lt;/p&gt;

&lt;p&gt;That distinction matters a lot in production.&lt;/p&gt;

&lt;h2&gt;
  
  
  Half-open connections are nasty
&lt;/h2&gt;

&lt;p&gt;This became one of the hardest reliability problems.&lt;/p&gt;

&lt;p&gt;Sometimes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;TCP stays alive&lt;/li&gt;
&lt;li&gt;SSH transport stays alive&lt;/li&gt;
&lt;li&gt;but the remote process is effectively dead&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Or:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;packets silently disappear&lt;/li&gt;
&lt;li&gt;the remote kernel freezes&lt;/li&gt;
&lt;li&gt;stdout stops forever&lt;/li&gt;
&lt;li&gt;but the socket never closes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;From the application perspective:&lt;br&gt;
&lt;code&gt;everything looks connected&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;while the operation is permanently stalled.&lt;/p&gt;

&lt;p&gt;This is the classic half-open connection problem.&lt;/p&gt;
&lt;h2&gt;
  
  
  Blocking reads break automation
&lt;/h2&gt;

&lt;p&gt;This code looks innocent:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;stdout&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But under real workloads it becomes dangerous.&lt;/p&gt;

&lt;p&gt;If:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the command hangs&lt;/li&gt;
&lt;li&gt;stdout stops producing data&lt;/li&gt;
&lt;li&gt;the socket remains alive&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;then:&lt;br&gt;
&lt;code&gt;the thread blocks forever&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;We eventually moved to &lt;em&gt;streaming execution&lt;/em&gt; instead of buffered reads.&lt;/p&gt;
&lt;h2&gt;
  
  
  Streaming changes the execution model
&lt;/h2&gt;

&lt;p&gt;Long-running commands fundamentally change how remote execution must be handled.&lt;/p&gt;

&lt;p&gt;Operations like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;pg_dump&lt;/li&gt;
&lt;li&gt;VACUUM&lt;/li&gt;
&lt;li&gt;package upgrades&lt;/li&gt;
&lt;li&gt;log exports&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;can run for minutes or hours.&lt;/p&gt;

&lt;p&gt;Buffering all output in memory is unreliable.&lt;br&gt;
Blocking until completion destroys &lt;em&gt;observability&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Instead we switched to chunked streaming:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;channel&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exit_status_ready&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;channel&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;recv_ready&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;channel&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;recv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;4096&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;callback&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This solved several production problems simultaneously:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;realtime progress visibility&lt;/li&gt;
&lt;li&gt;lower memory usage&lt;/li&gt;
&lt;li&gt;cancellation support&lt;/li&gt;
&lt;li&gt;dead session detection&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Streaming ended up being much more operationally stable than buffered execution.&lt;/p&gt;

&lt;h2&gt;
  
  
  Security becomes infrastructure, not validation
&lt;/h2&gt;

&lt;p&gt;Another important lesson:&lt;/p&gt;

&lt;p&gt;SSH automation is remote code execution infrastructure.&lt;/p&gt;

&lt;p&gt;That means command construction rules matter enormously.&lt;/p&gt;

&lt;p&gt;This is catastrophic:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;cmd&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rm -rf &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;user_input&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Because eventually someone passes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;/home/user; rm -rf /
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We ended up treating all remote commands as infrastructure-sensitive operations.&lt;/p&gt;

&lt;p&gt;Input validation alone was insufficient.&lt;/p&gt;

&lt;p&gt;Every dynamic argument had to be:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;validated&lt;/li&gt;
&lt;li&gt;escaped&lt;/li&gt;
&lt;li&gt;constrained
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;safe_value&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;shlex&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;quote&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_input&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Even simple automation eventually becomes security-critical.&lt;/p&gt;

&lt;h2&gt;
  
  
  Resource cleanup matters more than expected
&lt;/h2&gt;

&lt;p&gt;SSH resources leak surprisingly easily.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Channels.&lt;/li&gt;
&lt;li&gt;Sockets.&lt;/li&gt;
&lt;li&gt;Transports.&lt;/li&gt;
&lt;li&gt;PTY buffers.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Under load, forgotten cleanup accumulates fast.&lt;/p&gt;

&lt;p&gt;We eventually standardized all operations around explicit lifecycle management:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;ssh_operation&lt;/span&gt;&lt;span class="p"&gt;(...)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;ssh&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;ssh&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(...)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The important part was not aesthetics.&lt;/p&gt;

&lt;p&gt;It was guaranteeing cleanup under:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;exceptions&lt;/li&gt;
&lt;li&gt;timeouts&lt;/li&gt;
&lt;li&gt;partial failures&lt;/li&gt;
&lt;li&gt;interrupted execution&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Production automation lives or dies on &lt;em&gt;cleanup guarantees&lt;/em&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The architecture we ended up with
&lt;/h2&gt;

&lt;p&gt;Over time the system evolved into several independent layers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Connection management
    ↓
Retry classification
    ↓
Execution supervision
    ↓
Streaming transport
    ↓
Resource cleanup
    ↓
Observability
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The important realization was:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;remote execution is not a helper function&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It is infrastructure.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final insight
&lt;/h2&gt;

&lt;p&gt;The happy path is trivial.&lt;/p&gt;

&lt;p&gt;Production architecture begins where execution certainty ends.&lt;/p&gt;

&lt;p&gt;SSH automation fails when treated like scripting.&lt;/p&gt;

&lt;p&gt;Because it is not scripting.&lt;/p&gt;

&lt;p&gt;It is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;remote process orchestration&lt;/li&gt;
&lt;li&gt;over unreliable transport&lt;/li&gt;
&lt;li&gt;with partial execution visibility&lt;/li&gt;
&lt;li&gt;inside a distributed system&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And once you accept that,&lt;br&gt;
the architecture changes completely.&lt;/p&gt;

</description>
      <category>ssh</category>
      <category>python</category>
      <category>distributedsystems</category>
      <category>devops</category>
    </item>
    <item>
      <title>We built a real psql terminal in the browser. Here’s what made it unexpectedly hard.</title>
      <dc:creator>Alex Zhdankov</dc:creator>
      <pubDate>Wed, 13 May 2026 20:53:07 +0000</pubDate>
      <link>https://dev.to/alex_zhdankov/we-built-a-real-psql-terminal-in-the-browser-heres-what-made-it-unexpectedly-hard-57a1</link>
      <guid>https://dev.to/alex_zhdankov/we-built-a-real-psql-terminal-in-the-browser-heres-what-made-it-unexpectedly-hard-57a1</guid>
      <description>&lt;p&gt;&lt;strong&gt;A PTY-backed PostgreSQL console running in the browser using reverse WebSockets, Redis Streams, and xterm.js — designed around centralized control-plane constraints and production failure modes.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We needed a real PostgreSQL terminal inside the browser.&lt;/p&gt;

&lt;p&gt;Not a SQL editor.&lt;br&gt;
Not a query API.&lt;br&gt;
A real psql session with full terminal semantics.&lt;/p&gt;

&lt;p&gt;That requirement immediately forced several architectural constraints:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a real PTY&lt;/li&gt;
&lt;li&gt;a long-lived stateful process&lt;/li&gt;
&lt;li&gt;bidirectional streaming&lt;/li&gt;
&lt;li&gt;terminal resize handling&lt;/li&gt;
&lt;li&gt;signal forwarding (&lt;code&gt;Ctrl+C&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;native &lt;code&gt;psql&lt;/code&gt; behavior&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And then the infrastructure constraints made things significantly more interesting:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;agents live in internal networks&lt;/li&gt;
&lt;li&gt;all traffic must go through the Control Plane&lt;/li&gt;
&lt;li&gt;xterm.js only supports WebSocket transport&lt;/li&gt;
&lt;li&gt;we could not emulate &lt;code&gt;psql&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At that point, this stopped being a “web feature”.&lt;br&gt;
It became a distributed terminal runtime problem.&lt;/p&gt;
&lt;h2&gt;
  
  
  High-level architecture
&lt;/h2&gt;

&lt;p&gt;This system only makes sense if you read it as a &lt;em&gt;dataflow graph&lt;/em&gt;, not as isolated services.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Browser (xterm.js)
    │
    │ WebSocket (terminal I/O)
    ▼
Control Plane
    │
    │ session management + auth
    ▼
Redis Streams (output buffer)
    │
    │ coordination + async delivery
    ▼
Agent WebSocket channel
    │
    │ PTY stdin/stdout bridge
    ▼
PTY → real psql process
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The critical architectural decision:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;the browser never connects to the agent directly.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The Control Plane is the only public entrypoint in the entire system.&lt;br&gt;
Everything flows through it.&lt;/p&gt;
&lt;h2&gt;
  
  
  Why the architecture looks “backwards”
&lt;/h2&gt;

&lt;p&gt;The surprising part is that the agent initiates the terminal transport.&lt;/p&gt;

&lt;p&gt;Not because NAT traversal was impossible.&lt;/p&gt;

&lt;p&gt;But because the system was intentionally designed around a centralized Control Plane.&lt;/p&gt;

&lt;p&gt;Agents sit in internal networks.&lt;br&gt;
The browser has no direct visibility into them.&lt;/p&gt;

&lt;p&gt;So instead of:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;Browser → Agent&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;the architecture becomes:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;Browser → Control Plane ← Agent&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;The Control Plane acts as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;session coordinator&lt;/li&gt;
&lt;li&gt;auth boundary&lt;/li&gt;
&lt;li&gt;transport router&lt;/li&gt;
&lt;li&gt;lifecycle owner&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Once that decision is made, reverse WebSockets become the natural transport model.&lt;/p&gt;
&lt;h2&gt;
  
  
  Session establishment
&lt;/h2&gt;

&lt;p&gt;The session lifecycle happens in multiple stages.&lt;/p&gt;

&lt;p&gt;Importantly:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;the PTY process does not exist when the browser first connects&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Only a logical session exists.&lt;/p&gt;
&lt;h2&gt;
  
  
  Step 1 — Browser creates a logical session
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Browser
  │
  │ WebSocket connect
  ▼
Control Plane
  ├── creates session_id
  ├── registers browser handler
  └── starts auth timeout
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;At this point:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;no PTY exists&lt;/li&gt;
&lt;li&gt;no psql exists&lt;/li&gt;
&lt;li&gt;no database connection exists&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The Control Plane only knows:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“a browser wants a terminal session”&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;
  
  
  Step 2 — Control Plane signals the agent
&lt;/h2&gt;

&lt;p&gt;The Control Plane sends a lightweight HTTP request:&lt;br&gt;
&lt;code&gt;POST /terminal?session_id=&amp;lt;uuid&amp;gt;&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;This is intentionally the only HTTP hop in the entire terminal lifecycle.&lt;/p&gt;

&lt;p&gt;The request does not carry terminal traffic.&lt;/p&gt;

&lt;p&gt;It only means:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“establish terminal transport for this session”&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;
  
  
  Step 3 — Agent opens reverse WebSocket
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Agent
  │
  │ outbound WebSocket
  ▼
Control Plane
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Now the system has two independent transport channels:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Browser WS → Control Plane&lt;/li&gt;
&lt;li&gt;Agent WS   → Control Plane&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But they are still disconnected.&lt;/p&gt;

&lt;p&gt;The system is in a &lt;em&gt;half-connected state&lt;/em&gt;.&lt;/p&gt;
&lt;h2&gt;
  
  
  Session stitching
&lt;/h2&gt;

&lt;p&gt;This is the moment where the architecture becomes interesting.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Browser Handler ───────┐
                       ├── session binding
Agent Handler ─────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At this point:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;the Control Plane stops being a transport endpoint and becomes a message router&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It now forwards:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;browser input → agent&lt;/li&gt;
&lt;li&gt;agent output → browser&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But critically:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;not directly&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;All terminal output passes through an asynchronous buffering layer.&lt;/p&gt;

&lt;p&gt;That layer ended up being one of the most important production decisions in the system.&lt;/p&gt;

&lt;h2&gt;
  
  
  PTY process creation
&lt;/h2&gt;

&lt;p&gt;Once the session is fully initialized, the agent forks a real PTY:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;child_pid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fd&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pty&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fork&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;child_pid&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;subprocess&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;psql&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-U&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-d&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dbname&lt;/span&gt;
    &lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At this point the architecture fundamentally changes.&lt;/p&gt;

&lt;p&gt;This is no longer “web infrastructure”.&lt;/p&gt;

&lt;p&gt;It becomes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;PTY supervision&lt;/li&gt;
&lt;li&gt;file descriptor management&lt;/li&gt;
&lt;li&gt;process lifecycle handling&lt;/li&gt;
&lt;li&gt;signal propagation&lt;/li&gt;
&lt;li&gt;backpressure management&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Most complexity appeared after this step.&lt;/p&gt;

&lt;p&gt;Not before it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The real data pipeline
&lt;/h2&gt;

&lt;p&gt;This is the most important flow in the system.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Browser
  │
  │ keystroke
  ▼
Control Plane
  │
  ▼
Agent WS handler
  │
  │ write(fd)
  ▼
PTY → psql
  │
  │ stdout
  ▼
PTY reader thread
  │
  │ Redis XADD
  ▼
Redis Streams
  │
  │ async consumer
  ▼
Control Plane
  │
  │ WS push
  ▼
Browser
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The most important line in the entire architecture is this:&lt;br&gt;
&lt;code&gt;PTY reader → Redis XADD → async consumer → WebSocket&lt;/code&gt;&lt;br&gt;
That line is the system’s stability boundary.&lt;/p&gt;
&lt;h2&gt;
  
  
  Why Redis Streams became mandatory
&lt;/h2&gt;

&lt;p&gt;The original implementation directly forwarded PTY output into WebSocket writes:&lt;br&gt;
&lt;code&gt;PTY → WebSocket&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;It worked in development.&lt;/p&gt;

&lt;p&gt;It failed in production.&lt;br&gt;
The issue was subtle:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;PTY reads are synchronous&lt;/li&gt;
&lt;li&gt;WebSocket writes can block&lt;/li&gt;
&lt;li&gt;backpressure propagates backwards&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The resulting failure mode was catastrophic for terminal UX:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;slow network
    ↓
blocked WS writes
    ↓
frozen PTY reader
    ↓
terminal stalls
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The terminal looked dead while psql was still running underneath.&lt;/p&gt;

&lt;p&gt;Redis Streams solved this by introducing a decoupling boundary.&lt;/p&gt;

&lt;p&gt;Now:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;PTY reads stay non-blocking&lt;/li&gt;
&lt;li&gt;network latency becomes isolated&lt;/li&gt;
&lt;li&gt;consumers can temporarily lag&lt;/li&gt;
&lt;li&gt;output survives reconnects&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The additional latency was negligible.&lt;/p&gt;

&lt;p&gt;The operational stability improvement was enormous.&lt;/p&gt;

&lt;h2&gt;
  
  
  The architecture is actually two independent loops
&lt;/h2&gt;

&lt;p&gt;This is the part most terminal architectures hide.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Input loop&lt;/strong&gt;&lt;br&gt;
&lt;code&gt;Browser → Control Plane → Agent → PTY&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Output loop&lt;/strong&gt;&lt;br&gt;
&lt;code&gt;PTY → Redis → Control Plane → Browser&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;These loops are intentionally independent.&lt;/p&gt;

&lt;p&gt;That separation is what allows the system to survive partial failures.&lt;/p&gt;
&lt;h2&gt;
  
  
  Why we split browser and agent handlers
&lt;/h2&gt;

&lt;p&gt;We intentionally kept browser-facing and agent-facing handlers separate.&lt;/p&gt;

&lt;p&gt;Because they solve fundamentally different problems.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Browser Handler&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;auth&lt;/li&gt;
&lt;li&gt;user session ownership&lt;/li&gt;
&lt;li&gt;browser disconnect semantics&lt;/li&gt;
&lt;li&gt;user errors&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Agent Handler&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;PTY lifecycle&lt;/li&gt;
&lt;li&gt;process supervision&lt;/li&gt;
&lt;li&gt;reconnect semantics&lt;/li&gt;
&lt;li&gt;infrastructure errors&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Trying to merge them created tightly coupled failure modes and significantly more lifecycle complexity.&lt;/p&gt;

&lt;p&gt;Separating them made the system dramatically easier to reason about.&lt;/p&gt;
&lt;h2&gt;
  
  
  Failure modes that mattered in production
&lt;/h2&gt;

&lt;p&gt;The hardest problems were not PostgreSQL problems.&lt;/p&gt;

&lt;p&gt;They were long-lived process problems.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A. Redis failure&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Impact:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;output pipeline breaks&lt;/li&gt;
&lt;li&gt;PTY continues running&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Mitigation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;memory limits&lt;/li&gt;
&lt;li&gt;retention limits&lt;/li&gt;
&lt;li&gt;monitoring&lt;/li&gt;
&lt;li&gt;bounded stream lifetime&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;B. Agent disconnect&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Impact:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;transport disappears&lt;/li&gt;
&lt;li&gt;PTY may still be alive&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Mitigation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;reconnect window&lt;/li&gt;
&lt;li&gt;session reattachment&lt;/li&gt;
&lt;li&gt;delayed teardown&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;C. Process explosion&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Impact:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;memory exhaustion&lt;/li&gt;
&lt;li&gt;PostgreSQL connection storms&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Mitigation:&lt;br&gt;
&lt;code&gt;BoundedSemaphore(max_sessions=10)&lt;/code&gt;&lt;br&gt;
This was one of the simplest and most effective safeguards in the system.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;D. xterm resize storms&lt;/strong&gt;&lt;br&gt;
xterm.js emits resize events aggressively during browser resizing.&lt;/p&gt;

&lt;p&gt;Impact:&lt;br&gt;
Each resize triggers:&lt;br&gt;
&lt;code&gt;ioctl(TIOCSWINSZ)&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Mitigation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Without throttling, the PTY spent significant time processing resize events instead of actual terminal traffic.&lt;/li&gt;
&lt;li&gt;Simple debounce logic completely fixed the issue.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Scaling reality
&lt;/h2&gt;

&lt;p&gt;The system does not scale like a normal WebSocket service.&lt;/p&gt;

&lt;p&gt;Each session includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a real &lt;code&gt;psql&lt;/code&gt; process&lt;/li&gt;
&lt;li&gt;a PTY&lt;/li&gt;
&lt;li&gt;multiple threads&lt;/li&gt;
&lt;li&gt;Redis streams&lt;/li&gt;
&lt;li&gt;two WebSocket channels&lt;/li&gt;
&lt;li&gt;a database connection&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The scaling bottleneck is not Redis.&lt;/p&gt;

&lt;p&gt;It is not CPU.&lt;/p&gt;

&lt;p&gt;It is not WebSockets.&lt;/p&gt;

&lt;p&gt;It is:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;how many real PostgreSQL sessions the infrastructure can sustain&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;
  
  
  Why HTTP and SSE were rejected
&lt;/h2&gt;

&lt;p&gt;We evaluated both.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;HTTP&lt;/strong&gt;&lt;br&gt;
Failed because:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;stateless&lt;/li&gt;
&lt;li&gt;no streaming terminal semantics&lt;/li&gt;
&lt;li&gt;no signal handling&lt;/li&gt;
&lt;li&gt;no persistent shell state&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;SSE&lt;/strong&gt;&lt;br&gt;
Failed because:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;one-directional transport&lt;/li&gt;
&lt;li&gt;incompatible with terminal interaction patterns&lt;/li&gt;
&lt;li&gt;xterm.js expects bidirectional communication&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At the end, terminals naturally map onto WebSockets.&lt;/p&gt;

&lt;p&gt;Trying to avoid that only complicates the architecture.&lt;/p&gt;
&lt;h2&gt;
  
  
  What this system actually is
&lt;/h2&gt;

&lt;p&gt;If you remove all abstractions:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;this is a distributed process supervisor for a PTY running &lt;code&gt;psql&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Everything else is transport, routing, buffering, and failure handling around that core idea.&lt;/p&gt;
&lt;h2&gt;
  
  
  Final architecture insight
&lt;/h2&gt;

&lt;p&gt;The system is ultimately defined by three separations.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Connection separation&lt;/strong&gt;&lt;br&gt;
The Control Plane isolates browsers from agents.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Process separation&lt;/strong&gt;&lt;br&gt;
PTY isolates PostgreSQL from the web layer.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Flow separation&lt;/strong&gt;&lt;br&gt;
Redis isolates terminal I/O from network I/O.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;
  
  
  Final mental model
&lt;/h2&gt;

&lt;p&gt;If you understand only one thing, understand this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Browser ↔ Control Plane ↔ Agent ↔ PTY ↔ psql
                     ↑
              Redis is the buffer
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Everything else is lifecycle management around this chain.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final thought
&lt;/h2&gt;

&lt;p&gt;We did not build a “web UI for PostgreSQL”.&lt;/p&gt;

&lt;p&gt;We built a distributed, fault-tolerant runtime for a stateful terminal process.&lt;/p&gt;

&lt;p&gt;PostgreSQL just happened to be the process attached to it.&lt;/p&gt;

</description>
      <category>websockets</category>
      <category>redis</category>
      <category>architecture</category>
      <category>python</category>
    </item>
  </channel>
</rss>
