<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Alex Zhdankov</title>
    <description>The latest articles on DEV Community by Alex Zhdankov (@alex_zhdankov).</description>
    <link>https://dev.to/alex_zhdankov</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3929959%2Fb8e74cb6-a829-45e7-820e-5ae2678fb69a.png</url>
      <title>DEV Community: Alex Zhdankov</title>
      <link>https://dev.to/alex_zhdankov</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/alex_zhdankov"/>
    <language>en</language>
    <item>
      <title>SQLite as an offline metrics buffer in a monitoring agent</title>
      <dc:creator>Alex Zhdankov</dc:creator>
      <pubDate>Thu, 28 May 2026 14:53:32 +0000</pubDate>
      <link>https://dev.to/alex_zhdankov/sqlite-as-an-offline-metrics-buffer-in-a-monitoring-agent-5c3k</link>
      <guid>https://dev.to/alex_zhdankov/sqlite-as-an-offline-metrics-buffer-in-a-monitoring-agent-5c3k</guid>
      <description>&lt;p&gt;&lt;em&gt;Reliable monitoring becomes surprisingly hard once you assume the network will eventually fail.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Monitoring systems quietly depend on one dangerous assumption:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;the monitoring server will always be reachable&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In production, that assumption breaks constantly.&lt;/p&gt;

&lt;p&gt;And once it breaks,&lt;br&gt;
you discover that metrics pipelines are really distributed systems.&lt;/p&gt;
&lt;h2&gt;
  
  
  The problem
&lt;/h2&gt;

&lt;p&gt;Our PostgreSQL monitoring agent periodically collects metrics and sends them to the Control Plane.&lt;/p&gt;

&lt;p&gt;Under normal conditions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Agent
   │
   └──► metrics ───► Control Plane ───► storage
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Simple enough.&lt;/p&gt;

&lt;p&gt;But then production networks happen:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;VPN reconnects&lt;/li&gt;
&lt;li&gt;firewall reloads&lt;/li&gt;
&lt;li&gt;DNS failures&lt;/li&gt;
&lt;li&gt;routing instability&lt;/li&gt;
&lt;li&gt;Control Plane maintenance windows&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Once connectivity drops,&lt;br&gt;
the entire delivery pipeline becomes unreliable.&lt;/p&gt;

&lt;p&gt;The obvious answer sounds simple:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“just retry later”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That idea turned out to be much harder than expected.&lt;/p&gt;

&lt;p&gt;Questions immediately appeared:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;how long should metrics stay in memory?&lt;/li&gt;
&lt;li&gt;what happens during process restart?&lt;/li&gt;
&lt;li&gt;how do we avoid unbounded memory growth?&lt;/li&gt;
&lt;li&gt;how do we recover after long outages?&lt;/li&gt;
&lt;li&gt;how do we know delivery is degraded?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At that point,&lt;br&gt;
this stopped being “metrics collection”.&lt;/p&gt;

&lt;p&gt;It became a buffering and reliability problem.&lt;/p&gt;
&lt;h2&gt;
  
  
  The incident that changed the design
&lt;/h2&gt;

&lt;p&gt;One outage lasted almost six hours after a network partition between monitoring agents and the Control Plane.&lt;/p&gt;

&lt;p&gt;The agents themselves stayed healthy.&lt;br&gt;
PostgreSQL stayed healthy.&lt;/p&gt;

&lt;p&gt;Only the network path disappeared.&lt;/p&gt;

&lt;p&gt;The original implementation buffered metrics in memory.&lt;/p&gt;

&lt;p&gt;At 4:12 AM, one agent restarted during a package update.&lt;/p&gt;

&lt;p&gt;Six hours of buffered monitoring data vanished instantly.&lt;/p&gt;

&lt;p&gt;The dashboards looked like this:&lt;br&gt;
&lt;code&gt;timeline ─────── gap ─────── timeline&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;No visibility.&lt;br&gt;
No recovery.&lt;br&gt;
No idea what happened during the outage.&lt;/p&gt;

&lt;p&gt;That incident permanently killed the idea of in-memory buffering.&lt;/p&gt;

&lt;p&gt;Metrics had to survive process restarts.&lt;/p&gt;
&lt;h2&gt;
  
  
  The architecture
&lt;/h2&gt;

&lt;p&gt;The final design became intentionally simple:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                PostgreSQL
                     │
                     ▼
              metrics collector
                     │
                     ▼
          +----------------------+
          |      SQLite WAL      |
          |----------------------|
          | durable local queue  |
          | bounded retention    |
          +----------------------+
                     │
                     ▼
             batch delivery loop
                     │
                     ▼
               Control Plane
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key architectural shift:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;the Control Plane is no longer in the critical path of collection&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Metrics are always written locally first.&lt;/p&gt;

&lt;p&gt;Network delivery became asynchronous.&lt;/p&gt;

&lt;p&gt;That single separation changed the reliability model completely.&lt;/p&gt;

&lt;p&gt;The network could fail for hours,&lt;br&gt;
and the agent would continue operating normally.&lt;/p&gt;
&lt;h2&gt;
  
  
  Why SQLite
&lt;/h2&gt;

&lt;p&gt;The interesting part here is not SQL.&lt;/p&gt;

&lt;p&gt;It's what SQLite fundamentally is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a single file&lt;/li&gt;
&lt;li&gt;atomic writes&lt;/li&gt;
&lt;li&gt;crash-safe persistence&lt;/li&gt;
&lt;li&gt;zero infrastructure&lt;/li&gt;
&lt;li&gt;included in Python itself&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We evaluated alternatives.&lt;/p&gt;
&lt;h3&gt;
  
  
  In-memory queue
&lt;/h3&gt;

&lt;p&gt;Failed because:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;metrics disappear on restart&lt;/li&gt;
&lt;li&gt;memory usage becomes unbounded during long outages&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Plain append-only files
&lt;/h3&gt;

&lt;p&gt;Failed because:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;manual locking&lt;/li&gt;
&lt;li&gt;manual serialization&lt;/li&gt;
&lt;li&gt;corruption handling&lt;/li&gt;
&lt;li&gt;cleanup complexity&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Redis
&lt;/h3&gt;

&lt;p&gt;We already use Redis elsewhere in the platform (for the psql terminal).&lt;br&gt;&lt;br&gt;
But for the metrics buffer, Redis was the wrong fit:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Redis is in-memory by default. Persistence (RDB/AOF) is optional and adds latency.&lt;/li&gt;
&lt;li&gt;If Redis restarts, the buffer disappears — even if the agent is healthy.&lt;/li&gt;
&lt;li&gt;Mixing low-latency terminal traffic with background batch writes would create contention.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;SQLite gives us durability on disk, survives agent restarts, and has zero dependencies outside the agent process.&lt;br&gt;&lt;br&gt;
Redis is great for real-time coordination. SQLite is better for durable local buffering.&lt;/p&gt;
&lt;h2&gt;
  
  
  The schema is intentionally tiny
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;IF&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;EXISTS&lt;/span&gt; &lt;span class="n"&gt;metrics_queue&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;id&lt;/span&gt;          &lt;span class="nb"&gt;INTEGER&lt;/span&gt; &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt; &lt;span class="n"&gt;AUTOINCREMENT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;instance_id&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt;    &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nb"&gt;timestamp&lt;/span&gt;   &lt;span class="nb"&gt;TEXT&lt;/span&gt;    &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;metric_type&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt;    &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;data&lt;/span&gt;        &lt;span class="nb"&gt;TEXT&lt;/span&gt;    &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;created_at&lt;/span&gt;  &lt;span class="nb"&gt;TEXT&lt;/span&gt;    &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'now'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;One row per metric type per collection cycle.&lt;/p&gt;

&lt;p&gt;Payloads are JSON-encoded.&lt;/p&gt;

&lt;p&gt;This is not relational modeling.&lt;/p&gt;

&lt;p&gt;This is durable buffering.&lt;/p&gt;
&lt;h2&gt;
  
  
  Concurrency: why SQLite didn't become a bottleneck
&lt;/h2&gt;

&lt;p&gt;One obvious question:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“doesn't SQLite lock the whole database?”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Yes.&lt;br&gt;
And that matters.&lt;/p&gt;

&lt;p&gt;The reason it worked here:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;exactly one writer exists&lt;/li&gt;
&lt;li&gt;the collector loop owns all inserts&lt;/li&gt;
&lt;li&gt;delivery workers only read/delete in batches&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We also enabled WAL mode:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;PRAGMA&lt;/span&gt; &lt;span class="n"&gt;journal_mode&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;WAL&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="n"&gt;PRAGMA&lt;/span&gt; &lt;span class="n"&gt;synchronous&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;NORMAL&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;WAL mode mattered for one specific reason:&lt;/p&gt;

&lt;p&gt;the collector continuously appends metrics,&lt;br&gt;
while the delivery loop simultaneously reads and deletes batches.&lt;/p&gt;

&lt;p&gt;Without WAL,&lt;br&gt;
those operations would frequently block each other.&lt;/p&gt;

&lt;p&gt;WAL allows readers to continue while writes are happening.&lt;/p&gt;

&lt;p&gt;We also use:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;PRAGMA&lt;/span&gt; &lt;span class="n"&gt;synchronous&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;NORMAL&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This reduces fsync frequency while still preserving crash consistency,&lt;br&gt;
which was a better latency/durability tradeoff for monitoring data.&lt;/p&gt;

&lt;p&gt;For our workload:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;collection every 10–30 seconds&lt;/li&gt;
&lt;li&gt;batches of 100–500 rows&lt;/li&gt;
&lt;li&gt;occasional degraded-network bursts&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;SQLite contention was negligible.&lt;/p&gt;

&lt;p&gt;If we needed hundreds of concurrent writers,&lt;br&gt;
we would likely move to PostgreSQL instead.&lt;/p&gt;

&lt;p&gt;For this system,&lt;br&gt;
operational simplicity mattered more than horizontal scalability.&lt;/p&gt;
&lt;h2&gt;
  
  
  The buffer state machine
&lt;/h2&gt;

&lt;p&gt;The buffer itself behaves like a bounded queue:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;COLLECT
   │
   ├── buffer healthy ──► INSERT
   │
   └── limits exceeded
              │
              ▼
         DROP + WARN
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two limits control growth:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;maximum file size&lt;/li&gt;
&lt;li&gt;maximum record age&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Those constraints turned out to be operationally critical.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why age-based cleanup matters
&lt;/h2&gt;

&lt;p&gt;Imagine a six-hour outage.&lt;/p&gt;

&lt;p&gt;Without retention limits,&lt;br&gt;
the moment connectivity returns,&lt;br&gt;
the agent floods the Control Plane with six hours of historical metrics.&lt;/p&gt;

&lt;p&gt;On dashboards, that looks like:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;timeline ─── gap ─── MASSIVE SPIKE ─── normal&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Operationally, that's worse than losing data.&lt;/p&gt;

&lt;p&gt;The spike is misleading.&lt;/p&gt;

&lt;p&gt;So we intentionally discard stale records.&lt;/p&gt;

&lt;p&gt;If buffered metrics exceed max_age_hours,&lt;br&gt;
they are silently removed during cleanup.&lt;/p&gt;

&lt;p&gt;The result becomes:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;timeline ─── gap ─── normal&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;The outage remains visible.&lt;br&gt;
But the monitoring data stays trustworthy.&lt;/p&gt;

&lt;p&gt;That tradeoff mattered more than perfect retention.&lt;/p&gt;
&lt;h2&gt;
  
  
  Batch delivery
&lt;/h2&gt;

&lt;p&gt;The delivery loop periodically reconstructs batches from SQLite:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;instance_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;metric_type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;data&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;metrics_queue&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;created_at&lt;/span&gt; &lt;span class="k"&gt;ASC&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="o"&gt;?&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Rows are grouped back into metric payloads and sent to the Control Plane.&lt;br&gt;
Importantly:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;rows are deleted only after successful delivery&lt;br&gt;
&lt;/p&gt;


&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;send_to_control_plane&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;cursor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
        DELETE FROM metrics_queue
        WHERE id IN (&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;?&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ids&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;)
        &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;ids&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;commit&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This creates a very simple delivery guarantee:&lt;br&gt;
&lt;code&gt;INSERT → SEND → DELETE&lt;/code&gt;&lt;br&gt;
If the process crashes before DELETE,&lt;br&gt;
the metrics remain in SQLite and are retried later.&lt;/p&gt;

&lt;p&gt;That makes delivery effectively at-least-once.&lt;/p&gt;

&lt;p&gt;This effectively turns SQLite into a durable local write-ahead queue.&lt;/p&gt;
&lt;h2&gt;
  
  
  The second production incident
&lt;/h2&gt;

&lt;p&gt;Several months later,&lt;br&gt;
another issue appeared.&lt;/p&gt;

&lt;p&gt;Monitoring dashboards showed healthy agents,&lt;br&gt;
but metrics were arriving hours late.&lt;/p&gt;

&lt;p&gt;The agents were buffering correctly.&lt;/p&gt;

&lt;p&gt;The problem was:&lt;br&gt;
the SQLite files had silently grown to hundreds of megabytes after a prolonged outage.&lt;/p&gt;

&lt;p&gt;Delivery throughput could not drain the backlog fast enough.&lt;/p&gt;

&lt;p&gt;At that point we realized:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;the buffer itself needed observability&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;
  
  
  Buffer observability
&lt;/h2&gt;

&lt;p&gt;Every heartbeat sent to the Control Plane now includes buffer state:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="err"&gt;'buffer_size_bytes':&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;size&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="err"&gt;'oldest_record_age':&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;age&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="err"&gt;'buffer_utilization':&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;utilization&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Operationally, the most useful metrics became:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Metric             | Meaning                  | Operational action
───────────────────+──────────────────────────+────────────────────────
buffer_size_bytes  | Current SQLite file size | Alert if &amp;gt;100 MB
oldest_record_age  | Delivery lag             | Alert if &amp;gt;1 hour
buffer_utilization | Current size / max size  | Scale retention policy
delivery_failures  | Consecutive failed sends | Investigate connectivity
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Without these metrics,&lt;br&gt;
network degradation looked identical to healthy operation.&lt;/p&gt;

&lt;p&gt;The buffer itself became another production subsystem&lt;br&gt;
that required monitoring.&lt;/p&gt;
&lt;h2&gt;
  
  
  Production numbers
&lt;/h2&gt;

&lt;p&gt;Typical production values:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;collection interval: 10–30 seconds&lt;/li&gt;
&lt;li&gt;average batch size: 100–500 rows&lt;/li&gt;
&lt;li&gt;SQLite buffer limit: 128 MB&lt;/li&gt;
&lt;li&gt;longest outage survived: ~6 hours&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Under normal degraded-network conditions,&lt;br&gt;
the SQLite file rarely exceeded 10–15 MB.&lt;/p&gt;

&lt;p&gt;CPU overhead was effectively negligible.&lt;/p&gt;
&lt;h2&gt;
  
  
  Failure modes we actually cared about
&lt;/h2&gt;
&lt;h3&gt;
  
  
  A. Agent restart during outage
&lt;/h3&gt;

&lt;p&gt;Metrics survive because the buffer lives on disk.&lt;br&gt;
Delivery resumes automatically after restart.&lt;/p&gt;
&lt;h3&gt;
  
  
  B. Control Plane recovery after long outage
&lt;/h3&gt;

&lt;p&gt;Fresh metrics resume immediately.&lt;br&gt;
Stale buffered records expire automatically.&lt;br&gt;
No misleading spikes.&lt;/p&gt;
&lt;h3&gt;
  
  
  C. Disk pressure
&lt;/h3&gt;

&lt;p&gt;Once the buffer exceeds the configured size limit:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;new writes stop&lt;/li&gt;
&lt;li&gt;warnings are emitted&lt;/li&gt;
&lt;li&gt;collection pauses temporarily&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Delivery resumes automatically once backlog drains.&lt;/p&gt;
&lt;h3&gt;
  
  
  D. SQLite corruption
&lt;/h3&gt;

&lt;p&gt;Rare, but possible.&lt;/p&gt;

&lt;p&gt;We intentionally kept handling simple:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;log the error&lt;/li&gt;
&lt;li&gt;recreate the buffer&lt;/li&gt;
&lt;li&gt;continue operation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For a monitoring system,&lt;br&gt;
temporary metric loss is acceptable.&lt;/p&gt;

&lt;p&gt;Operational complexity was not worth stronger guarantees.&lt;/p&gt;
&lt;h2&gt;
  
  
  Bounded loss over unbounded growth
&lt;/h2&gt;

&lt;p&gt;One important design decision:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;the system intentionally prefers bounded data loss over unbounded resource growth&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Old metrics eventually expire.&lt;/p&gt;

&lt;p&gt;The SQLite file has a hard size limit.&lt;/p&gt;

&lt;p&gt;That tradeoff was deliberate.&lt;/p&gt;

&lt;p&gt;For operational monitoring,&lt;br&gt;
partial visibility is better than:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a dead agent&lt;/li&gt;
&lt;li&gt;an exhausted disk&lt;/li&gt;
&lt;li&gt;or an overloaded recovery pipeline&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Perfect durability was never the goal.&lt;/p&gt;

&lt;p&gt;Operational survivability was.&lt;/p&gt;
&lt;h2&gt;
  
  
  The mental model
&lt;/h2&gt;

&lt;p&gt;If you understand only one thing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;collect
   ▼
SQLite
   ▼
deliver
   ▼
delete
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;SQLite here is not acting as a database.&lt;/p&gt;

&lt;p&gt;It's:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a durable queue&lt;/li&gt;
&lt;li&gt;with bounded retention&lt;/li&gt;
&lt;li&gt;bounded size&lt;/li&gt;
&lt;li&gt;atomic writes&lt;/li&gt;
&lt;li&gt;and zero infrastructure dependencies&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That's the entire design.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final thought
&lt;/h2&gt;

&lt;p&gt;The interesting realization was this:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;monitoring reliability problems are mostly buffering problems&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Once network delivery became unreliable,&lt;br&gt;
the architecture naturally evolved toward:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;local durability&lt;/li&gt;
&lt;li&gt;asynchronous delivery&lt;/li&gt;
&lt;li&gt;bounded retention&lt;/li&gt;
&lt;li&gt;explicit degradation handling&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Not because SQLite is magical.&lt;/p&gt;

&lt;p&gt;Because a single durable file solved the exact failure mode we actually had.&lt;/p&gt;

</description>
      <category>python</category>
      <category>postgres</category>
      <category>sqlite</category>
      <category>devops</category>
    </item>
    <item>
      <title>Why your SSH scripts will fail in production</title>
      <dc:creator>Alex Zhdankov</dc:creator>
      <pubDate>Mon, 18 May 2026 15:41:40 +0000</pubDate>
      <link>https://dev.to/alex_zhdankov/why-your-ssh-scripts-will-fail-in-production-4cb8</link>
      <guid>https://dev.to/alex_zhdankov/why-your-ssh-scripts-will-fail-in-production-4cb8</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Remote command execution looks trivial — until unstable networks, retries, long-running commands, and half-open connections turn it into a reliability problem.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;We use Paramiko with a thin supervision layer on top.&lt;br&gt;
The same operational problems apply to AsyncSSH, Fabric, or plain OpenSSH subprocesses.&lt;/p&gt;

&lt;p&gt;At first, the implementation looked completely straightforward:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;paramiko&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;SSHClient&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hostname&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;host&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;username&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;stdin&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;stdout&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;stderr&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exec_command&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;systemctl restart postgres&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;output&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;stdout&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In development, this worked perfectly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Then production happened&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Hundreds of hosts.&lt;/li&gt;
&lt;li&gt;Unstable networks.&lt;/li&gt;
&lt;li&gt;Long-running commands.&lt;/li&gt;
&lt;li&gt;Frozen sessions.&lt;/li&gt;
&lt;li&gt;Half-open connections.&lt;/li&gt;
&lt;li&gt;Retries.&lt;/li&gt;
&lt;li&gt;Partial execution.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At that point this stopped being &lt;em&gt;“SSH scripting”&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;It became a distributed systems problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  SSH is deceptively simple
&lt;/h2&gt;

&lt;p&gt;Most developers intuitively model SSH like this:&lt;br&gt;
&lt;code&gt;local subprocess, but remote&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;But production SSH execution is actually:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;network transport
+ stateful session
+ interactive channel
+ remote process lifecycle
+ unreliable infrastructure
+ partial execution visibility
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And failures can happen independently at every layer.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Application
    ↓
SSH Client
    ↓
TCP transport        ← packets can vanish
    ↓
SSH session          ← can hang without closing
    ↓
Remote shell         ← can ignore commands
    ↓
Process execution    ← may continue after disconnect
    ↓
stdout/stderr        ← can block forever
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This distinction changes everything.&lt;/p&gt;

&lt;h2&gt;
  
  
  Failure mode #1 — execution uncertainty
&lt;/h2&gt;

&lt;p&gt;This was the first major production lesson.&lt;/p&gt;

&lt;p&gt;If the SSH transport dies, you do not know whether the command:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;succeeded&lt;/li&gt;
&lt;li&gt;failed&lt;/li&gt;
&lt;li&gt;partially executed&lt;/li&gt;
&lt;li&gt;is still running remotely&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That uncertainty completely changes retry semantics.&lt;/p&gt;

&lt;p&gt;For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;systemctl restart postgres
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the connection drops immediately after sending the command:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;did restart begin?&lt;/li&gt;
&lt;li&gt;is postgres still restarting?&lt;/li&gt;
&lt;li&gt;did it already succeed?&lt;/li&gt;
&lt;li&gt;is the service now dead?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You no longer have execution certainty.&lt;/p&gt;

&lt;p&gt;This is not a &lt;em&gt;“Paramiko problem”&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;This is a distributed systems problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  Retry is dangerous
&lt;/h2&gt;

&lt;p&gt;Retries sound harmless until commands become stateful.&lt;/p&gt;

&lt;p&gt;Some operations are naturally &lt;em&gt;idempotent&lt;/em&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cat&lt;/span&gt; /proc/meminfo
&lt;span class="nb"&gt;ls&lt;/span&gt; &lt;span class="nt"&gt;-la&lt;/span&gt; /etc
systemctl status postgres
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Others are not:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;useradd deploy
&lt;span class="nb"&gt;rm&lt;/span&gt; &lt;span class="nt"&gt;-rf&lt;/span&gt; /some/path
systemctl restart postgres
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A failed transport does not imply failed execution.&lt;/p&gt;

&lt;p&gt;That means naive retry logic can create destructive side effects.&lt;/p&gt;

&lt;p&gt;This forced us to separate failures into two categories:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;transport uncertainty&lt;/li&gt;
&lt;li&gt;command failure&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Those are fundamentally different operational states.&lt;/p&gt;

&lt;h2&gt;
  
  
  Timeouts are not one thing
&lt;/h2&gt;

&lt;p&gt;One of the most common mistakes in SSH automation is treating timeout as a single concept.&lt;/p&gt;

&lt;p&gt;Production systems usually need several independent timeout layers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;TCP connect timeout&lt;/li&gt;
&lt;li&gt;SSH handshake timeout&lt;/li&gt;
&lt;li&gt;authentication timeout&lt;/li&gt;
&lt;li&gt;command execution timeout&lt;/li&gt;
&lt;li&gt;idle/read timeout&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each failure means something different operationally.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;hostname&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;host&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;username&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;username&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;banner_timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;auth_timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;15&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But even that is insufficient.&lt;/p&gt;

&lt;p&gt;A command may still hang forever while the socket technically remains alive.&lt;/p&gt;

&lt;p&gt;That distinction matters a lot in production.&lt;/p&gt;

&lt;h2&gt;
  
  
  Half-open connections are nasty
&lt;/h2&gt;

&lt;p&gt;This became one of the hardest reliability problems.&lt;/p&gt;

&lt;p&gt;Sometimes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;TCP stays alive&lt;/li&gt;
&lt;li&gt;SSH transport stays alive&lt;/li&gt;
&lt;li&gt;but the remote process is effectively dead&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Or:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;packets silently disappear&lt;/li&gt;
&lt;li&gt;the remote kernel freezes&lt;/li&gt;
&lt;li&gt;stdout stops forever&lt;/li&gt;
&lt;li&gt;but the socket never closes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;From the application perspective:&lt;br&gt;
&lt;code&gt;everything looks connected&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;while the operation is permanently stalled.&lt;/p&gt;

&lt;p&gt;This is the classic half-open connection problem.&lt;/p&gt;
&lt;h2&gt;
  
  
  Blocking reads break automation
&lt;/h2&gt;

&lt;p&gt;This code looks innocent:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;stdout&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But under real workloads it becomes dangerous.&lt;/p&gt;

&lt;p&gt;If:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the command hangs&lt;/li&gt;
&lt;li&gt;stdout stops producing data&lt;/li&gt;
&lt;li&gt;the socket remains alive&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;then:&lt;br&gt;
&lt;code&gt;the thread blocks forever&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;We eventually moved to &lt;em&gt;streaming execution&lt;/em&gt; instead of buffered reads.&lt;/p&gt;
&lt;h2&gt;
  
  
  Streaming changes the execution model
&lt;/h2&gt;

&lt;p&gt;Long-running commands fundamentally change how remote execution must be handled.&lt;/p&gt;

&lt;p&gt;Operations like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;pg_dump&lt;/li&gt;
&lt;li&gt;VACUUM&lt;/li&gt;
&lt;li&gt;package upgrades&lt;/li&gt;
&lt;li&gt;log exports&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;can run for minutes or hours.&lt;/p&gt;

&lt;p&gt;Buffering all output in memory is unreliable.&lt;br&gt;
Blocking until completion destroys &lt;em&gt;observability&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Instead we switched to chunked streaming:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;channel&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exit_status_ready&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;channel&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;recv_ready&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;channel&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;recv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;4096&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;callback&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This solved several production problems simultaneously:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;realtime progress visibility&lt;/li&gt;
&lt;li&gt;lower memory usage&lt;/li&gt;
&lt;li&gt;cancellation support&lt;/li&gt;
&lt;li&gt;dead session detection&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Streaming ended up being much more operationally stable than buffered execution.&lt;/p&gt;

&lt;h2&gt;
  
  
  Security becomes infrastructure, not validation
&lt;/h2&gt;

&lt;p&gt;Another important lesson:&lt;/p&gt;

&lt;p&gt;SSH automation is remote code execution infrastructure.&lt;/p&gt;

&lt;p&gt;That means command construction rules matter enormously.&lt;/p&gt;

&lt;p&gt;This is catastrophic:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;cmd&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rm -rf &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;user_input&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Because eventually someone passes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;/home/user; rm -rf /
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We ended up treating all remote commands as infrastructure-sensitive operations.&lt;/p&gt;

&lt;p&gt;Input validation alone was insufficient.&lt;/p&gt;

&lt;p&gt;Every dynamic argument had to be:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;validated&lt;/li&gt;
&lt;li&gt;escaped&lt;/li&gt;
&lt;li&gt;constrained
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;safe_value&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;shlex&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;quote&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_input&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Even simple automation eventually becomes security-critical.&lt;/p&gt;

&lt;h2&gt;
  
  
  Resource cleanup matters more than expected
&lt;/h2&gt;

&lt;p&gt;SSH resources leak surprisingly easily.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Channels.&lt;/li&gt;
&lt;li&gt;Sockets.&lt;/li&gt;
&lt;li&gt;Transports.&lt;/li&gt;
&lt;li&gt;PTY buffers.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Under load, forgotten cleanup accumulates fast.&lt;/p&gt;

&lt;p&gt;We eventually standardized all operations around explicit lifecycle management:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;ssh_operation&lt;/span&gt;&lt;span class="p"&gt;(...)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;ssh&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;ssh&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(...)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The important part was not aesthetics.&lt;/p&gt;

&lt;p&gt;It was guaranteeing cleanup under:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;exceptions&lt;/li&gt;
&lt;li&gt;timeouts&lt;/li&gt;
&lt;li&gt;partial failures&lt;/li&gt;
&lt;li&gt;interrupted execution&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Production automation lives or dies on &lt;em&gt;cleanup guarantees&lt;/em&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The architecture we ended up with
&lt;/h2&gt;

&lt;p&gt;Over time the system evolved into several independent layers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Connection management
    ↓
Retry classification
    ↓
Execution supervision
    ↓
Streaming transport
    ↓
Resource cleanup
    ↓
Observability
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The important realization was:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;remote execution is not a helper function&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It is infrastructure.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final insight
&lt;/h2&gt;

&lt;p&gt;The happy path is trivial.&lt;/p&gt;

&lt;p&gt;Production architecture begins where execution certainty ends.&lt;/p&gt;

&lt;p&gt;SSH automation fails when treated like scripting.&lt;/p&gt;

&lt;p&gt;Because it is not scripting.&lt;/p&gt;

&lt;p&gt;It is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;remote process orchestration&lt;/li&gt;
&lt;li&gt;over unreliable transport&lt;/li&gt;
&lt;li&gt;with partial execution visibility&lt;/li&gt;
&lt;li&gt;inside a distributed system&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And once you accept that,&lt;br&gt;
the architecture changes completely.&lt;/p&gt;

</description>
      <category>ssh</category>
      <category>python</category>
      <category>distributedsystems</category>
      <category>devops</category>
    </item>
    <item>
      <title>We built a real psql terminal in the browser. Here’s what made it unexpectedly hard.</title>
      <dc:creator>Alex Zhdankov</dc:creator>
      <pubDate>Wed, 13 May 2026 20:53:07 +0000</pubDate>
      <link>https://dev.to/alex_zhdankov/we-built-a-real-psql-terminal-in-the-browser-heres-what-made-it-unexpectedly-hard-57a1</link>
      <guid>https://dev.to/alex_zhdankov/we-built-a-real-psql-terminal-in-the-browser-heres-what-made-it-unexpectedly-hard-57a1</guid>
      <description>&lt;p&gt;&lt;strong&gt;A PTY-backed PostgreSQL console running in the browser using reverse WebSockets, Redis Streams, and xterm.js — designed around centralized control-plane constraints and production failure modes.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We needed a real PostgreSQL terminal inside the browser.&lt;/p&gt;

&lt;p&gt;Not a SQL editor.&lt;br&gt;
Not a query API.&lt;br&gt;
A real psql session with full terminal semantics.&lt;/p&gt;

&lt;p&gt;That requirement immediately forced several architectural constraints:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a real PTY&lt;/li&gt;
&lt;li&gt;a long-lived stateful process&lt;/li&gt;
&lt;li&gt;bidirectional streaming&lt;/li&gt;
&lt;li&gt;terminal resize handling&lt;/li&gt;
&lt;li&gt;signal forwarding (&lt;code&gt;Ctrl+C&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;native &lt;code&gt;psql&lt;/code&gt; behavior&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And then the infrastructure constraints made things significantly more interesting:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;agents live in internal networks&lt;/li&gt;
&lt;li&gt;all traffic must go through the Control Plane&lt;/li&gt;
&lt;li&gt;xterm.js only supports WebSocket transport&lt;/li&gt;
&lt;li&gt;we could not emulate &lt;code&gt;psql&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At that point, this stopped being a “web feature”.&lt;br&gt;
It became a distributed terminal runtime problem.&lt;/p&gt;
&lt;h2&gt;
  
  
  High-level architecture
&lt;/h2&gt;

&lt;p&gt;This system only makes sense if you read it as a &lt;em&gt;dataflow graph&lt;/em&gt;, not as isolated services.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Browser (xterm.js)
    │
    │ WebSocket (terminal I/O)
    ▼
Control Plane
    │
    │ session management + auth
    ▼
Redis Streams (output buffer)
    │
    │ coordination + async delivery
    ▼
Agent WebSocket channel
    │
    │ PTY stdin/stdout bridge
    ▼
PTY → real psql process
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The critical architectural decision:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;the browser never connects to the agent directly.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The Control Plane is the only public entrypoint in the entire system.&lt;br&gt;
Everything flows through it.&lt;/p&gt;
&lt;h2&gt;
  
  
  Why the architecture looks “backwards”
&lt;/h2&gt;

&lt;p&gt;The surprising part is that the agent initiates the terminal transport.&lt;/p&gt;

&lt;p&gt;Not because NAT traversal was impossible.&lt;/p&gt;

&lt;p&gt;But because the system was intentionally designed around a centralized Control Plane.&lt;/p&gt;

&lt;p&gt;Agents sit in internal networks.&lt;br&gt;
The browser has no direct visibility into them.&lt;/p&gt;

&lt;p&gt;So instead of:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;Browser → Agent&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;the architecture becomes:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;Browser → Control Plane ← Agent&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;The Control Plane acts as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;session coordinator&lt;/li&gt;
&lt;li&gt;auth boundary&lt;/li&gt;
&lt;li&gt;transport router&lt;/li&gt;
&lt;li&gt;lifecycle owner&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Once that decision is made, reverse WebSockets become the natural transport model.&lt;/p&gt;
&lt;h2&gt;
  
  
  Session establishment
&lt;/h2&gt;

&lt;p&gt;The session lifecycle happens in multiple stages.&lt;/p&gt;

&lt;p&gt;Importantly:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;the PTY process does not exist when the browser first connects&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Only a logical session exists.&lt;/p&gt;
&lt;h2&gt;
  
  
  Step 1 — Browser creates a logical session
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Browser
  │
  │ WebSocket connect
  ▼
Control Plane
  ├── creates session_id
  ├── registers browser handler
  └── starts auth timeout
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;At this point:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;no PTY exists&lt;/li&gt;
&lt;li&gt;no psql exists&lt;/li&gt;
&lt;li&gt;no database connection exists&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The Control Plane only knows:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“a browser wants a terminal session”&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;
  
  
  Step 2 — Control Plane signals the agent
&lt;/h2&gt;

&lt;p&gt;The Control Plane sends a lightweight HTTP request:&lt;br&gt;
&lt;code&gt;POST /terminal?session_id=&amp;lt;uuid&amp;gt;&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;This is intentionally the only HTTP hop in the entire terminal lifecycle.&lt;/p&gt;

&lt;p&gt;The request does not carry terminal traffic.&lt;/p&gt;

&lt;p&gt;It only means:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“establish terminal transport for this session”&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;
  
  
  Step 3 — Agent opens reverse WebSocket
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Agent
  │
  │ outbound WebSocket
  ▼
Control Plane
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Now the system has two independent transport channels:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Browser WS → Control Plane&lt;/li&gt;
&lt;li&gt;Agent WS   → Control Plane&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But they are still disconnected.&lt;/p&gt;

&lt;p&gt;The system is in a &lt;em&gt;half-connected state&lt;/em&gt;.&lt;/p&gt;
&lt;h2&gt;
  
  
  Session stitching
&lt;/h2&gt;

&lt;p&gt;This is the moment where the architecture becomes interesting.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Browser Handler ───────┐
                       ├── session binding
Agent Handler ─────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At this point:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;the Control Plane stops being a transport endpoint and becomes a message router&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It now forwards:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;browser input → agent&lt;/li&gt;
&lt;li&gt;agent output → browser&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But critically:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;not directly&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;All terminal output passes through an asynchronous buffering layer.&lt;/p&gt;

&lt;p&gt;That layer ended up being one of the most important production decisions in the system.&lt;/p&gt;

&lt;h2&gt;
  
  
  PTY process creation
&lt;/h2&gt;

&lt;p&gt;Once the session is fully initialized, the agent forks a real PTY:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;child_pid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fd&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pty&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fork&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;child_pid&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;subprocess&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;psql&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-U&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-d&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dbname&lt;/span&gt;
    &lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At this point the architecture fundamentally changes.&lt;/p&gt;

&lt;p&gt;This is no longer “web infrastructure”.&lt;/p&gt;

&lt;p&gt;It becomes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;PTY supervision&lt;/li&gt;
&lt;li&gt;file descriptor management&lt;/li&gt;
&lt;li&gt;process lifecycle handling&lt;/li&gt;
&lt;li&gt;signal propagation&lt;/li&gt;
&lt;li&gt;backpressure management&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Most complexity appeared after this step.&lt;/p&gt;

&lt;p&gt;Not before it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The real data pipeline
&lt;/h2&gt;

&lt;p&gt;This is the most important flow in the system.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Browser
  │
  │ keystroke
  ▼
Control Plane
  │
  ▼
Agent WS handler
  │
  │ write(fd)
  ▼
PTY → psql
  │
  │ stdout
  ▼
PTY reader thread
  │
  │ Redis XADD
  ▼
Redis Streams
  │
  │ async consumer
  ▼
Control Plane
  │
  │ WS push
  ▼
Browser
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The most important line in the entire architecture is this:&lt;br&gt;
&lt;code&gt;PTY reader → Redis XADD → async consumer → WebSocket&lt;/code&gt;&lt;br&gt;
That line is the system’s stability boundary.&lt;/p&gt;
&lt;h2&gt;
  
  
  Why Redis Streams became mandatory
&lt;/h2&gt;

&lt;p&gt;The original implementation directly forwarded PTY output into WebSocket writes:&lt;br&gt;
&lt;code&gt;PTY → WebSocket&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;It worked in development.&lt;/p&gt;

&lt;p&gt;It failed in production.&lt;br&gt;
The issue was subtle:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;PTY reads are synchronous&lt;/li&gt;
&lt;li&gt;WebSocket writes can block&lt;/li&gt;
&lt;li&gt;backpressure propagates backwards&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The resulting failure mode was catastrophic for terminal UX:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;slow network
    ↓
blocked WS writes
    ↓
frozen PTY reader
    ↓
terminal stalls
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The terminal looked dead while psql was still running underneath.&lt;/p&gt;

&lt;p&gt;Redis Streams solved this by introducing a decoupling boundary.&lt;/p&gt;

&lt;p&gt;Now:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;PTY reads stay non-blocking&lt;/li&gt;
&lt;li&gt;network latency becomes isolated&lt;/li&gt;
&lt;li&gt;consumers can temporarily lag&lt;/li&gt;
&lt;li&gt;output survives reconnects&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The additional latency was negligible.&lt;/p&gt;

&lt;p&gt;The operational stability improvement was enormous.&lt;/p&gt;

&lt;h2&gt;
  
  
  The architecture is actually two independent loops
&lt;/h2&gt;

&lt;p&gt;This is the part most terminal architectures hide.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Input loop&lt;/strong&gt;&lt;br&gt;
&lt;code&gt;Browser → Control Plane → Agent → PTY&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Output loop&lt;/strong&gt;&lt;br&gt;
&lt;code&gt;PTY → Redis → Control Plane → Browser&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;These loops are intentionally independent.&lt;/p&gt;

&lt;p&gt;That separation is what allows the system to survive partial failures.&lt;/p&gt;
&lt;h2&gt;
  
  
  Why we split browser and agent handlers
&lt;/h2&gt;

&lt;p&gt;We intentionally kept browser-facing and agent-facing handlers separate.&lt;/p&gt;

&lt;p&gt;Because they solve fundamentally different problems.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Browser Handler&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;auth&lt;/li&gt;
&lt;li&gt;user session ownership&lt;/li&gt;
&lt;li&gt;browser disconnect semantics&lt;/li&gt;
&lt;li&gt;user errors&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Agent Handler&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;PTY lifecycle&lt;/li&gt;
&lt;li&gt;process supervision&lt;/li&gt;
&lt;li&gt;reconnect semantics&lt;/li&gt;
&lt;li&gt;infrastructure errors&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Trying to merge them created tightly coupled failure modes and significantly more lifecycle complexity.&lt;/p&gt;

&lt;p&gt;Separating them made the system dramatically easier to reason about.&lt;/p&gt;
&lt;h2&gt;
  
  
  Failure modes that mattered in production
&lt;/h2&gt;

&lt;p&gt;The hardest problems were not PostgreSQL problems.&lt;/p&gt;

&lt;p&gt;They were long-lived process problems.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A. Redis failure&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Impact:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;output pipeline breaks&lt;/li&gt;
&lt;li&gt;PTY continues running&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Mitigation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;memory limits&lt;/li&gt;
&lt;li&gt;retention limits&lt;/li&gt;
&lt;li&gt;monitoring&lt;/li&gt;
&lt;li&gt;bounded stream lifetime&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;B. Agent disconnect&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Impact:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;transport disappears&lt;/li&gt;
&lt;li&gt;PTY may still be alive&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Mitigation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;reconnect window&lt;/li&gt;
&lt;li&gt;session reattachment&lt;/li&gt;
&lt;li&gt;delayed teardown&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;C. Process explosion&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Impact:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;memory exhaustion&lt;/li&gt;
&lt;li&gt;PostgreSQL connection storms&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Mitigation:&lt;br&gt;
&lt;code&gt;BoundedSemaphore(max_sessions=10)&lt;/code&gt;&lt;br&gt;
This was one of the simplest and most effective safeguards in the system.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;D. xterm resize storms&lt;/strong&gt;&lt;br&gt;
xterm.js emits resize events aggressively during browser resizing.&lt;/p&gt;

&lt;p&gt;Impact:&lt;br&gt;
Each resize triggers:&lt;br&gt;
&lt;code&gt;ioctl(TIOCSWINSZ)&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Mitigation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Without throttling, the PTY spent significant time processing resize events instead of actual terminal traffic.&lt;/li&gt;
&lt;li&gt;Simple debounce logic completely fixed the issue.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Scaling reality
&lt;/h2&gt;

&lt;p&gt;The system does not scale like a normal WebSocket service.&lt;/p&gt;

&lt;p&gt;Each session includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a real &lt;code&gt;psql&lt;/code&gt; process&lt;/li&gt;
&lt;li&gt;a PTY&lt;/li&gt;
&lt;li&gt;multiple threads&lt;/li&gt;
&lt;li&gt;Redis streams&lt;/li&gt;
&lt;li&gt;two WebSocket channels&lt;/li&gt;
&lt;li&gt;a database connection&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The scaling bottleneck is not Redis.&lt;/p&gt;

&lt;p&gt;It is not CPU.&lt;/p&gt;

&lt;p&gt;It is not WebSockets.&lt;/p&gt;

&lt;p&gt;It is:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;how many real PostgreSQL sessions the infrastructure can sustain&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;
  
  
  Why HTTP and SSE were rejected
&lt;/h2&gt;

&lt;p&gt;We evaluated both.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;HTTP&lt;/strong&gt;&lt;br&gt;
Failed because:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;stateless&lt;/li&gt;
&lt;li&gt;no streaming terminal semantics&lt;/li&gt;
&lt;li&gt;no signal handling&lt;/li&gt;
&lt;li&gt;no persistent shell state&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;SSE&lt;/strong&gt;&lt;br&gt;
Failed because:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;one-directional transport&lt;/li&gt;
&lt;li&gt;incompatible with terminal interaction patterns&lt;/li&gt;
&lt;li&gt;xterm.js expects bidirectional communication&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At the end, terminals naturally map onto WebSockets.&lt;/p&gt;

&lt;p&gt;Trying to avoid that only complicates the architecture.&lt;/p&gt;
&lt;h2&gt;
  
  
  What this system actually is
&lt;/h2&gt;

&lt;p&gt;If you remove all abstractions:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;this is a distributed process supervisor for a PTY running &lt;code&gt;psql&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Everything else is transport, routing, buffering, and failure handling around that core idea.&lt;/p&gt;
&lt;h2&gt;
  
  
  Final architecture insight
&lt;/h2&gt;

&lt;p&gt;The system is ultimately defined by three separations.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Connection separation&lt;/strong&gt;&lt;br&gt;
The Control Plane isolates browsers from agents.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Process separation&lt;/strong&gt;&lt;br&gt;
PTY isolates PostgreSQL from the web layer.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Flow separation&lt;/strong&gt;&lt;br&gt;
Redis isolates terminal I/O from network I/O.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;
  
  
  Final mental model
&lt;/h2&gt;

&lt;p&gt;If you understand only one thing, understand this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Browser ↔ Control Plane ↔ Agent ↔ PTY ↔ psql
                     ↑
              Redis is the buffer
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Everything else is lifecycle management around this chain.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final thought
&lt;/h2&gt;

&lt;p&gt;We did not build a “web UI for PostgreSQL”.&lt;/p&gt;

&lt;p&gt;We built a distributed, fault-tolerant runtime for a stateful terminal process.&lt;/p&gt;

&lt;p&gt;PostgreSQL just happened to be the process attached to it.&lt;/p&gt;

</description>
      <category>websockets</category>
      <category>redis</category>
      <category>architecture</category>
      <category>python</category>
    </item>
  </channel>
</rss>
