Remote command execution looks trivial — until unstable networks, retries, long-running commands, and half-open connections turn it into a reliability problem.
We use Paramiko with a thin supervision layer on top.
The same operational problems apply to AsyncSSH, Fabric, or plain OpenSSH subprocesses.
At first, the implementation looked completely straightforward:
client = paramiko.SSHClient()
client.connect(hostname=host, username=user)
stdin, stdout, stderr = client.exec_command(
"systemctl restart postgres"
)
output = stdout.read().decode()
In development, this worked perfectly.
Then production happened.
- Hundreds of hosts.
- Unstable networks.
- Long-running commands.
- Frozen sessions.
- Half-open connections.
- Retries.
- Partial execution.
At that point this stopped being “SSH scripting”.
It became a distributed systems problem.
SSH is deceptively simple
Most developers intuitively model SSH like this:
local subprocess, but remote
But production SSH execution is actually:
network transport
+ stateful session
+ interactive channel
+ remote process lifecycle
+ unreliable infrastructure
+ partial execution visibility
And failures can happen independently at every layer.
Application
↓
SSH Client
↓
TCP transport ← packets can vanish
↓
SSH session ← can hang without closing
↓
Remote shell ← can ignore commands
↓
Process execution ← may continue after disconnect
↓
stdout/stderr ← can block forever
This distinction changes everything.
Failure mode #1 — execution uncertainty
This was the first major production lesson.
If the SSH transport dies, you do not know whether the command:
- succeeded
- failed
- partially executed
- is still running remotely
That uncertainty completely changes retry semantics.
For example:
systemctl restart postgres
If the connection drops immediately after sending the command:
- did restart begin?
- is postgres still restarting?
- did it already succeed?
- is the service now dead?
You no longer have execution certainty.
This is not a “Paramiko problem”.
This is a distributed systems problem.
Retry is dangerous
Retries sound harmless until commands become stateful.
Some operations are naturally idempotent:
cat /proc/meminfo
ls -la /etc
systemctl status postgres
Others are not:
useradd deploy
rm -rf /some/path
systemctl restart postgres
A failed transport does not imply failed execution.
That means naive retry logic can create destructive side effects.
This forced us to separate failures into two categories:
- transport uncertainty
- command failure
Those are fundamentally different operational states.
Timeouts are not one thing
One of the most common mistakes in SSH automation is treating timeout as a single concept.
Production systems usually need several independent timeout layers:
- TCP connect timeout
- SSH handshake timeout
- authentication timeout
- command execution timeout
- idle/read timeout
Each failure means something different operationally.
client.connect(
hostname=host,
username=username,
timeout=10,
banner_timeout=15,
auth_timeout=15
)
But even that is insufficient.
A command may still hang forever while the socket technically remains alive.
That distinction matters a lot in production.
Half-open connections are nasty
This became one of the hardest reliability problems.
Sometimes:
- TCP stays alive
- SSH transport stays alive
- but the remote process is effectively dead
Or:
- packets silently disappear
- the remote kernel freezes
- stdout stops forever
- but the socket never closes
From the application perspective:
everything looks connected
while the operation is permanently stalled.
This is the classic half-open connection problem.
Blocking reads break automation
This code looks innocent:
stdout.read()
But under real workloads it becomes dangerous.
If:
- the command hangs
- stdout stops producing data
- the socket remains alive
then:
the thread blocks forever
We eventually moved to streaming execution instead of buffered reads.
Streaming changes the execution model
Long-running commands fundamentally change how remote execution must be handled.
Operations like:
- pg_dump
- VACUUM
- package upgrades
- log exports
can run for minutes or hours.
Buffering all output in memory is unreliable.
Blocking until completion destroys observability.
Instead we switched to chunked streaming:
while not channel.exit_status_ready():
if channel.recv_ready():
data = channel.recv(4096)
callback(data)
This solved several production problems simultaneously:
- realtime progress visibility
- lower memory usage
- cancellation support
- dead session detection
Streaming ended up being much more operationally stable than buffered execution.
Security becomes infrastructure, not validation
Another important lesson:
SSH automation is remote code execution infrastructure.
That means command construction rules matter enormously.
This is catastrophic:
cmd = f"rm -rf {user_input}"
Because eventually someone passes:
/home/user; rm -rf /
We ended up treating all remote commands as infrastructure-sensitive operations.
Input validation alone was insufficient.
Every dynamic argument had to be:
- validated
- escaped
- constrained
safe_value = shlex.quote(user_input)
Even simple automation eventually becomes security-critical.
Resource cleanup matters more than expected
SSH resources leak surprisingly easily.
- Channels.
- Sockets.
- Transports.
- PTY buffers.
Under load, forgotten cleanup accumulates fast.
We eventually standardized all operations around explicit lifecycle management:
with ssh_operation(...) as ssh:
ssh.execute(...)
The important part was not aesthetics.
It was guaranteeing cleanup under:
- exceptions
- timeouts
- partial failures
- interrupted execution
Production automation lives or dies on cleanup guarantees.
The architecture we ended up with
Over time the system evolved into several independent layers:
Connection management
↓
Retry classification
↓
Execution supervision
↓
Streaming transport
↓
Resource cleanup
↓
Observability
The important realization was:
remote execution is not a helper function
It is infrastructure.
Final insight
The happy path is trivial.
Production architecture begins where execution certainty ends.
SSH automation fails when treated like scripting.
Because it is not scripting.
It is:
- remote process orchestration
- over unreliable transport
- with partial execution visibility
- inside a distributed system
And once you accept that,
the architecture changes completely.
Top comments (0)