Alex Zhdankov

Posted on May 18

Why your SSH scripts will fail in production

#ssh #python #distributedsystems #devops

Remote command execution looks trivial — until unstable networks, retries, long-running commands, and half-open connections turn it into a reliability problem.

We use Paramiko with a thin supervision layer on top.
The same operational problems apply to AsyncSSH, Fabric, or plain OpenSSH subprocesses.

At first, the implementation looked completely straightforward:

client = paramiko.SSHClient()
client.connect(hostname=host, username=user)

stdin, stdout, stderr = client.exec_command(
    "systemctl restart postgres"
)

output = stdout.read().decode()

In development, this worked perfectly.

Then production happened.

Hundreds of hosts.
Unstable networks.
Long-running commands.
Frozen sessions.
Half-open connections.
Retries.
Partial execution.

At that point this stopped being “SSH scripting”.

It became a distributed systems problem.

SSH is deceptively simple

Most developers intuitively model SSH like this:
local subprocess, but remote

But production SSH execution is actually:

network transport
+ stateful session
+ interactive channel
+ remote process lifecycle
+ unreliable infrastructure
+ partial execution visibility

And failures can happen independently at every layer.

Application
    ↓
SSH Client
    ↓
TCP transport        ← packets can vanish
    ↓
SSH session          ← can hang without closing
    ↓
Remote shell         ← can ignore commands
    ↓
Process execution    ← may continue after disconnect
    ↓
stdout/stderr        ← can block forever

This distinction changes everything.

Failure mode #1 — execution uncertainty

This was the first major production lesson.

If the SSH transport dies, you do not know whether the command:

succeeded
failed
partially executed
is still running remotely

That uncertainty completely changes retry semantics.

For example:

systemctl restart postgres

If the connection drops immediately after sending the command:

did restart begin?
is postgres still restarting?
did it already succeed?
is the service now dead?

You no longer have execution certainty.

This is not a “Paramiko problem”.

This is a distributed systems problem.

Retry is dangerous

Retries sound harmless until commands become stateful.

Some operations are naturally idempotent:

cat /proc/meminfo
ls -la /etc
systemctl status postgres

Others are not:

useradd deploy
rm -rf /some/path
systemctl restart postgres

A failed transport does not imply failed execution.

That means naive retry logic can create destructive side effects.

This forced us to separate failures into two categories:

transport uncertainty
command failure

Those are fundamentally different operational states.

Timeouts are not one thing

One of the most common mistakes in SSH automation is treating timeout as a single concept.

Production systems usually need several independent timeout layers:

TCP connect timeout
SSH handshake timeout
authentication timeout
command execution timeout
idle/read timeout

Each failure means something different operationally.

client.connect(
    hostname=host,
    username=username,
    timeout=10,
    banner_timeout=15,
    auth_timeout=15
)

But even that is insufficient.

A command may still hang forever while the socket technically remains alive.

That distinction matters a lot in production.

Half-open connections are nasty

This became one of the hardest reliability problems.

Sometimes:

TCP stays alive
SSH transport stays alive
but the remote process is effectively dead

Or:

packets silently disappear
the remote kernel freezes
stdout stops forever
but the socket never closes

From the application perspective:
everything looks connected

while the operation is permanently stalled.

This is the classic half-open connection problem.

Blocking reads break automation

This code looks innocent:

stdout.read()

But under real workloads it becomes dangerous.

If:

the command hangs
stdout stops producing data
the socket remains alive

then:
the thread blocks forever

We eventually moved to streaming execution instead of buffered reads.

Streaming changes the execution model

Long-running commands fundamentally change how remote execution must be handled.

Operations like:

pg_dump
VACUUM
package upgrades
log exports

can run for minutes or hours.

Buffering all output in memory is unreliable.
Blocking until completion destroys observability.

Instead we switched to chunked streaming:

while not channel.exit_status_ready():
    if channel.recv_ready():
        data = channel.recv(4096)
        callback(data)

This solved several production problems simultaneously:

realtime progress visibility
lower memory usage
cancellation support
dead session detection

Streaming ended up being much more operationally stable than buffered execution.

Security becomes infrastructure, not validation

Another important lesson:

SSH automation is remote code execution infrastructure.

That means command construction rules matter enormously.

This is catastrophic:

cmd = f"rm -rf {user_input}"

Because eventually someone passes:

/home/user; rm -rf /

We ended up treating all remote commands as infrastructure-sensitive operations.

Input validation alone was insufficient.

Every dynamic argument had to be:

validated
escaped
constrained

safe_value = shlex.quote(user_input)

Even simple automation eventually becomes security-critical.

Resource cleanup matters more than expected

SSH resources leak surprisingly easily.

Channels.
Sockets.
Transports.
PTY buffers.

Under load, forgotten cleanup accumulates fast.

We eventually standardized all operations around explicit lifecycle management:

with ssh_operation(...) as ssh:
    ssh.execute(...)

The important part was not aesthetics.

It was guaranteeing cleanup under:

exceptions
timeouts
partial failures
interrupted execution

Production automation lives or dies on cleanup guarantees.

The architecture we ended up with

Over time the system evolved into several independent layers:

Connection management
    ↓
Retry classification
    ↓
Execution supervision
    ↓
Streaming transport
    ↓
Resource cleanup
    ↓
Observability

The important realization was:

remote execution is not a helper function

It is infrastructure.

Final insight

The happy path is trivial.

Production architecture begins where execution certainty ends.

SSH automation fails when treated like scripting.

Because it is not scripting.

It is:

remote process orchestration
over unreliable transport
with partial execution visibility
inside a distributed system

And once you accept that,
the architecture changes completely.

DEV Community