DEV Community

Cover image for Why your SSH scripts will fail in production
Alex Zhdankov
Alex Zhdankov

Posted on

Why your SSH scripts will fail in production

Remote command execution looks trivial — until unstable networks, retries, long-running commands, and half-open connections turn it into a reliability problem.

We use Paramiko with a thin supervision layer on top.
The same operational problems apply to AsyncSSH, Fabric, or plain OpenSSH subprocesses.

At first, the implementation looked completely straightforward:

client = paramiko.SSHClient()
client.connect(hostname=host, username=user)

stdin, stdout, stderr = client.exec_command(
    "systemctl restart postgres"
)

output = stdout.read().decode()
Enter fullscreen mode Exit fullscreen mode

In development, this worked perfectly.

Then production happened.

  • Hundreds of hosts.
  • Unstable networks.
  • Long-running commands.
  • Frozen sessions.
  • Half-open connections.
  • Retries.
  • Partial execution.

At that point this stopped being “SSH scripting”.

It became a distributed systems problem.

SSH is deceptively simple

Most developers intuitively model SSH like this:
local subprocess, but remote

But production SSH execution is actually:

network transport
+ stateful session
+ interactive channel
+ remote process lifecycle
+ unreliable infrastructure
+ partial execution visibility
Enter fullscreen mode Exit fullscreen mode

And failures can happen independently at every layer.

Application
    ↓
SSH Client
    ↓
TCP transport        ← packets can vanish
    ↓
SSH session          ← can hang without closing
    ↓
Remote shell         ← can ignore commands
    ↓
Process execution    ← may continue after disconnect
    ↓
stdout/stderr        ← can block forever
Enter fullscreen mode Exit fullscreen mode

This distinction changes everything.

Failure mode #1 — execution uncertainty

This was the first major production lesson.

If the SSH transport dies, you do not know whether the command:

  • succeeded
  • failed
  • partially executed
  • is still running remotely

That uncertainty completely changes retry semantics.

For example:

systemctl restart postgres
Enter fullscreen mode Exit fullscreen mode

If the connection drops immediately after sending the command:

  • did restart begin?
  • is postgres still restarting?
  • did it already succeed?
  • is the service now dead?

You no longer have execution certainty.

This is not a “Paramiko problem”.

This is a distributed systems problem.

Retry is dangerous

Retries sound harmless until commands become stateful.

Some operations are naturally idempotent:

cat /proc/meminfo
ls -la /etc
systemctl status postgres
Enter fullscreen mode Exit fullscreen mode

Others are not:

useradd deploy
rm -rf /some/path
systemctl restart postgres
Enter fullscreen mode Exit fullscreen mode

A failed transport does not imply failed execution.

That means naive retry logic can create destructive side effects.

This forced us to separate failures into two categories:

  • transport uncertainty
  • command failure

Those are fundamentally different operational states.

Timeouts are not one thing

One of the most common mistakes in SSH automation is treating timeout as a single concept.

Production systems usually need several independent timeout layers:

  • TCP connect timeout
  • SSH handshake timeout
  • authentication timeout
  • command execution timeout
  • idle/read timeout

Each failure means something different operationally.

client.connect(
    hostname=host,
    username=username,
    timeout=10,
    banner_timeout=15,
    auth_timeout=15
)
Enter fullscreen mode Exit fullscreen mode

But even that is insufficient.

A command may still hang forever while the socket technically remains alive.

That distinction matters a lot in production.

Half-open connections are nasty

This became one of the hardest reliability problems.

Sometimes:

  • TCP stays alive
  • SSH transport stays alive
  • but the remote process is effectively dead

Or:

  • packets silently disappear
  • the remote kernel freezes
  • stdout stops forever
  • but the socket never closes

From the application perspective:
everything looks connected

while the operation is permanently stalled.

This is the classic half-open connection problem.

Blocking reads break automation

This code looks innocent:

stdout.read()
Enter fullscreen mode Exit fullscreen mode

But under real workloads it becomes dangerous.

If:

  • the command hangs
  • stdout stops producing data
  • the socket remains alive

then:
the thread blocks forever

We eventually moved to streaming execution instead of buffered reads.

Streaming changes the execution model

Long-running commands fundamentally change how remote execution must be handled.

Operations like:

  • pg_dump
  • VACUUM
  • package upgrades
  • log exports

can run for minutes or hours.

Buffering all output in memory is unreliable.
Blocking until completion destroys observability.

Instead we switched to chunked streaming:

while not channel.exit_status_ready():
    if channel.recv_ready():
        data = channel.recv(4096)
        callback(data)
Enter fullscreen mode Exit fullscreen mode

This solved several production problems simultaneously:

  • realtime progress visibility
  • lower memory usage
  • cancellation support
  • dead session detection

Streaming ended up being much more operationally stable than buffered execution.

Security becomes infrastructure, not validation

Another important lesson:

SSH automation is remote code execution infrastructure.

That means command construction rules matter enormously.

This is catastrophic:

cmd = f"rm -rf {user_input}"
Enter fullscreen mode Exit fullscreen mode

Because eventually someone passes:

/home/user; rm -rf /
Enter fullscreen mode Exit fullscreen mode

We ended up treating all remote commands as infrastructure-sensitive operations.

Input validation alone was insufficient.

Every dynamic argument had to be:

  • validated
  • escaped
  • constrained
safe_value = shlex.quote(user_input)
Enter fullscreen mode Exit fullscreen mode

Even simple automation eventually becomes security-critical.

Resource cleanup matters more than expected

SSH resources leak surprisingly easily.

  • Channels.
  • Sockets.
  • Transports.
  • PTY buffers.

Under load, forgotten cleanup accumulates fast.

We eventually standardized all operations around explicit lifecycle management:

with ssh_operation(...) as ssh:
    ssh.execute(...)
Enter fullscreen mode Exit fullscreen mode

The important part was not aesthetics.

It was guaranteeing cleanup under:

  • exceptions
  • timeouts
  • partial failures
  • interrupted execution

Production automation lives or dies on cleanup guarantees.

The architecture we ended up with

Over time the system evolved into several independent layers:

Connection management
    ↓
Retry classification
    ↓
Execution supervision
    ↓
Streaming transport
    ↓
Resource cleanup
    ↓
Observability
Enter fullscreen mode Exit fullscreen mode

The important realization was:

remote execution is not a helper function

It is infrastructure.

Final insight

The happy path is trivial.

Production architecture begins where execution certainty ends.

SSH automation fails when treated like scripting.

Because it is not scripting.

It is:

  • remote process orchestration
  • over unreliable transport
  • with partial execution visibility
  • inside a distributed system

And once you accept that,
the architecture changes completely.

Top comments (0)