Supun Sriyananda

Posted on Jun 7

Deploying Production Systems on Raspberry Pi: Lessons from the Field

#raspberrypi #deployment #linux #reliability

Deploying Production Systems on Raspberry Pi: Lessons from the Field

These are the things I wish I had known before deploying Pis in production.

SD Cards Will Kill You

The first production Pi I deployed used a generic microSD card. It failed after four months. The second one used a "name brand" card. It failed after six months. The pattern remained always the same: the filesystem corrupts during a power loss, the Pi boots into read-only mode, and whatever the system was supposed to be doing silently stops working.

SD card corruption under power loss is not a bug you can fix in software. It is a fundamental characteristic of flash storage that was designed for cameras, not servers. The cells wear out, write operations are not atomic, and a sudden power cut mid-write leaves the filesystem in a state that fsck (File System Consistency Check) sometimes cannot recover.

If you choose SD cards: Switch to a Pi-rated industrial SD card (SanDisk MAX Endurance, Samsung Pro Endurance) or eliminate the SD card entirely by booting from a USB SSD. USB boot on Pi 4 and Pi 5 is stable and the endurance difference is enormous — a decent SSD handles orders of magnitude more write cycles than any SD card.

For systems that must use SD, mount the filesystem read-only and put all writable state on a tmpfs or a separate partition with journaling.

tmpfs is a special type of temporary file storage facility in Linux and Unix-like systems that stores files directly in volatile memory (RAM) instead of on a persistent drive like an SD card or SSD.

When you mount a folder as tmpfs, any files written to that folder behave like regular files, but they consume RAM and exist purely in memory.

# /etc/fstab — mount root read-only, put logs and data elsewhere
/dev/mmcblk0p2  /        ext4  ro,defaults  0  1
tmpfs           /tmp     tmpfs defaults      0  0
tmpfs           /var/log tmpfs defaults      0  0

# Separate partition for application data with journaling
/dev/mmcblk0p3  /var/lib/myapp  ext4  defaults,data=journal  0  2

For a warehouse sensor deployment, the application writes data to SQLite on a separate ext4 partition with journaling enabled. If the Pi loses power mid-write, fsck can recover the journal. The root partition is read-only and survives power loss cleanly every time.

Thermal Throttling Is Silent and Intermittent

The thing is Pi will not tell you that it is throttling. It will not log a warning. That means, your video stream will just start dropping frames, your serial latency will increase, and your MQTT reconnects will take longer. And as you can see, all these symptoms look like software bugs. But if you check:

vcgencmd get_throttled

vcgencmd get_throttled is a command line tool unique to the Raspberry Pi that checks whether the computer has lowered its CPU speed.

0x0 means everything is fine. Anything else means the Pi is throttling now or has throttled since last boot. The common values:

0x50005 — The danger zone. You are currently under-volted and throttled right now, and it has happened before.
0x50000 — This means your system is physically running fine right now, but under-voltage (0x10000) and throttling (0x40000) have occurred in the past. Your current power supply is dropping voltage under load, making your SD card highly vulnerable to corruption.
0x4 — soft temperature limit active

I was building a telepresence robot, encoding 720p30 H.264 while running the aiohttp server and serial communication was pushing the Pi 4 to 80°C without a heatsink. The encoder started dropping frames randomly. Adding a heatsink brought idle temperature to 45°C and load temperature to 62°C. Managed to get the throttling under control.

On a Pi 5, the situation is better but not solved. The Pi 5 has an active cooler as an official accessory and it is worth using in any deployment where the Pi is in an enclosure. But, enclosures trap heat. A Pi in a plastic project box with no airflow will throttle faster than a bare board.

Also it is a best practice to add temperature monitoring to your health check endpoint so you find out before users do:

import subprocess

def get_pi_health():
    temp = subprocess.run(
        ['vcgencmd', 'measure_temp'],
        capture_output=True, text=True
    ).stdout.strip()  # "temp=58.0'C"

    throttled = subprocess.run(
        ['vcgencmd', 'get_throttled'],
        capture_output=True, text=True
    ).stdout.strip()  # "throttled=0x0"

    return {
        'temperature': temp,
        'throttled': throttled,
        'throttled_ok': throttled == 'throttled=0x0'
    }

Power Supply Quality Matters More Than Rated Current

A power supply rated for 3A is not always a 3A power supply. Cheap USB-C supplies have poor voltage regulation. So, under a load they sag below the 5.1V the Pi needs and trigger under-voltage throttling. The symptom is identical to thermal throttling: random slowdowns, occasional reboots, SD card corruption on shutdown.

The official Raspberry Pi power supply is not a premium product — it is a specification-compliant one. Use it, or use a bench power supply for development and a known-good supply for deployment. The Pi 5 draws up to 5A under full load; a 3A supply will cause under-voltage events when the CPU is fully loaded.

For any deployment running off mains power, a UPS hat (Geekworm UPS hat, Waveshare UPS hat) is worth the £20. The Pi gets notified of incoming power loss and can initiate a clean shutdown before the battery dies, which eliminates the entire class of "power cut during SD write" corruption events.

Network time synchronization errors can cause unexpected system failures

A Pi that has been offline for an extended period will have a wrong system clock when it boots — sometimes wrong by days if the RTC battery is dead or there is no RTC at all. Applications that timestamp log entries, certificate validity checks, and SQLite timestamp comparisons all behave unexpectedly when the system time is wrong.

A specific failure case I encountered: the MQTT client was writing timestamps using datetime.now().isoformat(). After a boot without internet, the system clock was set to 2023-01-01 (the default). All queued messages got timestamps in 2023. When the clock corrected to 2024 via NTP after network connection, the retention policy deleted those messages as being "older than 7 days" — because relative to the current time they appeared to be a year old.

Fix 1: Use a hardware RTC. The DS3231 costs about £3 and keeps accurate time across power cycles without network. Enable it with a device tree overlay:

# /boot/firmware/config.txt
dtoverlay=i2c-rtc,ds3231

Fix 2: For timestamps that survive offline periods, use monotonic time for intervals and NTP-synced time only for absolute timestamps. Do not mix them.

Fix 3: Configure chrony or systemd-timesyncd to be aggressive about syncing on boot and to accept large time jumps:

# /etc/chrony.conf
makestep 1 -1   # Accept any step size, any number of times

Watchdog Timers Are Not Optional

If your Pi is in a location where you cannot physically reach it — mounted on a robot, installed in a warehouse, bolted inside a wall panel — a software crash or infinite loop that freezes the application is effectively a permanent failure until someone intervenes.

The Linux kernel watchdog kicks the hardware watchdog timer while the kernel is running. If the kernel hangs, the watchdog expires and forces a reboot. But it does not know whether your application is running correctly. For that, you need an application-level watchdog.

systemd's built-in watchdog support requires almost no code:

# In your main loop, notify systemd you're still alive
import os
import time

def notify_watchdog():
    """Tell systemd the application is healthy."""
    os.system("systemd-notify WATCHDOG=1")

# Call this from your main loop — if you stop calling it,
# systemd will restart the service
while True:
    do_work()
    notify_watchdog()
    time.sleep(5)

# /etc/systemd/system/myapp.service
[Service]
WatchdogSec=30        # Restart if no heartbeat for 30 seconds
Restart=always
RestartSec=5
StartLimitIntervalSec=120
StartLimitBurst=5

For applications where notify_watchdog() cannot be called from the main loop (async applications, multi-threaded servers), run it from a background thread that monitors the health of the main thread.

Remote Access Must Work Before Anything Breaks

The time to set up remote access is before you deploy, not after. I use Tailscale for most Pi deployments because it takes five minutes to configure, works through NAT, does not require port forwarding, and uses WireGuard under the hood. Once it is running, you have a reliable backdoor to your hardware. However, if you need more control, complete data sovereignty, and zero third-party dependencies, use vanilla WireGuard instead. While WireGuard requires you to manually configure routing rules and host a central server with an open port for NAT traversal, it gives you total ownership over your network topology without device or account limitations:

# Install
curl -fsSL https://tailscale.com/install.sh | sh
sudo tailscale up

# Enable SSH in your tailnet policy and you can reach the Pi from anywhere
ssh user@cyrobot.turkey-trench.ts.net

Tailscale also provides valid TLS certificates for your device's hostname in the tailnet — which is how the WebRTC server serves HTTPS without a domain name or public CA.

Set up mosh alongside SSH for unreliable connections. Regular SSH sessions die when the network hiccups. Mosh sessions survive.

Logs Fill Up the Filesystem

/var/log on a production Pi will fill up over months of continuous operation. When it does, your application cannot write logs, SQLite cannot open its WAL file, and things fail in confusing ways that do not obviously point to "disk full."

Set up log rotation from day one:

# /etc/logrotate.d/myapp
/var/log/myapp/*.log {
    daily
    rotate 7
    compress
    missingok
    notifempty
    size 10M
}

And add disk usage to your health check:

import shutil

def check_disk():
    usage = shutil.disk_usage('/')
    free_percent = (usage.free / usage.total) * 100
    return {
        'free_gb': round(usage.free / 1e9, 2),
        'used_percent': round(100 - free_percent, 1),
        'warning': free_percent < 15
    }

The Summary

Before deploying any Pi to a location you cannot easily reach:

Boot storage is an industrial SD card or USB SSD. The root filesystem is read-only or has journaling on writable partitions. A hardware RTC is installed. A heatsink or active cooler is fitted. A quality power supply is used and a UPS hat is fitted if on mains power. Tailscale is installed and tested from outside the local network. systemd service has Restart=always, WatchdogSec, and StartLimitBurst set. Log rotation is configured. A health endpoint exposes temperature, throttle status, and disk usage. You have confirmed you can SSH in and restart the service from the office before going to the deployment site.

Written by Supun Akalanka | Category: Lessons Learned | Tags: Raspberry Pi, Production, Reliability, Embedded Linux, Hardware

DEV Community

Deploying Production Systems on Raspberry Pi: Lessons from the Field

Deploying Production Systems on Raspberry Pi: Lessons from the Field

SD Cards Will Kill You

Thermal Throttling Is Silent and Intermittent

Power Supply Quality Matters More Than Rated Current

Network time synchronization errors can cause unexpected system failures

Watchdog Timers Are Not Optional

Remote Access Must Work Before Anything Breaks

Logs Fill Up the Filesystem

The Summary

Top comments (0)