<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Supun Sriyananda</title>
    <description>The latest articles on DEV Community by Supun Sriyananda (@ranaweerasupun).</description>
    <link>https://dev.to/ranaweerasupun</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3951298%2Fb6d6c160-d027-48c3-aeb0-91a507d5b6a8.jpeg</url>
      <title>DEV Community: Supun Sriyananda</title>
      <link>https://dev.to/ranaweerasupun</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/ranaweerasupun"/>
    <language>en</language>
    <item>
      <title>DuckDB vs SQLite: Which one is better?</title>
      <dc:creator>Supun Sriyananda</dc:creator>
      <pubDate>Tue, 09 Jun 2026 09:18:52 +0000</pubDate>
      <link>https://dev.to/ranaweerasupun/duckdb-vs-sqlite-two-tiny-databases-that-dont-actually-compete-39d1</link>
      <guid>https://dev.to/ranaweerasupun/duckdb-vs-sqlite-two-tiny-databases-that-dont-actually-compete-39d1</guid>
      <description>&lt;p&gt;I spend most of my time somewhere between microcontrollers and dashboards. One week I'm squeezing firmware onto a device with barely any memory to spare, the next I'm staring at a few million sensor readings trying to work out why a gateway in the field keeps misbehaving. So when people line up "DuckDB vs SQLite" as a fight to the death, I always want to gently jump in.&lt;/p&gt;

&lt;p&gt;They're both small. They both run inside your own program instead of off on some server. They both let you work with a single file using SQL. On paper that makes them sound like rivals. But the more I've used them — across embedded work, edge devices, and plain old data crunching — the more they feel like two neighbours who happen to do completely different jobs.&lt;/p&gt;

&lt;p&gt;Let me walk through what I mean, starting from the very beginning.&lt;/p&gt;

&lt;h2&gt;
  
  
  First, what "embedded" actually means here
&lt;/h2&gt;

&lt;p&gt;When people hear the word "database," they often picture something big and separate: a program running on its own server somewhere that your app has to connect to over the network, with a username and password, before it can read or write anything. PostgreSQL and MySQL work like that. There's nothing wrong with it — it's how most large web apps run — but it's a lot of moving parts.&lt;/p&gt;

&lt;p&gt;SQLite and DuckDB throw all of that out. There's no separate program to start. Nothing to log into. No server quietly running in the background. The whole database is just an ordinary file sitting on your disk, and the database engine is a small library your code loads in directly. You point it at the file, you run your queries, and that's the whole setup. Nothing else to install or babysit.&lt;/p&gt;

&lt;p&gt;That shared simplicity is the lovely part, and it's exactly why you find these two in places a big server-based database could never go — phones, web browsers, tiny sensors, a quick script on your laptop. But the moment you look at &lt;em&gt;how&lt;/em&gt; each one stores your data and works through it, they head off in opposite directions. That difference is the whole story, so let's go there next.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9hesh2w71jiy0wu9u12l.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9hesh2w71jiy0wu9u12l.png" alt="row_vs_column_storage" width="799" height="419"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  SQLite: the one that's already everywhere
&lt;/h2&gt;

&lt;p&gt;SQLite stores data in rows. Picture a spreadsheet where each row is one complete record, and all the values for that record sit together: the id, the name, the temperature reading, all in one place. That layout is perfect when you're constantly poking at individual records — add this new reading, update that setting, grab the latest entry for device 42. You want the whole record at once, and SQLite hands it to you fast.&lt;/p&gt;

&lt;p&gt;A few things I love about it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;It's tiny. The whole engine is around a megabyte. On a small device, that matters enormously.&lt;/li&gt;
&lt;li&gt;It's everywhere, and I mean everywhere. There are tens of billions of SQLite files in active use — it's running on basically every phone, and your web browser uses it right now to store your history and settings. It has earned the right to be boring.&lt;/li&gt;
&lt;li&gt;It's astonishingly reliable. It's one of the most thoroughly tested pieces of software on the planet. The US Library of Congress even &lt;a href="https://sqlite.org/locrsf.html" rel="noopener noreferrer"&gt;recommends SQLite as a format for preserving digital files long-term&lt;/a&gt;, because they trust it'll still open decades from now. When you're shipping a device that has to run untouched in a cabinet for years, that track record buys you a lot of peace of mind.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In my embedded work, SQLite is the default for anything that holds &lt;em&gt;state&lt;/em&gt; — the current situation the device needs to remember. Its settings. A queue of readings waiting to be uploaded. A small log of recent events. It handles a steady stream of small writes gracefully and it doesn't ask for resources the hardware doesn't have.&lt;/p&gt;

&lt;p&gt;Where it starts to struggle is heavy number-crunching across huge piles of data. Ask SQLite to add up and average ten million rows and it'll get there — but slowly, because it reads through the data row by row, dragging along every column even when your question only touches one of them. That's not a flaw. It simply wasn't built for that job.&lt;/p&gt;

&lt;h2&gt;
  
  
  DuckDB: the one that makes big analysis feel easy
&lt;/h2&gt;

&lt;p&gt;DuckDB flips the storage layout around. Instead of keeping each row together, it keeps each &lt;em&gt;column&lt;/em&gt; together — all the temperatures in one place, all the device names in another. So when you ask "what's the average temperature across ten million readings," it reads just the temperature column and skips everything else. On top of that it processes data in big batches rather than one row at a time, which modern processors are very good at. The result is that the heavy questions that make SQLite sweat come back from DuckDB before you've finished a sip of coffee.&lt;/p&gt;

&lt;p&gt;There's another part that genuinely changed how I work, and it has nothing to do with speed. DuckDB will read your data files directly, right where they sit:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;device_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;avg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="s1"&gt;'readings/*.parquet'&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;device_id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No loading step. No importing anything into a table first. You point it at a folder of files — CSV, JSON, or Parquet (a compact file format that's common for this kind of data) — and it just reads them. For someone who regularly gets handed a pile of sensor dumps, that's the difference between "give me an afternoon to set up a pipeline" and "give me thirty seconds."&lt;/p&gt;

&lt;p&gt;It also handles data bigger than your computer's memory by spilling the overflow to disk, so a dataset that would crash a normal in-memory tool just... works.&lt;/p&gt;

&lt;p&gt;It's newer than SQLite and the engine is a bit chunkier, but it's still the same idea at heart: a file and a small library, nothing to run separately.&lt;/p&gt;

&lt;h2&gt;
  
  
  Are these actually used in the real world?
&lt;/h2&gt;

&lt;p&gt;Both, heavily — this isn't a case of betting on something obscure.&lt;/p&gt;

&lt;p&gt;SQLite is &lt;a href="https://www.sqlite.org/mostdeployed.html" rel="noopener noreferrer"&gt;the most widely deployed database engine that exists&lt;/a&gt;, full stop. The tens of billions of copies in daily use make it more common than every other database combined. It's not going anywhere.&lt;/p&gt;

&lt;p&gt;DuckDB is younger but its adoption has shot up. It's pulling in around 37 million downloads a month on Python's package index, it's MIT licensed and free, and it has real commercial backing behind it (a company called MotherDuck builds a cloud service on top while keeping the core engine open and free), which answers the usual worry about whether an open-source tool will still be maintained in five years. It also fits neatly with where the industry is heading: open file formats, modern multi-core processors, and even AI coding assistants, which tend to be good at writing SQL and so reach for DuckDB naturally.&lt;/p&gt;

&lt;p&gt;And SQLite isn't standing still either. Newer spin-offs like Turso/libSQL are adding things like replication and edge support on top of the classic engine. Both of these tools are safe bets.&lt;/p&gt;

&lt;h2&gt;
  
  
  So when does each one actually win?
&lt;/h2&gt;

&lt;p&gt;Here's how it shakes out across the three places I work.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;On small, constrained hardware&lt;/strong&gt;, SQLite, almost every time. DuckDB's appetite for memory and processing power during a big query is more than a tiny chip wants to give. On a larger edge device running Linux it's a different story, but down at the small end, SQLite's tiny size and long history are hard to argue with.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;At the edge&lt;/strong&gt; — meaning the gateway boxes that sit between your sensors and the cloud — they actually team up. A pattern I keep coming back to: let SQLite handle the incoming readings on the device, quietly buffering them as they arrive, then let DuckDB do the local number-crunching before anything gets sent upstream. You ship neat summaries instead of the raw firehose, which is kinder to both your bandwidth bill and your cloud costs. They're not competing here. They're a relay team.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx61ysj2a0savsfyzlb8a.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx61ysj2a0savsfyzlb8a.png" alt="edge_pipeline_sqlite_duckdb" width="799" height="348"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For sitting down and analysing data&lt;/strong&gt;, DuckDB is the one I reach for. Being able to throw SQL at a folder of files without setting up any heavy machinery is exactly the kind of low-fuss tool the job usually calls for. It has quietly become a go-to for local analysis, and for good reason.&lt;/p&gt;

&lt;h2&gt;
  
  
  The one line I keep in my head
&lt;/h2&gt;

&lt;p&gt;If I had to shrink all of this down to a fridge magnet:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;SQLite is for managing data while it's being created and changed. DuckDB is for analysing it once it's all piled up.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Same small, no-server spirit — opposite ends of the data's life. SQLite looks after the data as it's coming in and changing. DuckDB shows up later to make sense of the whole pile. Once that clicked for me, the "versus" framing kind of fell apart.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd6gvjrp5fgtn0o3hc3nb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd6gvjrp5fgtn0o3hc3nb.png" alt="state_vs_analysis_concept" width="800" height="278"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;So if you're choosing between them, the honest answer is often &lt;em&gt;both, for different things&lt;/em&gt;. Work out whether the job in front of you is about handling data as it changes or making sense of a big pile of it, and the choice mostly makes itself.&lt;/p&gt;

&lt;p&gt;If you've wired these two together in your own projects, I'd love to hear how you split the work. I'm always tinkering with my own setup, and the edge folks always seem to have the most interesting war stories.&lt;/p&gt;

</description>
      <category>duckdb</category>
      <category>sqlite</category>
      <category>iot</category>
      <category>dataengineering</category>
    </item>
    <item>
      <title>Deploying Production Systems on Raspberry Pi: Lessons from the Field</title>
      <dc:creator>Supun Sriyananda</dc:creator>
      <pubDate>Sun, 07 Jun 2026 05:38:31 +0000</pubDate>
      <link>https://dev.to/ranaweerasupun/deploying-production-systems-on-raspberry-pi-lessons-from-the-field-1i4k</link>
      <guid>https://dev.to/ranaweerasupun/deploying-production-systems-on-raspberry-pi-lessons-from-the-field-1i4k</guid>
      <description>&lt;h2&gt;
  
  
  Deploying Production Systems on Raspberry Pi: Lessons from the Field
&lt;/h2&gt;

&lt;p&gt;These are the things I wish I had known before deploying Pis in production.&lt;/p&gt;




&lt;h2&gt;
  
  
  SD Cards Will Kill You
&lt;/h2&gt;

&lt;p&gt;The first production Pi I deployed used a generic microSD card. It failed after four months. The second one used a "name brand" card. It failed after six months. The pattern remained always the same: the filesystem corrupts during a power loss, the Pi boots into read-only mode, and whatever the system was supposed to be doing silently stops working.&lt;/p&gt;

&lt;p&gt;SD card corruption under power loss is not a bug you can fix in software. It is a fundamental characteristic of flash storage that was designed for cameras, not servers. The cells wear out, write operations are not atomic, and a sudden power cut mid-write leaves the filesystem in a state that &lt;strong&gt;fsck (File System Consistency Check)&lt;/strong&gt; sometimes cannot recover.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;If you choose SD cards:&lt;/strong&gt; Switch to a Pi-rated industrial SD card (SanDisk MAX Endurance, Samsung Pro Endurance) or eliminate the SD card entirely by booting from a USB SSD. USB boot on Pi 4 and Pi 5 is stable and the endurance difference is enormous — a decent SSD handles orders of magnitude more write cycles than any SD card.&lt;/p&gt;

&lt;p&gt;For systems that must use SD, mount the filesystem read-only and put all writable state on a &lt;strong&gt;tmpfs&lt;/strong&gt; or a separate partition with journaling.&lt;/p&gt;

&lt;p&gt;tmpfs is a special type of temporary file storage facility in Linux and Unix-like systems that stores files directly in volatile memory (RAM) instead of on a persistent drive like an SD card or SSD.&lt;/p&gt;

&lt;p&gt;When you mount a folder as tmpfs, any files written to that folder behave like regular files, but they consume RAM and exist purely in memory.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# /etc/fstab — mount root read-only, put logs and data elsewhere&lt;/span&gt;
/dev/mmcblk0p2  /        ext4  ro,defaults  0  1
tmpfs           /tmp     tmpfs defaults      0  0
tmpfs           /var/log tmpfs defaults      0  0

&lt;span class="c"&gt;# Separate partition for application data with journaling&lt;/span&gt;
/dev/mmcblk0p3  /var/lib/myapp  ext4  defaults,data&lt;span class="o"&gt;=&lt;/span&gt;journal  0  2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For a warehouse sensor deployment, the application writes data to SQLite on a separate ext4 partition with journaling enabled. If the Pi loses power mid-write, fsck can recover the journal. The root partition is read-only and survives power loss cleanly every time.&lt;/p&gt;




&lt;h2&gt;
  
  
  Thermal Throttling Is Silent and Intermittent
&lt;/h2&gt;

&lt;p&gt;The thing is Pi will not tell you that it is throttling. It will not log a warning. That means, your video stream will just start dropping frames, your serial latency will increase, and your MQTT reconnects will take longer. And as you can see, all these symptoms look like software bugs. But if you check:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;vcgencmd get_throttled
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;vcgencmd get_throttled&lt;/code&gt; is a command line tool unique to the Raspberry Pi that checks whether the computer has lowered its CPU speed.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;0x0&lt;/code&gt; means everything is fine. Anything else means the Pi is throttling now or has throttled since last boot. The common values:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;0x50005&lt;/code&gt; — The danger zone. You are currently under-volted and throttled right now, and it has happened before.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;0x50000&lt;/code&gt; — This means your system is physically running fine right now, but under-voltage (0x10000) and throttling (0x40000) have occurred in the past. Your current power supply is dropping voltage under load, making your SD card highly vulnerable to corruption.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;0x4&lt;/code&gt; — soft temperature limit active&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I was building a telepresence robot, encoding 720p30 H.264 while running the aiohttp server and serial communication was pushing the Pi 4 to 80°C without a heatsink. The encoder started dropping frames randomly. Adding a heatsink brought idle temperature to 45°C and load temperature to 62°C. Managed to get the throttling under control.&lt;/p&gt;

&lt;p&gt;On a Pi 5, the situation is better but not solved. The Pi 5 has an active cooler as an official accessory and it is worth using in any deployment where the Pi is in an enclosure. But, enclosures trap heat. A Pi in a plastic project box with no airflow will throttle faster than a bare board.&lt;/p&gt;

&lt;p&gt;Also it is a best practice to add temperature monitoring to your health check endpoint so you find out before users do:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;subprocess&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_pi_health&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;temp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;subprocess&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;vcgencmd&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;measure_temp&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="n"&gt;capture_output&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;
    &lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;stdout&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;# "temp=58.0'C"
&lt;/span&gt;
    &lt;span class="n"&gt;throttled&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;subprocess&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;vcgencmd&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;get_throttled&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="n"&gt;capture_output&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;
    &lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;stdout&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;# "throttled=0x0"
&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;temperature&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;temp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;throttled&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;throttled&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;throttled_ok&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;throttled&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;throttled=0x0&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Power Supply Quality Matters More Than Rated Current
&lt;/h2&gt;

&lt;p&gt;A power supply rated for 3A is not always a 3A power supply. Cheap USB-C supplies have poor voltage regulation. So, under a load they sag below the 5.1V the Pi needs and trigger under-voltage throttling. The symptom is identical to thermal throttling: random slowdowns, occasional reboots, SD card corruption on shutdown.&lt;/p&gt;

&lt;p&gt;The official Raspberry Pi power supply is not a premium product — it is a specification-compliant one. Use it, or use a bench power supply for development and a known-good supply for deployment. The Pi 5 draws up to 5A under full load; a 3A supply will cause under-voltage events when the CPU is fully loaded.&lt;/p&gt;

&lt;p&gt;For any deployment running off mains power, a UPS hat (Geekworm UPS hat, Waveshare UPS hat) is worth the £20. The Pi gets notified of incoming power loss and can initiate a clean shutdown before the battery dies, which eliminates the entire class of "power cut during SD write" corruption events.&lt;/p&gt;




&lt;h2&gt;
  
  
  Network time synchronization errors can cause unexpected system failures
&lt;/h2&gt;

&lt;p&gt;A Pi that has been offline for an extended period will have a wrong system clock when it boots — sometimes wrong by days if the RTC battery is dead or there is no RTC at all. Applications that timestamp log entries, certificate validity checks, and SQLite timestamp comparisons all behave unexpectedly when the system time is wrong.&lt;/p&gt;

&lt;p&gt;A specific failure case I encountered: the MQTT client was writing timestamps using &lt;code&gt;datetime.now().isoformat()&lt;/code&gt;. After a boot without internet, the system clock was set to &lt;strong&gt;2023-01-01&lt;/strong&gt; (the default). All queued messages got timestamps in 2023. When the clock corrected to 2024 via NTP after network connection, the retention policy deleted those messages as being "older than 7 days" — because relative to the current time they appeared to be a year old.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix 1:&lt;/strong&gt; Use a hardware RTC. The DS3231 costs about £3 and keeps accurate time across power cycles without network. Enable it with a device tree overlay:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# /boot/firmware/config.txt&lt;/span&gt;
&lt;span class="nv"&gt;dtoverlay&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;i2c-rtc,ds3231
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Fix 2:&lt;/strong&gt; For timestamps that survive offline periods, use monotonic time for intervals and NTP-synced time only for absolute timestamps. Do not mix them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix 3:&lt;/strong&gt; Configure &lt;strong&gt;chrony&lt;/strong&gt; or &lt;strong&gt;systemd-timesyncd&lt;/strong&gt; to be aggressive about syncing on boot and to accept large time jumps:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ini"&gt;&lt;code&gt;&lt;span class="c"&gt;# /etc/chrony.conf
&lt;/span&gt;&lt;span class="err"&gt;makestep&lt;/span&gt; &lt;span class="err"&gt;1&lt;/span&gt; &lt;span class="err"&gt;-1&lt;/span&gt;   &lt;span class="c"&gt;# Accept any step size, any number of times
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Watchdog Timers Are Not Optional
&lt;/h2&gt;

&lt;p&gt;If your Pi is in a location where you cannot physically reach it — mounted on a robot, installed in a warehouse, bolted inside a wall panel — a software crash or infinite loop that freezes the application is effectively a permanent failure until someone intervenes.&lt;/p&gt;

&lt;p&gt;The Linux kernel watchdog kicks the hardware watchdog timer while the kernel is running. If the kernel hangs, the watchdog expires and forces a reboot. But it does not know whether your application is running correctly. For that, you need an application-level watchdog.&lt;/p&gt;

&lt;p&gt;systemd's built-in watchdog support requires almost no code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# In your main loop, notify systemd you're still alive
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;notify_watchdog&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Tell systemd the application is healthy.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;system&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;systemd-notify WATCHDOG=1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Call this from your main loop — if you stop calling it,
# systemd will restart the service
&lt;/span&gt;&lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;do_work&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="nf"&gt;notify_watchdog&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ini"&gt;&lt;code&gt;&lt;span class="c"&gt;# /etc/systemd/system/myapp.service
&lt;/span&gt;&lt;span class="nn"&gt;[Service]&lt;/span&gt;
&lt;span class="py"&gt;WatchdogSec&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;30        # Restart if no heartbeat for 30 seconds&lt;/span&gt;
&lt;span class="py"&gt;Restart&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;always&lt;/span&gt;
&lt;span class="py"&gt;RestartSec&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;5&lt;/span&gt;
&lt;span class="py"&gt;StartLimitIntervalSec&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;120&lt;/span&gt;
&lt;span class="py"&gt;StartLimitBurst&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;5&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For applications where &lt;code&gt;notify_watchdog()&lt;/code&gt; cannot be called from the main loop (async applications, multi-threaded servers), run it from a background thread that monitors the health of the main thread.&lt;/p&gt;




&lt;h2&gt;
  
  
  Remote Access Must Work Before Anything Breaks
&lt;/h2&gt;

&lt;p&gt;The time to set up remote access is before you deploy, not after. I use Tailscale for most Pi deployments because it takes five minutes to configure, works through NAT, does not require port forwarding, and uses WireGuard under the hood. Once it is running, you have a reliable backdoor to your hardware. However, if you need more control, complete data sovereignty, and zero third-party dependencies, use vanilla WireGuard instead. While WireGuard requires you to manually configure routing rules and host a central server with an open port for NAT traversal, it gives you total ownership over your network topology without device or account limitations:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install&lt;/span&gt;
curl &lt;span class="nt"&gt;-fsSL&lt;/span&gt; https://tailscale.com/install.sh | sh
&lt;span class="nb"&gt;sudo &lt;/span&gt;tailscale up

&lt;span class="c"&gt;# Enable SSH in your tailnet policy and you can reach the Pi from anywhere&lt;/span&gt;
ssh user@cyrobot.turkey-trench.ts.net
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Tailscale also provides valid TLS certificates for your device's hostname in the tailnet — which is how the WebRTC server serves HTTPS without a domain name or public CA.&lt;/p&gt;

&lt;p&gt;Set up &lt;strong&gt;mosh&lt;/strong&gt; alongside SSH for unreliable connections. Regular SSH sessions die when the network hiccups. Mosh sessions survive.&lt;/p&gt;




&lt;h2&gt;
  
  
  Logs Fill Up the Filesystem
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;/var/log&lt;/code&gt; on a production Pi will fill up over months of continuous operation. When it does, your application cannot write logs, SQLite cannot open its WAL file, and things fail in confusing ways that do not obviously point to "disk full."&lt;/p&gt;

&lt;p&gt;Set up log rotation from day one:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# /etc/logrotate.d/myapp&lt;/span&gt;
/var/log/myapp/&lt;span class="k"&gt;*&lt;/span&gt;.log &lt;span class="o"&gt;{&lt;/span&gt;
    daily
    rotate 7
    compress
    missingok
    notifempty
    size 10M
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And add disk usage to your health check:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;shutil&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;check_disk&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;usage&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;shutil&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;disk_usage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;free_percent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;free&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;total&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;free_gb&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;free&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mf"&gt;1e9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;used_percent&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;free_percent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;warning&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;free_percent&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;15&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  The Summary
&lt;/h2&gt;

&lt;p&gt;Before deploying any Pi to a location you cannot easily reach:&lt;/p&gt;

&lt;p&gt;Boot storage is an industrial SD card or USB SSD. The root filesystem is read-only or has journaling on writable partitions. A hardware RTC is installed. A heatsink or active cooler is fitted. A quality power supply is used and a UPS hat is fitted if on mains power. Tailscale is installed and tested from outside the local network. systemd service has &lt;strong&gt;&lt;em&gt;Restart=always&lt;/em&gt;&lt;/strong&gt;, &lt;strong&gt;&lt;em&gt;WatchdogSec&lt;/em&gt;&lt;/strong&gt;, and &lt;strong&gt;&lt;em&gt;StartLimitBurst&lt;/em&gt;&lt;/strong&gt; set. Log rotation is configured. A health endpoint exposes temperature, throttle status, and disk usage. You have confirmed you can SSH in and restart the service from the office before going to the deployment site.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Written by Supun Akalanka | Category: Lessons Learned | Tags: Raspberry Pi, Production, Reliability, Embedded Linux, Hardware&lt;/em&gt;&lt;/p&gt;

</description>
      <category>raspberrypi</category>
      <category>deployment</category>
      <category>linux</category>
      <category>reliability</category>
    </item>
    <item>
      <title>Creating Robust systemd Services for Embedded Applications</title>
      <dc:creator>Supun Sriyananda</dc:creator>
      <pubDate>Wed, 03 Jun 2026 09:51:04 +0000</pubDate>
      <link>https://dev.to/ranaweerasupun/creating-robust-systemd-services-for-embedded-applications-50fm</link>
      <guid>https://dev.to/ranaweerasupun/creating-robust-systemd-services-for-embedded-applications-50fm</guid>
      <description>&lt;p&gt;There is a moment every embedded Linux developer hits eventually. You have spent days building something that works beautifully — a sensor pipeline, a streaming server, an MQTT client — and then you reboot the device and everything is silent. Nothing started. You SSH in, manually run your script, and it all comes back to life. The hardware is fine. Your code is fine. You just have no way of automatically running it.&lt;/p&gt;

&lt;p&gt;That is the gap systemd fills. It is the init system on virtually every modern Linux distribution, and on embedded Linux systems like the Raspberry Pi it is what decides what runs at boot, what gets restarted if it crashes, and where all the logs go. Once you understand how to write a service file, your applications stop being fragile scripts you need to babysit and start being first-class system services that survive reboots, network drops, and unexpected crashes.&lt;/p&gt;

&lt;p&gt;This tutorial builds up from the simplest possible service file to a production-ready configuration, explaining every line along the way. By the end you will have a service running your own Python application, logging to the system journal, and automatically restarting itself after failures.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;See Complete Tutorial in Github:&lt;/strong&gt; &lt;a href="https://github.com/ranaweerasupun/mini-tech-tutorials/tree/main/systemd-services-tutorial" rel="noopener noreferrer"&gt;Systemd Services Tutorial&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  What systemd Actually Does
&lt;/h2&gt;

&lt;p&gt;Before writing any configuration, it helps to understand what problem systemd is solving, because the design of service files makes much more sense once you see the underlying model.&lt;/p&gt;

&lt;p&gt;When your Raspberry Pi boots, the Linux kernel starts and immediately hands control to process ID 1 — the very first user-space process. On modern systems, that process is &lt;em&gt;systemd&lt;/em&gt;. Everything that happens next — mounting filesystems, bringing up the network, starting your application — is orchestrated by systemd. It reads configuration files called &lt;strong&gt;unit files&lt;/strong&gt; that describe what should be started, when, in what order, and what to do if something goes wrong.&lt;/p&gt;

&lt;p&gt;A service file is just one type of unit file (there are also unit files for timers, sockets, mount points, and more, but services are what you will use most). When you tell systemd about your application through a service file, you are essentially saying: "here is my program, here is when I want it to run, and here is how I want you to manage it." systemd takes it from there — starting it at boot, watching it, restarting it if it dies, and capturing everything it prints to stdout and stderr into a structured log called the &lt;strong&gt;journal&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The central tool for interacting with systemd is &lt;em&gt;systemctl&lt;/em&gt;. You use it to start, stop, enable, disable, and inspect services. The companion tool &lt;em&gt;journalctl&lt;/em&gt; gives you access to the journal — the logs that systemd collects from every service it manages.&lt;/p&gt;




&lt;h2&gt;
  
  
  A Minimal Service File
&lt;/h2&gt;

&lt;p&gt;Let us start with the simplest possible service to understand the structure, then build from there. Suppose you have a Python script at &lt;code&gt;/home/pi/mqtt_client/production_client.py&lt;/code&gt; that you want to run automatically at boot.&lt;/p&gt;

&lt;p&gt;Service files live in &lt;code&gt;/etc/systemd/system/&lt;/code&gt;. Create a new one:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;nano /etc/systemd/system/mqtt-client.service
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here is the minimal version:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ini"&gt;&lt;code&gt;&lt;span class="nn"&gt;[Unit]&lt;/span&gt;
&lt;span class="py"&gt;Description&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;Production MQTT Edge Client&lt;/span&gt;

&lt;span class="nn"&gt;[Service]&lt;/span&gt;
&lt;span class="py"&gt;ExecStart&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;/usr/bin/python3 /home/pi/mqtt_client/production_client.py&lt;/span&gt;

&lt;span class="nn"&gt;[Install]&lt;/span&gt;
&lt;span class="py"&gt;WantedBy&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;multi-user.target&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Even this tiny file has three sections and each one has a specific job. The &lt;strong&gt;&lt;em&gt;[Unit]&lt;/em&gt;&lt;/strong&gt; section contains metadata and dependency declarations — the &lt;strong&gt;&lt;em&gt;Description&lt;/em&gt;&lt;/strong&gt; is just a human-readable label that appears in logs and status output. The &lt;strong&gt;&lt;em&gt;[Service]&lt;/em&gt;&lt;/strong&gt; section is where the actual execution configuration lives — right now we only have &lt;strong&gt;&lt;em&gt;ExecStart&lt;/em&gt;&lt;/strong&gt;, which is the command that launches your program. The &lt;strong&gt;&lt;em&gt;[Install]&lt;/em&gt;&lt;/strong&gt; section controls how the service integrates into the boot process — &lt;code&gt;WantedBy=multi-user.target&lt;/code&gt; means "start this service when the system reaches the normal multi-user state", which is essentially "start this at boot when the system is ready for normal operation."&lt;/p&gt;

&lt;p&gt;To activate it, you need to do two things: &lt;strong&gt;reload&lt;/strong&gt; systemd so it picks up the new file, and &lt;strong&gt;enable&lt;/strong&gt; the service so it starts at boot:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Tell systemd to re-read all unit files&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl daemon-reload

&lt;span class="c"&gt;# Enable it to start at boot&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl &lt;span class="nb"&gt;enable &lt;/span&gt;mqtt-client.service

&lt;span class="c"&gt;# Start it right now without waiting for a reboot&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl start mqtt-client.service
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Check whether it is running:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl status mqtt-client.service
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You will see output that tells you the service state &lt;code&gt;active (running)&lt;/code&gt;. This is what you want, the process ID, when it started, and the last few lines of log output. If it failed, the status output will usually tell you exactly why — an incorrect path, a Python import error, whatever the problem is.&lt;/p&gt;

&lt;p&gt;The difference between &lt;strong&gt;&lt;em&gt;enable&lt;/em&gt;&lt;/strong&gt; and &lt;strong&gt;&lt;em&gt;start&lt;/em&gt;&lt;/strong&gt; trips people up at first, so it is worth being explicit: &lt;strong&gt;&lt;em&gt;enable&lt;/em&gt;&lt;/strong&gt; creates a symlink that tells systemd to start the service at boot — it does not start it right now. &lt;strong&gt;&lt;em&gt;start&lt;/em&gt;&lt;/strong&gt; starts it immediately — it does not persist across reboots. In practice you almost always want both.&lt;/p&gt;




&lt;h2&gt;
  
  
  Making it Actually Robust: Restart Policies
&lt;/h2&gt;

&lt;p&gt;The minimal service above will start your application at boot, but if the application crashes — which happens, especially in edge environments with unreliable hardware or network — systemd will not do anything. It will just leave the service in a failed state. For an embedded device running unattended in the field, that is not acceptable.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;&lt;em&gt;Restart&lt;/em&gt;&lt;/strong&gt; directive in the &lt;strong&gt;&lt;em&gt;[Service]&lt;/em&gt;&lt;/strong&gt; section tells systemd what to do when your application exits. Here is what the options mean in practice. &lt;em&gt;no&lt;/em&gt; means never restart (the default — not what you want for production). &lt;strong&gt;&lt;em&gt;on-failure&lt;/em&gt;&lt;/strong&gt; means restart if the process exits with a non-zero exit code or is killed by a signal, but not if it exits cleanly with code 0. &lt;strong&gt;&lt;em&gt;always&lt;/em&gt;&lt;/strong&gt; means restart no matter what — even if the process exits with code 0. &lt;strong&gt;&lt;em&gt;on-abnormal&lt;/em&gt;&lt;/strong&gt; means restart on crash or signal, but not on clean exit or timeout.&lt;/p&gt;

&lt;p&gt;For most embedded applications, &lt;code&gt;on-failure&lt;/code&gt; is the right choice. It means "if something goes wrong and the program dies unexpectedly, bring it back", but it also means "if I deliberately stop the service with &lt;strong&gt;&lt;em&gt;systemctl stop&lt;/em&gt;&lt;/strong&gt;, do not restart it."&lt;/p&gt;

&lt;p&gt;There is one more important setting to pair with &lt;strong&gt;&lt;em&gt;Restart&lt;/em&gt;&lt;/strong&gt;: &lt;code&gt;RestartSec&lt;/code&gt;, which sets how long to wait before restarting. Without it, systemd restarts immediately, which can cause problems if your service is crashing due to a dependency that is not ready yet (like the network). A short delay — even two or three seconds — gives things time to settle.&lt;/p&gt;

&lt;p&gt;Here is the updated service with restart policy:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ini"&gt;&lt;code&gt;&lt;span class="nn"&gt;[Unit]&lt;/span&gt;
&lt;span class="py"&gt;Description&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;Production MQTT Edge Client&lt;/span&gt;

&lt;span class="nn"&gt;[Service]&lt;/span&gt;
&lt;span class="py"&gt;ExecStart&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;/usr/bin/python3 /home/pi/mqtt_client/production_client.py&lt;/span&gt;
&lt;span class="py"&gt;Restart&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;on-failure&lt;/span&gt;
&lt;span class="py"&gt;RestartSec&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;5&lt;/span&gt;

&lt;span class="nn"&gt;[Install]&lt;/span&gt;
&lt;span class="py"&gt;WantedBy&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;multi-user.target&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Declaring Dependencies
&lt;/h2&gt;

&lt;p&gt;This is where systemd really starts to earn its keep. Most embedded applications do not run in isolation — they depend on the network being up, a filesystem being mounted, or another service being ready. systemd lets you express these dependencies explicitly so that your service starts at the right time and in the right order.&lt;/p&gt;

&lt;p&gt;The two most commonly confused directives here are &lt;strong&gt;&lt;em&gt;After&lt;/em&gt;&lt;/strong&gt; and &lt;strong&gt;&lt;em&gt;Requires&lt;/em&gt;&lt;/strong&gt;. Think of them as answering different questions. &lt;code&gt;After=network.target&lt;/code&gt; answers the question "when should I start?" — it tells systemd to not even attempt to launch this service until the network target has been reached. &lt;code&gt;Requires=network.target&lt;/code&gt; answers the question "what must exist for me to function?" — it tells systemd that if the network goes away, this service should be stopped too.&lt;/p&gt;

&lt;p&gt;You can use them together, and for network-dependent applications you almost always should. There is also &lt;strong&gt;&lt;em&gt;Wants&lt;/em&gt;&lt;/strong&gt;, which is a softer version of &lt;strong&gt;&lt;em&gt;Requires&lt;/em&gt;&lt;/strong&gt; — it expresses preference rather than hard dependency. If what &lt;strong&gt;&lt;em&gt;Wants&lt;/em&gt;&lt;/strong&gt; declares is not available, the service will still start rather than failing.&lt;/p&gt;

&lt;p&gt;For an MQTT client that needs the network to do anything useful:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ini"&gt;&lt;code&gt;&lt;span class="nn"&gt;[Unit]&lt;/span&gt;
&lt;span class="py"&gt;Description&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;Production MQTT Edge Client&lt;/span&gt;
&lt;span class="py"&gt;After&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;network.target&lt;/span&gt;
&lt;span class="py"&gt;Wants&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;network.target&lt;/span&gt;

&lt;span class="nn"&gt;[Service]&lt;/span&gt;
&lt;span class="py"&gt;ExecStart&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;/usr/bin/python3 /home/pi/mqtt_client/production_client.py&lt;/span&gt;
&lt;span class="py"&gt;Restart&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;on-failure&lt;/span&gt;
&lt;span class="py"&gt;RestartSec&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;5&lt;/span&gt;

&lt;span class="nn"&gt;[Install]&lt;/span&gt;
&lt;span class="py"&gt;WantedBy&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;multi-user.target&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;&lt;em&gt;Wants&lt;/em&gt;&lt;/strong&gt; rather than &lt;strong&gt;&lt;em&gt;Requires&lt;/em&gt;&lt;/strong&gt; here is deliberate. The MQTT client already handles network loss internally — it has exponential backoff and an offline queue for exactly this scenario. So we do not want systemd to kill the service if the network drops; we just want to make sure we do not start before the network stack is even initialised.&lt;/p&gt;

&lt;p&gt;For cases where your application would be completely broken without a dependency — a database-backed service where the database is also managed by systemd, for example — use &lt;strong&gt;&lt;em&gt;Requires&lt;/em&gt;&lt;/strong&gt;. The distinction matters in practice.&lt;/p&gt;




&lt;h2&gt;
  
  
  Working Directory and Environment
&lt;/h2&gt;

&lt;p&gt;Two things that bite almost every developer on their first real systemd deployment are working directory and environment variables. When you run a script manually from your terminal, you have a working directory (usually your home folder) and a full set of environment variables inherited from your shell. systemd does not give you either of those by default.&lt;/p&gt;

&lt;p&gt;This matters because Python scripts often use relative paths like &lt;code&gt;./config.json&lt;/code&gt; or &lt;code&gt;./logs/&lt;/code&gt;, and those paths will fail when the working directory is not what you expect. Environment variables like &lt;strong&gt;PATH&lt;/strong&gt;, &lt;strong&gt;HOME&lt;/strong&gt;, and any custom variables your application reads from the environment will also be missing or wrong.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;WorkingDirectory&lt;/em&gt; solves the path problem, and &lt;em&gt;Environment&lt;/em&gt; or &lt;em&gt;EnvironmentFile&lt;/em&gt; solves the variable problem:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ini"&gt;&lt;code&gt;&lt;span class="nn"&gt;[Unit]&lt;/span&gt;
&lt;span class="py"&gt;Description&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;Production MQTT Edge Client&lt;/span&gt;
&lt;span class="py"&gt;After&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;network.target&lt;/span&gt;
&lt;span class="py"&gt;Wants&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;network.target&lt;/span&gt;

&lt;span class="nn"&gt;[Service]&lt;/span&gt;
&lt;span class="c"&gt;# Set the working directory so relative paths work correctly
&lt;/span&gt;&lt;span class="py"&gt;WorkingDirectory&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;/home/pi/mqtt_client&lt;/span&gt;

&lt;span class="c"&gt;# Pass environment variables directly
&lt;/span&gt;&lt;span class="py"&gt;Environment&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;MQTT_BROKER_HOST=192.168.1.100&lt;/span&gt;
&lt;span class="py"&gt;Environment&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;MQTT_CLIENT_ID=warehouse_sensor_01&lt;/span&gt;

&lt;span class="c"&gt;# Or load them from a file (better for secrets)
# EnvironmentFile=/etc/mqtt-client/config.env
&lt;/span&gt;
&lt;span class="py"&gt;ExecStart&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;/usr/bin/python3 /home/pi/mqtt_client/production_client.py&lt;/span&gt;
&lt;span class="py"&gt;Restart&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;on-failure&lt;/span&gt;
&lt;span class="py"&gt;RestartSec&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;5&lt;/span&gt;

&lt;span class="nn"&gt;[Install]&lt;/span&gt;
&lt;span class="py"&gt;WantedBy&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;multi-user.target&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;strong&gt;&lt;em&gt;EnvironmentFile&lt;/em&gt;&lt;/strong&gt; approach is worth knowing about even if you do not use it immediately. It lets you store configuration in a separate file — including secrets like passwords — outside the service file itself. A file like &lt;code&gt;/etc/mqtt-client/config.env&lt;/code&gt; contains simple &lt;code&gt;KEY=value&lt;/code&gt; lines and is loaded by systemd before starting the service. This keeps credentials out of version control and makes configuration changes possible without touching the service file.&lt;/p&gt;




&lt;h2&gt;
  
  
  Logging
&lt;/h2&gt;

&lt;p&gt;One of systemd's genuine gifts to embedded developers is centralised logging. Everything your application prints to stdout and stderr is automatically captured by the journal — no log file configuration required, no rotation to set up. The journal is structured, persistent across reboots (on most systems), and queryable in powerful ways.&lt;/p&gt;

&lt;p&gt;The basic command to read logs from your service is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Show all logs from your service&lt;/span&gt;
journalctl &lt;span class="nt"&gt;-u&lt;/span&gt; mqtt-client.service

&lt;span class="c"&gt;# Follow logs in real-time (like tail -f)&lt;/span&gt;
journalctl &lt;span class="nt"&gt;-u&lt;/span&gt; mqtt-client.service &lt;span class="nt"&gt;-f&lt;/span&gt;

&lt;span class="c"&gt;# Show only logs since the last boot&lt;/span&gt;
journalctl &lt;span class="nt"&gt;-u&lt;/span&gt; mqtt-client.service &lt;span class="nt"&gt;-b&lt;/span&gt;

&lt;span class="c"&gt;# Show logs from the last hour&lt;/span&gt;
journalctl &lt;span class="nt"&gt;-u&lt;/span&gt; mqtt-client.service &lt;span class="nt"&gt;--since&lt;/span&gt; &lt;span class="s2"&gt;"1 hour ago"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For embedded devices with limited storage, it is worth explicitly configuring how much disk space the journal is allowed to use. You do this not in the service file but in the journal configuration:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;nano /etc/systemd/journald.conf
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Add or modify these lines:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ini"&gt;&lt;code&gt;&lt;span class="nn"&gt;[Journal]&lt;/span&gt;
&lt;span class="c"&gt;# Maximum journal size on disk
&lt;/span&gt;&lt;span class="py"&gt;SystemMaxUse&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;50M&lt;/span&gt;

&lt;span class="c"&gt;# Maximum size of a single journal file
&lt;/span&gt;&lt;span class="py"&gt;SystemMaxFileSize&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;10M&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then restart the journal daemon:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl restart systemd-journald
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For embedded deployments where the device has a small SD card or eMMC, keeping the journal small is important. Fifty megabytes is a reasonable limit for most edge devices.&lt;/p&gt;




&lt;h2&gt;
  
  
  A Production-Ready Example
&lt;/h2&gt;

&lt;p&gt;Putting all of this together, here is what the service file looks like for a real deployment — the kind of configuration you would use for a long-running embedded application that needs to be reliable in the field. This is modelled after exactly how I deploy the MQTT edge client and the WebRTC streaming service on the telepresence robot:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ini"&gt;&lt;code&gt;&lt;span class="nn"&gt;[Unit]&lt;/span&gt;
&lt;span class="py"&gt;Description&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;Production MQTT Edge Client with Offline Resilience&lt;/span&gt;
&lt;span class="c"&gt;# Human-readable detail shown in status output and logs
&lt;/span&gt;&lt;span class="py"&gt;Documentation&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;https://github.com/ranaweerasupun/mqtt-production-client&lt;/span&gt;

&lt;span class="c"&gt;# Start after the network is available, but do not require it —
# the application handles network loss internally with its own
# backoff and offline queue
&lt;/span&gt;&lt;span class="py"&gt;After&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;network.target&lt;/span&gt;
&lt;span class="py"&gt;Wants&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;network.target&lt;/span&gt;

&lt;span class="c"&gt;# If the service restarts more than 5 times in 60 seconds, give up
# This prevents an infinite crash-restart-crash loop from overwhelming the system
# (these belong in [Unit] on systemd v230+)
&lt;/span&gt;&lt;span class="py"&gt;StartLimitIntervalSec&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;60&lt;/span&gt;
&lt;span class="py"&gt;StartLimitBurst&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;5&lt;/span&gt;

&lt;span class="nn"&gt;[Service]&lt;/span&gt;
&lt;span class="c"&gt;# Run as a specific user rather than root — better security practice
&lt;/span&gt;&lt;span class="py"&gt;User&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;pi&lt;/span&gt;
&lt;span class="py"&gt;Group&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;pi&lt;/span&gt;

&lt;span class="c"&gt;# Set working directory so relative paths in the application work
&lt;/span&gt;&lt;span class="py"&gt;WorkingDirectory&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;/home/pi/mqtt_client&lt;/span&gt;

&lt;span class="c"&gt;# Load configuration from a separate file
# This keeps secrets out of the service file
&lt;/span&gt;&lt;span class="py"&gt;EnvironmentFile&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;/etc/mqtt-client/config.env&lt;/span&gt;

&lt;span class="c"&gt;# The command that starts the application
&lt;/span&gt;&lt;span class="py"&gt;ExecStart&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;/usr/bin/python3 /home/pi/mqtt_client/production_client.py&lt;/span&gt;

&lt;span class="c"&gt;# What to do when the process exits unexpectedly
# on-failure: restart if the process crashes or exits non-zero
# but NOT if you run 'systemctl stop mqtt-client'
&lt;/span&gt;&lt;span class="py"&gt;Restart&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;on-failure&lt;/span&gt;

&lt;span class="c"&gt;# Wait 5 seconds before restarting
# This prevents rapid restart loops if something is fundamentally broken
&lt;/span&gt;&lt;span class="py"&gt;RestartSec&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;5&lt;/span&gt;

&lt;span class="c"&gt;# Give the service up to 30 seconds to stop gracefully before killing it
&lt;/span&gt;&lt;span class="py"&gt;TimeoutStopSec&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;30&lt;/span&gt;

&lt;span class="c"&gt;# Capture stdout and stderr to the journal
&lt;/span&gt;&lt;span class="py"&gt;StandardOutput&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;journal&lt;/span&gt;
&lt;span class="py"&gt;StandardError&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;journal&lt;/span&gt;

&lt;span class="c"&gt;# Tag journal entries with this identifier for easy filtering
&lt;/span&gt;&lt;span class="py"&gt;SyslogIdentifier&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;mqtt-client&lt;/span&gt;

&lt;span class="nn"&gt;[Install]&lt;/span&gt;
&lt;span class="py"&gt;WantedBy&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;multi-user.target&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;strong&gt;&lt;em&gt;StartLimitIntervalSec&lt;/em&gt;&lt;/strong&gt; and &lt;strong&gt;&lt;em&gt;StartLimitBurst&lt;/em&gt;&lt;/strong&gt; combination is worth understanding because it solves a real problem. Imagine your MQTT client has a bug that makes it crash immediately on startup — perhaps a malformed config file or a missing dependency. Without these limits, systemd would restart it immediately, it would crash again, systemd would restart it again, and this would loop forever, consuming CPU and filling your journal with crash logs. With &lt;code&gt;StartLimitBurst=5&lt;/code&gt; and &lt;code&gt;StartLimitIntervalSec=60&lt;/code&gt;, systemd will make five restart attempts within a 60-second window, and if all five fail, it marks the service as failed and stops trying. At that point &lt;code&gt;systemctl status&lt;/code&gt; will clearly tell you the service has hit its restart limit, which prompts you to actually investigate the root cause.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;User=pi&lt;/code&gt; directive is also important for production embedded deployments. Running services as root is a security risk — if your MQTT client has a vulnerability, an attacker who exploits it gets root access. Running as a non-privileged user limits the damage. The tradeoff is that you need to make sure that user has permission to access the files and ports your application needs.&lt;/p&gt;




&lt;h2&gt;
  
  
  Useful Commands to Know
&lt;/h2&gt;

&lt;p&gt;Once your service is running, a handful of commands will cover most of what you need day-to-day:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Check current status, recent logs, and whether it is enabled at boot&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl status mqtt-client.service

&lt;span class="c"&gt;# Start the service right now&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl start mqtt-client.service

&lt;span class="c"&gt;# Stop the service (will not restart automatically due to Restart=on-failure)&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl stop mqtt-client.service

&lt;span class="c"&gt;# Restart the service (useful after changing your application code)&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl restart mqtt-client.service

&lt;span class="c"&gt;# Reload the service file after you edit it (then restart to apply changes)&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl daemon-reload
&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl restart mqtt-client.service

&lt;span class="c"&gt;# Disable the service from starting at boot&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl disable mqtt-client.service

&lt;span class="c"&gt;# View full log history for this service&lt;/span&gt;
journalctl &lt;span class="nt"&gt;-u&lt;/span&gt; mqtt-client.service

&lt;span class="c"&gt;# View logs in real-time&lt;/span&gt;
journalctl &lt;span class="nt"&gt;-u&lt;/span&gt; mqtt-client.service &lt;span class="nt"&gt;-f&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One workflow note: whenever you edit the service file, you must run &lt;code&gt;daemon-reload&lt;/code&gt; before the changes take effect. systemd caches unit file contents and will not notice your edits otherwise. Forgetting this step is a common source of confusion when your changes do not seem to be working.&lt;/p&gt;




&lt;h2&gt;
  
  
  What systemd Gives You for Free
&lt;/h2&gt;

&lt;p&gt;It is worth pausing to appreciate what you get from writing a proper service file, because the alternative — ad hoc startup scripts in &lt;code&gt;/etc/rc.local&lt;/code&gt; or cron &lt;code&gt;@reboot&lt;/code&gt; jobs — gives you almost none of it.&lt;/p&gt;

&lt;p&gt;With a systemd service you get automatic restart on crash, ordered startup relative to other system components, centralised and queryable logging, a clean mechanism to start and stop the application during development, restart rate limiting to prevent crash loops, graceful shutdown handling, and the ability to run as a non-root user easily. All of that from a text file that is, at its core, less than twenty lines.&lt;/p&gt;

&lt;p&gt;For an embedded device deployed in a factory, a warehouse, or anywhere else it needs to run unattended for months at a time, that reliability infrastructure is not optional. systemd gives it to you essentially for free, as long as you take the time to describe your service correctly.&lt;/p&gt;

</description>
      <category>systemd</category>
      <category>linux</category>
      <category>embedded</category>
      <category>devops</category>
    </item>
    <item>
      <title>Finishing What I Started — From a TODO List to a Published PyPI Package</title>
      <dc:creator>Supun Sriyananda</dc:creator>
      <pubDate>Thu, 28 May 2026 13:03:03 +0000</pubDate>
      <link>https://dev.to/ranaweerasupun/finishing-what-i-started-from-a-todo-list-to-a-published-pypi-package-38bh</link>
      <guid>https://dev.to/ranaweerasupun/finishing-what-i-started-from-a-todo-list-to-a-published-pypi-package-38bh</guid>
      <description>&lt;p&gt;&lt;em&gt;This is a submission for the &lt;a href="https://dev.to/challenges/github-2026-05-21"&gt;GitHub Finish-Up-A-Thon Challenge&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Built
&lt;/h2&gt;

&lt;p&gt;I built &lt;strong&gt;robmqtt&lt;/strong&gt; — a resilient MQTT client for edge IoT devices, now published on PyPI.&lt;/p&gt;

&lt;p&gt;It came out of a problem that cost me real time. I deploy sensor systems on hardware that lives in the field: Raspberry Pi units on 4G cellular, embedded gateways, battery monitoring systems. On those networks, connectivity is never stable. And the standard MQTT client, &lt;code&gt;paho-mqtt&lt;/code&gt;, silently drops messages when the broker is unreachable — no error, no warning, no log entry unless you write one yourself.&lt;/p&gt;

&lt;p&gt;I lost data for a long time before I understood why. And finding the cause was brutal. There are no proper logs unless you explicitly add them, so I spent hours — days — chasing a problem that left no trace. I went through forums, blogs, chat AIs, everywhere, looking for how other people had solved it. What I found was the same thing every time: people asking for a &lt;em&gt;production-resilient&lt;/em&gt; MQTT client, and no real answer. Just scattered advice and code snippets that handled one piece and ignored the rest.&lt;/p&gt;

&lt;p&gt;So I built it myself:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Offline queue&lt;/strong&gt; — when the broker is unreachable, messages are written to SQLite instead of being dropped. They survive process restarts and power cycles.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Inflight tracking&lt;/strong&gt; — messages sent but not yet acknowledged are tracked separately and re-sent on reconnect, closing a gap that even QoS 1 leaves open.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Priority eviction&lt;/strong&gt; — when the queue fills, low-priority telemetry is evicted before critical alerts.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Exponential backoff&lt;/strong&gt; — a fleet of devices reconnecting after an outage won't all hammer the broker at once.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;TLS support&lt;/strong&gt; — for brokers that require encrypted connections.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And then everything a library needs to actually be used in production, not just by me:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;TLS and mutual TLS&lt;/strong&gt; plus username/password auth, for secured brokers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bidirectional messaging&lt;/strong&gt; — &lt;code&gt;subscribe&lt;/code&gt;/&lt;code&gt;unsubscribe&lt;/code&gt; with full MQTT wildcard support, and subscriptions that survive reconnects.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Observability&lt;/strong&gt; — a built-in HTTP health check endpoint (&lt;code&gt;/health&lt;/code&gt; returning healthy / degraded / unhealthy) with Docker &lt;code&gt;HEALTHCHECK&lt;/code&gt; and Kubernetes liveness-probe examples.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Structured logging&lt;/strong&gt; throughout — so the next person doesn't lose days to a problem with no trace, the way I did.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;73 tests&lt;/strong&gt; across the storage, tracking, topic-matching, and client layers.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;robmqtt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This matters to me because it's not a toy. It's running on real systems I maintain — battery management monitoring and robotics telemetry — where a gap in the data record has actual consequences.&lt;/p&gt;

&lt;h2&gt;
  
  
  Demo
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;PyPI:&lt;/strong&gt; &lt;a href="https://pypi.org/project/robmqtt/" rel="noopener noreferrer"&gt;https://pypi.org/project/robmqtt/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/ranaweerasupun/resilient-edge-mqtt-client" rel="noopener noreferrer"&gt;https://github.com/ranaweerasupun/resilient-edge-mqtt-client&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The most satisfying way to see it work is to watch the offline queue survive a broker outage. The repo includes a simulation script for exactly this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Terminal 1 — run a simulated device publishing every 5 seconds&lt;/span&gt;
python test_13.py

&lt;span class="c"&gt;# Terminal 2 — kill the broker mid-run&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl stop mosquitto
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The device keeps publishing. Messages start queuing to SQLite instead of erroring. Then:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Bring the broker back&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl start mosquitto
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The queue drains automatically, in priority order. Every message that piled up during the outage is delivered. Nothing is lost.&lt;/p&gt;

&lt;p&gt;Basic usage looks like this — the application code never has to know whether the broker is reachable:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;robmqtt&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ProductionMQTTClient&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ProductionMQTTClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;client_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;field_device_001&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;broker_host&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mqtt.yourdomain.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;broker_port&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1883&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_queue_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;db_path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;./device.db&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;publish&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;topic&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sensors/temperature&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;read_sensor&lt;/span&gt;&lt;span class="p"&gt;()),&lt;/span&gt;
        &lt;span class="n"&gt;qos&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;priority&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1u9zdfkxtrpi92ce4mpe.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1u9zdfkxtrpi92ce4mpe.png" alt="robmqtt-architechture" width="800" height="441"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvrif4hwjmz99eyy6kgbh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvrif4hwjmz99eyy6kgbh.png" alt="messages publish" width="622" height="511"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fum0srn1ehrnn3pa0c2r0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fum0srn1ehrnn3pa0c2r0.png" alt="sqlite queue drained" width="626" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Comeback Story
&lt;/h2&gt;

&lt;p&gt;Here's the honest before-and-after.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Before.&lt;/strong&gt; Once I'd finally figured out the code that solved my problem, I used it — and it went into my GitHub repo, where it sat. It worked. It solved the thing that had cost me days. But no one knew it existed.&lt;/p&gt;

&lt;p&gt;And that was the part that nagged at me. I'd keep seeing people online asking for exactly what I'd built — something resilient enough for production — and getting the same non-answers I'd gotten: random advice, half-solutions, snippets. I had the answer sitting in a corner of my repo, and a thought stuck in the corner of my head: &lt;em&gt;it's right here. I need to show this off.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The things I knew I should do but hadn't:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Package it for PyPI so people could &lt;code&gt;pip install&lt;/code&gt; it instead of cloning the repo&lt;/li&gt;
&lt;li&gt;Add TLS/mTLS support for secure broker connections&lt;/li&gt;
&lt;li&gt;Expose a metrics endpoint for observability&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It was the classic unfinished side project. Functional, but not &lt;em&gt;shipped&lt;/em&gt;. Not something anyone other than me could actually use. The plan was always "publish it once it's polished" — and polishing it was the part that kept not happening.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The hard part.&lt;/strong&gt; I have a full-time job. Finishing this wasn't something I could do in work hours. So it happened on weekends, late at night, and in the gaps — I drew flowcharts and planned the package structure on my commute, in my head, on paper. Progress came in small pieces, squeezed around everything else. There were plenty of stretches where it would have been easier to just leave it in the repo and move on.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;After.&lt;/strong&gt; I pushed it over the line:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Published to PyPI as &lt;code&gt;robmqtt&lt;/code&gt; v1.0.0.&lt;/strong&gt; This was more work than I expected — restructuring the code into a proper installable package, writing the &lt;code&gt;pyproject.toml&lt;/code&gt;, sorting out the module layout, testing the install in a clean environment.&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Added TLS support.&lt;/strong&gt; The client now takes &lt;code&gt;use_tls&lt;/code&gt;, &lt;code&gt;ca_certs&lt;/code&gt;, &lt;code&gt;certfile&lt;/code&gt;, &lt;code&gt;keyfile&lt;/code&gt;, and broker auth parameters, so it works with secured brokers, not just local plaintext ones.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Observability&lt;/strong&gt; Built an HTTP health check endpoint with Docker and Kubernetes probe examples — so the library reports its own health instead of failing silently.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Structured logging&lt;/strong&gt; so the next person doesn't suffer days to a problem with no trace, the way I did!&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Data Integrity&lt;/strong&gt; Fixed three bugs I'm glad I caught before anyone relied on it: binary payloads being corrupted by string conversion, a resend-tracking gap that could silently lose messages on a second disconnect, and a race condition in &lt;code&gt;publish()&lt;/code&gt; that could drop a message between the connection check and the send.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Started building a real project on top of it.&lt;/strong&gt; To put the library through its paces in a full system, I'm building an IoT fleet analytics platform around it — simulated devices publishing through robmqtt, an MQTT-to-InfluxDB bridge, Grafana dashboards, and statistical anomaly detection. It's still in progress, but it's already doing what mattered most: proving the library holds up as the foundation of a real pipeline, not just in isolated tests.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When it was finally published and I ran the test scripts and watched the offline queue fill and drain exactly the way it was supposed to — messages held through an outage, then delivered, nothing lost — there was nothing more rewarding. The thing that had cost me days of frustration was now one &lt;code&gt;pip install&lt;/code&gt; away for anyone who hits the same wall I did.&lt;/p&gt;

&lt;h2&gt;
  
  
  My Experience with GitHub Copilot
&lt;/h2&gt;

&lt;p&gt;I'll be honest about my workflow, because the reason I relied on Copilot is specific.&lt;/p&gt;

&lt;p&gt;I use a few AI tools when I build things. The general-purpose chat assistants are useful for sketching out an approach or talking through a design — but when it came to actual code, they kept giving me snippets that didn't quite work. That's not surprising: they can't see my project. They don't know my file structure, my variable names, the exact version of a library I'm using, or how the piece they're suggesting fits the rest of the code. So I'd paste in a snippet, hit an error, paste the error back, get a revised snippet, hit another error. Going back and forth with a chat window that can't see my codebase got slow and frustrating.&lt;/p&gt;

&lt;p&gt;Copilot was different because it lives in my editor and can see everything — all my open files, the actual code around the cursor, the real context. That's why it became the tool I leaned on.&lt;/p&gt;

&lt;p&gt;Two ways I used it:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Inline completion, always on.&lt;/strong&gt; As I typed, Copilot autocompleted the repetitive, structural parts — the device profile dictionaries, the JSON payload construction, the boilerplate around &lt;code&gt;publish()&lt;/code&gt; calls with their QoS and priority arguments. Because it could see the patterns already in my file, its suggestions actually fit, instead of being generically plausible code I'd have to rewrite.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Copilot Chat for debugging in context.&lt;/strong&gt; This is where it earned its place. When something broke, I'd ask it directly — &lt;em&gt;"Why am I getting this error, and how can I fix it?"&lt;/em&gt; or &lt;em&gt;"Can you check this code snippet?"&lt;/em&gt; — and because it could read my actual files, the answers were grounded in my real code, not a guess about what my code might look like. That's the difference that made me stop pasting snippets into external chat windows. The tool that can see your codebase gives you answers that apply to your codebase.&lt;/p&gt;

&lt;p&gt;That freed me to spend my real thinking time on the parts that mattered: the sine-wave drift model for realistic sensor data, the priority eviction logic, and the inflight-tracking design that closes the QoS 1 gap. The hard design decisions were mine. Copilot removed the friction around them — the typing, the boilerplate, and the slow debugging loop — so I could stay focused on the actual engineering.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Built by Supun Sriyananda — R&amp;amp;D Engineer working on embedded and IoT systems. robmqtt is open source on &lt;a href="https://github.com/ranaweerasupun/resilient-edge-mqtt-client" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; and &lt;a href="https://pypi.org/project/robmqtt/" rel="noopener noreferrer"&gt;PyPI&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>devchallenge</category>
      <category>githubchallenge</category>
      <category>python</category>
      <category>iot</category>
    </item>
    <item>
      <title>How to Generate Realistic IoT Sensor Data for Testing Your MQTT Pipeline</title>
      <dc:creator>Supun Sriyananda</dc:creator>
      <pubDate>Thu, 28 May 2026 11:02:48 +0000</pubDate>
      <link>https://dev.to/ranaweerasupun/how-to-generate-realistic-iot-sensor-data-for-testing-your-mqtt-pipeline-1a2h</link>
      <guid>https://dev.to/ranaweerasupun/how-to-generate-realistic-iot-sensor-data-for-testing-your-mqtt-pipeline-1a2h</guid>
      <description>&lt;h2&gt;
  
  
  How to Generate Realistic IoT Sensor Data for Testing Your MQTT Pipeline
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;This is part 2 of a series on building robmqtt. &lt;a href="https://dev.to/ranaweerasupun/why-your-mqtt-client-is-silently-losing-messages-and-how-i-fixed-it-robmqtt-4n4k"&gt;Part 1&lt;/a&gt; covered why paho-mqtt silently drops messages and the library I built to fix it. This part is about testing — how to exercise an MQTT pipeline without deploying physical hardware.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;In the last post I wrote about robmqtt, a resilient MQTT client for edge devices. Once I had it working, I hit the obvious next problem: how do I test it properly without setting up a rack of Raspberry Pis?&lt;/p&gt;

&lt;p&gt;I needed data flowing through the pipeline. Lots of it. From many devices. Behaving differently. And ideally surviving a broker outage so I could watch the offline queue do its job.&lt;/p&gt;

&lt;p&gt;So I wrote a device simulator. And in writing it, I learned that generating &lt;em&gt;realistic&lt;/em&gt; fake sensor data is harder than it looks — and that most people get it wrong in the same way.&lt;/p&gt;




&lt;h2&gt;
  
  
  The trap: random data isn't realistic data
&lt;/h2&gt;

&lt;p&gt;The first instinct when simulating a sensor is to reach for &lt;code&gt;random.uniform()&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;cpu&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;uniform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;   &lt;span class="c1"&gt;# don't do this
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This produces data that looks nothing like a real sensor. Real sensor readings don't jump randomly across the whole range every second. A CPU sitting at 8% doesn't suddenly read 94% then drop to 3%. Temperature drifts slowly. Signal strength wobbles around a baseline. There's continuity from one reading to the next.&lt;/p&gt;

&lt;p&gt;If you test your pipeline with pure random noise, your charts look like static, your anomaly detection has nothing meaningful to detect, and your dashboards are useless for spotting whether anything actually works.&lt;/p&gt;

&lt;p&gt;I wanted data that &lt;em&gt;looked&lt;/em&gt; like it came from a real device.&lt;/p&gt;




&lt;h2&gt;
  
  
  Realistic drift with a sine wave
&lt;/h2&gt;

&lt;p&gt;The trick I landed on was a slow sine wave with per-device random phase, plus a little Gaussian noise on top:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_cpu&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;profile&lt;/span&gt;
    &lt;span class="n"&gt;drift&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sin&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;300&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_drift_phase&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;noise&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;gauss&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cpu_variance&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cpu_base&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;drift&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;noise&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;99.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three things are happening here:&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;sine wave&lt;/strong&gt; (&lt;code&gt;math.sin(time.time() / 300 ...)&lt;/code&gt;) creates a slow, smooth oscillation with a period of about five minutes. This is the gradual drift you see in real systems as load rises and falls through the day.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;phase offset&lt;/strong&gt; (&lt;code&gt;self._drift_phase&lt;/code&gt;, a random value set once when the device starts) means every device is at a different point in its cycle. Without it, all your simulated devices would drift up and down in perfect unison, which is a dead giveaway that the data is fake.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_drift_phase&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;uniform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pi&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;strong&gt;Gaussian noise&lt;/strong&gt; (&lt;code&gt;random.gauss&lt;/code&gt;) adds small reading-to-reading variation on top of the drift. Real sensors are never perfectly smooth — there's always measurement jitter.&lt;/p&gt;

&lt;p&gt;The result is data that drifts, wobbles, and stays within a believable range — and each device has its own personality. When you chart it, it looks like telemetry, not like a random number generator.&lt;/p&gt;




&lt;h2&gt;
  
  
  Device profiles: a camera is not a sensor
&lt;/h2&gt;

&lt;p&gt;A real fleet isn't 15 copies of the same device. A camera runs hot and busy. A simple sensor sips power and idles. A gateway sits in between. If your simulator treats them all identically, your test data doesn't reflect anything real.&lt;/p&gt;

&lt;p&gt;So each device type gets a profile:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;DEVICE_PROFILES&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sensor&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cpu_base&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cpu_variance&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;temp_base&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;42.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;temp_variance&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;5.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;telemetry_interval&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;failure_rate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.03&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;camera&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cpu_base&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;65&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cpu_variance&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;temp_base&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;61.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;temp_variance&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;10.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;telemetry_interval&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;failure_rate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.02&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="c1"&gt;# gateway, controller ...
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A camera baselines at 65% CPU and 61°C, publishing every 5 seconds. A sensor baselines at 8% CPU and 42°C, publishing every 10 seconds. When this data lands in a dashboard, the device types are visibly different — exactly like a real deployment, where you can often guess a device's role just from its resource profile.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;failure_rate&lt;/code&gt; field controls how often the device injects an anomaly — a sudden CPU and temperature spike — so there's something for downstream anomaly detection to actually find:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_is_anomaly&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cpu_percent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;uniform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;88&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;99&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;temperature_c&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;uniform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;78&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;92&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anomaly&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Using robmqtt in the simulator
&lt;/h2&gt;

&lt;p&gt;This is where the simulator doubles as a usage example for the library. Each simulated device is a real robmqtt client:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;robmqtt&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ProductionMQTTClient&lt;/span&gt;

&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ProductionMQTTClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;client_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fleet_&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;device_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;broker_host&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;broker_host&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;broker_port&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;broker_port&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_queue_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;db_path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;./data/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;device_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;.db&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;min_backoff&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_backoff&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;log_dir&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;./logs/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;device_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each device publishes three kinds of message, and the QoS and priority differ by importance:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Telemetry — frequent, can tolerate eviction under pressure
&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;publish&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;topic&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fleet/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/telemetry&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;qos&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;priority&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Status — operational health, higher priority
&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;publish&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;topic&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fleet/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;qos&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;priority&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Boot/alert events — must not be lost, highest priority, QoS 2
&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;publish&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;topic&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fleet/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/events&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;qos&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;priority&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the priority system from part 1 in action. If the broker goes down and the offline queue fills, routine telemetry (priority 5) gets evicted before status messages (priority 8), and event messages (priority 10, QoS 2) are protected.&lt;/p&gt;

&lt;p&gt;The status messages even report the client's own internal state, pulled straight from robmqtt:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;stats&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_statistics&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;queue_depth&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;     &lt;span class="n"&gt;stats&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;offline_queue_size&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;inflight_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="n"&gt;stats&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;inflight_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;is_connected&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;    &lt;span class="n"&gt;stats&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;is_connected&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reconnect_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;stats&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reconnect_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;So the simulated fleet reports on its own connectivity health — which means you can build a dashboard that shows the offline queue filling and draining in real time.&lt;/p&gt;




&lt;h2&gt;
  
  
  Watching the offline queue work
&lt;/h2&gt;

&lt;p&gt;This is the part I find satisfying. Start a device:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python device_simulator.py &lt;span class="nt"&gt;--device-id&lt;/span&gt; device_001 &lt;span class="nt"&gt;--device-type&lt;/span&gt; gateway
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ini"&gt;&lt;code&gt;&lt;span class="nn"&gt;[device_001]&lt;/span&gt; &lt;span class="err"&gt;Started&lt;/span&gt; &lt;span class="err"&gt;—&lt;/span&gt; &lt;span class="py"&gt;type&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;gateway location=warehouse_a&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now kill the broker. The device keeps publishing — but the messages are now being written to SQLite instead of sent:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl stop mosquitto
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The device doesn't crash. It doesn't error. It just quietly queues. The &lt;code&gt;queue_depth&lt;/code&gt; in the status payload climbs: 5, 12, 28, 45...&lt;/p&gt;

&lt;p&gt;Bring the broker back:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl start mosquitto
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The queue drains automatically. Every reading that piled up during the outage is delivered, in priority order. The &lt;code&gt;queue_depth&lt;/code&gt; falls back to zero. Nothing was lost.&lt;/p&gt;

&lt;p&gt;That's the whole point of the library, demonstrated in a way you can watch happen.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why a simulator is worth building
&lt;/h2&gt;

&lt;p&gt;Even if you have real hardware, a simulator earns its place:&lt;/p&gt;

&lt;p&gt;It lets you test at scale you don't have hardware for. You can run 15 simulated devices on your laptop and see how your pipeline, database, and dashboards behave under fleet-level load.&lt;/p&gt;

&lt;p&gt;It gives you reproducible failure scenarios. Killing a broker on demand is a lot easier than waiting for a real 4G connection to drop in the field.&lt;/p&gt;

&lt;p&gt;It produces clean test data with known properties. You injected the anomalies, so you know exactly what your anomaly detection should catch.&lt;/p&gt;

&lt;p&gt;And — as a bonus — it doubles as living documentation for how to use your client library. The simulator &lt;em&gt;is&lt;/em&gt; the example.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's next
&lt;/h2&gt;

&lt;p&gt;In part 3 I'll cover running this at fleet scale — launching many devices at once and feeding their telemetry through an analytics pipeline into a live dashboard.&lt;/p&gt;

&lt;p&gt;The full simulator code is on GitHub alongside robmqtt:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;PyPI:&lt;/strong&gt; &lt;a href="https://pypi.org/project/robmqtt/" rel="noopener noreferrer"&gt;pypi.org/project/robmqtt&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/ranaweerasupun/resilient-edge-mqtt-client" rel="noopener noreferrer"&gt;github.com/ranaweerasupun/resilient-edge-mqtt-client&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;How do you test your IoT pipelines — real hardware, simulators, or something else? I'm curious what others do — let me know in the comments.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>iot</category>
      <category>python</category>
      <category>mqtt</category>
      <category>testing</category>
    </item>
    <item>
      <title>Why Your MQTT Client Is Silently Losing Messages (And How I Fixed It) - robmqtt</title>
      <dc:creator>Supun Sriyananda</dc:creator>
      <pubDate>Tue, 26 May 2026 14:01:19 +0000</pubDate>
      <link>https://dev.to/ranaweerasupun/why-your-mqtt-client-is-silently-losing-messages-and-how-i-fixed-it-robmqtt-4n4k</link>
      <guid>https://dev.to/ranaweerasupun/why-your-mqtt-client-is-silently-losing-messages-and-how-i-fixed-it-robmqtt-4n4k</guid>
      <description>&lt;h2&gt;
  
  
  Why Your MQTT Client Is Silently Losing Messages (And How I Fixed It)
&lt;/h2&gt;

&lt;p&gt;I learned this the hard way.&lt;/p&gt;

&lt;p&gt;I was building a sensor system for a field deployment — Raspberry Pi units publishing temperature and humidity data over 4G cellular to an MQTT broker. The dashboard looked fine. The graphs looked fine. Then one day I compared the raw sensor logs against what actually made it to the broker.&lt;/p&gt;

&lt;p&gt;Thousands of readings. Gone. No errors. No warnings. Just gone.&lt;/p&gt;

&lt;p&gt;The culprit? &lt;code&gt;paho-mqtt&lt;/code&gt;'s default behaviour when the broker is unreachable: it silently drops your message and moves on.&lt;/p&gt;

&lt;p&gt;After losing enough data I wrote a library to fix it. It's now on PyPI as &lt;strong&gt;robmqtt&lt;/strong&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;robmqtt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But before I show you how it works, let me show you exactly what the problem is — because it's subtler than most people realise.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem with Standard MQTT Clients
&lt;/h2&gt;

&lt;p&gt;When you call &lt;code&gt;client.publish()&lt;/code&gt; in paho-mqtt and the broker is unreachable, one of two things happens:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The message is silently discarded (QoS 0)&lt;/li&gt;
&lt;li&gt;The message is queued in memory for QoS 1/2 — but that queue is lost on process restart, and there's a second gap that even QoS 1 doesn't close&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That second gap is the sneaky one. Here's what happens with QoS 1:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj6saayzafjthwujry6j2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj6saayzafjthwujry6j2.png" alt="qos1-silent-failure.png" width="618" height="434"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The message was "sent" from your perspective. It was never confirmed from the broker's perspective. And paho has no mechanism to track this gap across reconnections.&lt;/p&gt;

&lt;p&gt;On a stable data centre network, this almost never matters. On a Raspberry Pi running on 4G cellular in a field cabinet, it happens constantly.&lt;/p&gt;




&lt;h2&gt;
  
  
  What a Resilient Edge Client Actually Needs
&lt;/h2&gt;

&lt;p&gt;After losing enough data, I sat down and wrote out what a proper edge MQTT client needs to do:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Persist offline messages to disk&lt;/strong&gt;&lt;br&gt;
If the broker is unreachable when &lt;code&gt;publish()&lt;/code&gt; is called, the message should be written to disk and replayed later. Not held in memory — memory is lost on restart.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Track in-flight messages separately&lt;/strong&gt;&lt;br&gt;
Messages that have been handed to the broker but not yet ACK'd need to be tracked. On reconnect, they must be re-sent before any queued messages start draining.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Priority-based eviction&lt;/strong&gt;&lt;br&gt;
When the queue fills up, not all messages are equal. A critical alarm should survive. A routine telemetry reading from 6 hours ago should not block it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Exponential backoff on reconnect&lt;/strong&gt;&lt;br&gt;
A fleet of 50 devices coming back online after a broker restart should not all hammer the broker at the same second.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Thread-safe storage&lt;/strong&gt;&lt;br&gt;
The MQTT network thread and your application thread are both touching message state. This needs to be safe without forcing the caller to think about locking.&lt;/p&gt;

&lt;p&gt;None of this is exotic. All of it is missing from the standard &lt;code&gt;paho-mqtt&lt;/code&gt; client when used out of the box.&lt;/p&gt;


&lt;h2&gt;
  
  
  How robmqtt Solves It
&lt;/h2&gt;

&lt;p&gt;Here's the architecture:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj4922wwjiybq1zdzd7vm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj4922wwjiybq1zdzd7vm.png" alt="robmqtt-architecture.png" width="800" height="517"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  The Offline Queue
&lt;/h3&gt;

&lt;p&gt;When the client detects it's disconnected, &lt;code&gt;publish()&lt;/code&gt; routes to an &lt;code&gt;OfflineQueue&lt;/code&gt; backed by SQLite:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Simplified from offline_queue.py
&lt;/span&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;OfflineQueue&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;enqueue&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;topic&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;qos&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;priority&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_lock&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
                INSERT INTO queue (topic, payload, qos, priority, timestamp)
                VALUES (?, ?, ?, ?, ?)
            &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;topic&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;qos&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;priority&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()))&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;dequeue_batch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;batch_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_lock&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="c1"&gt;# Highest priority first, then oldest first within same priority
&lt;/span&gt;            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
                SELECT id, topic, payload, qos FROM queue
                ORDER BY priority DESC, timestamp ASC
                LIMIT ?
            &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;batch_size&lt;/span&gt;&lt;span class="p"&gt;,)).&lt;/span&gt;&lt;span class="nf"&gt;fetchall&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;SQLite gives you durability without a separate process. It survives power cycles. The threading lock means your application thread and the drain thread never step on each other.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Inflight Tracker
&lt;/h3&gt;

&lt;p&gt;This closes the gap QoS 1 leaves open:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Simplified from inflight_tracker.py
&lt;/span&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;InflightTracker&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;track&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;mid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;topic&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;qos&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Call this when you hand a message to paho.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_lock&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
                INSERT OR REPLACE INTO inflight (mid, topic, payload, qos)
                VALUES (?, ?, ?, ?)
            &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;topic&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;qos&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;acknowledge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;mid&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Call this in on_publish callback.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_lock&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DELETE FROM inflight WHERE mid = ?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mid&lt;/span&gt;&lt;span class="p"&gt;,))&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_all_pending&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Call this on reconnect — re-send everything unacknowledged.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_lock&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SELECT topic, payload, qos FROM inflight&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;fetchall&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On reconnect, the client replays all inflight messages first, then starts draining the offline queue. Delivery order is preserved.&lt;/p&gt;

&lt;h3&gt;
  
  
  Priority Eviction
&lt;/h3&gt;

&lt;p&gt;Each message gets a priority from 1 (lowest) to 10 (highest):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Routine telemetry — can be evicted when queue is full
&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;publish&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;topic&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sensors/temperature&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;value&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: 23.5}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;qos&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;priority&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Critical alert — survives eviction, displaces old telemetry
&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;publish&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;topic&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;alerts/critical&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;over_temp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;value&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: 87.2}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;qos&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;priority&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;9&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When the queue hits capacity, the lowest-priority messages are evicted first. Your critical alerts are never blocked by a backlog of stale routine data.&lt;/p&gt;




&lt;h2&gt;
  
  
  Using robmqtt
&lt;/h2&gt;

&lt;p&gt;Install:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;robmqtt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Basic usage — this is everything you need:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;robmqtt&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ProductionMQTTClient&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ProductionMQTTClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;client_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;field_device_001&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;broker_host&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mqtt.yourdomain.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;broker_port&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1883&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_queue_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    &lt;span class="c1"&gt;# holds ~5000 messages during outages
&lt;/span&gt;    &lt;span class="n"&gt;min_backoff&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;          &lt;span class="c1"&gt;# start retrying after 2s
&lt;/span&gt;    &lt;span class="n"&gt;max_backoff&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;         &lt;span class="c1"&gt;# cap retry interval at 60s
&lt;/span&gt;    &lt;span class="n"&gt;db_path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;./device.db&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# SQLite lives here — survives reboots
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# From here just call publish() — routing is handled internally.
# Connected: sends directly and tracks inflight.
# Disconnected: writes to SQLite, drains automatically on reconnect.
&lt;/span&gt;
&lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;reading&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;read_sensor&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;publish&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;topic&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sensors/temperature&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;reading&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;qos&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;priority&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The application code doesn't need to know whether the broker is reachable. That's the point.&lt;/p&gt;

&lt;p&gt;Check what's happening at runtime:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;stats&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_statistics&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stats&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# {
#   'is_connected': True,
#   'offline_queue_size': 0,
#   'inflight_count': 2,
#   'reconnect_count': 4,
#   ...
# }
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;TLS is supported if your broker requires it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ProductionMQTTClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;client_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;secure_device_001&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;broker_host&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mqtt.yourdomain.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;broker_port&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;8883&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;use_tls&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;ca_certs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/etc/ssl/certs/broker-ca.crt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;username&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;device001&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;password&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your_password&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Seeing It in Action
&lt;/h2&gt;

&lt;p&gt;The repo includes &lt;code&gt;test_13.py&lt;/code&gt;, a simulation designed specifically to demo the offline behaviour:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Terminal 1 — run the simulation (publishes every 5 seconds)&lt;/span&gt;
python test_13.py

&lt;span class="c"&gt;# Terminal 2 — simulate a network outage&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl stop mosquitto

&lt;span class="c"&gt;# Watch messages queue up in Terminal 1&lt;/span&gt;
&lt;span class="c"&gt;# Queue stats print every 10 readings&lt;/span&gt;

&lt;span class="c"&gt;# Restore connectivity&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl start mosquitto

&lt;span class="c"&gt;# Watch the offline queue drain automatically — zero messages lost&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The queue drain happens in a background daemon thread. Your application code does nothing. It just works.&lt;/p&gt;




&lt;h2&gt;
  
  
  Real-World Context
&lt;/h2&gt;

&lt;p&gt;I've deployed this pattern on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Battery management systems&lt;/strong&gt; — monitoring cell voltages and temperatures in production energy storage systems. A 10-minute broker outage during a network switch should not cause a gap in the battery health record.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Robotics telemetry&lt;/strong&gt; — ROS2 robots publishing sensor and status data. Process restarts during OTA updates should not lose the last known state.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MQTT edge gateways&lt;/strong&gt; — aggregating data from downstream sensors over serial or CAN and forwarding to a cloud broker over 4G. The gateway may reconnect dozens of times per day.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In all of these cases, the pattern is the same: treat disconnection as normal, not exceptional. Design the client to buffer, not to fail.&lt;/p&gt;




&lt;h2&gt;
  
  
  Who This Is For
&lt;/h2&gt;

&lt;p&gt;robmqtt is specifically for &lt;strong&gt;edge device deployments&lt;/strong&gt; where:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Network connectivity is unreliable (cellular, Wi-Fi roaming, VPNs)&lt;/li&gt;
&lt;li&gt;Process restarts happen (watchdog resets, power cycles, OTA updates)&lt;/li&gt;
&lt;li&gt;Message loss has real consequences (industrial monitoring, remote sensors, fleet telemetry)&lt;/li&gt;
&lt;li&gt;You don't want to build and maintain this infrastructure yourself&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you're running MQTT on a stable cloud-to-cloud connection, &lt;code&gt;paho-mqtt&lt;/code&gt; alone is probably fine. If you're deploying on Raspberry Pi, industrial gateways, field sensors, or anything running on 4G/LTE — this is for you.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;The Prometheus metrics endpoint is on the roadmap. The structured logging already writes &lt;code&gt;.jsonl&lt;/code&gt; metrics files — exposing them via HTTP is a small step and would make robmqtt slot naturally into standard observability stacks.&lt;/p&gt;

&lt;p&gt;If you try it and hit an issue, open a GitHub issue. If you want a feature, open a discussion.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;PyPI:&lt;/strong&gt; &lt;a href="https://pypi.org/project/robmqtt/" rel="noopener noreferrer"&gt;pypi.org/project/robmqtt&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/ranaweerasupun/resilient-edge-mqtt-client" rel="noopener noreferrer"&gt;github.com/ranaweerasupun/resilient-edge-mqtt-client&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Have you run into MQTT message loss on edge devices? How did you handle it — drop a comment below.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>iot</category>
      <category>python</category>
      <category>mqtt</category>
      <category>raspberrypi</category>
    </item>
  </channel>
</rss>
