Graham Trott

Posted on Aug 24, 2024 • Edited on Aug 27, 2024

Chained wi-fi networking for control systems

#esp8266 #control #networking #micropython

This describes a networking strategy for controlling simple wifi devices distributed over a large physical area, without the need for repeaters. The code is Micropython and the system was designed for a network of 4MB ESP8266 devices. It was originally developed for a home heating control system but is equally well suited to handling the needs of industrial heating, irrigation or similar systems, where several devices need to be controlled centrally.

Background

Most small-scale WiFi networking systems adopt a “star” topology:

with all the control and sensor nodes connected to a single hub; often a standard internet router. This will be located close to where the service provider's cable enters the property, often in a remote corner. Devices such as the ESP8266 have small antennae and operate poorly over anything more than quite a short distance. This shows up as delays in responding, frequent resets and even crashes, but these may not become noticeable until after the installation has been completed and signed off. The system would then require a time-consuming rework.

This problem can be overcome by the use of extra routers or network extenders, to form what is known as a 'mesh' system, but these may be visually intrusive and they add extra cost and complexity to the system.

In the case of the home heating system, there was an additional problem, that if the internet connection went down the local NAT network would also fail and the system controller could not access the relays and other devices. To avoid this, the controller was fitted with a second wifi interface to provide a private LAN, and the entire control network was placed on that. In the event of a WAN failure the system carries on happily by itself.

This works fine up to a point, but small computers like Raspberry/Orange pi can only support a limited number of wifi devices, and it didn't deal with devices being too far away to work reliably unless mesh routers were added.

A secondary issue concerned updates to the firmware inside the devices. When updates were required it was necessary to manually visit each device in turn and perform an update.

This article describes a simple, low-cost approach where the controlled devices themselves take care of all the extended networking. It can handle a large number of devices and can be deployed using very modest hardware such as an ESP8266. The system connects to its controller through a single wifi address and can be completely configured and tested before delivery for installation. Software updates propagate silently and automatically through the system.

Push or pull

In a "push" scenario, a central controller determines the states of each of the controlled devices and "pushes" messages to each of them via REST, MQTT or some other comms mechanism. This is fine while it's working but can cause major timing issues when devices go offline and the controller doesn't get the responses it expected.

This system adopts a "pull" strategy. As before, the controller decides the states of the devices, but it's then up to each of them to poll the controller whenever it wishes and to act on the response it receives back. After several months' testing there have been no signs of timing issues. Occasionally a device will lose contact, suffer a memory overflow or some other problem that causes the local watchdog to force a reset, but this has little to no effect on the behaviour of the overall system.

Outline of the strategy

This strategy employs some of the networked devices as message relays handing data to and from other nearby devices. The network then looks like this:

Implementation

The strategy centres on a packet of information constructed by the system controller. Encoded as JSON, this contains a section for each of the networked devices, keyed by the name of the device. It also contains the current timestamp and the version number of the system firmware. Let's call this packet the “map”. Here’s an example of a simple system with two relay nodes:

{“Room1“: {“relay“: “off“}, “Room2“: {“relay“: “on“}, “ts“: “1719094849“, “v“: “90“}

The content of each device section in the map depends on the device. A relay has its state, on or off. A thermometer usually requires no information since its job is to return data, not consume it.

One device is designated as the "entry point" to the network. It connects to the access point (wifi hotspot) published by the hub, which either is the system controller itself or an interface device that connects to it. The entry device knows nothing of its special role; it behaves just like every other node. The reason for having a single entry node is so the hub doesn’t have to support more than one wifi client or combine messages coming from multiple sources. Once connected, the entry device polls the hub at regular intervals such as every 2 seconds, and the hub responds by sending it the map. The device looks in the map for its own name and extracts the attached data to perform whatever tasks are needed. For a relay this will be to turn on or off. For a thermometer, no actions are needed at this stage.

When a device starts up it constructs an empty 'return packet', which it maintains for the lifetime of the running program. The device adds to this packet the timestamp contained in the last outgoing map. For a relay, that's all there is; a thermometer also adds the current temperature. The return packet looks like this:

{“Room1“: {“ts”: “1719094849”}}
or
{“Thermo1“: {“temp”: “25.3”, “ts”: “1719094849”}}

When a device polls its parent it supplies its current return packet, which includes those of its own child devices as explained below. If polling stops or for any other reason the timestamps don't get updated, the controller can tell there's something wrong in the system by checking the ages of the returning timestamps.

Adding more devices

Devices like the ESP8266 and ESP32 can operate simultaneously in Station and in Access Point mode. That is, they can connect to a wifi hotspot while at the same time publishing one of their own. These local hotspots have limited functionality but can usually support up to 4 connections, which allows us to expand the system by connecting devices to each other in a chain rather than all to a central hub. In this system, each device sets up a hotspot with an SSID based on its own MAC address and a network IP address that's different to the one it’s connected to.

When a device receives a polling request from a child device it extracts the contained return packet and adds it to its own return packet, for onward transmission to its own parent on the next poll. For example, if Room2 is a child of Room1, then following a poll of Room1 by Room2 the return packet from Room1 will expand to look like this:

{“Room1“: {“ts”: “1719094849”}, “Room2“: {“ts”: “1719093995”}}

As we move back along the chain towards the entry point device, the return packet grows to include all the nodes before finally being delivered to the hub on the next poll.

After dealing with the return packet, the device responds to the poll request by returning the current outgoing map (that came from the controller) to the child device that polled it. In this way, the outgoing map propagates to all devices.

When devices are added to the system they do not connect directly to the system hub as they would in a star topology. Instead, they connect to the nearest already configured device that has remaining capacity for another connection. Since each device publishes a hotspot, it's easy to check on a smartphone which one has the strongest signal, to guarantee reliable system performance.

The behaviour of every device is identical so it’s very easy to configure the system; the device name and its parent SSID are the only items that differ from one device to another. And so the system grows, with devices being connected to any other device that can offer a good wifi signal.

The system also has the welcome ability to operate without the need to manually track IP addresses. Devices are identified by unique names and SSIDs, where the latter are created automatically from the device MAC address. Each device chooses for its own hotspot an IP address range that's different to the one used by its parent. In the prototype systems, these ranges alternate between 172.24.100.x and 172.24.101.x. Each device uses one or the other.

It will often be useful to create a simple spreadsheet to record the essential details of the system, as a guide while installing and when doing maintenance. Here's the table for a home heating system:

In this example, Request is the entry device. (Here it's the relay that controls whether the boiler - or in this case heat pump - should run. It's ON when any radiator is ON.)

It's quite simple to check signal strengths at a given location using any computer or smartphone, to decide which devices can connect to which parents. When your computer connects to a device hotspot, the gateway address (whose 4th octet is 1, e.g. 172.24.100.1) runs a small HTTP server that returns basic information about the device; its name, SSID, parent and how long it's been up and running.

Over The Air (OTA) updates

The version number handed out by the hub is used to keep the system firmware up to date. Each device keeps a note of its current version. When a message packet (map) arrives with a higher version number, the device requests a list of files from its parent, and then requests each of the files listed, one by one. Once it has finished updating it saves the new version number.

While updating is taking place, normal operation is suspended. Clients of a device will not get responses to their polling and will eventually time out and reset themselves. Once updating has finished, a device will restart normal operation and when it receives the map from its parent will pass it on. Since this now contains an updated version number, each of its clients will start its own update. So the latest version ripples through the system and everyone ends up with the same code.

If the system is a mix of different device types, the code for all of them must be contained in each. So a relay contains thermometer code and vice versa.

Upsides

Ability to handle a large number of devices spread over a large physical area
No wifi black spots
All IP addresses are inferred by the devices themselves
Very low cost
The messaging protocol is flexible. It can be extended to add more complex information exchanges, without any compatibility issues.

Downsides

The main downside of this system is the time it takes for messages to propagate through the network. It's not suitable for use where a rapid response is needed, as response times are measured in seconds or even tens of seconds. There are however many applications for which this is not a problem, such as home or commercial heating systems and large-scale irrigation.

The code

The device on which this code runs, typically an ESP8266, must have at least 4M bytes of flash memory in order to support the full set of Micropython modules required - most importantly, asynchronous functions. If you are using the ESP-01 be sure to get the 4MB version, not the older 1MB.

Since this is an ongoing project, any code I present here will be out of date very rapidly. So instead, here is a link to the GitHub repository, which includes all the source files referred to in the following paragraphs:
(https://github.com/easycoder/rbr/blob/main/roombyroom/Controller/home/orangepi/firmware/XR)

As with all Micropython projects there is a boot file and a main program. First, boot.py:

from machine import Pin

led = Pin(2, Pin.OUT)
led.on()

import esp
esp.osdebug(None)

import gc
gc.collect()

This is pretty standard. Next, main.py, which will always run automatically following boot.py:

import os

def fileExists(filename):
    try:
        os.stat(filename)
        return True
    except OSError:
        return False

if fileExists('config.json'):
    if fileExists('update'):
        import updater
        f = open('update','r')
        value=f.read()
        f.close()
        updater.run(value)
    else:
        import configured
        configured.run()
else:
    import unconfigured
    unconfigured.run()

The operating mode is determined by a couple of files kept in the device flash. The file config.json contains information specific to this device. If this file does not exist the device is unconfigured; if it exists then the device is considered to be configured and ready to run. In the latter case, the file update is checked; if this one also exists the device enters update mode, which I’ll cover last.

Note that the main function modules are only called in if they are needed for the chosen mode. This avoids having to load everything up front when much of it will remain unused.

Unconfigured mode

This mode is handled by unconfigured.py. The mode is optional; if you provide a config.json file manually the device will never run unconfigured. It exists so that a batch of devices can be flashed with identical firmware and then each one configured individually using a browser.

The job of the module is to run a web server that offers a configuration form for the user to supply configuration data. At the top are a couple of HTML pages, then some utility functions and finally the main code.

The key feature is the use of asynchronous functions to permit multitasking. The program runs 2 jobs; one waits for a client to connect and the other flashes the LED. Since neither of these can be allowed to block, asynchronous code is needed. The first thing is to create the two tasks, using asyncio, then enter an endless loop while they perform their duties.

The code sets up a webserver whose SSID is of the form RBR-xr-xxxxxx, where RBR-xr- is a product descriptor and xxxxxx are the last 6 digits of the device MAC address. The IP address of the server is 192.168.66.1 (or anything else you prefer). Connecting to it at http://192.168.66.1/config brings up the following screen:

The fields are as follows:

Relay Name: Any name that uniquely identifies this relay in the network. Spaces are not allowed.
Host SSID: The SSID of the device this one should connect to.
Host Password: The Wifi password of the host.
My Password: The Wifi password that will allow access to the server on this device once configured.

When Setup is clicked the fields are combined into a JSON structure and saved into the device flash area as config.json. Then the device resets itself. If all is well it should restart in Configured mode.

Configured mode

This mode is handled by configured.py. As with unconfigured mode, the code runs concurrent tasks using the asyncio library. One of these tasks is the main program; the other is a watchdog that monitors activity to ensure polling runs as expected. If for any reason it stops (such as the device’s host going offline), the watchdog forces a reset after a decent interval. This avoids un-needed resets when momentary failures occur.

Most of the functions required by the code - other than standard libraries - are kept in two local modules; functions.py and hardware.py. These will be described later.

The main progam starts by doing some initialization. Two special dictionary objects - outgoingMap and incomingMap - hold the data passing around the network. Both of these, once created, last for as long as the program is running.

The code also sets up a local access point (hotspot) with an SSID that is similar but not identical to the one used in configured mode. In this example it will be RBR-XR-xxxxxx, where xxxxxx is as before the last 6 digits of the MAC address. Since two of the characters in the product descriptor are now uppercase, the device is easy to find in a list of local access points.

In the main loop, the code polls its parent device at regular intervals, sending it the incomingMap. The reply from the parent - the outgoingMap - is parsed to find if the local relay should be on or off. The version number in outgoingMap is compared with that currently saved, and if there’s been an advance the device creates an update file and forces a reset.

When a client device polls this one, the incoming request is parsed to find the command and its data (if any). The format of the request is one of

http://(my ip address)/command
http://(my ip address)/command?data

The commands recognised are as follows:

reboot forces a restart of the code.
reset erases the configuration file then forces a reboot, causing the device to restart in unconfigured mode.
getFile?{filename} is a request for the contents of the named file to be returned. This is used by the updater.
poll?data={incomingMap} Poll the parent of this device, supplying the incoming map (or the part of it known to this device). The outgoing map is returned to the caller.

Any other command will return a line of general information about the device and its status.

The 'functions' modules

A set of helper functions is kept in functions.py and hardware.py. These are provided as two files to help avoid running out of memory during updates.

A single third-party library, uaiohttpclient, which handles asynchronous HTTP requests, is placed in the lib directory.

Update mode

This mode is handled by updater.py. The updater runs its main task and a watchdog. The main task starts by requesting from its parent device a list of all the files that comprise the code set. This list is saved as files.txt, then each of the files named is requested. When all have successfully been dealt with, the new version number is saved.

The mechanism for processing each file has to be robust and able to recover from errors. The procedure is as follows:

1 Download the requested file and save it as a temporary file.
1 Read the saved file and check it’s the same as the data downloaded. If not, force a reset.
1 If there is no existing file with the name requested, rename the temporary file and return.
1 If there is an existing file, read it and compare it with the downloaded file. If they are the same, no update is needed so delete the temporary file and return.
1 If they are different, delete the current (old) version, rename the temporary file and return.

Future work

The system is agnostic about the environment in which it operates. In the current prototypes the hub is a small computer such as a Raspberry or Orange pi, where control is done by a PHP module on an Apache web server. A useful next step would be to create an integration for Home Assistant. A hub module - probably itself an ESP8266 - would handle a complete chain of devices and present them to HA as a single item via MQTT. A suitable API would allow any named device in the chain to be addressed and controlled independently via the hub.