DEV Community: Simon Aubury

I don’t need an untrusted LLM to tell me I’m spending too much on coffee

Simon Aubury — Thu, 14 May 2026 12:06:52 +0000

Analysing my personal spending data with a local large language model. Running open LLM models locally with Claude desktop, tool chaining and a local MCP server with data mastered in DuckDB.

Sharing my dubious coffee habits with the world

I’m so enjoying the power of getting agents take care of the dubious grunt work in my life. Need to buy a USB cable in a hurry? Sure I’ll offload that to my OpenClaw server. Ditto I’m happy to get codex to upgrade some very old Java 11 code. But some tasks I’m just a little too wary of throwing my personal data into the void and hoping for the best. So in the spirit of paranoid analysis I’m continuing my adventures in using LLM’s to work on private data sets.

So many transactions

I started using Quicken to track my money and spending since .. sometime in the mid 2000’s I’ve moved through a few personal finance manager applications — but never felt confident centralising all my financial records on a single cloud applications.

Yes — really found this in my shed

For the last few years I’ve been a very happy Moneydance user — but I’m still transacting on a bunch of specialised local data. I’ve some 30,506 (!) transactions with a lifetime of spending behaviour which I’ve always wanted to run some interesting analysis on. I got inspired to see what I could discover running an LLM locally — and wiring it through a local DuckDB

Let’s LLM

Ollama is the “easiest way to build with open models” — and it really is super convenient. A simple command line allows you to pull and run large language models such as gpt-oss, Gemma 3, DeepSeek-R1, Qwen3 etc.,.

One of the cool party-tricks of Ollama is the built in integration of open models with other tools such as Claude Code. That is, Ollama provides the model — and it gets integrated within the flow of another application such as coding, chat, automation or assistants. If both the model and the integration supports tool calling your model can invoke tools and incorporate their results into its replies. TL;DR — your simple request typed into Claude Code gets routed to a local LLM, and any data required to answer your query can be picked up by a locally configured tool.

Concepts aside — how did I get an agent to judge my coffee spending habits? Let’s break this down into data export, tool integration and analysis

Mondeydance data export

I really do love Moneydance — it even has it’s own Jython console exposing the entire Java object framework of its financial database. With a bit of vibe’ed code (linked below) I manage to build an extractor which can dump all my financial data into a local DuckDB database.

This is a one-time export — I’m not keeping this export updated so I can close Moneydance once all my transactions have been loaded into DuckDB. Let’s now get into the tool chaining.

Tool integration

The MotherDuck DuckDB local MCP server “connects AI assistants to your data using DuckDB’s powerful analytical SQL engine”. In short — my coding agent can be given a hook into my queryable snapshot of my financial data.

Adding this to .claude/settings.json is all that’s need to wire up the MCP server. SQL analytics processing with just a sprinkle of configuration magic, neat!

{
  "mcpServers": {
    "duckdb": {
      "command": "uvx",
      "args": [
        "mcp-server-motherduck",
        "--db-path",
        "/Users/saubury/Library/Containers/com.infinitekind.MoneydanceOSX/Data/Moneydance-export.db",
        "--allow-switch-databases"
      ]
    }
  }
}

To connect a local Ollama model to a MCP servers you need a “host” that implements the MCP client protocol and bridges it to the Ollama API. Claude Code fills that role: launched against a local Ollama model with tool-calling support, it acts as the MCP host. If during a query such as “how much did I spend on coffee” some data needs to be found … the request gets routed out to the MCP to locate. That’s what this configuration does — presenting a route between the model (which doesn’t know any of my financial secrets) to a MCP “router” that can go dig into data to get what’s required. Some SQL gets created on the fly — but TLDR; data gets extracted

Neat! Now we’ve got the plumbing sorted we’re set to do some data analytics …

Analysis

We can launch Claude code with via ollama and a model with one pretty simple command

ollama launch claude --model qwen3.5:27b

And yes, we’re now in Claude Code — but note the model is a local one …

And now we’re good to pester the agent will all sorts of silly queries.

It’s worth adding you can bust this local paranoia if your requests get a bit too complex for your local model. If you skip ollama you can run claude code “normally” with it’s more powerful cloud model. The MCP server is still there querying your data — but all bets are off at this point for where your data actually goes.

So … what did I learn

Local models are slow — very very slow. I found as a proof of concept the idea of running LLM models locally with Claude desktop, throwing in tool chaining and hooking up a local MCP server for my DuckDB database was a blast. It’s so cool how accessible this tech is — but I think I need to go a bigger machine before I can turn off my Anthropic monthly subscription. At least I know where to put that transaction …

MoneydanceMCP Code

https://github.com/saubury/MoneydanceMCP

When plans change at 500 feet: Complex event processing of ADS-B aviation data with Apache Flink

Simon Aubury — Mon, 16 Jun 2025 09:56:24 +0000

When plans change at 500 feet: Complex event processing of ADS-B aviation data with Apache Flink

Using open-source Apache Flink stream processing to process real time aviation data to find missed approaches and paired runway landings. With some neat Flink SQL and custom functions, it’s possible to spot those rare times when planes pair in the sky or abort a landing at the last moment.

Project code available on https://github.com/saubury/plane_track

Finding missed landing approaches and paired runway landings

Aircraft determine their position using GPS; and periodically transmit that position along with an aircraft identity string, altitude, speed etc as ADS-B signals. These signals are transmitted in clear text — and can be readily received with a small radio receiver. The event stream of data around a local airport is a fascinating source of data for complex event processing.

I wanted to see if I could determine when infrequent but noteworthy aviation situations occur at my local airport

Missed approach (or go-around) during aircraft landing — an uncommon manoeuvre where a pilot discontinues the final approach to the runway and climbs away from the airport for another attempt at the landing.
Paired flight landings where aircraft land (or takeoff) on parallel runways. I was especially interested in the golden photographic moments when the same commercial aircraft type were flying in close formation.

Acquiring the flight data

My first attempt at acquiring aircraft transponder messages (ADS-B signals) was with a Raspberry Pi and a RTL2832U — a USB dongle originally sold to watch digital TV on a computer. This approach (detailed here) was only partially successful. Although I got a rich feed of data as planes flew over my house — I was too far from the airport to receive transmissions when the aircraft were on the final descent into my local airport.

I then discovered adsb.fi — a community-driven flight tracker project with a free real-time API for personal projects. Their API returns aircraft transponder messages within a nominated radius of a specified location point. You can get a glimpse at flights over Sydney with a curl command like this

curl --silent https://opendata.adsb.fi/api/v2/lat/-33.9401302/lon/151.175371/dist/5

For me this was an ideal way of receiving live flight location, track and altitude for aircraft within 5 nautical miles of Sydney airport.

Sample of aircraft transponder messages (ADS-B signals)

OK, now I’ve got a feed of data I need to analyse it to find some interesting flight events.

Processing the flight data stream

I wrote a python based aircraft monitor which polls the adsb.fi feed for aircraft transponder messages, and publishes each location update as a new event into an Apache Kafka topic. I used Apache Flink — and more specially Flink SQL, to transform and analyse my flight data. The TL;DR summary is I can write SQL for my real-time data processing queries — and get the scalability, fault tolerance, and low latency managed by the Flink runtime.

Identifying missed approach landings

A missed approach is a standard procedure where a pilot discontinues the final approach to landing and climbs away from the runway, typically due to poor visibility, an unstable approach or an unsafe runway condition.

Google Earth mapping of missed approach landing

Missed approaches are not rare but are still relatively uncommon in normal operations. I couldn’t find accurate statistics, but at busy airports (especially in poor weather) roughly 1–3% of approaches might result in a missed approach.

I needed to define a query for missed approach detection with a go-around-like patterns in flight altitude data. I wanted a SQL Flink statement to find a sequence where an aircraft **descends, **lands or nearly lands, then **climbs again*, reaching a minimum safe altitude. *An example series of altitude measurements could be graphed over time like this:-

Missed approach series of altitude measurements

I classified a is-descending landing as seeing 5 consecutive decreasing values, followed by a is_near_ground event of descending below 800 ft, followed by an is_ascending event.

My final Flink SQL query uses **MATCH_RECOGNIZE**, a powerful pattern recognition feature in Flink SQL for complex event processing. It identifies specific flight altitude patterns in a stream of aircraft data, partitioned by callsign (i.e., individual aircraft).

SELECT *
FROM flight
MATCH_RECOGNIZE(
    PARTITION BY callsign
    ORDER BY proc_time
    MEASURES
        IS_DESCENDING.flightts as desc_UTC,
        IS_GROUND.flightts as ground_UTC,
        IS_ASCENDING.flightts AS asc_UTC,
        IS_ABOVE_MIN.flightts AS abvm_UTC,
        IS_GROUND.altitude AS grd_altitude,
        IS_ASCENDING.altitude AS asc_altitude
    ONE ROW PER MATCH
    AFTER MATCH SKIP TO LAST IS_ASCENDING
    PATTERN (IS_DESCENDING{5,} IS_GROUND{1,} IS_ASCENDING IS_ABOVE_MIN)
    DEFINE
        IS_DESCENDING AS (LAST(altitude, 1) IS NULL AND altitude >= 1000) OR altitude < LAST(altitude, 1),
        IS_GROUND AS altitude <= 800,
        IS_ASCENDING AS altitude > last(altitude,1),
        IS_ABOVE_MIN AS altitude > 1000
) AS T
where TIMESTAMPDIFF(second, desc_UTC, asc_UTC) between 0 and 1000;

When I originally wrote this query I got a number of false positives where planes would land, and take off a few hours later with the same flight code. To exclude these particular conditions I added an additional predicate to only return matches where the time between the descent and the subsequent ascent is within 1000 seconds.

With my query running it actually took a few days to identify the first missed approach. On a particularly stormy morning I managed to identify three occasions when a go-around was performed — and validated the result by looking up the flights historic path with FlightRadar24.

A flight identified as missed approach landing

With my data acquisition successfully finding missed approach I wanted to move onto more complex event processing this time with multiple aircraft events.

Identifying twin landings

Paired flight landings occur when aircraft land on parallel runways.

As I wanted to determine the distance between aircraft I used a user-defined function (UDF) to extend the capabilities of Apache Flink to implement custom logic beyond what is supported by built-in SQL functions. By adding a distance scalar Java function I could calculate distance between two aircraft.

    // Equirectangular approximation to calculate distance in km between two points 
    public float eval(float lat1, float lon1, float lat2, float lon2) {
        float EARTH_RADIUS = 6371;
        float lat1Rad = (float) Math.toRadians(lat1);
        float lat2Rad = (float) Math.toRadians(lat2);
        float lon1Rad = (float) Math.toRadians(lon1);
        float lon2Rad = (float) Math.toRadians(lon2);

        float x = (float) ((lon2Rad - lon1Rad) * Math.cos((lat1Rad + lat2Rad) / 2));
        float y = (lat2Rad - lat1Rad);
        float distance = (float) (Math.sqrt(x * x + y * y) * EARTH_RADIUS);

        return distance;
    }

Refer to the readme for jar build steps and the operations to add to Flink. But the short summary is compile the JAR with mvn clean package and then register the UDF function in Flink with

ADD JAR '/target-jars/udf_example-1.0.jar';

CREATE FUNCTION distancekm  AS 'com.example.my.Distance';

OK — with my Flink distance UDF available I can run a query that finds pairs of flights that were geographically close (within 1.5 km) of each other during overlapping or near-overlapping times (within 20 seconds), and reports their callsigns and distance.

SELECT f1.callsign AS f1, 
f2.callsign AS f2,
CAST(ROUND(distancekm(f1.latitude , f1.longtitude, f2.latitude, f2.longtitude), 1) AS VARCHAR) as km
FROM flight f1, flight f2
WHERE f1.flightts BETWEEN f2.flightts - interval '20' SECOND AND f2.flightts
AND f1.callsign < f2.callsign
AND distancekm(f1.latitude , f1.longtitude, f2.latitude, f2.longtitude) < 1.5;

The query identifies flights that came close together in both space and time and reports the distance between them.

Nearby flights

This works — but is showing any *paired aircraft movement.* What I really wanted to do was find was the less common occurrence of the same aircraft type (such as two Boeing 737’s) flying in formation. I need a bit more data …

Annotating ADS-B messages with aircraft type and routes

Peering into the ADS-B messages I have raw messages from the aircraft. Each payload comes with an ICAO 24-bit transponder code specifically assigned to each aircraft (eg. 7c7a3d) and a flight route code (eg. 7VOZ518). What I want to do is load a static reference set of data to map

Aircraft ICAO codes such as 7c7a3d to airframes such as Boeing,737NG 8FE
Flight code such as 7VOZ518 a route from the Gold Coast (OOL) to Sydney (SYD)

A very convenient capability of Flink SQL is the ability to create a table directly from a CSV file. So I can populate the aircraft_lookup table with a command like this

CREATE TABLE aircraft_lookup (
    icao24  varchar(100) not null,
    country  varchar(100),
    manufacturerName varchar(100),
    model varchar(100),
    owner varchar(100),
    registration varchar(100),
    typecode varchar(100)
) WITH ( 
    'connector' = 'filesystem',
    'path' = '/data_csv/aircraft_lookup.csv',
    'format' = 'csv'  
);

I downloaded aircraft data from the Opensky data archive. With aircraft_lookup and route_lookup data loaded, I created a Flink view to supplement the data coming in from the flight Kafka topic

CREATE OR REPLACE VIEW flight_decorated
AS
SELECT f.*, a.model, a.owner, a.typecode, r.route
FROM flight f 
LEFT JOIN aircraft_lookup a ON (f.icao = a.icao24)
LEFT JOIN route_lookup r ON (f.callsign = r.flight);

With data loaded and the flight feed decorated with aircraft type and route information I can now search for the perfect photographic moment of twin planes landing.

Twin landings

I now have a live feed of data with aircraft location, airframe type and route information. Along with my distance function I can query the stream to find the golden photographic moments when the same commercial aircraft type were flying in close formation.

SELECT f1.flightts,
f1.callsign || ' ('  || COALESCE(f1.route, '-') ||')' || ' ' || f1.typecode AS f1,
CAST(ROUND(DISTANCEKM(f1.latitude , f1.longtitude, f2.latitude, f2.longtitude), 1) AS VARCHAR) AS km,
f2.callsign || ' ('  || COALESCE(f2.route, '-') ||')' || ' ' || f2.typecode AS f2
FROM flight_decorated f1, flight_decorated f2
WHERE f1.flightts BETWEEN f2.flightts - interval '20' SECOND AND f2.flightts
AND f1.callsign < f2.callsign
AND f1.typecode = f2.typecode
AND DISTANCEKM(f1.latitude , f1.longtitude, f2.latitude, f2.longtitude) < 1.5;

Which indeed finds the moment when two similarly typed aircraft land together

Two Boing 737’s landing on parallel runways

Conclusions

This project was a fun exercise — and shows how Apache Flink can turn a stream of aircraft transponder pings into a hunt for interesting aviation moments like go-arounds and perfect twin landings.

With some neat Flink SQL and custom functions, it’s possible to spot those rare times when planes pair in the sky or wave off a landing at the last second.

✈️ Project code available on https://github.com/saubury/plane_track

Puppy data

Simon Aubury — Wed, 19 Feb 2025 02:45:31 +0000

Puppy data

🐾 Puppy Data 🐾 combines a love of data and dogs — a project to track Barney the puppy🐶! Using load-cells and low power motion tracking sensors I built an automated system to monitor Barney’s weight, sleep habits, and activity📊. Data is sent to Home Assistant for easy tracking via a slick interface 📱. CotexAI and a Streamlit application allows me to ask questions like “How much heavier is Barney this month?” or “When was he most active?” 🐕💡. A blend of IoT, DIY tech, and puppy love ❤️📈!

This is Barney — our (now) 6 month old “Staffy Cross” — adopted from a local shelter. We love Barney joining our family— and I love data. Barney has agreed (sort of) to help with a local bit of data gathering so we can watch him grow.

Barney at home — along with sample analytics

My goal with this project was to passively collect data on Barney’s activity — his sleeping habits, weight gain, sleep patterns and movements around our house. The data is fed into data warehouse where a conversational interface is used to interact with my Barney data to answer questions like “When was Barney the most active yesterday?”, “How much heavier is Barney this month?” or “Where did Barney put my shoe?” (okay — maybe it didn’t help with that last one).

Puppy weight

Puppies grow quickly — and I wanted an automated way of measuring his weight each day. On the underside of Barney’s kennel ( the “Barn” ! ) I installed 4 load cell weighing sensors — one for each corner of the Barney’s Barn. By summing the combined weight across the 4 cells and subtracting the tare weight (of the kennel itself and any cushions or toys dragged into the Barn overnight) I can determine an accurate daily weight for our puppy.

Load cells are pretty neat. They measure weight (or, more accurately, directional force). Each load cell is able to measure the electrical resistance that changes in response to (and is proportional to) the force applied. As Barney jumps into his Barn, we can instant weigh him — and as a bonus also measure the time he spends napping in his kennel.

Close up a load cell and 3D printed mounting bracket

The 50kg load cells and HX711 amplifier module was around $5. The peculiar thing about these sensors is they don’t sit flush to a surface of the barn, and the centre of each sensor needs a gap for it to flex when a load is applied. Searching online I found you can 3D print a bracket to help keep it clear of the underside to avoid this problem.

HX711 amplifier module (left) and ESP32 (right)

I wired the four load cells into a single circuit with the HX711 amplifier module in a wheatstone bridge configuration — have a look at this helpful blog for the detailed steps.

Underside of barn — with load cells and circuitry

Finally I was ready to connect the load cells and HX711 amplifier module into an ESP32 microcontrollers with integrated Wi-Fi and Bluetooth. Everything was taped to the underside of the barn — and the delicate wires and circuit boards hidden from the curious pup.

Barney — passionate data producer

I flashed the ESP32 with ESPHome and added the HX711 sensor platform configuration.

sensor:
  - platform: hx711
    name: "HX711 Value"
    dout_pin: GPIO14
    clk_pin: GPIO13
    gain: 128
    update_interval: 15s   
    filters: 
    - calibrate_linear:
        - -455742 -> 0
        - -550682 -> 4.404
    unit_of_measurement: kg
    accuracy_decimals: 1

The weight was captured “as-is” — and I did some SQL post processing to work out the tare weight. The final sensor then got added to Home Assistant, giving me monitoring, persistent storage and a nice use interface (and app) which I can use anywhere I’ve got an internet connection. Behind the scenes a local PostgreSQL database is used to store all the sensor measurements every minute.

Home assistant dashboard

With weight measurement sorted, I then set out to measure Barney’s activity tracking.

Puppy activity

Barney rarely keeps still — so I wanted a way to track his movements during the day and sleep patterns at night. I got him FitBark 2 and placed it on his collar to monitor his everyday activity. This is a small 3D accelerometer and Bluetooth transmitter that weights only 10 grams! This is a very cool device — and it has a developer API!

Barney with Fitbark on his collar

I used the Home Assistant rest sensor platform to consume the Fitbark RESTful API.

rest:
  - authentication: digest
    verify_ssl: false
    # update every 1 hour
    scan_interval: 3600
    resource: https://app.fitbark.com/api/v2/activity_series
    method: POST
    payload_template: >-
      {   
        "activity_series":{"slug":"XXxxXXxx",           
          "from":"{{ now().strftime('%Y-%m-%d') }}",           
          "to":"{{ now().strftime('%Y-%m-%d') }}",           
          "resolution":"HOURLY"   
        }
      }
    headers:
      Authorization: !secret fitbark_bearer_token
      Content-Type: application/json
      User-Agent: Mozilla/5.0
    sensor:
      - name: "Fitbark_activityseries_activity_value"
        unique_id: fitbark_activityseries_activity_value
        value_template: "{{ value_json.activity_series.records[-2].activity_value | int }}" 

      - name: "Fitbark_activityseries_min_play"
        unique_id: fitbark_activityseries_min_play
        device_class: duration
        unit_of_measurement: min
        value_template: "{{ value_json.activity_series.records[-2].min_play | int }}"

The sensor has support for GET and POST requests, and I used the Get Activity Series API to get the recent hourly activity for Barney, giving me a breakdown of his active, play and rest minutes.

Activity by hour

Much like the weight measurements, the hourly activity measures are managed by Home Assistant which automatically stores the sensor measurements to a local PostgreSQL database.

With Barney weight and activity data captured let’s moved onto data analysis.

Cortex Analyst in Snowflake

Barney had given me a lot of data — and now I wanted to do something with it! I’ve been using Cortex Analyst in Snowflake database in my “day job” and thought it was an ideal way of asking questions in natural language about Barney and receiving direct answers without writing SQL.

Cortex Analyst is a fully-managed, LLM-powered Snowflake Cortex feature that helps you create applications capable of answering questions based on your data stored in Snowflake. So Cortex Analyst wil be tasked with answering the “Is Barney playing more this week?” style of questions in natural language and receive direct answers without writing SQL.

Here’s a look at Streamlit conversational application.

Streamlit conversational application.

I can start asking questioons such as

What was Barney’s smallest weight and when was that measured?

Barney used to weigh 8.55kg

Barney used be be such a tiny puppy! Let’s look at his growth

Show me the weekly average weight of Barney for the last 3 months.

Barney has grown a lot over 3 months

Barney really has grown a lot over the last 3 months. Finally, let’s look at his activity

Pivot the activity type for Barny and summarise the minutes each week for the last month.

Lots of play time!

I see Barney enjoys his naps — and loves to run around too

Lessons & future enhancements

I’ve been happy with the Barney data collected so far — but a few things haven’t worked as expected

The “food scale” which I initially created to weigh the food consumed was too impractical — as we kept moving the feeding bowl and Barney would often chew on the wires
The FitBark API call from Home Assistant is a delicate— requiring me to request specific blocks of time and has no error or retry logic. I’d prefer to rework this as a proper backfill operator
I originally wanted to use the signal strength (RSSI value) of the FitBark bluetooth as a proxy for location — however this was too imprecise for any meaningful measurements
The data transfer from local PostgreSQL to cloud Snowflake was manual — I’d like to automate this (and perhaps play with dlt)

What did work and I was happy with

The weight scale worked better than expect — with (what appears to be) a smooth progressive and reliable weight measurement over the last few months
FitBark has an impressive data logging mechanism — and the battery has only been charged once in 6 months!

In the end, Puppy Data showcases how IoT and puppy energy can turn a love for dogs and data into a fun way to track Barney’s growth and adventures, proving that even tech can have a heart ❤️🐾📊!

Code

The code and example data is available at

https://github.com/saubury/puppy_data

References

FridgeBot — GPT-4o shopping list automation

Simon Aubury — Tue, 21 May 2024 02:20:02 +0000

FridgeBot — GPT-4o shopping list automation

Monitoring the contents of my fridge and automatically adding grocery items to my shopping list with the new GPT-4o vision API

OpenAI recently announced their new generative AI model GPT-4o — the “o” stands for “omni,” referring to the model’s ability to use mixed modalities including text, speech and video. I wanted to give GPT-4o a real challenge — helping me keep on top of the shopping list by automatically monitoring the contents of my fridge and adding grocery items when I had run out of something.

Processing steps

The fridge door light is used as a signal for when to start and stop looking for changes in the fridge. I’m assuming that anything taken from my fridge has been removed when the door is open and the door light illuminated.

The code does the following

Open the camera feed — and use the OpenCV library for real-time computer vision processing
Every 500ms we take an image from the video feed and convert the image to grey-scale
Calculate mean brightness of the grey-scale image — and use the average value as an indicator for if the fridge door is open
If the last image was “dark” and the current image is “light”, we assume the fridge door has just been opened. We wait 500ms for the camera white balance to settle, and save this as image-1 to represent the initial state.
We continue to take photos every 500ms, and save the image temporarily as we don’t know when the door is going to close
When the average brightness of the latest image drops significantly, we know the fridge door has just been closed. We discard this image (as it is dark), and use the last image taken and save this as image-2 to represent the final state.

Now we have the before and after images, use image-1 and image-2 as inputs to the OpenAI

We encode both images as base-64 to transform the binary image data into a sequence of printable characters
We use the OpenAI GPT-4o vision API along with prompt ‘What item is missing in the second image?’
The response is decoded — and is assumed to be a single word describing what item was in image-1 and is not present in image-2

Knowing the item, we can add to our shopping list

The Todoist Sync API is used to add the item to our shopping list

Hardware

The final form of this project runs on a RaspberryPi with a live video processing stream within my fridge. In reality this was a little impractical as both power and ethernet cables needed to be routed past the fridge seal.

To demonstrate the steps without the need for specific hardware (or a fridge) you can run this project on almost any machine. The setup steps below use a demonstration video file.

Overview

FridgeBot — fridge monitoring with GPT-4o

Setup virtual python environment

Create a virtual python environment to keep dependencies separate. The venv module is the preferred way to create and manage virtual environments.

python3 -m venv .venv

Before you can start installing or using packages in your virtual environment you’ll need to activate it.

source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt

API setup

Now it’s time to set up the local API secrets.

cp -i config_secrets_example.py config_secrets.py

Edit config_secrets.py with OpenAPI tokens along with the Todoist API secret tokens

FridgeBot — video process only

To run FridgeBot with example video file without calling OpenAI run the following

python fridgebot.py --video media/pikelets.mov

This should generate two files representing the first lit image and the last lit image within the video media/pikelets.1.jpg and media/pikelets.2.jpg

FridgeBot — video process and OpenAI

To run FridgeBot with example video file, calling OpenAI run the following

python fridgebot.py --video media/pikelets.mov --openai

This should generate two files, and query the OpenAI API to identify the item removed — Pikelets

FridgeBot — video process, OpenAI and Todoist

To run FridgeBot with example video file, calling OpenAI and adding item to Todoist

python fridgebot.py --video media/pikelets.mov --openai --todoist

This should generate two files, and query the OpenAI API and add Pikelets to the Todoist shopping list

FridgeBot Code

FridgeBot code — https://github.com/saubury/fridgebot_openai

🍹GinAI - Cocktails mixed with generative AI

Simon Aubury — Thu, 19 Oct 2023 10:33:23 +0000

🍹GinAI - Cocktails mixed with generative AI

GinAI — a robotic bartender which can make a nice drink given a random collection of juices, mixers and spirits. Real cocktails created and music chosen by OpenAI — all mixed by a RaspberryPi bartender.

GinAI — Cocktails mixed with generative AI and a RaspberryPi.

I’m bored — can I get a video?

Here’s a quick video of GinAI in action.

Starting at the end

Let me describe the finished project — and we can work backwards on how I built 🍹GinAI🍸. The GinAI bartender uses up to four ingredients — and when I press the dispense button, OpenAI ChatGPT will “create” a drink, describe the cocktail creation and select an appropriate song 🎵.

A row of decorative lights look pretty during the creation and flash 🚨once the cocktail is ready. A Google Nest Mini is used as the speaker for both the spoken words 🗣️ and for playing the tunes 🎶.

Cocktail inspiration from ChatGPT

I started my cocktail mixing adventure by simply asking OpenAI for a cocktail recipe from the random spirits and mixers I had available. For example, I prompted ChatGPT to suggest a cocktail with this query

create a cocktail from the ingredients gin, tequila, apple juice and tonic.

Which returns a helpful cocktail mixing recipe along with text instructions as a response.

First experiment with ChatGPT console

I quickly found a few limitations with my initial cocktail creations from the console.

OpenAI would sometime suggest a cocktail with an ingredients I didn’t specify (and didn’t have). I corrected this with the added instruction to only use ingredients from the provided list. Results were more reliable with prompt instructions “You do not need to use all of the ingredients. You may only use a maximum of 4 ingredients.” I can now call myself a prompt engineer 😀
Recipe suggestions used a variety of measurement unit — such as fluid ounce “a dash of” or other bizarre imperial measurement. I could coerce the output by simply adding the prompt “Only give quantities in metric units. Only give quantities in whole numbers”.
There was no consistency with the total volume of cocktail produced. Some may not consider 2 litres of alcohol a problem — but at the very least it overflowed my available cocktail glassware. The prompt instruction to limit to a 250 millilitre volume reduced spillage and excessive drinking

With the prompt creation giving reasobale results, I moved onto building a reliable interface.

Predictable OpenAI responses with function calling

The cocktail recipes created by ChatGPT were going to drive the automated drink dispensing — so I needed ensure OpenAI API would generate a predicable schema for the JSON responses. OpenAI recently added **Function calling** functionality to their API, which I could use to return a consistent JSON response. Function calling is primarily aimed at connecting GPT’s capabilities with external tools and APIs — and converts queries such as “Email Alice to see if she wants to get coffee next Friday” to a function call like send_email(to: string, body: string).

I didn’t need to direct function calling, but I can use the technique (along with a dummy function) to ensure recipes suggested by OpenAI ChatGPT meet a my specification. Enforcing a predictable JSON output means the directions can be easily parsed and implemented by the RaspberryPi liquid dispensing pumps to make yummy cocktails based on the response from a [gpt-3.5-turbo](https://platform.openai.com/docs/models/gpt-3-5) model.

The easiest implementation I found was to use a PyDantic class for my target schema — and use that as a parameter for the method call to “ChatCompletion.create()”. Here’s a fragment of the GinAI Python classes used.

# create a PyDantic schema for output
class Ingredient(BaseModel):
ingredient_name: str
quantity_ml: int

class Cocktail(BaseModel):
cocktail_name: str
description: str
inventor: str
matching_song: str
instructions: str
ingredients: list[Ingredient]

This is a basic class, where an Ingredient class is the ingredient name and a quantity specified in millilitres. The Cocktail class has a list of the Ingredient objects, along with the name, description and appropriate song to compliment the drinking of the coctail

I followed this guide as a great tutorial for using the new function calling feature from OpenAI to enforce a structured output from GPT models.

The OpenAI Python calling logic looks like this (or see the whole openai_util.py module).

completion = openai.ChatCompletion.create( 
    model='gpt-3.5-turbo',
    messages=[
        {'role': 'system', 'content': 'You are a helpful bartender.'},
        {'role': 'user', 'content': prompt},
    ],
    functions=[
        {
        'name': 'get_answer_for_user_query',
        'description': 'Get user answer in series of steps',
        'parameters': Cocktail.model_json_schema()
        }
    ],
    function_call={'name': 'get_answer_for_user_query'}
)

OpenAI function calling ensured created recipes conformed to a strict JSON schema. For example, a typical response would look like this.

{
    "cocktail_name": "Summer Breeze",
    "description": "A refreshing cocktail perfect for a hot day.",
    "inventor": "My Bartender",
    "matching_song": "Summertime by DJ Jazzy Jeff & The Fresh Prince",
    "instructions": "1. Fill a glass with ice.\n2. Combine the ingredients in the glass.\n3. Stir well.\n4. Garnish with a slice of orange.\n5. Enjoy!",
    "ingredients": [
        {
            "ingredient_name": "gin",
            "quantity_ml": 45
        },
        {
            "ingredient_name": "orange juice",
            "quantity_ml": 60
        },
        {
            "ingredient_name": "tonic",
            "quantity_ml": 15
        }
    ]
}

With cocktail recipes consistently created, I could move onto building the drink dispensing hardware.

GinAI — Hardware build

With the interfaces and software roughed out to imagine cocktails, the next job was to build the pouring hardware.

Pumps

I used 4 peristaltic pumps to provide a “food safe” way to pump the liquids from the drink bottles. These pumps provide a steady rate of flow of liquids when powered. By carefully timing the “on” time for the pump I can precisely deliver the ideal amount of spirits or mixers for the perfect 🍸

A quick search on AliExpress

The pumps are pretty cheap — and provide an accurate way to dispense precise quantities of liquids.

Pumps — straight from the post

The pumps are mounted on a basic wooden frame higher than the tallest bottle. My children helped to build the frame, labelling and installing the pumps.

The important task of adding labels to pumps

A view from the rear show the placement of pumps and liquids. A manual electrical switch allow the pumps to be run independently from the Raspberry Pi (helpful for cleaning).

Rear view of Cocktail maker

The pumps use 12 volt motors. To operate them via the Raspberry Pi I used a 4 Channel 12V Relay Module. This allows the pumps to be switched on and off independently with the 5 volt signals from the GPIO pins of the Raspberry Pi.

Relay board

The Raspberry Pi is mounted with the relay board. The relays switch 12 volt power on and off for the pump motors. The signals for the relay board are taken directly from the GPIO header of the Raspberry Pi.

Board placement

Finally I added the mandatory RGB LED’s for some colourful lighting effects. I used a row of WS2812B LED strip. This was installed behind the final collecting tube with a bit of soldering hidden by white heat shrink tubing.

WS2812B RGB LED strip.

A great bartended needs to be able to talk and entertain, so GinAI needed a speaker. I used a Google Nest Mini mounted to the frame as a speaker for spoken words and for playing the music.

Google Nest Mini

🍹 Cheers!

With a bit of coding and a lot of trust in my GinAI bartender I’m enjoying the surprising world of generative created cocktails

Cheers

🛠️ Code

The code for GinAI is available at

https://github.com/saubury/GinAI/

My data, your LLM — paranoid analysis of iMessage chats with OpenAI, LlamaIndex & DuckDB

Simon Aubury — Tue, 12 Sep 2023 10:16:08 +0000

Can I safely combine my local personal data with a public large language model to understand my texting behaviour? A project combining natural language and generative AI models to explore my private data without sharing (too much of) my personal life with the robots.

Data architecture — image by author

✍️ A blog about exploratory data analysis of my private iMessage chats — with equal parts of wonder and paranoia.

Motivation 🤔

iMessage is an instant messaging service for text communication between users on Apple devices. Behind the scenes, iMessage uses a local SQLite database to store a copy of message conversations. This means I have a complete local copy of all my messages in a relational database on my Mac laptop.

With 2 years of iMessage history I wanted to explore the text data to create data visualisations — with natural language prompts. The data however is very personal — so I needed to use a privacy preserving design to ensure my personal communications doesn’t leave my machine.

Tech stack 🔧

To explore my text messages I’m using

DuckDB open-source, embedded, in-process OLAP database. With the SQLite extension to directly read from an iMessage SQLite database file
LlamaIndex — framework for connecting custom data sources to large language models, allowing for natural language querying of my data
OpenAI gpt-3.5-turbo model for code generation to create the python to make the visualisations
mitmproxy — an open source interactive HTTPS proxy to view encrypted network traffic

What am I working towards? 📈

My goal is explore my messaging behaviour, such as texting frequency and time of day usage. I want to “talk” to my data using generative AI — to create visualisations on top of my private data.

Image by author

🎉 Tada! Creating charts locally on my private data from a natural language prompt. Let’s now see how I built this, breaking this down into three parts

Wrangling iMessage data — with DuckDB to ingest and pre-process my iMessage history
Inspecting the network traffic — with a “Man in the Middle” proxy to view request and response encrypted traffic.
Prompted visualisations — with PandasQueryEngine and LlamaIndex connecting my local custom data iMessage data sources to a public large language models.

🛠️ The complete notebook is available at https://github.com/saubury/paranoid_text_LLM/

Wrangling iMessage data with DuckDB 🦆

The first task is to extract my iMessages and transform into a sensible form for data analysis.

⏩ If you’re not interested in the data engineering step you can jump ahead to reading about generative AI with Pandas and LlamaIndex.

I’ve written before about DuckDB — a lightweight, free yet powerful analytical database designed to streamline data analysis workflows which runs locally. I’m using 🦆 DuckDB’s as a quick way to ingest and pre-process my iMessage history. My first task is to load the SQLite Scanner DuckDB extension which allows DuckDB to directly read data from a SQLite database such as iMessage.

INSTALL sqlite_scanner;
LOAD sqlite_scanner;

If you are on a Mac, logged into your iCloud account you can copy the local iMessage SQLite database and load it. If you receive the error Operation not permitted you may need to run the command in a terminal and accept the prompts to interact with privileged files.

cp ~/Library/Messages/chat.db ./sql/chat.db

With DuckDB, we will open the iMessage SQLite database, with the attach command. This will open the SQLite database file ./sql/chat.db in the schema namespace chat_sqlite chat.

ATTACH './sql/chat.db' as chat_sqlite (TYPE sqlite,  READ_ONLY TRUE);

Load messages into table

We create the chat_messages DuckDB table by joining three tables from the iMessage SQLite database. Within the same query I also want to

determine the message time by evaluating the interval (number of seconds since EPOC of 2001-01-01)
extract the phone number country calling code (eg, +1, +61)
redact phone number like +61412341234 to +614...41234 (for screenshots in this blog)

CREATE OR REPLACE TABLE chat_messages AS
SELECT TIMESTAMP '2001-01-01' + INTERVAL (msg.date / 1000000000) seconds as message_date,
msg.text,
msg.attributedBody,
msg.is_from_me,
CASE WHEN ct.chat_identifier like '+1%' then SUBSTRING(ct.chat_identifier, 1, 2) when ct.chat_identifier like '+%' then SUBSTRING(ct.chat_identifier, 1, 3) end as phone_country_calling_code,
regexp_replace(ct.chat_identifier, '^(\+[0-9][0-9][0-9])([0-9][0-9][0-9])', '\1...\3') as phone_number
FROM chat_sqlite.chat ct
JOIN chat_sqlite.chat_message_join cmj ON ct."ROWID" = cmj.chat_id
JOIN chat_sqlite.message msg ON cmj.message_id = msg."ROWID";

I can peek at the chat_messages DuckDB table by querying it (with good old select * ).

Sample of chat_messages — image by author

I can now send the contents of the chat_messages DuckDB table into a chat_messages_df dataframe with << within a sql magic.

%%sql
chat_messages_df <<
  SELECT *
  FROM chat_messages
  ORDER BY message_date DESC;

Decoding attributedBody 🪄

The iMessage database has a mixture of encoding formats, with older messages as plain text in the text field, with newer messages encoded in the attributedBody field. Sometime around November 2022 the messages started coming in in new format which migh be related to a message upgrade related to the release of iOS 16. I’m thankful to the iMessage-Tools project which had the logic to extract the text content is hidden within the attributedBody field. The decode_message utility function extracts the text regardless of format.

import re

def decode_message(row):
  msg_text = row['text']
  msg_attributed_body = row['attributedBody']

  # Logic from https://github.com/my-other-github-account/imessage_tools
  body=''
  if not msg_text:
    body = msg_text
  elif msg_attributed_body is None:
    body = ''
  else:
    try:
      msg_attributed_body = msg_attributed_body.decode('utf-8', errors='replace')
    except AttributeError as err:
      pass

    if "NSNumber" in str(msg_attributed_body):
      msg_attributed_body = str(msg_attributed_body).split("NSNumber")[0]
      if "NSString" in msg_attributed_body:
        msg_attributed_body = str(msg_attributed_body).split("NSString")[1]
        if "NSDictionary" in msg_attributed_body:
          msg_attributed_body = str(msg_attributed_body).split("NSDictionary")[0]
          msg_attributed_body = msg_attributed_body[6:-12]
          body = msg_attributed_body

  body = re.sub(r'\n', ' ', body)
  return body

Inline message extraction

We'll use a the pandas apply() method to apply the decode_message function to the DataFrame. In short, we’ll set message_text to something readable, regardless of the format the text came in.

chat_messages_df['message_text'] = chat_messages_df.apply(decode_message, axis=1)

chat_messages_df.head()

With the messages decoded, I can peek at the first few records.

Sample of iMessage texts sent and received — image by author

With my several thousand iMessages loaded into the chat_messages_df dataframe, I can move onto some prompted analysis.

Talking to my data — Generative AI on Pandas with LlamaIndex 🗣️

I’ll be using LlamaIndex — a flexible data framework for connecting custom data sources to large language models. LlamaIndex uses Retrieval Augmented Generation (RAG) systems that combine a large language model (such as those provided by OpenAI or Hugging Face) with a private data set (set as my personal copy of iMessages).

The RAG pipeline retrieves the most relevant context for my query (such as the shape of my data), and passes that to the LLM to generate a response. The response should be the python code necessary to execute on my local data to visualise a result in response to my query.

LlamaIndex — High-Level Concepts

LlamaIndex

I will be using the paid OpenAI service, and have created a secret API key which is saved in the notebook.cfg configuration file.

import pandas as pd
from llama_index.query_engine.pandas_query_engine import PandasQueryEngine
import openai
import configparser

config = configparser.ConfigParser()
config.read('notebook.cfg')
openai_api_token = config.get('openai', 'api_token')
openai.api_key =  openai_api_token

I can now take my chat_messages_dfdata frame — and ask a question like “What is the most frequent phone_number?”

query_engine = PandasQueryEngine(df=chat_messages_df)
response = query_engine.query("What is the most frequent phone_number?")
print(response)

Which generates a small fragment of Python code, which can be applied to my data .. and gives me the value for the most frequently iMessaged user!

> Pandas Instructions:
eval("df['phone_number'].value_counts().idxmax()")

+61 412 321915

🎉 Whoa — that’s pretty amazing. By simply asking the query engine for a goal, the public LLM has generated the correct Python code that when run locally on my data frame gives me the correct answer. My wife will be happy to know she is the most frequently messaged from my phone 😅!

Let’s have a peek into the traffic to see what’s happening behind the scenes.

Inspecting the network traffic with mitmproxy 🕵️‍♀️

I was curious to see what kinds of requests the code was making to OpenAI and what kind of responses it is getting back.

⏩ This is an optional step for the paranoid, and if you’re not interested in the network analysis you can skip to prompted visualisations to analyse data.

I used a great guide to get started with mitmproxy to observe to capture encrypted requests & responses. The short summary is the mitmproxy proxy sits between the local Python code and the internet, to intercept and inspect SSL/TLS-protected traffic.

To view the traffic, start mitmweb proxy and set the following environment variables to metwork traffic passes through the local proxy, signed with a local certificate (which is conveniently created when you first run mitmproxy).

import os
os.environ['http_proxy'] = "http://127.0.0.1:8080" 
os.environ['https_proxy'] = "https://127.0.0.1:8080" 
os.environ['REQUESTS_CA_BUNDLE'] = "/Users/saubury/.mitmproxy/mitmproxy-ca-cert.pem"

With the proxy established, both network requests and responses are visible in the web dashboard.

“Man in the Middle” HTTPS proxy — mitmproxy web page

With the proxy running, I can start running some tests.

Prompted visualisations with PandasQueryEngine and LlamaIndex 📊

When do I send and receive texts throughout the day?

Let’s try a query and see what requests and responses are seen by the proxy. I’ll use the PandasQueryEngine of LlamaIndex to query my iMessage data, and ask for the following visualisation to be created …

Extract hour of day from message_date. Visualize a distribution of the hour extracted from message_date. Add a title and label the axis. Use colors and add a gap between bars. Colour the bars with an hour of 5 in red and the rest in blue.

LlamaIndex will compose a request to OpenAI, and I can capture the outward request

Request sent to OpenAI

It appears that a sample (5 rows) of my data is sent outwards to describe the data. In this case I’m not all that worried as these are simply dates, but obviously sending a sample of something more personal would concern me more.

Within a few seconds, LlamaIndex will relay the response and we can peek at the code returned by OpenAI.

Response returned with python embedded

The code is then automatically run against my entire local dataset. I was especially impressed the code to extract the “hour” from the timestamp field worked as expected. The result of the generated python code appears exactly as I had asked.

Message time distribution by hour of day — image by author

🎉 Voilà — the code is correct — and generates and run code that renders a distribution shows most of my messages and sent between 6am and 9pm.

Further experiments 🔬

Let’s see a few more examples of data queries, and the payloads which need to be sent to create working python code.

Most frequent contacts

Create a plot a bar chart showing the frequency of top eight phone_numbers. The X axis labels should be at a 45 degree angle. Use a different colour for each bar

Most frequent contacts — image by author

To build a bar chart of my most frequent contacts, LlamaIndex sent a sample of 5 phone numbers to OpenAI to describe the datatypes expected in the dataframe. The resulting Python code executed locally on my entire data set created the correct bar chart. I was impressed the request to turn the X-axis labels 45 degrees was honoured.

import pandas as pd
import matplotlib.pyplot as plt

# Count the frequency of each phone_number
phone_number_counts = df['phone_number'].value_counts().head(8)

# Create a bar chart
plt.bar(phone_number_counts.index, phone_number_counts.values, color=['red', 'blue', 'green', 'yellow', 'orange', 'purple', 'pink', 'brown'])

# Add a title
plt.title('Frequency of Top Eight Phone Numbers')

# Rotate x-axis labels by 45 degrees
plt.xticks(rotation=45)

# Show the plot
plt.show()

Message length

Visualize a distribution of the length of message_text. Use a logarithmic scale. Add a title and label both axis. Add a space between bars.

Distribution of the length of message text — image by author.

To display a distribution of the length of messages, LlamaIndex sent the contents of 5 sample messages t to OpenAI. The resulting Python code executed locally on my entire data set created a lambda function to run locally to determine message length. A logarithmic scale was created, however my prompt to add a space between bars was incorrectly misinterpreted as at plt.tight_layout.

import matplotlib.pyplot as plt

# Calculate the length of each message_text
df['message_length'] = df['message_text'].apply(lambda x: len(x))

# Create a histogram of the message_length with a logarithmic scale
plt.hist(df['message_length'], bins=10, log=True, edgecolor='black')

# Add a title and label both axes
plt.title('Distribution of Message Text Length')
plt.xlabel('Message Length')
plt.ylabel('Frequency')

# Add a space between bars
plt.tight_layout()

# Show the plot
plt.show()

Inbound vs., outbound messages

Visualize a pie chart of the proportion of is_from_me. Label the value 0 as ‘inbound’. Add a percentage rounded to 1 decimal places.

Pie chart of the proportion inbound and outbound messages — image by author

To build a pie chart showing the proportion of inbound to outbound messages, LlamaIndex simply sent a sample of “is_from_me” boolean records to OpenAI. The resulting Python code executed locally on my entire data set created the correct pie chart. I was impressed the label of outbound to the value of 1, which was a clever inference from me describing value 0 as outbound.

import matplotlib.pyplot as plt

# Count the number of occurrences of each value in the 'is_from_me' column
value_counts = df['is_from_me'].value_counts()

# Create a pie chart using the value counts
plt.pie(value_counts, labels=['inbound', 'outbound'], autopct='%.1f%%')

# Display the pie chart
plt.show()

Summary

To answer the question “can I safely combine my local personal data with a public large language model”— well, kind of sort of.

I set a clear boundary between my personal iMessage data (kept private to my machine) and the public generative models to build the logic for my data analysis. I am satisfied with the compromises I made, with a handful of records shared outside of my network used to generate high-quality Python code quickly and effectively address my queries.

Yes, I could next time use dummy data, inspect the code, obfuscate the payloads or run the models locally. In fact — that’s what I might do in a future blog.

For now, I’m happy with my paranoid analysis of iMessage chats.

🛠️ The complete notebook is available at https://github.com/saubury/paranoid_text_LLM/

GenPiCam - Generative AI Camera

Simon Aubury — Wed, 28 Jun 2023 11:18:23 +0000

GenPiCam - Generative AI Camera

Generative AI (GenAI) is a type of Artificial Intelligence that can create a wide variety of images, video and text. To accelerate the robot uprising I chained two GenAI models together to build a camera which describes the current scene in words, and then uses a second model to create a new generated stylised image. Let me introduce GenPiCam — a RaspberryPi based camera that reimagines the world with GenAI.

Before and after images created by GenPiCam

The heavy processing and true smarts of this project is handled by Midjourney — an external service using machine learning-based image generators. GenPiCam makes use of two Midjourney capabilities

Describe which starts with an existing photo and creates a text description prompts for the image.
Imagine which converts natural language prompts into images

Between these two steps I allow of a level of creative input, so the GenPiCam camera has a dial to tweak the style of the final image. This essentially becomes a filter, adding an “anime”, “pop-art” or “futuristic” influence to the generated image.

I’m bored — can I get a video?

Sure — here’s the 2 minute summary

The “photographic” process

The initial photo image is taken with a Raspberry Pi Camera Module. An external camera shutter (pushbutton connected to the Raspberry Pi GPIO pins) when pushed takes a still image and saves the photo as a jpeg image.

Taking still images of wildlife in the garden

The photo is uploaded to Midjourney which starts with an existing photo and creates a text description prompts for the image. For the curious, I’m using some very inelegant bot interactions with PyAutoGUI to control the mouse and keyboard (as there’s no API) — let this be an example of code you should never write.

Midjourney’s describe tool takes an image as input, then generates text prompts. This is a pretty clever service, reversing the usual process of “text to image” by doing the reverse, starting with the photo and then extracting text to describe the essence of the image. Here is Snowy, but Midjouney has a much more expressive description.

Snowy the cat — laying on bed under yellow blanket …

black cat laying on bed under yellow blanket, in the style of berrypunk, irridescent, glimmering, unpolished, symmetrical, rounded, chinapunk — ar 4:3

The describe function actually returns four descriptions based on the image, but GenPiCam arbitrarily selects the first description.

Now for the fun part. We can take that text prompt, and use it to create a brand new image with Generative AI with a new call to Midjouney imagine. Here is a image generate from the previous text prompt.

*Midjouney imagine generated image from text prompt *

GenPiCam has a selection switch to update the prompt with stylistic instructions.

Scene selector

This is a 12 way rotary switch connected to the Raspberry Pi GPIO pins. By reading the current “artistic selection” GenPiCam will add a prefix such as “retro pop art-style illustration” to the text prompt. A few of the other style prompts include

Anime style
Hyper Realistic, whimsical with colourful hat and balloons,
Blurry brushstrokes,
Futuristic, in a space station, hyper realistic

Let’s see the before and after “pop-art” images for Snowy.

*Final image with before and after photos along with text prompt *

The final image is a created using the Pillow Python imaging library, and is comprised of

Initial photo taken by the Raspberry Pi camera module, resized on the left
Final Midjouney image — the first of four images is selected, composited to the right
Text prompt — against a coloured background and icon signifying the style mode

Here’s the same process, but adding the text *“Hyper Realistic, whimsical with colourful hat and balloons”. *

Even though the image on the right is a creation from Generative AI, there’s still still a sense of disappointment coming through Snowy’s judgmental eyes.

Generative AI Images — Learnings

I had so much fun building the GenPiCam camera — and this was an interesting path for exploring prompt engineering for Generative AI. The better photos were the ones which had a simple composition — essentially images that were easy to put words to. For example, this scene is easy to describe with a colour and definitive objects.

A green stuffed animal and white keyboard

However, there were some very strange results while describing more unique scenes. I found the description of a classic Australian cloths line created a unusual image.

Australian cloths line

One of my favourite reimagined images was the identification of my laser mouse. It turns out a laser mouse has multiple meaning leading to a striking result.

Laser mouse

The hardware

The least stylish part of GenPiCam is the hardware which I hastily assembled. If you want to build your own reality distorting camera, you’ll need the following.

The inner workings of GenPiCam

It isn’t the most beautiful of builds — but I’ll just excuse this as being highly functional

Boot image for GenPiCam camera

Summary, code & credits

The GenPiCam has been a fun way to explore Generative AI, transforming photos into stylised (and sometime surprising) images.

Photo of author on the left — and a stylised version of Simon on the right

Credits

Ned Letcher — who first got me inspired by showing off the Midjourney describe functionality and provided the concept of recreating images
How to Create a Discord Bot to Download Midjourney Images by Michael King — A great write up showing Python automation for interacting with Midjourney along with Discord bot configuration.
Midjourney — Midjourney command syntax for bot channels
discord.py — Python API wrapper for Discord.

Code

https://github.com/saubury/GenPiCam

My (very) personal data warehouse — Fitbit activity analysis with DuckDB

Simon Aubury — Thu, 01 Jun 2023 04:37:19 +0000

Wearable fitness trackers have become an integral part of our lives, collecting and tracking data about our daily activities, sleep patterns, location, heart rate, and much more. I’ve been using a Fitbit device for 6 years to monitor my health. However, I have always found the data analysis capabilities lacking — especially when I wanted to track my progress against long term fitness goals. What insights are buried within my archive of personal fitness activity data? To start exploring I needed a good approach for performing data analysis over thousands of poorly documented JSON and CSV files … extra points for analysis that doesn’t require my data to leave my laptop.

Enter DuckDB — a lightweight, free yet powerful analytical database designed to streamline data analysis workflows — that runs locally. In this blog post, I want to use DuckDB to explore my Fitbit data achieve and share the approach for analysing a variety of data formats and charting my health and fitness goals with the help of Seaborn data visualisations.

Export Fitbit data archive

Firstly, I needed to get hold of all of my historic fitness data. Fitbit make it fairly easy to export your Fitbit data for the lifetime of your account by following the instructions at export your account archive.

Instructions for using the export Fitbit data archive — Screenshot by the author.

You’ll need to confirm your request … and be patient. My archive took over three days to create — but I finally received an email with instructions to download a ZIP file containing my Fitbit data. This file should contain all my personal fitness activity recorded by my Fitbit or associated service. Unzipping the archive reveals a huge collection of files — mine for example had 7,921 files once I unzipped the 79MB file.

A small sample of the thousands of nested files — Screenshot by the author.

Let’s start looking at the variety of data available in the archive.

Why DuckDB?

There are many great blogs (1,2,3) describing DuckDB — the TL;DR summary is DuckDB is an open-source in-process OLAP database built specifically for analytical queries. It runs locally, has extensive SQL support and can run queries directly on Pandas data, Parquet, JSON data. Extra points for its seamless integration with Python and R. The fact it’s insanely fast and does (mostly) all processing in memory make it a good choice for building my personal data warehouse.

Fitbit activity data

The first collection of files I looked at was activity data. Physical Activity and broad exercise information appears to be stored in numbered files such as Physical Activity/exercise-1700.json

I couldn’t work out what the file numbering actually meant, my guess is they are just increasing integers for a collection of exercise files. In my data export the earliest files started at 0 and went to file number 1700 over a 6 year period. Inside is an array of records, each with a description of an activity. The record seems to change depending on the activity — here is an example of a “walk”

"activityName" : "Walk",  
  "averageHeartRate" : 79,  
  "calories" : 122,  
  "duration" : 1280000,  
  "steps" : 1548,  
  "startTime" : "01/06/23 01:08:57",  
  "elevationGain" : 67.056,  
  "hasGps" : false,  
  : : : :  
  "activityLevel" : [  
    { "minutes" : 1, "name" : "sedentary"},  
    { "minutes" : 2, "name" : "lightly"},  
    { "minutes" : 6, "name" : "fairly"},  
    { "minutes" : 6, "name" : "very"  
  }]

This physical activity data is one file of the 7,921 files now on my laptop. Fortunately, DuckDB can read (and auto-detect the schema) from JSON files using read_json function, allowing me to load all of the exercise files into the physical_activity table using a single SQL statement. It’s worth noting I needed to specify the date format mask as the Fitbit export has a very American style date format 😕.

CREATE OR REPLACE TABLE physical_activity  
as  
SELECT   
  startTime + INTERVAL 11 hours as activityTime  
, activityName  
, activityLevel  
, averageHeartRate  
, calories  
, duration / 60000 as duration_minutes  
, steps  
, distance  
, distanceUnit  
, tcxLink  
, source  
FROM read_json('./Physical Activity/exercise-*.json'  
, format='array'  
, timestampformat='%m/%d/%y %H:%M:%S');

This SQL command reads the physical activity data from disk, converts the activity and duration and loads into a DuckDB table in memory.

Load Physical Activity data into data frame

I wanted to understand how I was spending my time with each month. As the activity data is stored at a very granular level I used the DuckDB SQL time_bucket function to truncate the activityTime timestamp into monthly buckets. Loading the grouped physical activity data into data frame can be accomplished with this aggregate SQL and the query results can be directed into a Pandas dataframe with the << operator.

activity_df <<  
  select time_bucket(interval '1 month', activityTime) as activity_day  
  , activityName  
  , sum(duration_minutes) as duration  
  from physical_activity  
  where activityTime between '2022-09-01' and '2023-05-01'  
  group by 1, 2  
  order by 1;

This single SQL query groups my activity data (bike, walk, run etc.,) into monthly buckets and allows me to honestly reflect on how much time I was devoting to physical activity.

Plot Monthly Activity Minutes

I want to now explore my activity data visually — so let’s get the Fitbit data and produce some statistical graphics. I’m going to use the Python Seaborn data visualisation library to create a bar plot of the monthly activity minutes directly from the activity_df dataframe.

import matplotlib.pyplot as plt  
import seaborn as sns  
from matplotlib.dates import DateFormatter  
plt.figure(figsize=(15, 6))  
plt.xticks(rotation=45)  

myplot =sns.barplot(data=activity_df, x="activity_day", y="duration", hue="activityName")  
myplot.set(xlabel='Month of', ylabel='Duration (min)', title='Monthly Activity Minutes')  
plt.legend(loc="upper right", title='Activity')   
plt.show()

Executing this against the loaded activity data creates this bar plot.

Workout activity breakdown — Screenshot by the author.

It looks like my primary activity continues to be walking, and my New Years resolution to run more often in 2023 hasn’t actually happened (yet?)

Sleep

About one in three adults doesn’t get enough sleep, so I wanted to explore my long term sleeping patterns. In my Fitbit archive sleep data appears to be recorded in dated files such as Sleep/sleep-2022-12-28.json. Each file holds a months worth of data, but confusingly is dated for the month before the event. For example, the file sleep-2022-12-28.json appears to have data for January spanning the dates 2023-01-02 to 2023-01-27. Anyway — file naming weirdness aside we can explore the contents of the file. Within the record is an extended “levels” block with a breakdown of sleep type (wake, light, REM, deep)

"logId" : 39958970367,  
  "startTime" : "2023-01-26T22:47:30.000",  
  "duration" : 26040000,  
  :: :: ::  
  "levels":   
    "summary" : {  
      {  
      "light": { "count": 30, "minutes": 275},  
      "rem": { "count": 4, "minutes": 48 },  
      "wake": { "count" : 29, "minutes" : 42 },  
      "deep" : { "count" : 12, "minutes" : 75}  
      }  
    }

If I look at some of the older files (possibly created with my older Fitbit surge device) there is a different breakdown of sleep type (restless, awake, asleep).

"logId" : 18841054316,  
  "startTime" : "2018-07-12T22:42:00.000",  
  "duration" : 25440000,  
  :: :: ::  
  "levels" : {  
    "summary" : {  
      "restless" : {"count" : 9, "minutes" : 20 },  
      "awake" : { "count" : 2, "minutes" : 5 },  
      "asleep" : { "count" : 0,   "minutes" : 399}  
    }  
  }

Regardless of the schema, we can use the DuckDB JSON reader to read the records into a single table.

CREATE OR REPLACE TABLE sleep_log  
as  
select dateOfSleep   
, levels  
from read_json('./Sleep/sleep*.json'  
, columns={dateOfSleep: 'DATE', levels: 'JSON'}  
, format='array') ;

Schema changes for sleep data

I wanted to process all of my sleep data, and handle the apparent schema change in the way sleep is recorded (most likely as I changed models of Fitbit devices). Some of the records have time recorded against $.awake which is similar (but not identical to) $.wake

I used the SQL coalesce function — which return the first expression that evaluates to a non-NULL value to combine similar types of sleep stage.

sleep_log_df <<  
  select dateOfSleep  
  , cast(coalesce(json_extract(levels, '$.summary.awake.minutes'), json_extract(levels, '$.summary.wake.minutes')) as int) as min_wake  
  , cast(coalesce(json_extract(levels, '$.summary.deep.minutes'), json_extract(levels, '$.summary.asleep.minutes')) as int) as min_deep  
  , cast(coalesce(json_extract(levels, '$.summary.light.minutes'), json_extract(levels, '$.summary.restless.minutes')) as int) as min_light  
  , cast(coalesce(json_extract(levels, '$.summary.rem.minutes'), 0) as int) as min_rem  
  from sleep_log  
  where dateOfSleep between '2023-04-01' and '2023-04-30'  
  order by 1;

With DuckDB I can query with json_extract to extract the duration stages from the nested JSON to generate a sleep_log_df dataframe with all of the historic sleep stages grouped.

Plot sleep activity

We can not take the daily sleep logs and produce a stacked bar plot showing the breakdown each night of being awake and in light, deep and REM sleep.

import matplotlib.pyplot as plt  
import seaborn as sns  
import matplotlib.dates as mdates  

#create stacked bar chart  
fig, axes = plt.subplots(figsize=(15,6))  
myplot = sleep_log_df.set_index('dateOfSleep').plot(ax=axes, kind='bar', stacked=True, color=['chocolate', 'palegreen', 'green', 'darkblue'])  
myplot.set(xlabel='Date', ylabel='Duration (min)', title='Sleep')  
axes.xaxis.set_major_locator(mdates.DayLocator(interval=7))  
plt.legend(loc="upper right", labels = ['Awake', 'Deep', 'Light', 'REM'])   
plt.xticks(rotation=45)  
plt.show()

Loading a month of sleep data allows me to create a broader analysis of sleep duration.

Sleep cycle duration each night — Screenshot by the author.

The ability to graph multiple nights of sleep together on a single plot allows me to start understanding how days of the week and cyclic events affects the duration and quality of my sleep.

Heart rate

Heart rate is captured very frequently (every 10–15 seconds) in files stored daily named like Physical Activity/heart_rate-2023-01-26.json. These files are really big — each day has around 70,000 lines — all wrapped in a single array.

[{{"dateTime": "01/25/25 13:00:07", "value": {"bpm": 54, "confidence": 2}},  
  {"dateTime": "01/25/25 13:00:22", "value": {"bpm": 54, "confidence": 2}},  
  {"dateTime": "01/25/25 13:00:37", "value": {"bpm": 55, "confidence": 2}},  
  : : : : : :  
  {"dateTime": "01/26/26 12:59:57", "value": {"bpm": 55, "confidence": 3}  
}]

My theory here is the file name represents the locale of the user. For example, in my timezone (GMT+11) named heart_rate-2023-01-26.json the data covers the day 26 00:00 (AEST) to 23:59 (AEST) - and it makes logical sense if the dates within the files are in GMT.

Transform JSON files

Up to now I’ve managed to process my Fitbit data as-is with included DuckDB functions. However, I hit a problem when trying to process these enormous heart rate files. DuckDB gave me this error when trying to process a large array of records in a JSON files

(duckdb.InvalidInputException) “INTERNAL Error: Unexpected yyjson tag in ValTypeToString”

I think this error message is an abrupt way of telling me it’s unreasonable to expect a JSON array to have so many elements. The fix was to pre-process the file so it wasn’t an array of JSON records, instead converted to newline-delimited JSON, or ndjson.

{"dateTime": "01/25/23 13:00:07", "value": {"bpm": 54, "confidence": 2}  
{"dateTime": "01/25/23 13:00:22", "value": {"bpm": 54, "confidence": 2}  
{"dateTime": "01/25/23 13:00:37", "value": {"bpm": 55, "confidence": 2}  
  : : : : : :  
{"dateTime": "01/26/23 12:59:57", "value": {"bpm": 55, "confidence": 3}

To transform heart rate array_of_records into newline-delimited JSON I used a sneaky bit of Python to convert each file.

import glob
import json
import ndjson
import re

for json_src_file in sorted(glob.glob('MyFitbitData/SimonAubury/Physical Activity/steps-*.json') + glob.glob('MyFitbitData/SimonAubury/Physical Activity/heart_rate-*.json')):
  json_dst_file = re.sub('\.[a-z]*$', '.ndjson', json_src_file)
  print(f'{json_src_file} -->  {json_dst_file}')

  with open(json_src_file) as f_json_src_file:
    json_dict =json.load(f_json_src_file) 

    with open(json_dst_file, 'w') as outfile:
      ndjson.dump(json_dict, outfile)

This will find each .json file, read converting the contents into newline-delimited JSON with a new file created with the file extension .ndjson. This converts an array with 70,000 records to a file with 70,000 lines — with each JSON record now stored on a new line.

Load heart rate data into table

With the newly converted ndjson files, I’m now ready to load heart rate data into a DuckDB table. Note the use of timestampformat='%m/%d/%y %H:%M:%S'); to describe the leading month in the dates (for example "01/25/23 13:00:07")

CREATE OR REPLACE TABLE heart_rate  
as  
SELECT dateTime + INTERVAL 11 hours as hr_date_time  
, cast(value->'$.bpm' as integer) as bpm  
FROM read_json('./Physical Activity/*.ndjson'  
, columns={dateTime: 'TIMESTAMP', value: 'JSON'}  
, format='newline_delimited'  
, timestampformat='%m/%d/%y %H:%M:%S');

We can load all the .ndjson files by setting the format to ’newline_delimited’. Note we can extract the BPM (beats per minute) with the JSON extraction and cast into an integer.

DuckDB is blazing fast at processing JSON- Screenshot by the author.

It’s worth highlighting here how insanely fast DuckDB is — it took only 2.8 seconds to load 12 million records!

Load heart rate into data frame

With 12 million hear rate measurements loaded, let’s load a single days worth of data into a data frame for the 21st of May.

hr_df <<   
  SELECT time_bucket(interval '1 minutes', hr_date_time) as created_day  
  ,  min(bpm) as bpm_min  
  ,  avg(bpm) as bpm_avg  
  ,  max(bpm) as bpm_max  
  FROM heart_rate  
  where hr_date_time between '2023-05-21 00:00' and '2023-05-21 23:59'  
  group by 1;

This DuckDB query aggregates the variability of heart rate into time bucks of 1 minute; banding into min, average and maximum within each period.

Plot Heart rate

I can plat the heart rate using a plot like this (and also to show off I actually did go for a run at 6am)

import matplotlib.pyplot as plt
from matplotlib.dates import DateFormatter
plt.figure(figsize=(15, 6))
plt.xticks(rotation=45)

myplot = sns.lineplot(data=hr_df, x="created_day", y="bpm_min")
myplot = sns.lineplot(data=hr_df, x="created_day", y="bpm_avg")
myplot = sns.lineplot(data=hr_df, x="created_day", y="bpm_max")

myFmt = DateFormatter("%H:%M")
myplot.xaxis.set_major_formatter(myFmt)
myplot.set(xlabel='Time of day', ylabel='Heart BPM', title='Heart rate')
plt.show()

Heart rate over a day — Screenshot by the author.

Exploring heart rate with fine granularity allows me to track my fitness goals — especially if I stick with my regular running routine.

Steps

Steps are recorded in daily files named Physical Activity/steps-2023-02-26.json. This appears to be a fine grain count of steps during period blocks (every 5 to 10 minutes) throughout the day

[{  
  "dateTime" : "02/25/23 13:17:00",  
  "value" : "0"  
},{  
  "dateTime" : "02/25/23 13:52:00",  
  "value" : "5"  
},{  
  "dateTime" : "02/25/23 14:00:00",  
  "value" : "0"  
},{  
:: :: ::  
},{  
  "dateTime" : "03/24/23 08:45:00",  
  "value" : "15"  
}]

To aggregate the steps into daily counts I needed to convert GMT into my local timezone (GMT+11)

steps_df <<
select cast(time_bucket(interval '1 day', dateTime + INTERVAL 11 hours  ) as DATE) as activity_day
, sum(value) as steps
from read_json('MyFitbitData/SimonAubury/Physical Activity/steps-2023-02-26.ndjson'
, auto_detect=True
, format='newline_delimited'
, timestampformat='%m/%d/%y %H:%M:%S') 
group by 1;

Aggregating the number of daily steps into the steps_df dataframe allows me to explore the longer term activity trends as I attempt to exceed 10,000 steps to realise the increased health benefits.

Plot daily steps

We can now take dataframe and plot a daily step count

import matplotlib.pyplot as plt  
from matplotlib.dates import DateFormatter  
plt.figure(figsize=(15, 6))  
plt.xticks(rotation=45)  

myplot = sns.barplot(data=steps_df, x="activity_day", y="steps")  
myplot.set(xlabel='Day', ylabel='Steps', title='Daily steps')  
plt.show()

Daily step count — Screenshot by the author.

Which shows I’ve still got to work at my daily step goal — another strike against my new years fitness resolution.

GPS Mapping

Fitbit stores GPS logged activities as TCX (Training Center XML) files. These XML files are not in the downloaded ZIP, but we have a reference to their location in the Physical Activity files, which I can query like this.

select tcxLink   
from physical_activity  
where tcxLink is not null;

The tcxLink field is a URL reference to their location in the Physical Activity files.

The URL for each TCX file — Screenshot by the author.

We can use this URL directly in a browser (once logged onto the Fitbit website) to do download the GPS XML file. Looking inside the TCX file, we find low level GPS locations every few seconds.

TCX GPS XML file sample contentents - Screenshot by the author.

The good news is this has some obvious fields like latitude, longitude and time. The not so good news is this is XML, so we need to pre-process these files prior to loading into DuckDB as presently XML isn’t supported by the file reader. We can convert XML files into JSON files with another bit of Python code, looping over each .tcx file.

There is a bit of nasty XML nesting going on here, with the location data found under TrainingCenterDatabase/Activities/Activity/Lap.

import glob
import json
import ndjson
import xmltodict
import re

for xml_src_file in sorted(glob.glob('MyFitbitData/tcx/*.tcx')):
    json_dst_file = re.sub('\.[a-z]*$', '.ndjson', xml_src_file)
    print(f'{xml_src_file} -->  {json_dst_file}')

    with open(xml_src_file) as f_xml_src_file:
        # erase file if it exists
        open(json_dst_file, 'w') 
        data_dict = xmltodict.parse(f_xml_src_file.read())

        # Loop over the "laps" in the file; roughly every 1km
        for lap in data_dict['TrainingCenterDatabase']['Activities']['Activity']['Lap']:
            data_dict_inner = lap['Track']['Trackpoint']
            # append file
            with open(json_dst_file, 'a') as outfile:
                ndjson.dump(data_dict_inner, outfile)
                outfile.write('\n')

Loading GPS Geospatial data

We can load the Geospatial data like this

route_df <<
    SELECT time
    , position
    , cast(json_extract_string(position, '$.LatitudeDegrees') as float) as latitude
    , cast(json_extract_string(position, '$.LongitudeDegrees') as float) as longitude
    FROM read_json('MyFitbitData/tcx/54939192717.ndjson'
    , columns={Time: 'TIMESTAMP', Position: 'JSON', AltitudeMeters: 'FLOAT', DistanceMeters: 'FLOAT', HeartRateBpm: 'JSON'}
    , format='newline_delimited'
    , timestampformat='%Y-%m-%dT%H:%M:%S.%f%z');

This DuckDB query flattens the JSON, converts the latitude, longitude and time into the correct data types and loads into the route_df dataframe.

Visualize GPS Routes with Folium

Having a table of location information isn’t very descriptive, so I wanted to start plotting my running routes on an interactive map. I used this blog to help Visualize routes with Folium. Modify the code helped me plot my own runs, for example this is a plot of a recent run while on holiday in Canberra.

import folium

route_map = folium.Map(
    location=[-35.275, 149.129],
    zoom_start=13,
    tiles='openstreetmap',
    width=1024,
    height=600
)

coordinates = [tuple(x) for x in route_df[['latitude', 'longitude']].to_numpy()]
folium.PolyLine(coordinates, weight=8, color='red').add_to(route_map)

display(route_map)

Folium map plot of a run — Screenshot by the author.

Which generates a plot of my run using open street map tiles, giving me a great interactive detailed map of my run.

Data goals and fitness goal summary

Did I get get closer to my goal of analysis my Fitbit device data — absolutely! DuckDB proved to be an ideal flexible lightweight analytical tool for wrangling my extensive and chaotic Fitbit data achieve. Blazing through literally millions of records in seconds with the extensive SQL support and flexible file parsing options locally into dataframes makes DuckDB ideal for building my own personal data warehouse.

As for my fitness goal — I have some work to do. I think I should leave this blog now as I’m short of my step goal target for today

Code

🛠️Code for Fitbit activity analysis with DuckDB — https://github.com/saubury/duckdb-fitbit

Mastodon usage — counting toots with Kafka, DuckDB & Seaborn 🐘🦆📊

Simon Aubury — Thu, 23 Feb 2023 10:29:52 +0000

Mastodon usage — counting toots with Kafka, DuckDB & Seaborn 🐘🦆📊

Mastodon is a decentralized social networking platform. Users are members of a specific Mastodon instance, and servers are capable of joining other servers to form a federated social network.

I wanted to start exploring Mastodon usage; and perform exploratory data analysis of user activity, server popularity and language usage. I used distributed stream processing tools to collect data from multiple instances to get a glimpse into what’s happening in the fediverse.

This blog covers the tools for data collection and data processing (Apache Kafka stream processing). If this doesn’t interest you you can jump straight to the data analysis (DuckDB and Seaborn). For the enthusiastic you can run the code

Collection of open source projects used

Tools used

Mastodon.py — Python library for interacting with the Mastodon API
Apache Kafka — distributed event streaming platform
DuckDB — in-process SQL OLAP database and the HTTPFS DuckDB extension for reading remote/writing remote files of object storage using the S3 API
MinIO — S3 compatible server
Seaborn — visualization library

Data collection — the Mastodon listener

ℹ️ If you’re not interested in the data collection … jump straight to the data analysis

Data collection architecture

There is a large collection of Mastodon servers with a wide variety of subjects, topics and interesting communities. Accessing a public stream is generally possible without authenticating, so no account is required to work out what’s happening on each server.

In the decentralized Mastodon network, not every message is sent to every server. Generally, public toots from instance-A will only be sent to instance-B if a user from B follows that user from A.

I wrote a python application mastodonlisten to listen for public posts from a given server. By running multiple listeners I could collect toots from both popular and niche instances. Each python listener collects public toots from that server to and publishes to a private Kafka broker. To illustrate how multiple Mastodon listeners can be run in the background like this:

python mastodonlisten.py --baseURL https://mastodon.social --enableKafka &

python mastodonlisten.py --baseURL https://universeodon.com --enableKafka &

python mastodonlisten.py --baseURL https://hachyderm.io --enableKafka &

Kafka Connect

I’ve now got multiple Mastodon listeners feeding public posts from multiple servers into a single Kafka topic. My next task is to understand what’s going on with all this activity from the decentralised network.

I decided to incrementally dump the “toots” into Parquet files on an S3 object store. Parquet is a columnar storage that is optimised for analytical querying. I chose to use Kafka Connect to streaming data between my Kafka topic and land the data into S3 using the S3SinkConnector

That sounds like a lot of work — but the TL;DR is with a bit of configuration, I can instruct Kafka Connect to do everything for me. To consume the mastodon-topic from Kafka and create a new parquet file on S3 every 1000 records I can accomplish with this configuration

{
    "name": "mastodon-sink-s3",
    "connector.class": "io.confluent.connect.s3.S3SinkConnector",
    "topics": "mastodon-topic",
    "format.class": "io.confluent.connect.s3.format.parquet.ParquetFormat",
    "flush.size": "1000",
    "s3.bucket.name": "mastodon",
    "aws.access.key.id": "minio",
    "aws.secret.access.key": "minio123",
    "storage.class": "io.confluent.connect.s3.storage.S3Storage",
    "store.url": "http://minio:9000"
}

To check this is all working correctly, I can see new files are being regularly created by looking in the MinIO web-based object browser.

MinIO web-based object browser

Data analysis

Now we have collected a week of Mastodon activity, let’s have a look at some data. These steps are detailed in the notebook

⚠️ Observations in this section are based on naive interpretation on a few days worth of data. Please don’t rely on any of this analysis, but feel free to use these techniques yourself to explore and learn

Firstly, some quick statistics from the data collected over 10 days (3 Feb to 12 Feb 2023)

🔢 Number of Mastodon toots seen 1,622,149
👤 Number of unique Mastodon users 142,877
💻 Number of unique Mastodon instances 8,309
🌏 Number of languages seen 131
✍️ Shortest toot 0 characters, average toot length 151 characters and longest toot 68,991 characters (If you’re curious, the longest toot was a silly comment followed the repeating the same emoji 68,930 times)
📚 Total of all toots 245,245,677 characters
🦆 DuckDB memory used to hold 1.6 million toots is just 745.5MB (which is tiny!)
⏱ Time it takes to calculate the above statistics in a single SQL query is** 0.7 seconds** (wow — fast!)

DuckDB’s Python client can be used directly in Jupyter notebook. The first step is import the relevant libraries. DuckDB Python package can run queries directly on Pandas. With a few SqlMagic settings it’s possible to configure the notebook to directly output data to Pandas

%load_ext sql
%sql duckdb:///:memory:
%config SqlMagic.autopandas = True
%config SqlMagic.feedback = False
%config SqlMagic.displaycon = False

Plus we can use the HTTPFS DuckDB extension for reading remote/writing remote files of object storage using the S3 API

%%sql
INSTALL httpfs;
LOAD httpfs;

Establish s3 endpoint

Here we’re using a local MinIO as an Open Source, Amazon S3 compatible server (and no, you shouldn’t share your secret_access_key). Set the S3 endpoint settings like this

%%sql
set s3_endpoint='localhost:9000';
set s3_access_key_id='minio';
set s3_secret_access_key='minio123';
set s3_use_ssl=false;
set s3_region='us-east-1';
set s3_url_style='path';

And I can now query the parquet files directly from s3

%%sql
select *
from read_parquet('s3://mastodon/topics/mastodon-topic/partition=0/*');

Reading parquet from s3 without leaving the notebook

This is pretty cool — we can read the parquet data directly, which is sitting in our S3 bucket.

DuckDB SQL to process Mastodon activity

Before moving on, I had a bit of data cleanup which I could do within DuckDB, loading remote parquet files (from s3).

Map the ISO 639–1 (two-letter) language code (zh, cy, en) to a language description (Chinese, Welsh, English). We can create alanguage lookup table and load languages from language.csv.
Calculate created_timestamp from the created_at integer. The created_at timestamp is calculated as number of seconds from EPOC (1/1/1970)
Determine the originating instance with a regular expression to strip the URL
Create the table mastodon_toot as a join of mastodon_toot_raw to language

CREATE TABLE language(lang_iso VARCHAR PRIMARY KEY, language_name VARCHAR);

insert into language
select *
from read_csv('./language.csv', AUTO_DETECT=TRUE, header=True);

create table mastodon_toot_raw as
select m_id
, created_at, ('EPOCH'::TIMESTAMP + INTERVAL (created_at::INT) seconds)::TIMESTAMPTZ as created_tz
, app
, url
, regexp_replace(regexp_replace(url, '^http[s]://', ''), '/.$', '') as from_instance
, base_url
, language
, favourites
, username
, bot
, tags
, characters
, mastodon_text
from read_parquet('s3://mastodon/topics/mastodon-topic/partition=0/');
create table mastodon_toot as
select mr.*, ln.language_name
from mastodon_toot_raw mr
left outer join language ln on (mr.language = ln.lang_iso);

🪄Being able to do this cleanup and transformation in SQL and have it execute in 0.8 seconds is like magic to me.

Very fast processing — less then a second

Daily Mastodon usage

We can query the mastodon_toot table directly to see the number of toots, users each day by counting and grouping the activity by the day. We can use the mode aggregate function to find the most frequent “bot” and “not-bot” users to find the most active Mastodon users

%%sql
select strftime(created_tz, '%Y/%m/%d %a') as "Created day"
, count(*) as "Num toots"
, count(distinct(username)) as "Num users"
, count(distinct(from_instance)) as "Num urls"
, mode(case when bot='False' then username end) as "Most freq non-bot"
, mode(case when bot='True' then username end) as "Most freq bot"
, mode(base_url) as "Most freq host"
from mastodon_toot
group by 1
order by 1;

Raw daily counts of activity

️ℹ️ ️The first few days were a bit sporadic as I was playing with the data collection. Once everything was setup I was generally seeing

200,000 toots a day from 50,000 users
masterdon.social was the most popular host
news organisations are the biggest generator of content (and they don’t always set the “bot” attribute)

The Mastodon app landscape

What clients are used to access mastodon instance. We take the query the mastodon_toot table, excluding "bots" and load query results into the mastodon_app_df Panda dataframe

%%sql
mastodon_app_df << 
    select *
    from mastodon_toot
    where app is not null 
    and app <> ''
    and bot='False';

Seaborn is a visualization library for statistical graphics in Python, built on the top of matplotlib. It also works really well with Panda data structures.

We can use seaborn.countplot to show the counts of Mastodon app usage observations in each categorical bin using bars. Note, we are limiting this to the 10 highest occurances

sns.countplot(data=mastodon_app_df, y="app", order=mastodon_app_df.app.value_counts().iloc[:10].index)

Web, Ivory and Moa are popular ways of toot’ing

ℹ️ The Mastodon application landscape is rapidly changing. Web usage is the preferred client, with mobile apps like Ivory, Moa, Tusky and the Mastodon app

⚠️ The Mastodon API attempts to report application for the client used to post the toot. Generally this attribute does not federate and is therefore undefined for remote toots.

Time of day Mastodon usage

Let’s see when Mastodon is used throughout the day and night. I want to get a raw hourly count of toots each hour of each day. We can load the results of this query into the mastodon_usage_df dataframe

%%sql
mastodon_usage_df << 
    select strftime(created_tz, '%Y/%m/%d %a') as created_day
    , date_part('hour', created_tz) as created_hour
    , count(*) as num
    from mastodon_toot
    group by 1,2 
    order by 1,2;

sns.lineplot(data=mastodon_usage_df, x="created_hour", y="num", hue="created_day").set_xticks(range(24))

⏰ It was interesting to see daily activity follow a very similar usage pattern.

The lowest activity was seen at 3:00pm in Australia (12:00pm in China, 8:00pm in California and 4:00am in London)
The highest activity was seen at 2:00am in Australia (11:00pm in China, 7:00am in California and 3:00pm in London)

Language usage

The language of the toot, can be specified by the server or the client — so it’s not always an accurate indicator of the language within the toot. So consider this is a wildly inaccurate investigation of language tags.

%%sql
mastodon_usage_df << 
    select *
    from mastodon_toot;

sns.countplot(data=mastodon_usage_df, y="language_name", order=mastodon_usage_df.language_name.value_counts().iloc[:20].index)

English is prominent language, followed by Japanese

Toot length by language usage

I was also curious what the length of toots looked like over different languages.

%%sql
mastodon_lang_df << 
    select *
    from mastodon_toot
    where language not in ('unknown');

sns.boxplot(data=mastodon_lang_df, x="characters", y="language_name", whis=100, orient="h", order=mastodon_lang_df.language_name.value_counts().iloc[:20].index)

What’s interesting to see is the typical Chinese, Japanese and Korean toot is shorter than English, wheres Galicia and Finish messages are longer. A possible explanation is logographic languages (like Mandarin) may be able to convey more with fewer characters.

Closing thoughts

The rise of Mastodon is something I’ve been really interested in. The open sharing nature has helped with the rapid adoption by communities and new users (myself included).

It’s been great to explore the fediverse with powerful open source distributed stream processing tools. Performing exploratory data analysis in a Jupyter notebook with DuckDB is like pressing a turbo button ⏩. Reading parquet from s3 without leaving the notebook is neat, and DuckDB’s ability to run queries directly on Pandas data without ever importing or copying any data is really snappy.

I’m going to conclude with my two favourite statistics. DuckDB memory used to hold 1.6 million toots is just 745.5MB and to process my results in 0.7 seconds is like a super power 🪄

⚒️ Code
🐘 Simon on Mastodon

Real-Time Wildlife Monitoring with Apache Kafka

Simon Aubury — Fri, 20 Jan 2023 10:49:52 +0000

Wildlife monitoring is critical for keeping track of population changes of vulnerable animals. As part of the Confluent Hackathon ʼ22, I was inspired to investigate if a streaming platform could help with tracking animal movement patterns. The challenge was to examine trends in identified species and demonstrate how animal movement patterns can be observed in the wild using Apache Kafka® and open source dashboarding.

Note : This article was originally written and published for the Confluent blog.

Dashbaord with upding animal counts

I’ve been using Kafka in my “day job” for many years, building streaming solutions in retail, telemetrics, finance, and energy — but this hackathon challenged me to build something new and novel. The goal was ambitious. Before scaling up to monitor and alert on more exotic creatures, I initially chose to test the viability at a smaller scale by tracking wildlife in my own back garden.

Simplified archiecture diagram

Backyard animal detection

To test the viability of the project, I built a “backyard monitoring” experiment using a Raspberry Pi along with an attached camera. The images were processed locally to identify and classify animals, with the observation events published to Confluent Cloud for stream processing.

This project used TensorFlow Lite with Python on a Raspberry Pi 4 to perform real-time object classification using images streamed from the attached Raspberry Pi camera. TensorFlow is an open source platform for machine learning, and TensorFlow Lite is a slimmed-down library suitable for deploying models on low-powered, battery-operated edge devices such as a Raspberry Pi. TensorFlow also has a great number of community resources, so I was able to make use of a detection model already pre-trained to detect numerous animals, including zebras, elephants, cats, dogs and more importantly, teddy bears.

I deployed a small Python application to run on the Raspberry Pi. The application continuously captures images from the camera — and each detected animal is given an object detection score. To connect the Raspberry Pi to the Kafka cluster, I used confluent_kafka API, which is a powerful Python client library for interacting with Kafka. With some basic setup (and some secret tokens for connecting) my Python application acts as a Kafka producer. Whenever an animal is detected, be it an elephant, zebra, kangaroo, or household cat, it is sent as a new record to the objects topic.

Once tested, I deployed the Raspberry Pi, camera, and battery into the “field” ( aka my backyard) to monitor the local wildlife. This worked surprisingly well, capturing the cats, dogs, and several birds that appeared during the week of testing.

Peeking into zoo wildlife

I was happy with the minimal viable product (MVP), but was disappointed I hadn’t spotted any exotic animals in my local garden. To increase the variety of animal encounters, I deployed a second Kafka producer, but this time connected to a webcam at a local zoo. Live animal webcams provide a great source of video feeds with an increased likelihood of spotting giraffes, elephants, and zebras over what may be roaming in my back garden. Similar to the Python application described earlier, a stream of webcam images provided a great source of interesting animal encounters that could be detected with the TensorFlow object classification. Regardless of the source, animal detection events were sent to a shared Kafka cluster with a payload describing the image source and animals detected.

{
  "camera_name": "zoo-webcam",
  "objects_count": {
    "elephant": 1,
    "zebra": 2
  }
}

Elephants and zebras with identifed JSON stream

Animal crossing — the stream processing edition

So I now had two Kafka producers sending animal detection events into a Kafka cluster, and my next challenge was to intercept the animals detection payloads. The goal was to understand the population observations over short time periods (such as animals seen this hour) and longer-term trends (such as population changes day on the day). I elected to use ksqlDB to help transform the raw objects topic into some meaningful observations.

The first task was to declare an objects stream in ksqlDB—allowing me to author SQL statements against the underlying Kafka objects topic.

create stream objects (camera_name VARCHAR, objects_found VARCHAR, objects_count VARCHAR) 
WITH (KAFKA_TOPIC='objects', VALUE_FORMAT='json');

The objects identified by the Raspberry Pi and webcam were transported as JSON records. Understanding the number and variety of animals identified in each Kafka record required digging into the JSON. Fortunately, ksqlDB has a handy EXTRACTJSONFIELD function to retrieve nested field values from a string of JSON. It returns the number of each animal if the field key exists in that message; otherwise, it returns NULL. The following is an example of extracting counts of each animal into appropriately named fields:

create stream animals as
select camera_name
, objects_count
, cast(extractjsonfield(objects_count, '$.elephant') as bigint) as elephant
, cast(extractjsonfield(objects_count, '$.bear') as bigint) as bear
, cast(extractjsonfield(objects_count, '$.zebra') as bigint) as zebra
, cast(extractjsonfield(objects_count, '$.giraffe') as bigint) as giraffe
, cast(extractjsonfield(objects_count, '$.teddybear') as bigint) as teddybear
from objects;

Something I noticed during testing was that the TensorFlow detection didn’t always detect all the animals in each frame. Two animals in the frame were often only detected as a single animal. I discovered I could “smooth” out the observations from the detection model by finding the highest count of each animal type within a sliding 30-second window. This was achieved by defining a tumbling window to count the number of each animal that had been observed in the time window. I figured it was appropriate to call this the zoo table.

create table zoo as
select camera_name
, max(elephant) as max_elephant
, max(bear) as max_bear
, max(zebra) as max_zebra
, max(giraffe) as max_giraffe
, max(teddybear) as max_teddybear
from animals
window tumbling (size 30 seconds)
group by camera_name
emit changes;

Due to the limited processing on the Kafka producer, it was common to undercount the number of animals seen in consecutive frames of the video. By adding the max ksqlDB function I could create a table to project the highest occurrence count of each animal seen in the time window.

Visualizing

With animal observations running and stream transformations deployed I moved on to building the analytics system. To complete the project I wanted to create a live dashboard, with animal counts and a visual image of population trends over time. A Kibana dashboard was ideal, so I just needed to populate the underlying Elasticsearch indexes to create a visual analytics dashboard.

I wanted a simple way to send the Kafka data onwards to my analytics dashboard. Kafka Connect is a framework for connecting Kafka with external systems such as relational databases, document databases, and key-value stores. I used the Kafka Connect Elasticsearch connector to send both the animals and zoo Kafka topics to Elasticsearch indexes. Once configured, the connector consumes records from the two Kafka topics and writes to a corresponding index in Elasticsearch. With the Elasticsearch index created and populated, a Kibana dashboard provides a great way to visualize the mix of animals, and the trending population of the zoo.

Alerting for the ultra-rare teddy bear sighting

With the detection, transformation, and visualization complete for the wildlife monitoring system, there was still an opportunity for one last feature: rare animal alerting!

In addition to the more traditional animals, I noticed the detection model I was using had the ability to identify teddy bears. This provided an ideal situation to demonstrate exceptional processing conditions — after all, what’s more rare than the arrival of a teddy bear in the back garden?

Inspired by Building a Telegram Bot with Apache Kafka, I set up a Telegram bot to alert me when a rare animal is sighted — or at least send a phone push notification to my phone if a teddy bear is seen in the garden.

To create the special notification simply required a new ksqlDB stream. The teddytopic stream contains a recording signifying the arrival of a teddy bear.

create stream teddytopic as 
select '📢 Just saw a 🧸 TEDDY BEAR in the garden' as message 
from animals 
where teddybear > 0

Next, I wanted to build the alerting mechanism when the accompanying teddytopic Kafka topic acquired a record when an “endangered” teddy bear was sighted.

The Telegram Bot API allows me to create managed bots for interacting with Telegram — a popular messaging service. To call Telegram I needed to create a Kafka wildlife bot, which provides an HTTP-based interface to let me know instantly when something novel had been observed. With the bot created, I used the Kafka Connect HTTP Sink Connector to make an HTTPS API call for each record in the teddytopic topic. Once configured, the connector consumes records from Kafka topic, sending the record value in the request body to the configured http.api.url. In this case, the endpoint was a preconfigured api.telegram.org.

With these steps completed, I get instant notification on my phone whenever a rare teddy bear is observed.

Takeaways

Building this project as part of the Confluent Hackathon ʼ22 was both a great learning experience and a fun way to mix together some impressive technologies. Combining cheap computing devices and open source machine learning libraries along with the Kafka streaming platform provides a new tool to examine trends in animal population and movement observations.

Although wildlife watcher is a simple proof of concept, I hope a project like this demonstrates how incredible components can be incorporated quickly. Stream processing with ksqlDB, the flexibility of Confluent Cloud, and the integration of data systems with Kafka Connect allowed me to build a fun project with less than 200 lines of code and perhaps contribute a small part in helping to solve real-world wildlife challenges.

Links to code

GitHub project

Can ML predict where my cat is now — part 2

Simon Aubury — Mon, 04 Jul 2022 21:56:22 +0000

Can ML predict where my cat is now — part 2

Can ML predict where Snowy the cat would go throughout her day? With months of location & temperature data captured, this second blogs covers how to train a machine learning (ML) model to predict where Snowy would go throughout her day. For the impatient, you can skip directly to the prediction web-app here.

Part 1 of this blog covered the hardware required build a history of which room she used for her favourite sleeping spots.

Cat location prediction using Streamlit web apps

Where are we starting?

This first blog described the the method for locating Snowy and data collection platform. I had collected over three months of location observations, with over 12 million location, temperature, humidity and rainfall observations (I may have gone over the top with data collection).

The question I’ve been trying to answer, can I use these historic observations to build a prediction model of where she is likely to go? How confident can I be using a machine to predict where a cat is likely to be at predicting the hiding spot for Snowy?

ML Bootcamp

Supervised learning is the ML task of creating a function that maps an input to an output based on example input-output pairs. In my case, I want to take historic observations about cat location, temperature, time of day etc., as inputs and find patterns … a function (inference) that predicts future cat location.

Temperature, time and day — can it map to location?

My assumption is the problem can be generalised from this data; e.g. future data will follow some common pattern of past cat behaviour (for a cat — this assumption may be questionable) .

Cat location prediction

The training uses past information to build a model that is a deployable artefact. Once a candidate model is trained, it can be tested for predication accuracy and finally deployed. In my case, I wish to create a web application to make predictions on where Snowy is likely to be napping.

What’s also important is that the model doesn’t have to explicitly output an absolute location, but can give its answer in terms of a confidence. If it output P(location:study) near 1.0 it’s confident, but values near 0.5 represent “unsure” about the confidence of predicting Snowy’s location.

Summarising data with dbt

As covered in part 1 — my data platform Home assistant stores each sensor update in the states table. This is *really *fine-grained, with updates added every few seconds from all the sensors (in my case, around 18,000 sensor updates a day). My goal was to summarise the data into hourly updates — essentially a single (most prevalent) location, along with temperature and humidity readings.

Summarising lots of data into hourly summaries

Initially I was manually running the data processing with a bunch of SQL statements (like this) to process the data. However, I found this fairly cumbersome as I wanted to retraining the model with newer location and environmental conditions. I settled on using the trusty data engineering tool dbt to simplify the creation of the SQL transformation in my database to make retraining more effective.

The dbt lineage graph showing the transformation of data

dbt handles turning these my select statements into tables and views, performing the transforming data already inside of my postgres data warehouse.

Model training & evaluation

I used a Scikit-learn random forest decision tree classification for my predictive model. Random forests creates decision trees on randomly selected data samples, gets prediction from each tree and selects the best solution by means of voting. It also provides a pretty good indicator of the feature importance.

If you look at the python notebook you can see the steps taken to assigns a class label to inputs, based on many examples it has been trained on from thousands of past observations of time of day, temperature and location.

Python code segment for visualizing feature importance

One really cool thing about the Scikit-learn decision tree models is how easy it is to visualise what’s going on. By visualizing the model features (above) I can see that “hour of the day” is the most significant feature in the model.

Intuitively this makes sense — time of day is likely to have the most significant impact on where Snowy is likely to be. The second most significant feature in predicting Snowy’s location is outside air temperature. Again this makes sense — too hot or too cold is likely to change is she wants to be outside. What I found surprising was the least significant feature was the is-raining feature. One possible explanation is the feature only makes sense during daylight hours, the is-raining won’t have an effect on the model when Snowy is sleeping inside at night.

It’s also possible to visualize a decision tree from a random forest in Python using Scikit-Learn.

A visual decision tree showing the hour and day decision points

Here in my display tree I can see the hour of the day is the initial decision point in the prediction — with 7:00am an interesting part of the algorithm. This is the time when alarm clocks go off in our household — and the cat is motivated to get up and look for food. Another interesting part of the tree is the “day of the week ≤ 5.5”. This equates to day of day of week being Monday through Friday — and again this part of the algorithm makes sense as we (and the cat) generally get up a bit later on week-ends

The cat predictor web-app in Streamlit

With the model created, I now wanted to build a web application to predict Snowy’s location based on a range of inputs. Streamlit is an open-source Python library that makes it easy to create web apps (without me having to learn a bunch of front-end frameworks). I added sliders and selection boxes to for feature values, such as day and temperature.

Web application — with inputs as slider controls

And voila — with a bit more python code I’ve created a Cat Prediction App; a web-app that predicts the likely location of Snowy the cat. I found some excellent instructions to deploy my Streamlit app to Heroku. So I can now share my Cap Predicator app with the world!

Links to code

Hope you find this blog and code helpful for all your pet location prediction needs

Data platform and ML prediction: https://github.com/saubury/cat-predict
Streamlit App: https://github.com/saubury/cat-predict-app

Can ML predict where my cat is now — part 1

Simon Aubury — Thu, 03 Feb 2022 10:00:04 +0000

Can ML predict where my cat is now — part 1

It’s 9am on a rainy Tuesday morning — can a simple ML model predict where my cat will be sleeping? How I used a bluetooth tracker, a dozen microcontrollers plus a bit of Python to predict where Snowy the cat would be napping in the next hour.

Predicting where Snowy the cat is likely to be based on time and weather

Summary

With some inexpensive hardware (and a cat ambivalent to data privacy concerns) I wanted to see if I could train a machine learning (ML) model to predict where Snowy would go throughout her day.

Home based location & temperature tracking allowed me to build up an extensive history of which room she used for her favourite sleeping spots. I had a theory with sufficient data collected I’d be able to train an ML model to predict where the cat was likely to be.

This two part blog describes the hardware and software necessary to collect the data, build a prediction model and test the real-world accuracy of cat behaviour estimation. This first blog describes the hardware and data collection, and part 2 describes building the prediction model

Hardware for room level cat tracking

The first task was to collect a lot of data on where Snowy historically spent her time — along with environmental factors such as temperature & rainfall. I set an arbitrary target of collecting hourly updates for around three months of movement in and around the house.

Finding the room Snowy is in relies on a base station in each likely location

Cat’s aren’t great at data entry; so I needed an automated way of collecting her location. I asked Snowy to wear a Tile — a small, battery powered bluetooth tracker. This simply transmits a regular and unique BLE signal. I then used eight stationary receivers to listen for the BLE Tile signal. These receiver nodes were ESP32 based presence detection nodes each running ESPresense. The nodes were placed in named rooms in and around the house (6 inside, 2 outside).

A collection of ESP32 modules and a BLE Tile (white square)

Each node constantly looking for the unique BLE signal of Snowy’s tile and measuring the received signal strength indicator (RSSI). The stronger the signal, the closer Snowy is to that beacon (either that, or she’s messing with the battery). If I got a few seconds of strong signal next to the study sensor for example, I could assume Snowy was likely very close to that room.

Each ESP32 module is powered by a micro-usb power supply and communicate back to the base station by joining the home WiFi network. The networking is important as multiple receivers can simultaneously receive a signal — and they need to determine which base station has the “strongest” signal.

Hardware for logging environmental information

Snowy avoids the outside garden when it rains, and tends to fall asleep in the warm (but not hot) rooms of the house. I wanted to collect environmental conditions, as I figured temperature and rainfall would play a significant role in determining where Snowy would hang out.

Xiaomi Aqara Temperature and Humidity Sensors

I selected the Xiaomi Temperature and Humidity Sensor as they run for months on a battery, and communicate over large distances via the Zigbee wireless mesh network. I placed these sensors throughout the house and in two external locations to capture outside conditions

Integration — building a data collection platform

For the data collection platform I used Home Assistant running on a Raspberry Pi. Home Assistant is a free and open-source software for home automation that is designed to be the central control system for smart home devices. I was able to track Snowy’s location via the binary sensor configuration. Essentially the room based beacon receiving the strongest signal from Snowy’s BLE tile updates an MQTT topic with her current location.

Home Assistant display of location

For temperature and humidity measurements, I used the Xiaomi integration to get a constant update of room level environment conditions. (Worthy of another blog: TL;DR flash the Xiaomi Zigbee hub with Tasmota)

Home Assistant display of temperature and humidity

The TL;DR summary — with the support of the amazing Home Assistant and Tasmota community I was able to gather accurate cat location along with fine grained temperature and humidity readings.

Data preparation — extracting data from Home Assistant

Home Assistant by default uses a SQLite database with a 10 day retention. I actually wanted to retain a lot more historic data to train the model. By modifying the recorder integration I pushed all the data storage into a Postgres database with 6 months of retention.

Home assistant stores each sensor update in the states table. This is *really *fine-grained, with updates added every few seconds from all the sensors (in my case, around 18,000 sensor updates a day). My goal was to summarise the data into hourly updates — essentially a single (most prevalent) location, along with temperature and humidity readings.

I extracted the initial three months (SQL here) of hourly location and environmental conditions to train the model.

Extract of hourly location and environmental readings

What’s next in part 2

This first blog described the the method for locating Snowy and data collection platform. The next blog will describe building the prediction model, and how accurate can a ML model be when determining where a cat is likely to be.

Snowy looking forward to reviewing the confusion matrix