<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Simon Aubury</title>
    <description>The latest articles on DEV Community by Simon Aubury (@saubury).</description>
    <link>https://dev.to/saubury</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F366962%2Fc20e27f1-5dc6-4a9a-b509-ef106bf30d32.jpeg</url>
      <title>DEV Community: Simon Aubury</title>
      <link>https://dev.to/saubury</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/saubury"/>
    <language>en</language>
    <item>
      <title>When plans change at 500 feet: Complex event processing of ADS-B aviation data with Apache Flink</title>
      <dc:creator>Simon Aubury</dc:creator>
      <pubDate>Mon, 16 Jun 2025 09:56:24 +0000</pubDate>
      <link>https://dev.to/saubury/when-plans-change-at-500-feet-complex-event-processing-of-ads-b-aviation-data-with-apache-flink-56g6</link>
      <guid>https://dev.to/saubury/when-plans-change-at-500-feet-complex-event-processing-of-ads-b-aviation-data-with-apache-flink-56g6</guid>
      <description>&lt;h1&gt;
  
  
  When plans change at 500 feet: Complex event processing of ADS-B aviation data with Apache Flink
&lt;/h1&gt;

&lt;p&gt;Using open-source Apache Flink stream processing to process real time aviation data to find missed approaches and paired runway landings. With some neat Flink SQL and custom functions, it’s possible to spot those rare times when planes pair in the sky or abort a landing at the last moment.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Project code available on &lt;a href="https://github.com/saubury/plane_track" rel="noopener noreferrer"&gt;https://github.com/saubury/plane_track&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpisahc0nyqq80sjqcb9q.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpisahc0nyqq80sjqcb9q.png" alt="Finding missed landing approaches and paired runway landings" width="800" height="533"&gt;&lt;/a&gt;&lt;em&gt;Finding missed landing approaches and paired runway landings&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Aircraft determine their position using GPS; and periodically transmit that position along with an aircraft identity string, altitude, speed etc as &lt;a href="https://en.wikipedia.org/wiki/Automatic_Dependent_Surveillance%E2%80%93Broadcast" rel="noopener noreferrer"&gt;ADS-B signals&lt;/a&gt;. These signals are transmitted in clear text — and can be readily received with a small radio receiver. The event stream of data around a local airport is a fascinating source of data for complex event processing.&lt;/p&gt;

&lt;p&gt;I wanted to see if I could determine when infrequent but noteworthy aviation situations occur at my local airport&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Missed approach (or go-around) during aircraft landing — an uncommon manoeuvre where a pilot discontinues the final approach to the runway and climbs away from the airport for another attempt at the landing.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Paired flight landings where aircraft land (or takeoff) on parallel runways. I was especially interested in the golden photographic moments when the same commercial aircraft type were flying in close formation.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Acquiring the flight data
&lt;/h2&gt;

&lt;p&gt;My first attempt at acquiring aircraft transponder messages (ADS-B signals) was with a Raspberry Pi and a &lt;a href="https://www.rtl-sdr.com/about-rtl-sdr/" rel="noopener noreferrer"&gt;RTL2832U&lt;/a&gt; — a USB dongle originally sold to watch digital TV on a computer. This approach (detailed &lt;a href="https://simonaubury.com/posts/201805_usingksqlapachekafkafindtheplanethatwakessnowy/" rel="noopener noreferrer"&gt;here&lt;/a&gt;) was only partially successful. Although I got a rich feed of data as planes flew over my house — I was too far from the airport to receive transmissions when the aircraft were on the final descent into my local airport.&lt;/p&gt;

&lt;p&gt;I then discovered &lt;a href="https://adsb.fi/" rel="noopener noreferrer"&gt;adsb.fi&lt;/a&gt; — a community-driven flight tracker project with a free real-time API for personal projects. Their &lt;a href="https://github.com/adsbfi/opendata/tree/main?tab=readme-ov-file#public-endpoints" rel="noopener noreferrer"&gt;API&lt;/a&gt; returns aircraft transponder messages within a nominated radius of a specified location point. You can get a glimpse at flights over Sydney with a curl command like this&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;curl --silent https://opendata.adsb.fi/api/v2/lat/-33.9401302/lon/151.175371/dist/5 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;For me this was an ideal way of receiving live flight location, track and altitude for aircraft within 5 nautical miles of Sydney airport.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp9rcy4xrvcdm1mtia90q.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp9rcy4xrvcdm1mtia90q.png" alt="Sample of aircraft transponder messages (ADS-B signals)" width="246" height="218"&gt;&lt;/a&gt;&lt;em&gt;Sample of aircraft transponder messages (ADS-B signals)&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;OK, now I’ve got a feed of data I need to analyse it to find some interesting flight events.&lt;/p&gt;

&lt;h2&gt;
  
  
  Processing the flight data stream
&lt;/h2&gt;

&lt;p&gt;I wrote a &lt;a href="https://github.com/saubury/plane_track/blob/main/monitor_opendata.py" rel="noopener noreferrer"&gt;python based aircraft monitor&lt;/a&gt; which polls the adsb.fi feed for aircraft transponder messages, and publishes each location update as a new event into an Apache Kafka topic. I used &lt;a href="https://flink.apache.org/" rel="noopener noreferrer"&gt;Apache Flink&lt;/a&gt; — and more specially &lt;a href="https://nightlies.apache.org/flink/flink-docs-release-2.0/docs/dev/table/sql/overview/" rel="noopener noreferrer"&gt;Flink SQL&lt;/a&gt;, to transform and analyse my flight data. The TL;DR summary is I can write SQL for my real-time data processing queries — and get the scalability, fault tolerance, and low latency managed by the Flink runtime.&lt;/p&gt;

&lt;h2&gt;
  
  
  Identifying missed approach landings
&lt;/h2&gt;

&lt;p&gt;A missed approach is a standard procedure where a pilot discontinues the final approach to landing and climbs away from the runway, typically due to poor visibility, an unstable approach or an unsafe runway condition.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiy25c8psxioe42ayyaq7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiy25c8psxioe42ayyaq7.png" alt="Google Earth mapping of missed approach landing" width="800" height="403"&gt;&lt;/a&gt;&lt;em&gt;Google Earth mapping of missed approach landing&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Missed approaches are not rare but are still relatively uncommon in normal operations. I couldn’t find accurate statistics, but at busy airports (especially in poor weather) roughly 1–3% of approaches might result in a missed approach.&lt;/p&gt;

&lt;p&gt;I needed to define a query for missed approach detection with a go-around-like patterns in flight altitude data. I wanted a SQL Flink statement to find a &lt;em&gt;sequence where an aircraft **descends&lt;/em&gt;&lt;em&gt;, **lands or nearly lands&lt;/em&gt;&lt;em&gt;, then **climbs again&lt;/em&gt;*, reaching a minimum safe altitude. *An example series of altitude measurements could be graphed over time like this:-&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2800%2F0%2AhA4QiATXxka-k7Ag" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2800%2F0%2AhA4QiATXxka-k7Ag" alt="Missed approach series of altitude measurements" width="1969" height="980"&gt;&lt;/a&gt;&lt;em&gt;Missed approach series of altitude measurements&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;I classified a &lt;em&gt;is-descending&lt;/em&gt; landing as seeing 5 consecutive decreasing values, followed by a &lt;em&gt;is_near_ground&lt;/em&gt; event of descending below 800 ft, followed by an &lt;em&gt;is_ascending&lt;/em&gt; event.&lt;/p&gt;

&lt;p&gt;My final &lt;a href="https://github.com/saubury/plane_track/blob/main/README.md#find-missed-approaches" rel="noopener noreferrer"&gt;Flink SQL&lt;/a&gt; query uses &lt;a href="https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/dev/table/sql/queries/match_recognize/" rel="noopener noreferrer"&gt;**MATCH_RECOGNIZE&lt;/a&gt;**, a powerful pattern recognition feature in Flink SQL for complex event processing. It identifies specific flight altitude patterns in a stream of aircraft data, partitioned by callsign (i.e., individual aircraft).&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT *
FROM flight
MATCH_RECOGNIZE(
    PARTITION BY callsign
    ORDER BY proc_time
    MEASURES
        IS_DESCENDING.flightts as desc_UTC,
        IS_GROUND.flightts as ground_UTC,
        IS_ASCENDING.flightts AS asc_UTC,
        IS_ABOVE_MIN.flightts AS abvm_UTC,
        IS_GROUND.altitude AS grd_altitude,
        IS_ASCENDING.altitude AS asc_altitude
    ONE ROW PER MATCH
    AFTER MATCH SKIP TO LAST IS_ASCENDING
    PATTERN (IS_DESCENDING{5,} IS_GROUND{1,} IS_ASCENDING IS_ABOVE_MIN)
    DEFINE
        IS_DESCENDING AS (LAST(altitude, 1) IS NULL AND altitude &amp;gt;= 1000) OR altitude &amp;lt; LAST(altitude, 1),
        IS_GROUND AS altitude &amp;lt;= 800,
        IS_ASCENDING AS altitude &amp;gt; last(altitude,1),
        IS_ABOVE_MIN AS altitude &amp;gt; 1000
) AS T
where TIMESTAMPDIFF(second, desc_UTC, asc_UTC) between 0 and 1000;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;When I originally wrote this query I got a number of false positives where planes would land, and take off a few hours later with the same flight code. To exclude these particular conditions I added an additional predicate to only return matches where the time between the descent and the subsequent ascent is within 1000 seconds.&lt;/p&gt;

&lt;p&gt;With my query running it actually took a few days to identify the first missed approach. On a particularly stormy morning I managed to identify three occasions when a go-around was performed — and validated the result by looking up the flights historic path with &lt;a href="https://www.flightradar24.com/." rel="noopener noreferrer"&gt;FlightRadar24&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnr48zy9z0475rrnyyxhj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnr48zy9z0475rrnyyxhj.png" alt="A flight identified as missed approach landing" width="677" height="153"&gt;&lt;/a&gt;&lt;em&gt;A flight identified as missed approach landing&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;With my data acquisition successfully finding missed approach I wanted to move onto more complex event processing this time with multiple aircraft events.&lt;/p&gt;

&lt;h2&gt;
  
  
  Identifying twin landings
&lt;/h2&gt;

&lt;p&gt;Paired flight landings occur when aircraft land on parallel runways.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwhhxhqu9u4nalvdymawg.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwhhxhqu9u4nalvdymawg.gif" width="326" height="240"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As I wanted to determine the distance between aircraft I used a &lt;a href="https://docs.confluent.io/cloud/current/flink/how-to-guides/create-udf.html" rel="noopener noreferrer"&gt;user-defined function&lt;/a&gt; (UDF) to extend the capabilities of Apache Flink to implement custom logic beyond what is supported by built-in SQL functions. By adding a &lt;a href="https://github.com/saubury/plane_track/blob/main/java/example/Distance.java" rel="noopener noreferrer"&gt;distance scalar&lt;/a&gt; Java function I could calculate distance between two aircraft.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;    // Equirectangular approximation to calculate distance in km between two points 
    public float eval(float lat1, float lon1, float lat2, float lon2) {
        float EARTH_RADIUS = 6371;
        float lat1Rad = (float) Math.toRadians(lat1);
        float lat2Rad = (float) Math.toRadians(lat2);
        float lon1Rad = (float) Math.toRadians(lon1);
        float lon2Rad = (float) Math.toRadians(lon2);

        float x = (float) ((lon2Rad - lon1Rad) * Math.cos((lat1Rad + lat2Rad) / 2));
        float y = (lat2Rad - lat1Rad);
        float distance = (float) (Math.sqrt(x * x + y * y) * EARTH_RADIUS);

        return distance;
    }
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Refer to the &lt;a href="https://github.com/saubury/plane_track?tab=readme-ov-file#flink-udf" rel="noopener noreferrer"&gt;readme&lt;/a&gt; for jar build steps and the operations to add to Flink. But the short summary is compile the JAR with mvn clean package and then register the UDF function in Flink with&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ADD JAR '/target-jars/udf_example-1.0.jar';

CREATE FUNCTION distancekm  AS 'com.example.my.Distance';
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;OK — with my Flink distance UDF available I can run a query that finds pairs of flights that were geographically close (within 1.5 km) of each other during overlapping or near-overlapping times (within 20 seconds), and reports their callsigns and distance.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT f1.callsign AS f1, 
f2.callsign AS f2,
CAST(ROUND(distancekm(f1.latitude , f1.longtitude, f2.latitude, f2.longtitude), 1) AS VARCHAR) as km
FROM flight f1, flight f2
WHERE f1.flightts BETWEEN f2.flightts - interval '20' SECOND AND f2.flightts
AND f1.callsign &amp;lt; f2.callsign
AND distancekm(f1.latitude , f1.longtitude, f2.latitude, f2.longtitude) &amp;lt; 1.5;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The query identifies flights that came close together in both space and time and reports the distance between them.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu4rw110hdqsi9xnuf27f.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu4rw110hdqsi9xnuf27f.png" alt="Nearby flights" width="664" height="122"&gt;&lt;/a&gt;&lt;em&gt;Nearby flights&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This works — but is showing &lt;em&gt;any *paired aircraft movement&lt;/em&gt;.* What I really wanted to do was find was the less common occurrence of the same aircraft type (such as two Boeing 737’s) flying in formation. I need a bit more data …&lt;/p&gt;

&lt;h2&gt;
  
  
  Annotating ADS-B messages with aircraft type and routes
&lt;/h2&gt;

&lt;p&gt;Peering into the ADS-B messages I have raw messages from the aircraft. Each payload comes with an ICAO 24-bit &lt;a href="https://en.wikipedia.org/wiki/Transponder_(aviation)" rel="noopener noreferrer"&gt;transponder&lt;/a&gt; code specifically assigned to each aircraft (eg. 7c7a3d) and a flight route code (eg. 7VOZ518). What I want to do is load a static reference set of data to map&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Aircraft ICAO codes such as 7c7a3d to airframes such as Boeing,737NG 8FE&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Flight code such as 7VOZ518 a route from the Gold Coast (OOL) to Sydney (SYD)&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A very convenient capability of Flink SQL is the ability to create a table directly from a CSV file. So I can populate the aircraft_lookup table with a command like this&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CREATE TABLE aircraft_lookup (
    icao24  varchar(100) not null,
    country  varchar(100),
    manufacturerName varchar(100),
    model varchar(100),
    owner varchar(100),
    registration varchar(100),
    typecode varchar(100)
) WITH ( 
    'connector' = 'filesystem',
    'path' = '/data_csv/aircraft_lookup.csv',
    'format' = 'csv'  
);
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;I downloaded aircraft data from the &lt;a href="https://opensky-network.org/datasets/metadata/#metadata/" rel="noopener noreferrer"&gt;Opensky data&lt;/a&gt; archive. With aircraft_lookup and route_lookup data loaded, I created a Flink view to supplement the data coming in from the flight Kafka topic&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CREATE OR REPLACE VIEW flight_decorated
AS
SELECT f.*, a.model, a.owner, a.typecode, r.route
FROM flight f 
LEFT JOIN aircraft_lookup a ON (f.icao = a.icao24)
LEFT JOIN route_lookup r ON (f.callsign = r.flight);
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;With data loaded and the flight feed decorated with aircraft type and route information I can now search for the perfect photographic moment of twin planes landing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Twin landings
&lt;/h2&gt;

&lt;p&gt;I now have a live feed of data with aircraft location, airframe type and route information. Along with my distance function I can query the stream to find the golden photographic moments when the same commercial aircraft type were flying in close formation.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT f1.flightts,
f1.callsign || ' ('  || COALESCE(f1.route, '-') ||')' || ' ' || f1.typecode AS f1,
CAST(ROUND(DISTANCEKM(f1.latitude , f1.longtitude, f2.latitude, f2.longtitude), 1) AS VARCHAR) AS km,
f2.callsign || ' ('  || COALESCE(f2.route, '-') ||')' || ' ' || f2.typecode AS f2
FROM flight_decorated f1, flight_decorated f2
WHERE f1.flightts BETWEEN f2.flightts - interval '20' SECOND AND f2.flightts
AND f1.callsign &amp;lt; f2.callsign
AND f1.typecode = f2.typecode
AND DISTANCEKM(f1.latitude , f1.longtitude, f2.latitude, f2.longtitude) &amp;lt; 1.5;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Which indeed finds the moment when two similarly typed aircraft land together&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4zc2yw2rcbd57hsc5hby.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4zc2yw2rcbd57hsc5hby.png" alt="Two Boing 737’s landing on parallel runways" width="533" height="407"&gt;&lt;/a&gt;&lt;em&gt;Two Boing 737’s landing on parallel runways&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusions
&lt;/h2&gt;

&lt;p&gt;This project was a fun exercise — and shows how Apache Flink can turn a stream of aircraft transponder pings into a hunt for interesting aviation moments like go-arounds and perfect twin landings.&lt;/p&gt;

&lt;p&gt;With some neat Flink SQL and custom functions, it’s possible to spot those rare times when planes pair in the sky or wave off a landing at the last second.&lt;/p&gt;

&lt;p&gt;✈️ Project code available on &lt;a href="https://github.com/saubury/plane_track" rel="noopener noreferrer"&gt;https://github.com/saubury/plane_track&lt;/a&gt;&lt;/p&gt;

</description>
      <category>apache</category>
      <category>kafka</category>
      <category>planes</category>
    </item>
    <item>
      <title>Puppy data</title>
      <dc:creator>Simon Aubury</dc:creator>
      <pubDate>Wed, 19 Feb 2025 02:45:31 +0000</pubDate>
      <link>https://dev.to/saubury/puppy-data-2gin</link>
      <guid>https://dev.to/saubury/puppy-data-2gin</guid>
      <description>&lt;h1&gt;
  
  
  Puppy data
&lt;/h1&gt;

&lt;p&gt;🐾 Puppy Data 🐾 combines a love of data and dogs — a project to track Barney the puppy🐶! Using load-cells and low power motion tracking sensors I built an automated system to monitor Barney’s weight, sleep habits, and activity📊. Data is sent to Home Assistant for easy tracking via a slick interface 📱. CotexAI and a Streamlit application allows me to ask questions like “How much heavier is Barney this month?” or “When was he most active?” 🐕💡. A blend of IoT, DIY tech, and puppy love ❤️📈!&lt;/p&gt;

&lt;p&gt;This is Barney — our (now) 6 month old “Staffy Cross” — adopted from a local shelter. We love Barney joining our family— and I love data. Barney has agreed (sort of) to help with a local bit of data gathering so we can watch him grow.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq847pshnp0zpihgx3fc6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq847pshnp0zpihgx3fc6.png" alt="Barney at home — along with sample analytics" width="679" height="365"&gt;&lt;/a&gt;&lt;em&gt;Barney at home — along with sample analytics&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;My goal with this project was to passively collect data on Barney’s activity — his sleeping habits, weight gain, sleep patterns and movements around our house. The data is fed into data warehouse where a conversational interface is used to interact with my Barney data to answer questions like “When was Barney the most active yesterday?”, “How much heavier is Barney this month?” or “Where did Barney put my shoe?” (okay — maybe it didn’t help with that last one).&lt;/p&gt;

&lt;h2&gt;
  
  
  Puppy weight
&lt;/h2&gt;

&lt;p&gt;Puppies grow quickly — and I wanted an automated way of measuring his weight each day. On the underside of Barney’s kennel ( the “Barn” ! ) I installed 4 load cell weighing sensors — one for each corner of the Barney’s Barn. By summing the combined weight across the 4 cells and subtracting the &lt;a href="https://en.wikipedia.org/wiki/Tare_weight" rel="noopener noreferrer"&gt;tare weight&lt;/a&gt; (of the kennel itself and any cushions or toys dragged into the Barn overnight) I can determine an accurate daily weight for our puppy.&lt;/p&gt;

&lt;p&gt;Load cells are pretty neat. They measure weight (or, more accurately, directional force). Each load cell is able to measure the electrical resistance that changes in response to (and is proportional to) the force applied. As Barney jumps into his Barn, we can instant weigh him — and as a bonus also measure the time he spends napping in his kennel.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4sh0klffncq384pwzpvs.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4sh0klffncq384pwzpvs.png" alt="Close up a load cell and 3D printed mounting bracket" width="749" height="621"&gt;&lt;/a&gt;&lt;em&gt;Close up a load cell and 3D printed mounting bracket&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://www.aliexpress.com/item/32968926628.html" rel="noopener noreferrer"&gt;50kg load cells and HX711&lt;/a&gt; amplifier module was around $5. The peculiar thing about these sensors is they don’t sit flush to a surface of the barn, and the centre of each sensor needs a gap for it to flex when a load is applied. Searching online I found you can 3D print &lt;a href="https://www.thingiverse.com/thing:2274593" rel="noopener noreferrer"&gt;a bracket&lt;/a&gt; to help keep it clear of the underside to avoid this problem.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2a5xeeqoyye8to2tgvky.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2a5xeeqoyye8to2tgvky.png" alt="HX711 amplifier module (left) and ESP32 (right)" width="800" height="447"&gt;&lt;/a&gt;&lt;em&gt;HX711 amplifier module (left) and ESP32 (right)&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;I wired the four load cells into a single circuit with the HX711 amplifier module in a &lt;a href="https://en.wikipedia.org/wiki/Wheatstone_bridge" rel="noopener noreferrer"&gt;wheatstone bridge&lt;/a&gt; configuration — have a look at &lt;a href="https://circuitjournal.com/50kg-load-cells-with-HX711" rel="noopener noreferrer"&gt;this helpful blog&lt;/a&gt; for the detailed steps.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1tiy3q759imusulgbw34.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1tiy3q759imusulgbw34.png" alt="Underside of barn — with load cells and circuitry" width="800" height="600"&gt;&lt;/a&gt;&lt;em&gt;Underside of barn — with load cells and circuitry&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Finally I was ready to connect the load cells and HX711 amplifier module into an &lt;a href="https://www.aliexpress.com/gcp/300000512/nnmixupdatev3?spm=a2g0o.productlist.main.1.30ec7404w2UU6n&amp;amp;productIds=1005006336964908&amp;amp;pha_manifest=ssr&amp;amp;_immersiveMode=true&amp;amp;disableNav=YES&amp;amp;channelLinkTag=nn_newgcp&amp;amp;sourceName=mainSearchProduct&amp;amp;utparam-url=scene%3Asearch%7Cquery_from%3A" rel="noopener noreferrer"&gt;ESP32&lt;/a&gt; microcontrollers with integrated Wi-Fi and Bluetooth. Everything was taped to the underside of the barn — and the delicate wires and circuit boards hidden from the curious pup.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmc6aty2ovlcf9nup7evi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmc6aty2ovlcf9nup7evi.png" alt="Barney — passionate data producer" width="800" height="578"&gt;&lt;/a&gt;&lt;em&gt;Barney — passionate data producer&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;I flashed the ESP32 with &lt;a href="https://esphome.io/" rel="noopener noreferrer"&gt;ESPHome&lt;/a&gt; and added the &lt;a href="https://esphome.io/components/sensor/hx711.html" rel="noopener noreferrer"&gt;HX711&lt;/a&gt; sensor platform configuration.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sensor:
  - platform: hx711
    name: "HX711 Value"
    dout_pin: GPIO14
    clk_pin: GPIO13
    gain: 128
    update_interval: 15s   
    filters: 
    - calibrate_linear:
        - -455742 -&amp;gt; 0
        - -550682 -&amp;gt; 4.404
    unit_of_measurement: kg
    accuracy_decimals: 1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The weight was captured “as-is” — and I did some SQL post processing to work out the tare weight. The final sensor then got added to &lt;a href="https://www.home-assistant.io/" rel="noopener noreferrer"&gt;Home Assistant&lt;/a&gt;, giving me monitoring, persistent storage and a nice use interface (and app) which I can use anywhere I’ve got an internet connection. Behind the scenes a local PostgreSQL database is used to store all the sensor measurements every minute.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F211xvss7qfat67q8l0bg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F211xvss7qfat67q8l0bg.png" alt="Home assistant dashboard" width="520" height="578"&gt;&lt;/a&gt;&lt;em&gt;Home assistant dashboard&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;With weight measurement sorted, I then set out to measure Barney’s activity tracking.&lt;/p&gt;

&lt;h2&gt;
  
  
  Puppy activity
&lt;/h2&gt;

&lt;p&gt;Barney rarely keeps still — so I wanted a way to track his movements during the day and sleep patterns at night. I got him &lt;a href="https://www.fitbark.com/en-AU/store/fitbark2" rel="noopener noreferrer"&gt;FitBark 2&lt;/a&gt; and placed it on his collar to monitor his everyday activity. This is a small 3D accelerometer and Bluetooth transmitter that weights only 10 grams! This is a very cool device — and it has a &lt;a href="https://www.fitbark.com/en-AU/dev" rel="noopener noreferrer"&gt;developer API&lt;/a&gt;!&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5kawsk8nzsn0zrspcplt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5kawsk8nzsn0zrspcplt.png" alt="Barney with Fitbark on his collar" width="800" height="506"&gt;&lt;/a&gt;&lt;em&gt;Barney with Fitbark on his collar&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;I used the Home Assistant &lt;a href="https://www.home-assistant.io/integrations/rest" rel="noopener noreferrer"&gt;rest sensor&lt;/a&gt; platform to consume the Fitbark RESTful API.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;rest:
  - authentication: digest
    verify_ssl: false
    # update every 1 hour
    scan_interval: 3600
    resource: https://app.fitbark.com/api/v2/activity_series
    method: POST
    payload_template: &amp;gt;-
      {   
        "activity_series":{"slug":"XXxxXXxx",           
          "from":"{{ now().strftime('%Y-%m-%d') }}",           
          "to":"{{ now().strftime('%Y-%m-%d') }}",           
          "resolution":"HOURLY"   
        }
      }
    headers:
      Authorization: !secret fitbark_bearer_token
      Content-Type: application/json
      User-Agent: Mozilla/5.0
    sensor:
      - name: "Fitbark_activityseries_activity_value"
        unique_id: fitbark_activityseries_activity_value
        value_template: "{{ value_json.activity_series.records[-2].activity_value | int }}" 

      - name: "Fitbark_activityseries_min_play"
        unique_id: fitbark_activityseries_min_play
        device_class: duration
        unit_of_measurement: min
        value_template: "{{ value_json.activity_series.records[-2].min_play | int }}"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The sensor has support for GET and POST requests, and I used the &lt;a href="https://documenter.getpostman.com/view/238826/2s8ZDbW1Gf#7cde488a-ff03-4401-9658-207f7531a6ee" rel="noopener noreferrer"&gt;Get Activity Series&lt;/a&gt; API to get the recent hourly activity for Barney, giving me a breakdown of his active, play and rest minutes.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzdhhu5ez5wovpq7l9p72.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzdhhu5ez5wovpq7l9p72.png" alt="Activity by hour" width="455" height="368"&gt;&lt;/a&gt;&lt;em&gt;Activity by hour&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Much like the weight measurements, the hourly activity measures are managed by Home Assistant which automatically stores the sensor measurements to a local PostgreSQL database.&lt;/p&gt;

&lt;p&gt;With Barney weight and activity data captured let’s moved onto data analysis.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cortex Analyst in Snowflake
&lt;/h2&gt;

&lt;p&gt;Barney had given me a lot of data — and now I wanted to do something with it! I’ve been using &lt;a href="https://docs.snowflake.com/user-guide/snowflake-cortex/cortex-analyst" rel="noopener noreferrer"&gt;Cortex Analyst&lt;/a&gt; in &lt;a href="https://www.snowflake.com/en/" rel="noopener noreferrer"&gt;Snowflake&lt;/a&gt; database in my “day job” and thought it was an ideal way of asking questions in natural language about Barney and receiving direct answers without writing SQL.&lt;/p&gt;

&lt;p&gt;Cortex Analyst is a fully-managed, LLM-powered &lt;a href="https://www.snowflake.com/en/data-cloud/cortex/" rel="noopener noreferrer"&gt;Snowflake Cortex&lt;/a&gt; feature that helps you create applications capable of answering questions based on your data stored in Snowflake. So Cortex Analyst wil be tasked with answering the “Is Barney playing more this week?” style of questions in natural language and receive direct answers without writing SQL.&lt;/p&gt;

&lt;p&gt;Here’s a look at Streamlit conversational application.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuvtusfzec720mzk0fruo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuvtusfzec720mzk0fruo.png" alt="Streamlit conversational application." width="693" height="310"&gt;&lt;/a&gt;&lt;em&gt;Streamlit conversational application.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;I can start asking questioons such as&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;What was Barney’s smallest weight and when was that measured?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7vd9lp7bzjv3bwpn6ga1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7vd9lp7bzjv3bwpn6ga1.png" alt="Barney used to weigh 8.55kg" width="772" height="324"&gt;&lt;/a&gt;&lt;em&gt;Barney used to weigh 8.55kg&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Barney used be be such a tiny puppy! Let’s look at his growth&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Show me the weekly average weight of Barney for the last 3 months.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3ues1yq1t4k9v2797bcj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3ues1yq1t4k9v2797bcj.png" alt="Barney has grown a lot over 3 months" width="473" height="377"&gt;&lt;/a&gt;&lt;em&gt;Barney has grown a lot over 3 months&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Barney really has grown a lot over the last 3 months. Finally, let’s look at his activity&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Pivot the activity type for Barny and summarise the minutes each week for the last month.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4sphgqhnacm0x64iommp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4sphgqhnacm0x64iommp.png" alt="Lots of play time!" width="475" height="374"&gt;&lt;/a&gt;&lt;em&gt;Lots of play time!&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;I see Barney enjoys his naps — and loves to run around too&lt;/p&gt;

&lt;h2&gt;
  
  
  Lessons &amp;amp; future enhancements
&lt;/h2&gt;

&lt;p&gt;I’ve been happy with the Barney data collected so far — but a few things haven’t worked as expected&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;The “food scale” which I initially created to weigh the food consumed was too impractical — as we kept moving the feeding bowl and Barney would often chew on the wires&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The FitBark API call from Home Assistant is a delicate— requiring me to request specific blocks of time and has no error or retry logic. I’d prefer to rework this as a proper backfill operator&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;I originally wanted to use the signal strength (RSSI value) of the FitBark bluetooth as a proxy for location — however this was too imprecise for any meaningful measurements&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The data transfer from local PostgreSQL to cloud Snowflake was manual — I’d like to automate this (and perhaps play with &lt;a href="https://dlthub.com/" rel="noopener noreferrer"&gt;dlt&lt;/a&gt;)&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What did work and I was happy with&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;The weight scale worked better than expect — with (what appears to be) a smooth progressive and reliable weight measurement over the last few months&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;FitBark has an impressive data logging mechanism — and the battery has only been charged once in 6 months!&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In the end, &lt;em&gt;Puppy Data&lt;/em&gt; showcases how IoT and puppy energy can turn a love for dogs and data into a fun way to track Barney’s growth and adventures, proving that even tech can have a heart ❤️🐾📊!&lt;/p&gt;

&lt;h3&gt;
  
  
  Code
&lt;/h3&gt;

&lt;p&gt;The code and example data is available at&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/saubury/puppy_data" rel="noopener noreferrer"&gt;https://github.com/saubury/puppy_data&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  References
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://esphome.io/guides/getting_started_hassio.html" rel="noopener noreferrer"&gt;Getting Started with ESPHome and Home Assistant&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://quickstarts.snowflake.com/guide/getting_started_with_cortex_analyst_in_snowflake/index.html" rel="noopener noreferrer"&gt;Getting Started with Cortex Analyst in Snowflake&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>cortexai</category>
      <category>homeassistant</category>
      <category>snowflake</category>
    </item>
    <item>
      <title>FridgeBot — GPT-4o shopping list automation</title>
      <dc:creator>Simon Aubury</dc:creator>
      <pubDate>Tue, 21 May 2024 02:20:02 +0000</pubDate>
      <link>https://dev.to/saubury/fridgebot-gpt-4o-shopping-list-automation-98d</link>
      <guid>https://dev.to/saubury/fridgebot-gpt-4o-shopping-list-automation-98d</guid>
      <description>&lt;h1&gt;
  
  
  FridgeBot — GPT-4o shopping list automation
&lt;/h1&gt;

&lt;p&gt;Monitoring the contents of my fridge and automatically adding grocery items to my shopping list with the new GPT-4o vision API&lt;/p&gt;

&lt;p&gt;OpenAI recently announced their new generative AI model &lt;a href="https://openai.com/index/hello-gpt-4o/" rel="noopener noreferrer"&gt;GPT-4o&lt;/a&gt; — the “o” stands for “omni,” referring to the model’s ability to use mixed modalities including text, speech and video. I wanted to give GPT-4o a real challenge — helping me keep on top of the shopping list by automatically monitoring the contents of my fridge and adding grocery items when I had run out of something.&lt;/p&gt;

&lt;center&gt;&lt;/center&gt;

&lt;h2&gt;
  
  
  Processing steps
&lt;/h2&gt;

&lt;p&gt;The fridge door light is used as a signal for when to start and stop looking for changes in the fridge. I’m assuming that anything taken from my fridge has been removed when the door is open and the door light illuminated.&lt;/p&gt;

&lt;p&gt;The code does the following&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Open the camera feed — and use the &lt;a href="https://opencv.org/" rel="noopener noreferrer"&gt;OpenCV&lt;/a&gt; library for real-time computer vision processing&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Every 500ms we take an image from the video feed and convert the image to grey-scale&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Calculate mean brightness of the grey-scale image — and use the average value as an indicator for if the fridge door is open&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;If the last image was “dark” and the current image is “light”, we assume the fridge door has just been opened. We wait 500ms for the camera white balance to settle, and save this as &lt;em&gt;image-1&lt;/em&gt; to represent the initial state.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;We continue to take photos every 500ms, and save the image temporarily as we don’t know when the door is going to close&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;When the average brightness of the latest image drops significantly, we know the fridge door has just been closed. We discard this image (as it is dark), and use the &lt;em&gt;last&lt;/em&gt; image taken and save this as &lt;em&gt;image-2&lt;/em&gt; to represent the final state.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now we have the before and after images, use &lt;em&gt;image-1&lt;/em&gt; and &lt;em&gt;image-2&lt;/em&gt; as inputs to the OpenAI&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;We encode both images as &lt;a href="https://en.wikipedia.org/wiki/Base64" rel="noopener noreferrer"&gt;base-64&lt;/a&gt; to transform the binary image data into a sequence of printable characters&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;We use the OpenAI GPT-4o &lt;a href="https://platform.openai.com/docs/guides/vision" rel="noopener noreferrer"&gt;vision API&lt;/a&gt; along with prompt &lt;em&gt;‘What item is missing in the second image?’&lt;/em&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The response is decoded — and is assumed to be a single word describing what item was in &lt;em&gt;image-1&lt;/em&gt; and is not present in &lt;em&gt;image-2&lt;/em&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Knowing the item, we can add to our shopping list&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The &lt;a href="https://developer.todoist.com/guides/#developing-with-todoist" rel="noopener noreferrer"&gt;Todoist Sync API&lt;/a&gt; is used to add the item to our shopping list&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Hardware
&lt;/h2&gt;

&lt;p&gt;The final form of this project runs on a &lt;a href="https://www.raspberrypi.com/" rel="noopener noreferrer"&gt;RaspberryPi&lt;/a&gt; with a live video processing stream within my fridge. In reality this was a little impractical as both power and ethernet cables needed to be routed past the fridge seal.&lt;/p&gt;

&lt;p&gt;To demonstrate the steps without the need for specific hardware (or a fridge) you can run this project on almost any machine. The setup steps below use a demonstration video file.&lt;/p&gt;

&lt;h2&gt;
  
  
  Overview
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2AkRSzJQ8tezCfVoVp1PF9gQ.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2AkRSzJQ8tezCfVoVp1PF9gQ.png" alt="FridgeBot — fridge monitoring with GPT-4o"&gt;&lt;/a&gt;&lt;em&gt;FridgeBot — fridge monitoring with GPT-4o&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Setup virtual python environment
&lt;/h2&gt;

&lt;p&gt;Create a &lt;a href="https://packaging.python.org/en/latest/guides/installing-using-pip-and-virtual-environments/" rel="noopener noreferrer"&gt;virtual python&lt;/a&gt; environment to keep dependencies separate. The &lt;em&gt;venv&lt;/em&gt; module is the preferred way to create and manage virtual environments.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;python3 -m venv .venv
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Before you can start installing or using packages in your virtual environment you’ll need to activate it.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  API setup
&lt;/h2&gt;

&lt;p&gt;Now it’s time to set up the local API secrets.&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;cp -i config_secrets_example.py config_secrets.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Edit config_secrets.py with &lt;em&gt;OpenAPI&lt;/em&gt; tokens along with the &lt;em&gt;Todoist&lt;/em&gt; API secret tokens&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://platform.openai.com/api-keys" rel="noopener noreferrer"&gt;OpenAI API keys&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://app.todoist.com/app/settings/integrations/developer" rel="noopener noreferrer"&gt;Todoist API keys&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  FridgeBot — video process only
&lt;/h2&gt;

&lt;p&gt;To run FridgeBot with example video file without calling OpenAI run the following&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;python fridgebot.py --video media/pikelets.mov
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;This should generate two files representing the first lit image and the last lit image within the video media/pikelets.1.jpg and media/pikelets.2.jpg&lt;/p&gt;

&lt;h2&gt;
  
  
  FridgeBot — video process and OpenAI
&lt;/h2&gt;

&lt;p&gt;To run FridgeBot with example video file, calling OpenAI run the following&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;python fridgebot.py --video media/pikelets.mov --openai
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;This should generate two files, and query the OpenAI API to identify the item removed — Pikelets&lt;/p&gt;

&lt;h2&gt;
  
  
  FridgeBot — video process, OpenAI and Todoist
&lt;/h2&gt;

&lt;p&gt;To run FridgeBot with example video file, calling OpenAI and adding item to Todoist&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;python fridgebot.py --video media/pikelets.mov --openai --todoist
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;This should generate two files, and query the OpenAI API and add Pikelets to the Todoist shopping list&lt;/p&gt;

&lt;h2&gt;
  
  
  FridgeBot Code
&lt;/h2&gt;

&lt;p&gt;FridgeBot code — &lt;a href="https://github.com/saubury/fridgebot_openai" rel="noopener noreferrer"&gt;https://github.com/saubury/fridgebot_openai&lt;/a&gt;&lt;/p&gt;

</description>
      <category>gpt4o</category>
      <category>raspberrypi</category>
    </item>
    <item>
      <title>🍹GinAI - Cocktails mixed with generative AI</title>
      <dc:creator>Simon Aubury</dc:creator>
      <pubDate>Thu, 19 Oct 2023 10:33:23 +0000</pubDate>
      <link>https://dev.to/saubury/ginai-cocktails-mixed-with-generative-ai-2nda</link>
      <guid>https://dev.to/saubury/ginai-cocktails-mixed-with-generative-ai-2nda</guid>
      <description>&lt;h1&gt;
  
  
  🍹GinAI - Cocktails mixed with generative AI
&lt;/h1&gt;

&lt;p&gt;GinAI — a robotic bartender which can make a nice drink given a random collection of juices, mixers and spirits. Real cocktails created and music chosen by OpenAI — all mixed by a RaspberryPi bartender.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2AvpvwxhcsgUIT9zpaFOIRGA.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2AvpvwxhcsgUIT9zpaFOIRGA.png" alt="GinAI — Cocktails mixed with generative AI and a RaspberryPi."&gt;&lt;/a&gt;&lt;em&gt;GinAI — Cocktails mixed with generative AI and a RaspberryPi.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  I’m bored — can I get a video?
&lt;/h2&gt;

&lt;p&gt;Here’s a quick video of GinAI in action.&lt;/p&gt;

&lt;center&gt;&lt;/center&gt;

&lt;h2&gt;
  
  
  Starting at the end
&lt;/h2&gt;

&lt;p&gt;Let me describe the finished project — and we can work backwards on how I built 🍹GinAI🍸. The GinAI bartender uses up to four ingredients — and when I press the dispense button, OpenAI &lt;a href="https://chat.openai.com/" rel="noopener noreferrer"&gt;ChatGPT&lt;/a&gt; will “create” a drink, describe the cocktail creation and select an appropriate song 🎵.&lt;/p&gt;

&lt;p&gt;A row of decorative lights look pretty during the creation and flash 🚨once the cocktail is ready. A Google Nest Mini is used as the speaker for both the spoken words 🗣️ and for playing the tunes 🎶.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2A2jsMCyqVC0JhNktEY2o-lQ.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2A2jsMCyqVC0JhNktEY2o-lQ.gif"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Cocktail inspiration from ChatGPT
&lt;/h2&gt;

&lt;p&gt;I started my cocktail mixing adventure by simply asking OpenAI for a cocktail recipe from the random spirits and mixers I had available. For example, I prompted &lt;a href="https://chat.openai.com/" rel="noopener noreferrer"&gt;ChatGPT&lt;/a&gt; to suggest a cocktail with this query&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;create a cocktail from the ingredients gin, tequila, apple juice and tonic.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Which returns a helpful cocktail mixing recipe along with text instructions as a response.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2752%2F1%2A3yNvwplBIXR1jBzFTJ7vYA.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2752%2F1%2A3yNvwplBIXR1jBzFTJ7vYA.png" alt="First experiment with ChatGPT console"&gt;&lt;/a&gt;&lt;em&gt;First experiment with ChatGPT console&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;I quickly found a few limitations with my initial cocktail creations from the console.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;OpenAI would sometime suggest a cocktail with an ingredients I didn’t specify (and didn’t have). I corrected this with the added instruction to only use ingredients from the provided list. Results were more reliable with prompt instructions “You do not need to use all of the ingredients. You may only use a maximum of 4 ingredients.” I can now call myself a prompt engineer 😀&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Recipe suggestions used a variety of measurement unit — such as fluid ounce “a dash of” or other bizarre imperial measurement. I could coerce the output by simply adding the prompt “Only give quantities in metric units. Only give quantities in whole numbers”.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;There was no consistency with the total volume of cocktail produced. Some may not consider 2 litres of alcohol a problem — but at the very least it overflowed my available cocktail glassware. The prompt instruction to limit to a 250 millilitre volume reduced spillage and excessive drinking&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;With the prompt creation giving reasobale results, I moved onto building a reliable interface.&lt;/p&gt;

&lt;h2&gt;
  
  
  Predictable OpenAI responses with function calling
&lt;/h2&gt;

&lt;p&gt;The cocktail recipes created by ChatGPT were going to drive the automated drink dispensing — so I needed ensure OpenAI API would generate a predicable schema for the JSON responses. OpenAI recently added &lt;a href="https://openai.com/blog/function-calling-and-other-api-updates" rel="noopener noreferrer"&gt;**Function calling&lt;/a&gt;** functionality to their API, which I could use to return a consistent JSON response. Function calling is primarily aimed at connecting GPT’s capabilities with external tools and APIs — and converts queries such as “Email Alice to see if she wants to get coffee next Friday” to a function call like &lt;strong&gt;send_email(to: string, body: string)&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;I didn’t need to direct function calling, but I can use the technique (along with a dummy function) to ensure recipes suggested by OpenAI ChatGPT meet a my specification. Enforcing a predictable JSON output means the directions can be easily parsed and implemented by the RaspberryPi liquid dispensing pumps to make yummy cocktails based on the response from a &lt;code&gt;[gpt-3.5-turbo](https://platform.openai.com/docs/models/gpt-3-5)&lt;/code&gt; model.&lt;/p&gt;

&lt;p&gt;The easiest implementation I found was to use a &lt;a href="https://docs.pydantic.dev" rel="noopener noreferrer"&gt;PyDantic&lt;/a&gt; class for my target schema — and use that as a parameter for the method call to “&lt;strong&gt;ChatCompletion.create()&lt;/strong&gt;”. Here’s a fragment of the &lt;a href="https://github.com/saubury/GinAI/blob/master/ginai_types.py" rel="noopener noreferrer"&gt;GinAI Python classes&lt;/a&gt; used.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# create a PyDantic schema for output
class Ingredient(BaseModel):
ingredient_name: str
quantity_ml: int

class Cocktail(BaseModel):
cocktail_name: str
description: str
inventor: str
matching_song: str
instructions: str
ingredients: list[Ingredient]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;This is a basic class, where an &lt;strong&gt;Ingredient&lt;/strong&gt; class is the ingredient name and a quantity specified in millilitres. The &lt;strong&gt;Cocktail&lt;/strong&gt; class has a list of the Ingredient objects, along with the name, description and appropriate song to compliment the drinking of the coctail&lt;/p&gt;

&lt;p&gt;I followed &lt;a href="https://medium.com/dev-bits/a-clear-guide-to-openai-function-calling-with-python-dcbc200c5d70" rel="noopener noreferrer"&gt;this guide&lt;/a&gt; as a great tutorial for using the new function calling feature from OpenAI to enforce a structured output from GPT models.&lt;/p&gt;

&lt;p&gt;The OpenAI Python calling logic looks like this (or see the &lt;a href="https://github.com/saubury/GinAI/blob/master/openai_util.py" rel="noopener noreferrer"&gt;whole openai_util.py module&lt;/a&gt;).&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;completion = openai.ChatCompletion.create( 
    model='gpt-3.5-turbo',
    messages=[
        {'role': 'system', 'content': 'You are a helpful bartender.'},
        {'role': 'user', 'content': prompt},
    ],
    functions=[
        {
        'name': 'get_answer_for_user_query',
        'description': 'Get user answer in series of steps',
        'parameters': Cocktail.model_json_schema()
        }
    ],
    function_call={'name': 'get_answer_for_user_query'}
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;OpenAI function calling ensured created recipes conformed to a strict JSON schema. For example, a typical response would look like this.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{
    "cocktail_name": "Summer Breeze",
    "description": "A refreshing cocktail perfect for a hot day.",
    "inventor": "My Bartender",
    "matching_song": "Summertime by DJ Jazzy Jeff &amp;amp; The Fresh Prince",
    "instructions": "1. Fill a glass with ice.\n2. Combine the ingredients in the glass.\n3. Stir well.\n4. Garnish with a slice of orange.\n5. Enjoy!",
    "ingredients": [
        {
            "ingredient_name": "gin",
            "quantity_ml": 45
        },
        {
            "ingredient_name": "orange juice",
            "quantity_ml": 60
        },
        {
            "ingredient_name": "tonic",
            "quantity_ml": 15
        }
    ]
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;With cocktail recipes consistently created, I could move onto building the drink dispensing hardware.&lt;/p&gt;

&lt;h2&gt;
  
  
  GinAI — Hardware build
&lt;/h2&gt;

&lt;p&gt;With the interfaces and software roughed out to imagine cocktails, the next job was to build the pouring hardware.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pumps
&lt;/h2&gt;

&lt;p&gt;I used 4 &lt;a href="https://en.wikipedia.org/wiki/Peristaltic_pump" rel="noopener noreferrer"&gt;peristaltic pumps&lt;/a&gt; to provide a “food safe” way to pump the liquids from the drink bottles. These pumps provide a steady rate of flow of liquids when powered. By carefully timing the “on” time for the pump I can precisely deliver the ideal amount of spirits or mixers for the perfect 🍸&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F0%2ASD6o91Snh9JpsY5J.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F0%2ASD6o91Snh9JpsY5J.jpeg" alt="A quick search on AliExpress"&gt;&lt;/a&gt;&lt;em&gt;A quick search on AliExpress&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The pumps are pretty cheap — and provide an accurate way to dispense precise quantities of liquids.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F0%2AronmNuCM2hr09wrV.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F0%2AronmNuCM2hr09wrV.jpeg" alt="Pumps — straight from the post"&gt;&lt;/a&gt;&lt;em&gt;Pumps — straight from the post&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The pumps are mounted on a basic wooden frame higher than the tallest bottle. My children helped to build the frame, labelling and installing the pumps.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F0%2Ab_Ivo2Kkw4YaMUjP.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F0%2Ab_Ivo2Kkw4YaMUjP.jpeg" alt="The important task of adding labels to pumps"&gt;&lt;/a&gt;&lt;em&gt;The important task of adding labels to pumps&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;A view from the rear show the placement of pumps and liquids. A manual electrical switch allow the pumps to be run independently from the Raspberry Pi (helpful for cleaning).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F0%2A0mfhEEb0kIinOPuS.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F0%2A0mfhEEb0kIinOPuS.jpeg" alt="Rear view of Cocktail maker"&gt;&lt;/a&gt;&lt;em&gt;Rear view of Cocktail maker&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The pumps use 12 volt motors. To operate them via the Raspberry Pi I used a &lt;a href="https://www.jaycar.com.au/arduino-compatible-4-channel-12v-relay-module/p/XC4440" rel="noopener noreferrer"&gt;4 Channel 12V Relay Module&lt;/a&gt;. This allows the pumps to be switched on and off independently with the 5 volt signals from the &lt;a href="https://www.raspberrypi.org/documentation/usage/gpio/" rel="noopener noreferrer"&gt;GPIO pins&lt;/a&gt; of the Raspberry Pi.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F0%2AaBc80nhmSp4KfYRU.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F0%2AaBc80nhmSp4KfYRU.jpeg" alt="Relay board"&gt;&lt;/a&gt;&lt;em&gt;Relay board&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The Raspberry Pi is mounted with the relay board. The relays switch 12 volt power on and off for the pump motors. The signals for the relay board are taken directly from the GPIO header of the Raspberry Pi.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2068%2F1%2Aggvd82pOjv_SewNbWLpspQ.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2068%2F1%2Aggvd82pOjv_SewNbWLpspQ.png" alt="Board placement"&gt;&lt;/a&gt;&lt;em&gt;Board placement&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Finally I added the mandatory RGB LED’s for some colourful lighting effects. I used a row of &lt;a href="https://www.jaycar.com.au/duinotech-arduino-compatible-w2812b-rgb-led-strip-2m/p/XC4390" rel="noopener noreferrer"&gt;WS2812B LED strip&lt;/a&gt;. This was installed behind the final collecting tube with a bit of soldering hidden by white heat shrink tubing.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F5324%2F1%2AsNogE8G29nINRmw7PPRQwA.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F5324%2F1%2AsNogE8G29nINRmw7PPRQwA.png" alt="WS2812B RGB LED strip."&gt;&lt;/a&gt;&lt;em&gt;WS2812B RGB LED strip.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;A great bartended needs to be able to talk and entertain, so GinAI needed a speaker. I used a Google Nest Mini mounted to the frame as a speaker for spoken words and for playing the music.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F5508%2F1%2A1x8Z6HnqWQ_8CzRiHUDl7w.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F5508%2F1%2A1x8Z6HnqWQ_8CzRiHUDl7w.png" alt="Google Nest Mini"&gt;&lt;/a&gt;&lt;em&gt;Google Nest Mini&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  🍹 Cheers!
&lt;/h2&gt;

&lt;p&gt;With a bit of coding and a &lt;em&gt;lot&lt;/em&gt; of trust in my GinAI bartender I’m enjoying the surprising world of generative created cocktails&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2Aa2ZZ-P4j1-XN54fBlZPgDw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F1%2Aa2ZZ-P4j1-XN54fBlZPgDw.png" alt="Cheers"&gt;&lt;/a&gt;&lt;em&gt;Cheers&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  🛠️ Code
&lt;/h2&gt;

&lt;p&gt;The code for GinAI is available at&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/saubury/GinAI/" rel="noopener noreferrer"&gt;https://github.com/saubury/GinAI/&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>genai</category>
      <category>openai</category>
      <category>raspberrypi</category>
    </item>
    <item>
      <title>My data, your LLM — paranoid analysis of iMessage chats with OpenAI, LlamaIndex &amp; DuckDB</title>
      <dc:creator>Simon Aubury</dc:creator>
      <pubDate>Tue, 12 Sep 2023 10:16:08 +0000</pubDate>
      <link>https://dev.to/saubury/my-data-your-llm-paranoid-analysis-of-imessage-chats-with-openai-llamaindex-duckdb-825</link>
      <guid>https://dev.to/saubury/my-data-your-llm-paranoid-analysis-of-imessage-chats-with-openai-llamaindex-duckdb-825</guid>
      <description>&lt;p&gt;&lt;em&gt;Can I safely combine my local personal data with a public large language model to understand my texting behaviour? A project combining natural language and generative AI models to explore my private data without sharing (too much of) my personal life with the robots.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--oHAED7VP--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn-images-1.medium.com/max/4732/1%2A_cNWKGqKv1RwlODFBvUbSQ.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--oHAED7VP--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn-images-1.medium.com/max/4732/1%2A_cNWKGqKv1RwlODFBvUbSQ.png" alt="Data architecture — image by author" width="800" height="346"&gt;&lt;/a&gt;&lt;em&gt;Data architecture — image by author&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;✍️ A blog about exploratory data analysis of my private iMessage chats — with equal parts of wonder and paranoia.&lt;/p&gt;

&lt;h2&gt;
  
  
  Motivation 🤔
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://en.wikipedia.org/wiki/IMessage"&gt;iMessage&lt;/a&gt; is an instant messaging service for text communication between users on Apple devices. Behind the scenes, iMessage uses a local &lt;a href="https://en.wikipedia.org/wiki/SQLite"&gt;SQLite&lt;/a&gt; database to store a copy of message conversations. This means I have a complete local copy of all my messages in a relational database on my Mac laptop.&lt;/p&gt;

&lt;p&gt;With 2 years of iMessage history I wanted to explore the text data to create data visualisations — with natural language prompts. The data however is very personal — so I needed to use a privacy preserving design to ensure my personal communications doesn’t leave my machine.&lt;/p&gt;

&lt;h3&gt;
  
  
  Tech stack 🔧
&lt;/h3&gt;

&lt;p&gt;To explore my text messages I’m using&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://duckdb.org/"&gt;DuckDB&lt;/a&gt; open-source, embedded, in-process OLAP database. With the &lt;a href="https://duckdb.org/docs/extensions/sqlite_scanner.html"&gt;SQLite extension&lt;/a&gt; to directly read from an iMessage SQLite database file&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://www.llamaindex.ai/"&gt;LlamaIndex&lt;/a&gt; — framework for connecting custom data sources to large language models, allowing for natural language querying of my data&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://platform.openai.com/docs/models/gpt-3-5"&gt;OpenAI gpt-3.5-turbo&lt;/a&gt; model for code generation to create the python to make the visualisations &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://mitmproxy.org/"&gt;mitmproxy&lt;/a&gt; — an open source interactive HTTPS proxy to view encrypted network traffic&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  What am I working towards? 📈
&lt;/h3&gt;

&lt;p&gt;My goal is explore my messaging behaviour, such as texting frequency and time of day usage. I want to “talk” to my data using generative AI — to create visualisations on top of my private data.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--CM23ldl---/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn-images-1.medium.com/max/4136/1%2AUfDh6meQTKjI5Fjx7Ajoew.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--CM23ldl---/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn-images-1.medium.com/max/4136/1%2AUfDh6meQTKjI5Fjx7Ajoew.png" alt="Image by author" width="800" height="463"&gt;&lt;/a&gt;&lt;em&gt;Image by author&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;🎉 Tada! Creating charts locally on my private data from a natural language prompt. Let’s now see how I built this, breaking this down into three parts&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Wrangling iMessage data — with DuckDB to ingest and pre-process my iMessage history&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Inspecting the network traffic — with a “Man in the Middle” proxy to view request and response encrypted traffic.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Prompted visualisations — with PandasQueryEngine and LlamaIndex  connecting my local custom data iMessage data sources to a public large language models.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;🛠️ The complete notebook is available at &lt;a href="https://github.com/saubury/paranoid_text_LLM/"&gt;https://github.com/saubury/paranoid_text_LLM/&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Wrangling iMessage data with DuckDB 🦆
&lt;/h2&gt;

&lt;p&gt;The first task is to extract my iMessages and transform into a sensible form for data analysis.&lt;/p&gt;

&lt;p&gt;⏩ If you’re not interested in the data engineering step you can jump ahead to reading about generative AI with Pandas and LlamaIndex.&lt;/p&gt;

&lt;p&gt;I’ve &lt;a href="https://towardsdatascience.com/my-very-personal-data-warehouse-fitbit-activity-analysis-with-duckdb-8d1193046133"&gt;written before&lt;/a&gt; about &lt;a href="https://duckdb.org/why_duckdb"&gt;DuckDB&lt;/a&gt; — a lightweight, free yet powerful analytical database designed to streamline data analysis workflows which runs locally. I’m using 🦆 DuckDB’s as a quick way to ingest and pre-process my iMessage history. My first task is to load the &lt;a href="https://duckdb.org/docs/extensions/sqlite_scanner"&gt;SQLite Scanner DuckDB extension&lt;/a&gt; which allows DuckDB to directly read data from a SQLite database such as iMessage.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;INSTALL sqlite_scanner;
LOAD sqlite_scanner;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you are on a Mac, logged into your iCloud account you can copy the local iMessage SQLite database and load it. If you receive the error Operation not permitted you may need to run the command in a terminal and accept the prompts to interact with privileged files.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cp&lt;/span&gt; ~/Library/Messages/chat.db ./sql/chat.db
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With DuckDB, we will open the iMessage SQLite database, with the &lt;a href="https://duckdb.org/docs/sql/statements/attach.html"&gt;attach&lt;/a&gt; command. This will open the SQLite database file ./sql/chat.db in the schema namespace chat_sqlite chat.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ATTACH './sql/chat.db' as chat_sqlite (TYPE sqlite,  READ_ONLY TRUE);
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Load messages into table
&lt;/h3&gt;

&lt;p&gt;We create the chat_messages DuckDB table by joining three tables from the iMessage SQLite database. Within the same query I also want to&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;determine the message time by evaluating the interval (number of seconds since EPOC of &lt;code&gt;2001-01-01&lt;/code&gt;)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;extract the phone number country calling code (eg, +1, +61)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;redact phone number like &lt;code&gt;+61412341234&lt;/code&gt; to &lt;code&gt;+614...41234&lt;/code&gt; (for screenshots in this blog)&lt;br&gt;
&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="k"&gt;REPLACE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;chat_messages&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="nb"&gt;TIMESTAMP&lt;/span&gt; &lt;span class="s1"&gt;'2001-01-01'&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;INTERVAL&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;1000000000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;seconds&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;message_date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;attributedBody&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;is_from_me&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="k"&gt;CASE&lt;/span&gt; &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="n"&gt;ct&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat_identifier&lt;/span&gt; &lt;span class="k"&gt;like&lt;/span&gt; &lt;span class="s1"&gt;'+1%'&lt;/span&gt; &lt;span class="k"&gt;then&lt;/span&gt; &lt;span class="k"&gt;SUBSTRING&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ct&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat_identifier&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;when&lt;/span&gt; &lt;span class="n"&gt;ct&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat_identifier&lt;/span&gt; &lt;span class="k"&gt;like&lt;/span&gt; &lt;span class="s1"&gt;'+%'&lt;/span&gt; &lt;span class="k"&gt;then&lt;/span&gt; &lt;span class="k"&gt;SUBSTRING&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ct&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat_identifier&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;end&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;phone_country_calling_code&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;regexp_replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ct&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat_identifier&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'^(&lt;/span&gt;&lt;span class="se"&gt;\+&lt;/span&gt;&lt;span class="s1"&gt;[0-9][0-9][0-9])([0-9][0-9][0-9])'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="se"&gt;\1&lt;/span&gt;&lt;span class="s1"&gt;...&lt;/span&gt;&lt;span class="se"&gt;\3&lt;/span&gt;&lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;phone_number&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;chat_sqlite&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt; &lt;span class="n"&gt;ct&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;chat_sqlite&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat_message_join&lt;/span&gt; &lt;span class="n"&gt;cmj&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;ct&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;"ROWID"&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cmj&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat_id&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;chat_sqlite&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;cmj&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;message_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;"ROWID"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I can peek at the chat_messages DuckDB table by querying it (with good old select * ).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--UL6pov0q--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn-images-1.medium.com/max/2000/1%2A4wSsKWBgee8vN8jcy9aCKA.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--UL6pov0q--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn-images-1.medium.com/max/2000/1%2A4wSsKWBgee8vN8jcy9aCKA.png" alt="Sample of chat_messages — image by author" width="797" height="284"&gt;&lt;/a&gt;&lt;em&gt;Sample of chat_messages — image by author&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;I can now send the contents of the chat_messages DuckDB table into a chat_messages_df dataframe with &lt;code&gt;&amp;lt;&amp;lt;&lt;/code&gt; within a sql magic.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="o"&gt;%%&lt;/span&gt;&lt;span class="k"&gt;sql&lt;/span&gt;
&lt;span class="n"&gt;chat_messages_df&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt;
  &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
  &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;chat_messages&lt;/span&gt;
  &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;message_date&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Decoding attributedBody 🪄
&lt;/h3&gt;

&lt;p&gt;The iMessage database has a mixture of encoding formats, with older messages as plain text in the text field, with newer messages encoded in the attributedBody field. Sometime around November 2022 the messages started coming in in new format which migh be related to a message upgrade related to the release of iOS 16. I’m thankful to the &lt;a href="https://github.com/my-other-github-account/imessage_tools/"&gt;iMessage-Tools&lt;/a&gt; project which had the logic to extract the text content is hidden within the attributedBody field. The decode_message utility function extracts the text regardless of format.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;re&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;decode_message&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
  &lt;span class="n"&gt;msg_text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'text'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="n"&gt;msg_attributed_body&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'attributedBody'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

  &lt;span class="c1"&gt;# Logic from https://github.com/my-other-github-account/imessage_tools
&lt;/span&gt;  &lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;''&lt;/span&gt;
  &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;msg_text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;body&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;msg_text&lt;/span&gt;
  &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;msg_attributed_body&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;body&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;''&lt;/span&gt;
  &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
      &lt;span class="n"&gt;msg_attributed_body&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;msg_attributed_body&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'utf-8'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;errors&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'replace'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;AttributeError&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
      &lt;span class="k"&gt;pass&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="s"&gt;"NSNumber"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;msg_attributed_body&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
      &lt;span class="n"&gt;msg_attributed_body&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;msg_attributed_body&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"NSNumber"&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
      &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="s"&gt;"NSString"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;msg_attributed_body&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;msg_attributed_body&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;msg_attributed_body&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"NSString"&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="s"&gt;"NSDictionary"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;msg_attributed_body&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
          &lt;span class="n"&gt;msg_attributed_body&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;msg_attributed_body&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"NSDictionary"&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
          &lt;span class="n"&gt;msg_attributed_body&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;msg_attributed_body&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
          &lt;span class="n"&gt;body&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;msg_attributed_body&lt;/span&gt;

  &lt;span class="n"&gt;body&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="s"&gt;'\n'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;' '&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;body&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Inline message extraction
&lt;/h3&gt;

&lt;p&gt;We'll use a the &lt;a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html"&gt;pandas apply()&lt;/a&gt; method to apply the decode_message function to the DataFrame. In short, we’ll set message_text to something readable, regardless of the format the text came in.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;chat_messages_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'message_text'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;chat_messages_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;decode_message&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;chat_messages_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;head&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With the messages decoded, I can peek at the first few records.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--rvJZbTUu--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn-images-1.medium.com/max/2000/1%2AUaD9y151KqVhYQEzeUUIEg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--rvJZbTUu--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn-images-1.medium.com/max/2000/1%2AUaD9y151KqVhYQEzeUUIEg.png" alt="Sample of iMessage texts sent and received — image by author" width="800" height="154"&gt;&lt;/a&gt;&lt;em&gt;Sample of iMessage texts sent and received — image by author&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;With my several thousand iMessages loaded into the chat_messages_df dataframe, I can move onto some prompted analysis.&lt;/p&gt;

&lt;h2&gt;
  
  
  Talking to my data — Generative AI on Pandas with LlamaIndex 🗣️
&lt;/h2&gt;

&lt;p&gt;I’ll be using &lt;a href="https://www.llamaindex.ai/"&gt;LlamaIndex&lt;/a&gt; — a flexible data framework for connecting custom data sources to large language models. LlamaIndex uses &lt;a href="https://research.ibm.com/blog/retrieval-augmented-generation-RAG"&gt;Retrieval Augmented Generation (RAG)&lt;/a&gt; systems that combine a large language model (such as those provided by OpenAI or Hugging Face) with a private data set (set as my personal copy of iMessages). &lt;/p&gt;

&lt;p&gt;The RAG pipeline retrieves the most relevant context for my query (such as the shape of my data), and passes that to the LLM to generate a response. The response should be the python code necessary to execute on my local data to visualise a result in response to my query. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--oCzArTTm--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn-images-1.medium.com/max/2000/1%2AFRfgDwOBJb1nvZ5Z-IYJrg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--oCzArTTm--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn-images-1.medium.com/max/2000/1%2AFRfgDwOBJb1nvZ5Z-IYJrg.png" alt="LlamaIndex — High-Level Concepts" width="618" height="345"&gt;&lt;/a&gt;&lt;em&gt;LlamaIndex — &lt;a href="https://gpt-index.readthedocs.io/en/latest/getting_started/concepts.html"&gt;High-Level Concepts&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  LlamaIndex
&lt;/h3&gt;

&lt;p&gt;I will be using the paid OpenAI service, and have created a &lt;a href="https://platform.openai.com/account/api-keys"&gt;secret API key&lt;/a&gt; which is saved in the notebook.cfg configuration file.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;llama_index.query_engine.pandas_query_engine&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;PandasQueryEngine&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;openai&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;configparser&lt;/span&gt;

&lt;span class="n"&gt;config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;configparser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ConfigParser&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'notebook.cfg'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;openai_api_token&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'openai'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'api_token'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;openai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt;  &lt;span class="n"&gt;openai_api_token&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I can now take my chat_messages_dfdata frame — and ask a question like “What is the most frequent phone_number?”&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;query_engine&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;PandasQueryEngine&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;chat_messages_df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;query_engine&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"What is the most frequent phone_number?"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Which generates a small fragment of Python code, which can be applied to my data .. and gives me the value for the most frequently iMessaged user!&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;gt; Pandas Instructions:
eval("df['phone_number'].value_counts().idxmax()")

+61 412 321915
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;🎉 Whoa — that’s pretty amazing. By simply asking the query engine for a goal, the public LLM has generated the correct Python code that when run locally on my data frame gives me the correct answer. My wife will be happy to know she is the most frequently messaged from my phone 😅!&lt;/p&gt;

&lt;p&gt;Let’s have a peek into the traffic to see what’s happening behind the scenes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Inspecting the network traffic with mitmproxy 🕵️‍♀️
&lt;/h2&gt;

&lt;p&gt;I was curious to see what kinds of requests the code was making to OpenAI and what kind of responses it is getting back. &lt;/p&gt;

&lt;p&gt;⏩ This is an optional step for the paranoid, and if you’re not interested in the network analysis you can skip to prompted visualisations to analyse data.&lt;/p&gt;

&lt;p&gt;I used a &lt;a href="https://earthly.dev/blog/mitmproxy/"&gt;great guide&lt;/a&gt; to get started with &lt;a href="https://mitmproxy.org/"&gt;mitmproxy&lt;/a&gt; to observe to capture encrypted  requests &amp;amp; responses. The short summary is the mitmproxy proxy sits between the local Python code and the internet, to intercept and inspect SSL/TLS-protected traffic.&lt;/p&gt;

&lt;p&gt;To view the traffic, start mitmweb proxy and set the following environment variables to metwork traffic passes through the local proxy, signed with a local certificate (which is conveniently created when you first run mitmproxy).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;os&lt;/span&gt;
&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'http_proxy'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"http://127.0.0.1:8080"&lt;/span&gt; 
&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'https_proxy'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"https://127.0.0.1:8080"&lt;/span&gt; 
&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'REQUESTS_CA_BUNDLE'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"/Users/saubury/.mitmproxy/mitmproxy-ca-cert.pem"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With the proxy established, both network requests and responses are visible in the web dashboard.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--UznNWZsx--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn-images-1.medium.com/max/2000/1%2A7aga0Ck3LO1KEBim3nChcQ.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--UznNWZsx--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn-images-1.medium.com/max/2000/1%2A7aga0Ck3LO1KEBim3nChcQ.png" alt="“Man in the Middle” HTTPS proxy — mitmproxy web page" width="800" height="478"&gt;&lt;/a&gt;&lt;em&gt;“Man in the Middle” HTTPS proxy — mitmproxy web page&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;With the proxy running, I can start running some tests. &lt;/p&gt;

&lt;h2&gt;
  
  
  Prompted visualisations with PandasQueryEngine and LlamaIndex 📊
&lt;/h2&gt;

&lt;h3&gt;
  
  
  When do I send and receive texts throughout the day?
&lt;/h3&gt;

&lt;p&gt;Let’s try a query and see what requests and responses are seen by the proxy. I’ll use the &lt;a href="https://gpt-index.readthedocs.io/en/stable/examples/query_engine/pandas_query_engine.html"&gt;PandasQueryEngine&lt;/a&gt; of LlamaIndex to query my iMessage data, and ask for the following visualisation to be created …&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Extract hour of day from message_date. Visualize a distribution of the hour extracted from message_date. Add a title and label the axis. Use colors and add a gap between bars. Colour the bars with an hour of 5 in red and the rest in blue.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;LlamaIndex will compose a request to OpenAI, and I can capture the outward request&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--7J1b9V7c--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn-images-1.medium.com/max/2000/1%2AokAKJegFpUQ_vfX_S1C4eA.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--7J1b9V7c--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn-images-1.medium.com/max/2000/1%2AokAKJegFpUQ_vfX_S1C4eA.png" alt="Request sent to OpenAI" width="786" height="437"&gt;&lt;/a&gt;&lt;em&gt;Request sent to OpenAI&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;It appears that a sample (5 rows) of my data is sent outwards to describe the data. In this case I’m not all that worried as these are simply dates, but obviously sending a sample of something more personal would concern me more.&lt;/p&gt;

&lt;p&gt;Within a few seconds, LlamaIndex will relay the response and we can peek at the code returned by OpenAI.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--TQ8Dtsj9--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn-images-1.medium.com/max/2000/1%2AjRDOqpu-i1OVF8tMstUIhw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--TQ8Dtsj9--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn-images-1.medium.com/max/2000/1%2AjRDOqpu-i1OVF8tMstUIhw.png" alt="Response returned with python embedded" width="625" height="532"&gt;&lt;/a&gt;&lt;em&gt;Response returned with python embedded&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The code is then automatically run against my entire local dataset. I was especially impressed the code to extract the “hour” from the timestamp field worked as expected. The result of the generated python code appears exactly as I had asked.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--HTUwJjap--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn-images-1.medium.com/max/2000/1%2AM8VL4j2BZUguANfmTkxHbg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--HTUwJjap--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn-images-1.medium.com/max/2000/1%2AM8VL4j2BZUguANfmTkxHbg.png" alt="Message time distribution by hour of day — image by author" width="800" height="477"&gt;&lt;/a&gt;&lt;em&gt;Message time distribution by hour of day — image by author&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;🎉 Voilà — the code is correct — and generates and run code that renders a distribution shows most of my messages and sent between 6am and 9pm.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further experiments 🔬
&lt;/h2&gt;

&lt;p&gt;Let’s see a few more examples of data queries, and the payloads which need to be sent to create working python code.&lt;/p&gt;

&lt;h3&gt;
  
  
  Most frequent contacts
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;Create a plot a bar chart showing the frequency of top eight phone_numbers. The X axis labels should be at a 45 degree angle. Use a different colour for each bar&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--vDqPe3X4--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn-images-1.medium.com/max/2904/1%2Anr3z-UFvaK4f-VZYrQCERw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--vDqPe3X4--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn-images-1.medium.com/max/2904/1%2Anr3z-UFvaK4f-VZYrQCERw.png" alt="Most frequent contacts — image by author" width="800" height="323"&gt;&lt;/a&gt;&lt;em&gt;Most frequent contacts — image by author&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;To build a bar chart of my most frequent contacts, LlamaIndex sent a sample of 5 phone numbers to OpenAI to describe the datatypes expected in the dataframe. The resulting Python code executed locally on my entire data set created the correct bar chart. I was impressed the request to turn the X-axis labels 45 degrees was honoured.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;matplotlib.pyplot&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;plt&lt;/span&gt;

&lt;span class="c1"&gt;# Count the frequency of each phone_number
&lt;/span&gt;&lt;span class="n"&gt;phone_number_counts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'phone_number'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;value_counts&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;head&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Create a bar chart
&lt;/span&gt;&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;bar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;phone_number_counts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;phone_number_counts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;color&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'red'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'blue'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'green'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'yellow'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'orange'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'purple'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'pink'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'brown'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="c1"&gt;# Add a title
&lt;/span&gt;&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'Frequency of Top Eight Phone Numbers'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Rotate x-axis labels by 45 degrees
&lt;/span&gt;&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;xticks&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rotation&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;45&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Show the plot
&lt;/span&gt;&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Message length
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;Visualize a distribution of the length of message_text. Use a logarithmic scale. Add a title and label both axis. Add a space between bars.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Pi5HFsOb--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn-images-1.medium.com/max/3076/1%2AbwARNVLhJo0q01rU74Ma3A.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Pi5HFsOb--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn-images-1.medium.com/max/3076/1%2AbwARNVLhJo0q01rU74Ma3A.png" alt="Distribution of the length of message text — image by author." width="800" height="315"&gt;&lt;/a&gt;&lt;em&gt;Distribution of the length of message text — image by author.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;To display a distribution of the length of messages, LlamaIndex sent the contents of 5 sample messages t to OpenAI. The resulting Python code executed locally on my entire data set created a lambda function to run locally to determine message length. A logarithmic scale was created, however my prompt to add a space between bars was incorrectly misinterpreted as at plt.tight_layout.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;matplotlib.pyplot&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;plt&lt;/span&gt;

&lt;span class="c1"&gt;# Calculate the length of each message_text
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'message_length'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'message_text'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nb"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="c1"&gt;# Create a histogram of the message_length with a logarithmic scale
&lt;/span&gt;&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;hist&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'message_length'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;bins&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;edgecolor&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'black'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Add a title and label both axes
&lt;/span&gt;&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'Distribution of Message Text Length'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;xlabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'Message Length'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ylabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'Frequency'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Add a space between bars
&lt;/span&gt;&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tight_layout&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Show the plot
&lt;/span&gt;&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Inbound vs., outbound messages
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;Visualize a pie chart of the proportion of is_from_me. Label the value 0 as ‘inbound’. Add a percentage rounded to 1 decimal places.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--RwzevYzr--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn-images-1.medium.com/max/2972/1%2AP3pmq6wpWB5K3DbSZxZNiA.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--RwzevYzr--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn-images-1.medium.com/max/2972/1%2AP3pmq6wpWB5K3DbSZxZNiA.png" alt="Pie chart of the proportion inbound and outbound messages — image by author" width="800" height="328"&gt;&lt;/a&gt;&lt;em&gt;Pie chart of the proportion inbound and outbound messages — image by author&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;To build a pie chart showing the proportion of inbound to outbound messages, LlamaIndex simply sent a sample of “is_from_me” boolean records to OpenAI. The resulting Python code executed locally on my entire data set created the correct pie chart. I was impressed the label of &lt;em&gt;outbound&lt;/em&gt; to the value of &lt;em&gt;1&lt;/em&gt;, which was a clever inference from me describing value &lt;em&gt;0&lt;/em&gt; as &lt;em&gt;outbound&lt;/em&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;matplotlib.pyplot&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;plt&lt;/span&gt;

&lt;span class="c1"&gt;# Count the number of occurrences of each value in the 'is_from_me' column
&lt;/span&gt;&lt;span class="n"&gt;value_counts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'is_from_me'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;value_counts&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Create a pie chart using the value counts
&lt;/span&gt;&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pie&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value_counts&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;labels&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'inbound'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'outbound'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;autopct&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'%.1f%%'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Display the pie chart
&lt;/span&gt;&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;To answer the question “can I safely combine my local personal data with a public large language model”— well, kind of sort of.&lt;/p&gt;

&lt;p&gt;I set a clear boundary between my personal iMessage data (kept private to my machine) and the public generative models to build the logic for my data analysis. I am satisfied with the compromises I made, with a handful of records shared outside of my network used to generate high-quality Python code quickly and effectively address my queries.&lt;/p&gt;

&lt;p&gt;Yes, I could next time use dummy data, inspect the code, obfuscate the payloads or run the models locally. In fact — that’s what I might do in a future blog. &lt;/p&gt;

&lt;p&gt;For now, I’m happy with my paranoid analysis of iMessage chats.&lt;/p&gt;

&lt;p&gt;🛠️ The complete notebook is available at &lt;a href="https://github.com/saubury/paranoid_text_LLM/"&gt;https://github.com/saubury/paranoid_text_LLM/&lt;/a&gt;&lt;/p&gt;

</description>
      <category>llamaindex</category>
      <category>duckdb</category>
      <category>openai</category>
      <category>llm</category>
    </item>
    <item>
      <title>GenPiCam - Generative AI Camera</title>
      <dc:creator>Simon Aubury</dc:creator>
      <pubDate>Wed, 28 Jun 2023 11:18:23 +0000</pubDate>
      <link>https://dev.to/saubury/genpicam-generative-ai-camera-160</link>
      <guid>https://dev.to/saubury/genpicam-generative-ai-camera-160</guid>
      <description>&lt;h2&gt;
  
  
  GenPiCam - Generative AI Camera
&lt;/h2&gt;

&lt;p&gt;Generative AI (GenAI) is a type of Artificial Intelligence that can create a wide variety of images, video and text. To accelerate the robot uprising I chained two GenAI models together to build a camera which describes the current scene in words, and then uses a second model to create a new generated stylised image. Let me introduce GenPiCam — a RaspberryPi based camera that reimagines the world with GenAI.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--StVziVFd--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_800/https://cdn-images-1.medium.com/max/2000/1%2AeZzfeCJggafmHaYGcjqEDA.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--StVziVFd--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_800/https://cdn-images-1.medium.com/max/2000/1%2AeZzfeCJggafmHaYGcjqEDA.gif" alt="Before and after images created by GenPiCam" width="634" height="315"&gt;&lt;/a&gt;&lt;em&gt;Before and after images created by GenPiCam&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The heavy processing and true smarts of this project is handled by &lt;a href="https://www.midjourney.com/"&gt;Midjourney&lt;/a&gt; — an external service using  machine learning-based image generators. GenPiCam makes use of two Midjourney capabilities&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://docs.midjourney.com/docs/describe"&gt;Describe&lt;/a&gt; which starts with an existing photo and creates a text description prompts for the image. &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://docs.midjourney.com/docs/quick-start"&gt;Imagine&lt;/a&gt; which converts natural language prompts into images&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Between these two steps I allow of a level of creative input, so the GenPiCam camera has a dial to tweak the style of the final image. This essentially becomes a filter, adding an “anime”, “pop-art” or “futuristic” influence to the generated image. &lt;/p&gt;

&lt;h2&gt;
  
  
  I’m bored — can I get a video?
&lt;/h2&gt;

&lt;p&gt;Sure — here’s the 2 minute summary&lt;/p&gt;

&lt;p&gt;&lt;iframe width="710" height="399" src="https://www.youtube.com/embed/qqwRXybdNeo"&gt;
&lt;/iframe&gt;
&lt;/p&gt;

&lt;h2&gt;
  
  
  The “photographic” process
&lt;/h2&gt;

&lt;p&gt;The initial photo image is taken with a Raspberry Pi Camera Module. An external camera shutter (pushbutton connected to the Raspberry Pi GPIO pins) when pushed takes a still image and saves the photo as a jpeg image.  &lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--5vngkkzl--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn-images-1.medium.com/max/2220/1%2AuCIGwO2l4j-IqjzDYgWHKg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--5vngkkzl--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn-images-1.medium.com/max/2220/1%2AuCIGwO2l4j-IqjzDYgWHKg.png" alt="Taking still images of wildlife in the garden" width="800" height="508"&gt;&lt;/a&gt;&lt;em&gt;Taking still images of wildlife in the garden&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The photo is uploaded to Midjourney which starts with an existing photo and creates a text description prompts for the image. For the curious, I’m using some very inelegant bot interactions with PyAutoGUI to control the mouse and keyboard (as there’s no API) — let &lt;a href="https://github.com/saubury/GenPiCam/blob/main/midjourney.py"&gt;this&lt;/a&gt; be an example of code you should never write.&lt;/p&gt;

&lt;p&gt;Midjourney’s describe tool takes an image as input, then generates text prompts. This is a pretty clever service, reversing the usual process of “text to image” by doing the reverse, starting with the photo and then extracting text to describe the essence of the image. Here is Snowy, but Midjouney has a much more expressive description.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Axg5X5Gq--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn-images-1.medium.com/max/2000/1%2Acs7FHVNM9fxPCWNtb2VjxQ.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Axg5X5Gq--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn-images-1.medium.com/max/2000/1%2Acs7FHVNM9fxPCWNtb2VjxQ.png" alt="Snowy the cat — laying on bed under yellow blanket …" width="580" height="397"&gt;&lt;/a&gt;&lt;em&gt;Snowy the cat — laying on bed under yellow blanket …&lt;/em&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;black cat laying on bed under yellow blanket, in the style of berrypunk, irridescent, glimmering, unpolished, symmetrical, rounded, chinapunk — ar 4:3 &lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The describe function actually returns four descriptions based on the image, but GenPiCam  arbitrarily selects the first description.&lt;/p&gt;

&lt;p&gt;Now for the fun part. We can take that text prompt, and use it to create a brand new image with Generative AI with a new call to Midjouney imagine. Here is a image generate from the previous text prompt.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--CBj6CUJD--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn-images-1.medium.com/max/2000/1%2AWgf5DYmYxEaVlBGksP4BSQ.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--CBj6CUJD--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn-images-1.medium.com/max/2000/1%2AWgf5DYmYxEaVlBGksP4BSQ.png" alt="Midjouney imagine generated image from text prompt " width="416" height="339"&gt;&lt;/a&gt;*Midjouney imagine generated image from text prompt *&lt;/p&gt;

&lt;p&gt;GenPiCam has a selection switch to update the prompt with stylistic instructions.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--UbwuZaXa--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn-images-1.medium.com/max/2152/1%2AGTWo9YVRBa9J7Z5tjjbRAg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--UbwuZaXa--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn-images-1.medium.com/max/2152/1%2AGTWo9YVRBa9J7Z5tjjbRAg.png" alt="Scene selector" width="800" height="462"&gt;&lt;/a&gt;&lt;em&gt;Scene selector&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This is a 12 way rotary switch connected to the Raspberry Pi GPIO pins. By reading the current “artistic selection” GenPiCam will add a prefix such as “&lt;strong&gt;retro pop art-style illustration”&lt;/strong&gt; to the text prompt. A few of the other style prompts include&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Anime style &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Hyper Realistic, whimsical with colourful hat and balloons, &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Blurry brushstrokes,&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Futuristic, in a space station, hyper realistic&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let’s see the before and after “pop-art” images for Snowy.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--T9IuKSME--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn-images-1.medium.com/max/2000/1%2AH7_sXWoV0vkx4nLuWdpPqg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--T9IuKSME--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn-images-1.medium.com/max/2000/1%2AH7_sXWoV0vkx4nLuWdpPqg.png" alt="Final image with before and after photos along with text prompt " width="640" height="380"&gt;&lt;/a&gt;*Final image with before and after photos along with text prompt *&lt;/p&gt;

&lt;p&gt;The final image is a created using the &lt;a href="https://github.com/python-pillow/Pillow/"&gt;Pillow&lt;/a&gt; Python imaging library, and is comprised of &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Initial photo taken by the Raspberry Pi camera module, resized on the left&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Final Midjouney image — the first of four images is selected, composited to the right&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Text prompt — against a coloured background and icon signifying the style mode&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here’s the same process, but adding the text *&lt;em&gt;“Hyper Realistic, whimsical with colourful hat and balloons”. *&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--uItomsc_--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn-images-1.medium.com/max/2000/1%2ACenkY6lvmq-FyfLWo7rG2g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--uItomsc_--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn-images-1.medium.com/max/2000/1%2ACenkY6lvmq-FyfLWo7rG2g.png" alt="" width="635" height="405"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Even though the image on the right is a creation from Generative AI, there’s still still a sense of disappointment coming through Snowy’s judgmental eyes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Generative AI Images — Learnings
&lt;/h2&gt;

&lt;p&gt;I had so much fun building the GenPiCam camera — and this was an interesting path for exploring prompt engineering for Generative AI. The better photos were the ones which had a simple composition — essentially images that were easy to put words to. For example, this scene is easy to describe with a colour and definitive objects.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--SlsU0sYg--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn-images-1.medium.com/max/2000/1%2A5kpaw5iXpMk2CMFBLMN3gQ.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--SlsU0sYg--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn-images-1.medium.com/max/2000/1%2A5kpaw5iXpMk2CMFBLMN3gQ.png" alt="A green stuffed animal and white keyboard" width="641" height="347"&gt;&lt;/a&gt;&lt;em&gt;A green stuffed animal and white keyboard&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;However, there were some very strange results while describing more unique scenes. I found the description of a classic Australian cloths line created a unusual image.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--e7t15fq2--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn-images-1.medium.com/max/2000/1%2AOnKYcml6NPA3COYVfekaaA.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--e7t15fq2--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn-images-1.medium.com/max/2000/1%2AOnKYcml6NPA3COYVfekaaA.png" alt="Australian cloths line" width="638" height="377"&gt;&lt;/a&gt;&lt;em&gt;Australian cloths line&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;One of my favourite reimagined images was the identification of my laser mouse. It turns out a laser mouse has multiple meaning leading to a striking result.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--35Y2XFcJ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn-images-1.medium.com/max/2000/1%2Aw2hHXhXVBFtYKTZkwD9rLw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--35Y2XFcJ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn-images-1.medium.com/max/2000/1%2Aw2hHXhXVBFtYKTZkwD9rLw.png" alt="Laser mouse" width="638" height="346"&gt;&lt;/a&gt;&lt;em&gt;Laser mouse&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The hardware
&lt;/h2&gt;

&lt;p&gt;The least stylish part of GenPiCam is the hardware which I hastily assembled. If you want to build your own reality distorting camera, you’ll need the following.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://www.raspberrypi.com/products/raspberry-pi-4-model-b/"&gt;RaspberryPi 4&lt;/a&gt; running &lt;a href="https://www.raspberrypi.com/software/"&gt;Raspberry Pi OS&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://www.raspberrypi.com/products/camera-module-v2/"&gt;Raspberry Pi camera module v2&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://www.amazon.com.au/dp/B0BPP6MFFJ?ref_=pe_19115062_429603572_302_E_DDE_dt_1"&gt;Touchscreen Monitor for Raspberry Pi&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://www.jaycar.com.au/1-pole-sealed-pcb-rotary/p/SR1210"&gt;12 way PCB rotary switch&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://www.jaycar.com.au/pushbutton-push-on-momentary-spst-red-actuator/p/SP0716"&gt;Pushbutton momentary&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://www.jaycar.com.au/sealed-polycarbonate-enclosure-171-x-121-x-55/p/HB6218"&gt;Polycarbonate enclosure&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Rechargeable battery pack&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Ec2lJG2x--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn-images-1.medium.com/max/2138/1%2ABDZTQ67nDtfFOc05IjWtFg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Ec2lJG2x--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn-images-1.medium.com/max/2138/1%2ABDZTQ67nDtfFOc05IjWtFg.png" alt="The inner workings of GenPiCam" width="800" height="484"&gt;&lt;/a&gt;&lt;em&gt;The inner workings of GenPiCam&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;It isn’t the most beautiful of builds — but I’ll just excuse this as being highly functional&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--XuR9k0gh--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn-images-1.medium.com/max/2000/1%2AHA1htMfd5GjSVYvXvZddhw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--XuR9k0gh--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn-images-1.medium.com/max/2000/1%2AHA1htMfd5GjSVYvXvZddhw.png" alt="Boot image for GenPiCam camera" width="800" height="517"&gt;&lt;/a&gt;&lt;em&gt;Boot image for GenPiCam camera&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary, code &amp;amp; credits
&lt;/h2&gt;

&lt;p&gt;The GenPiCam has been a fun way to explore Generative AI, transforming photos into stylised (and sometime surprising) images. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--vaSSDHCA--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn-images-1.medium.com/max/2000/1%2AiWYVYr9B0641ZY5lRG-17w.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--vaSSDHCA--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn-images-1.medium.com/max/2000/1%2AiWYVYr9B0641ZY5lRG-17w.png" alt="Photo of author on the left — and a stylised version of Simon on the right" width="639" height="318"&gt;&lt;/a&gt;&lt;em&gt;Photo of author on the left — and a stylised version of Simon on the right&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Credits
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://twitter.com/nletcher"&gt;Ned Letcher&lt;/a&gt; — who first got me inspired by showing off the Midjourney describe functionality and provided the concept of recreating images&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://medium.com/@neonforge/how-to-create-a-discord-bot-to-download-midjourney-images-automatically-python-step-by-step-guide-3e76d3282871"&gt;How to Create a Discord Bot to Download Midjourney Images&lt;/a&gt; by Michael King — A great write up showing Python automation for interacting  with Midjourney along with Discord bot configuration.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://docs.midjourney.com/docs/command-list"&gt;Midjourney&lt;/a&gt; — Midjourney command syntax for bot channels&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://discordpy.readthedocs.io/en/stable/"&gt;discord.py&lt;/a&gt; — Python API wrapper for Discord.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Code
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://github.com/saubury/GenPiCam"&gt;https://github.com/saubury/GenPiCam&lt;/a&gt;&lt;/p&gt;

</description>
      <category>generativeai</category>
      <category>raspberrypi</category>
      <category>python</category>
    </item>
    <item>
      <title>My (very) personal data warehouse — Fitbit activity analysis with DuckDB</title>
      <dc:creator>Simon Aubury</dc:creator>
      <pubDate>Thu, 01 Jun 2023 04:37:19 +0000</pubDate>
      <link>https://dev.to/saubury/my-very-personal-data-warehouse-fitbit-activity-analysis-with-duckdb-426l</link>
      <guid>https://dev.to/saubury/my-very-personal-data-warehouse-fitbit-activity-analysis-with-duckdb-426l</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;Wearable fitness trackers have become an integral part of our lives, collecting and tracking data about our daily activities, sleep patterns, location, heart rate, and much more. I’ve been using a Fitbit device for 6 years to monitor my health. However, I have always found the data analysis capabilities lacking — especially when I wanted to track my progress against long term fitness goals. What insights are buried within my archive of personal fitness activity data? To start exploring I needed a good approach for performing data analysis over thousands of poorly documented JSON and CSV files … extra points for analysis that doesn’t require my data to leave my laptop.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Enter &lt;a href="https://duckdb.org/why_duckdb" rel="noopener noreferrer"&gt;DuckDB&lt;/a&gt; — a lightweight, free yet powerful analytical database designed to streamline data analysis workflows — that runs locally. In this blog post, I want to use DuckDB to explore my Fitbit data achieve and share the approach for analysing a variety of data formats and charting my health and fitness goals with the help of &lt;a href="https://seaborn.pydata.org/" rel="noopener noreferrer"&gt;Seaborn&lt;/a&gt; data visualisations.&lt;/p&gt;

&lt;h1&gt;
  
  
  Export Fitbit data archive
&lt;/h1&gt;

&lt;p&gt;Firstly, I needed to get hold of all of my historic fitness data. Fitbit make it fairly easy to export your Fitbit data for the lifetime of your account by following the instructions at &lt;a href="https://www.fitbit.com/settings/data/export" rel="noopener noreferrer"&gt;export your account archive&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fv2%2Fresize%3Afit%3A720%2Fformat%3Awebp%2F1%2ADfb-hKfZm4d0cYzhpTGnjg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fv2%2Fresize%3Afit%3A720%2Fformat%3Awebp%2F1%2ADfb-hKfZm4d0cYzhpTGnjg.png"&gt;&lt;/a&gt;&lt;br&gt;
Instructions for using the export Fitbit data archive — Screenshot by the author.&lt;/p&gt;

&lt;p&gt;You’ll need to confirm your request … and be patient. My archive took over three days to create — but I finally received an email with instructions to download a ZIP file containing my Fitbit data. This file should contain all my personal fitness activity recorded by my Fitbit or associated service. Unzipping the archive reveals a huge collection of files — mine for example had 7,921 files once I unzipped the 79MB file.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fv2%2Fresize%3Afit%3A640%2Fformat%3Awebp%2F1%2As4F47jMXdtl-paemZi17-g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fv2%2Fresize%3Afit%3A640%2Fformat%3Awebp%2F1%2As4F47jMXdtl-paemZi17-g.png"&gt;&lt;/a&gt;&lt;br&gt;
A small sample of the thousands of nested files — Screenshot by the author.&lt;/p&gt;

&lt;p&gt;Let’s start looking at the variety of data available in the archive.&lt;/p&gt;
&lt;h1&gt;
  
  
  Why DuckDB?
&lt;/h1&gt;

&lt;p&gt;There are many great blogs (&lt;a href="https://betterprogramming.pub/duckdb-whats-the-hype-about-5d46aaa73196" rel="noopener noreferrer"&gt;1&lt;/a&gt;,&lt;a href="https://mattpalmer.io/posts/whats-the-hype-duckdb/" rel="noopener noreferrer"&gt;2&lt;/a&gt;,&lt;a href="https://towardsdatascience.com/a-serverless-query-engine-from-spare-parts-bd6320f10353" rel="noopener noreferrer"&gt;3&lt;/a&gt;) describing DuckDB — the &lt;a href="https://www.dictionary.com/browse/tl-dr" rel="noopener noreferrer"&gt;TL;DR&lt;/a&gt; summary is DuckDB is an open-source in-process OLAP database built specifically for analytical queries. It runs locally, has extensive SQL support and can run queries directly on Pandas data, Parquet, JSON data. Extra points for its seamless integration with Python and R. The fact it’s insanely fast and does (mostly) all processing in memory make it a good choice for building my personal data warehouse.&lt;/p&gt;
&lt;h1&gt;
  
  
  Fitbit activity data
&lt;/h1&gt;

&lt;p&gt;The first collection of files I looked at was activity data. Physical Activity and broad exercise information appears to be stored in numbered files such as &lt;code&gt;Physical Activity/exercise-1700.json&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;I couldn’t work out what the file numbering actually meant, my guess is they are just increasing integers for a collection of exercise files. In my data export the earliest files started at 0 and went to file number 1700 over a 6 year period. Inside is an array of records, each with a description of an activity. The record seems to change depending on the activity — here is an example of a “walk”&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="nl"&gt;"activityName"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Walk"&lt;/span&gt;&lt;span class="err"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="nl"&gt;"averageHeartRate"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;79&lt;/span&gt;&lt;span class="err"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="nl"&gt;"calories"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;122&lt;/span&gt;&lt;span class="err"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="nl"&gt;"duration"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1280000&lt;/span&gt;&lt;span class="err"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="nl"&gt;"steps"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1548&lt;/span&gt;&lt;span class="err"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="nl"&gt;"startTime"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"01/06/23 01:08:57"&lt;/span&gt;&lt;span class="err"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="nl"&gt;"elevationGain"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;67.056&lt;/span&gt;&lt;span class="err"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="nl"&gt;"hasGps"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="err"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="nl"&gt;"activityLevel"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;  
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"minutes"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"sedentary"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;  
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"minutes"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"lightly"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;  
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"minutes"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"fairly"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;  
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"minutes"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"very"&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This physical activity data is one file of the 7,921 files now on my laptop. Fortunately, DuckDB can read (and auto-detect the schema) from JSON files using &lt;a href="https://duckdb.org/docs/data/json/overview.html#read_json_auto-function" rel="noopener noreferrer"&gt;read_json&lt;/a&gt; function, allowing me to load all of the exercise files into the &lt;code&gt;physical_activity&lt;/code&gt; table using a single SQL statement. It’s worth noting I needed to specify the date format mask as the Fitbit export has a very &lt;a href="https://en.wikipedia.org/wiki/Date_and_time_notation_in_the_United_States" rel="noopener noreferrer"&gt;American style date&lt;/a&gt; format 😕.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="k"&gt;REPLACE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;physical_activity&lt;/span&gt;  
&lt;span class="k"&gt;as&lt;/span&gt;  
&lt;span class="k"&gt;SELECT&lt;/span&gt;   
  &lt;span class="n"&gt;startTime&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;INTERVAL&lt;/span&gt; &lt;span class="mi"&gt;11&lt;/span&gt; &lt;span class="n"&gt;hours&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;activityTime&lt;/span&gt;  
&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;activityName&lt;/span&gt;  
&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;activityLevel&lt;/span&gt;  
&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;averageHeartRate&lt;/span&gt;  
&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;calories&lt;/span&gt;  
&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;duration&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;60000&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;duration_minutes&lt;/span&gt;  
&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;steps&lt;/span&gt;  
&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;distance&lt;/span&gt;  
&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;distanceUnit&lt;/span&gt;  
&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tcxLink&lt;/span&gt;  
&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;source&lt;/span&gt;  
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;read_json&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'./Physical Activity/exercise-*.json'&lt;/span&gt;  
&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'array'&lt;/span&gt;  
&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timestampformat&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'%m/%d/%y %H:%M:%S'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This SQL command reads the physical activity data from disk, converts the activity and duration and loads into a DuckDB table in memory.&lt;/p&gt;

&lt;h1&gt;
  
  
  Load Physical Activity data into data frame
&lt;/h1&gt;

&lt;p&gt;I wanted to understand how I was spending my time with each month. As the activity data is stored at a very granular level I used the DuckDB SQL &lt;a href="https://duckdb.org/docs/sql/functions/timestamp.html" rel="noopener noreferrer"&gt;time_bucket&lt;/a&gt; function to truncate the &lt;em&gt;activityTime&lt;/em&gt; timestamp into monthly buckets. Loading the grouped physical activity data into data frame can be accomplished with this aggregate SQL and the query results can be directed into a Pandas dataframe with the &lt;code&gt;&amp;lt;&amp;lt;&lt;/code&gt; operator.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;activity_df&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt;  
  &lt;span class="k"&gt;select&lt;/span&gt; &lt;span class="n"&gt;time_bucket&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;interval&lt;/span&gt; &lt;span class="s1"&gt;'1 month'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;activityTime&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;activity_day&lt;/span&gt;  
  &lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;activityName&lt;/span&gt;  
  &lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;duration_minutes&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;duration&lt;/span&gt;  
  &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="n"&gt;physical_activity&lt;/span&gt;  
  &lt;span class="k"&gt;where&lt;/span&gt; &lt;span class="n"&gt;activityTime&lt;/span&gt; &lt;span class="k"&gt;between&lt;/span&gt; &lt;span class="s1"&gt;'2022-09-01'&lt;/span&gt; &lt;span class="k"&gt;and&lt;/span&gt; &lt;span class="s1"&gt;'2023-05-01'&lt;/span&gt;  
  &lt;span class="k"&gt;group&lt;/span&gt; &lt;span class="k"&gt;by&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;  
  &lt;span class="k"&gt;order&lt;/span&gt; &lt;span class="k"&gt;by&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This single SQL query groups my activity data (bike, walk, run etc.,) into monthly buckets and allows me to honestly reflect on how much time I was devoting to physical activity.&lt;/p&gt;

&lt;h1&gt;
  
  
  Plot Monthly Activity Minutes
&lt;/h1&gt;

&lt;p&gt;I want to now explore my activity data visually — so let’s get the Fitbit data and produce some statistical graphics. I’m going to use the Python &lt;a href="https://seaborn.pydata.org/" rel="noopener noreferrer"&gt;Seaborn&lt;/a&gt; data visualisation library to create a bar plot of the monthly activity minutes directly from the &lt;em&gt;activity_df&lt;/em&gt; dataframe.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;matplotlib.pyplot&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;plt&lt;/span&gt;  
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;seaborn&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;sns&lt;/span&gt;  
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;matplotlib.dates&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DateFormatter&lt;/span&gt;  
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;figure&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;figsize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;  
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;xticks&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rotation&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;45&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  

&lt;span class="n"&gt;myplot&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;barplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;activity_df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;activity_day&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;duration&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hue&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;activityName&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
&lt;span class="n"&gt;myplot&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;xlabel&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Month of&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ylabel&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Duration (min)&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Monthly Activity Minutes&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;legend&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;loc&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;upper right&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Activity&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;   
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Executing this against the loaded activity data creates this bar plot.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fv2%2Fresize%3Afit%3A720%2Fformat%3Awebp%2F1%2AQ_wg63Ds0LYqfQLpq2VBKQ.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fv2%2Fresize%3Afit%3A720%2Fformat%3Awebp%2F1%2AQ_wg63Ds0LYqfQLpq2VBKQ.png"&gt;&lt;/a&gt;&lt;br&gt;
Workout activity breakdown — Screenshot by the author.&lt;/p&gt;

&lt;p&gt;It looks like my primary activity continues to be walking, and my New Years resolution to run more often in 2023 hasn’t actually happened (yet?)&lt;/p&gt;
&lt;h1&gt;
  
  
  Sleep
&lt;/h1&gt;

&lt;p&gt;About &lt;a href="https://www.health.harvard.edu/heart-health/are-you-getting-enough-sleep" rel="noopener noreferrer"&gt;one in three adults doesn’t get enough sleep&lt;/a&gt;, so I wanted to explore my long term sleeping patterns. In my Fitbit archive sleep data appears to be recorded in dated files such as &lt;code&gt;Sleep/sleep-2022-12-28.json&lt;/code&gt;. Each file holds a months worth of data, but confusingly is dated for the month before the event. For example, the file &lt;code&gt;sleep-2022-12-28.json&lt;/code&gt; appears to have data for January spanning the dates 2023-01-02 to 2023-01-27. Anyway — file naming weirdness aside we can explore the contents of the file. Within the record is an extended “levels” block with a breakdown of sleep type (wake, light, REM, deep)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="nl"&gt;"logId"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;39958970367&lt;/span&gt;&lt;span class="err"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="nl"&gt;"startTime"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2023-01-26T22:47:30.000"&lt;/span&gt;&lt;span class="err"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="nl"&gt;"duration"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;26040000&lt;/span&gt;&lt;span class="err"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="err"&gt;::&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;::&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;::&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="nl"&gt;"levels"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;   
    &lt;/span&gt;&lt;span class="nl"&gt;"summary"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;  
      &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;  
      &lt;/span&gt;&lt;span class="nl"&gt;"light"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"count"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"minutes"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;275&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;  
      &lt;/span&gt;&lt;span class="nl"&gt;"rem"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"count"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"minutes"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;48&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;  
      &lt;/span&gt;&lt;span class="nl"&gt;"wake"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"count"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;29&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"minutes"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;  
      &lt;/span&gt;&lt;span class="nl"&gt;"deep"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"count"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"minutes"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;75&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;  
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;  
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If I look at some of the older files (possibly created with my older Fitbit surge device) there is a different breakdown of sleep type (restless, awake, asleep).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="nl"&gt;"logId"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;18841054316&lt;/span&gt;&lt;span class="err"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="nl"&gt;"startTime"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2018-07-12T22:42:00.000"&lt;/span&gt;&lt;span class="err"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="nl"&gt;"duration"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;25440000&lt;/span&gt;&lt;span class="err"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="err"&gt;::&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;::&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;::&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="nl"&gt;"levels"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;  
    &lt;/span&gt;&lt;span class="nl"&gt;"summary"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;  
      &lt;/span&gt;&lt;span class="nl"&gt;"restless"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"count"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"minutes"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;  
      &lt;/span&gt;&lt;span class="nl"&gt;"awake"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"count"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"minutes"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;  
      &lt;/span&gt;&lt;span class="nl"&gt;"asleep"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"count"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="nl"&gt;"minutes"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;399&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;  
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Regardless of the schema, we can use the &lt;a href="https://duckdb.org/docs/extensions/json.html" rel="noopener noreferrer"&gt;DuckDB JSON&lt;/a&gt; reader to read the records into a single table.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="k"&gt;REPLACE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;sleep_log&lt;/span&gt;  
&lt;span class="k"&gt;as&lt;/span&gt;  
&lt;span class="k"&gt;select&lt;/span&gt; &lt;span class="n"&gt;dateOfSleep&lt;/span&gt;   
&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;levels&lt;/span&gt;  
&lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="n"&gt;read_json&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'./Sleep/sleep*.json'&lt;/span&gt;  
&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;dateOfSleep&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'DATE'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;levels&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'JSON'&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;  
&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'array'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h1&gt;
  
  
  Schema changes for sleep data
&lt;/h1&gt;

&lt;p&gt;I wanted to process all of my sleep data, and handle the apparent schema change in the way sleep is recorded (most likely as I changed models of Fitbit devices). Some of the records have time recorded against &lt;code&gt;$.awake&lt;/code&gt; which is similar (but not identical to) &lt;code&gt;$.wake&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;I used the SQL &lt;a href="https://duckdb.org/docs/sql/functions/utility.html" rel="noopener noreferrer"&gt;coalesce&lt;/a&gt; function — which return the first expression that evaluates to a non-NULL value to combine similar types of sleep stage.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;sleep_log_df&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt;  
  &lt;span class="k"&gt;select&lt;/span&gt; &lt;span class="n"&gt;dateOfSleep&lt;/span&gt;  
  &lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;cast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;coalesce&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json_extract&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;levels&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'$.summary.awake.minutes'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;json_extract&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;levels&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'$.summary.wake.minutes'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;min_wake&lt;/span&gt;  
  &lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;cast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;coalesce&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json_extract&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;levels&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'$.summary.deep.minutes'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;json_extract&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;levels&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'$.summary.asleep.minutes'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;min_deep&lt;/span&gt;  
  &lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;cast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;coalesce&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json_extract&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;levels&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'$.summary.light.minutes'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;json_extract&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;levels&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'$.summary.restless.minutes'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;min_light&lt;/span&gt;  
  &lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;cast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;coalesce&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json_extract&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;levels&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'$.summary.rem.minutes'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;min_rem&lt;/span&gt;  
  &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sleep_log&lt;/span&gt;  
  &lt;span class="k"&gt;where&lt;/span&gt; &lt;span class="n"&gt;dateOfSleep&lt;/span&gt; &lt;span class="k"&gt;between&lt;/span&gt; &lt;span class="s1"&gt;'2023-04-01'&lt;/span&gt; &lt;span class="k"&gt;and&lt;/span&gt; &lt;span class="s1"&gt;'2023-04-30'&lt;/span&gt;  
  &lt;span class="k"&gt;order&lt;/span&gt; &lt;span class="k"&gt;by&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With DuckDB I can query with &lt;a href="https://duckdb.org/docs/extensions/json.html#json-extraction-functions" rel="noopener noreferrer"&gt;json_extract&lt;/a&gt; to extract the duration stages from the nested JSON to generate a &lt;em&gt;sleep_log_df&lt;/em&gt; dataframe with all of the historic sleep stages grouped.&lt;/p&gt;

&lt;h1&gt;
  
  
  Plot sleep activity
&lt;/h1&gt;

&lt;p&gt;We can not take the daily sleep logs and produce a stacked bar plot showing the breakdown each night of being awake and in light, deep and &lt;a href="https://en.wikipedia.org/wiki/Rapid_eye_movement_sleep" rel="noopener noreferrer"&gt;REM&lt;/a&gt; sleep.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;matplotlib.pyplot&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;plt&lt;/span&gt;  
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;seaborn&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;sns&lt;/span&gt;  
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;matplotlib.dates&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;mdates&lt;/span&gt;  

&lt;span class="c1"&gt;#create stacked bar chart  
&lt;/span&gt;&lt;span class="n"&gt;fig&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;axes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;subplots&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;figsize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;  
&lt;span class="n"&gt;myplot&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sleep_log_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_index&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;dateOfSleep&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ax&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;axes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;kind&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;bar&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;stacked&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;color&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;chocolate&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;palegreen&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;green&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;darkblue&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;  
&lt;span class="n"&gt;myplot&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;xlabel&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Date&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ylabel&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Duration (min)&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Sleep&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
&lt;span class="n"&gt;axes&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;xaxis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_major_locator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mdates&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;DayLocator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;interval&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;  
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;legend&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;loc&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;upper right&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;labels&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Awake&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Deep&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Light&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;REM&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;   
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;xticks&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rotation&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;45&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Loading a month of sleep data allows me to create a broader analysis of sleep duration.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fv2%2Fresize%3Afit%3A720%2Fformat%3Awebp%2F1%2A-JoSLkEtLlWMQL-005pgsg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fv2%2Fresize%3Afit%3A720%2Fformat%3Awebp%2F1%2A-JoSLkEtLlWMQL-005pgsg.png"&gt;&lt;/a&gt;&lt;br&gt;
Sleep cycle duration each night — Screenshot by the author.&lt;/p&gt;

&lt;p&gt;The ability to graph multiple nights of sleep together on a single plot allows me to start understanding how days of the week and cyclic events affects the duration and quality of my sleep.&lt;/p&gt;
&lt;h1&gt;
  
  
  Heart rate
&lt;/h1&gt;

&lt;p&gt;Heart rate is captured very frequently (every 10–15 seconds) in files stored daily named like &lt;code&gt;Physical Activity/heart_rate-2023-01-26.json&lt;/code&gt;. These files are really big — each day has around 70,000 lines — all wrapped in a single array.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;[{{&lt;/span&gt;&lt;span class="nl"&gt;"dateTime"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"01/25/25 13:00:07"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"value"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"bpm"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;54&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"confidence"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;}},&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"dateTime"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"01/25/25 13:00:22"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"value"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"bpm"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;54&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"confidence"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;}},&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"dateTime"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"01/25/25 13:00:37"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"value"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"bpm"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;55&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"confidence"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;}},&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"dateTime"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"01/26/26 12:59:57"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"value"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"bpm"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;55&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"confidence"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;  
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="err"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;My theory here is the file name represents the locale of the user. For example, in my timezone (GMT+11) named &lt;code&gt;heart_rate-2023-01-26.json&lt;/code&gt; the data covers the day 26 00:00 (AEST) to 23:59 (AEST) - and it makes logical sense if the dates within the files are in GMT.&lt;/p&gt;

&lt;h1&gt;
  
  
  Transform JSON files
&lt;/h1&gt;

&lt;p&gt;Up to now I’ve managed to process my Fitbit data as-is with included DuckDB functions. However, I hit a problem when trying to process these enormous heart rate files. DuckDB gave me this error when trying to process a large array of records in a JSON files&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;(duckdb.InvalidInputException) “INTERNAL Error: Unexpected yyjson tag in ValTypeToString”&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I think this error message is an abrupt way of telling me it’s unreasonable to expect a JSON array to have so many elements. The fix was to pre-process the file so it wasn’t an array of JSON records, instead converted to newline-delimited JSON, or &lt;a href="http://ndjson.org/" rel="noopener noreferrer"&gt;ndjson&lt;/a&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"dateTime"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"01/25/23 13:00:07"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"value"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"bpm"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;54&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"confidence"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;  
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"dateTime"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"01/25/23 13:00:22"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"value"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"bpm"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;54&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"confidence"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;  
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"dateTime"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"01/25/23 13:00:37"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"value"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"bpm"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;55&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"confidence"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt;&lt;span class="w"&gt;  
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"dateTime"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"01/26/23 12:59:57"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"value"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"bpm"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;55&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"confidence"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To transform heart rate &lt;em&gt;array_of_records&lt;/em&gt; into newline-delimited JSON I used a sneaky bit of Python to convert each file.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;glob&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ndjson&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;json_src_file&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;sorted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;glob&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;glob&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;MyFitbitData/SimonAubury/Physical Activity/steps-*.json&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;glob&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;glob&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;MyFitbitData/SimonAubury/Physical Activity/heart_rate-*.json&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)):&lt;/span&gt;
  &lt;span class="n"&gt;json_dst_file&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;\.[a-z]*$&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;.ndjson&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;json_src_file&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;json_src_file&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; --&amp;gt;  &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;json_dst_file&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

  &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json_src_file&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f_json_src_file&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;json_dict&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f_json_src_file&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 

    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json_dst_file&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;w&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;outfile&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
      &lt;span class="n"&gt;ndjson&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dump&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json_dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;outfile&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This will find each &lt;em&gt;.json&lt;/em&gt; file, read converting the contents into newline-delimited JSON with a new file created with the file extension &lt;em&gt;.ndjson&lt;/em&gt;. This converts an array with 70,000 records to a file with 70,000 lines — with each JSON record now stored on a new line.&lt;/p&gt;

&lt;h1&gt;
  
  
  Load heart rate data into table
&lt;/h1&gt;

&lt;p&gt;With the newly converted &lt;em&gt;ndjson&lt;/em&gt; files, I’m now ready to load heart rate data into a DuckDB table. Note the use of &lt;code&gt;timestampformat='%m/%d/%y %H:%M:%S');&lt;/code&gt; to describe the leading month in the dates (for example &lt;em&gt;"01/25/23 13:00:07"&lt;/em&gt;)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="k"&gt;REPLACE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;heart_rate&lt;/span&gt;  
&lt;span class="k"&gt;as&lt;/span&gt;  
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="nb"&gt;dateTime&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;INTERVAL&lt;/span&gt; &lt;span class="mi"&gt;11&lt;/span&gt; &lt;span class="n"&gt;hours&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;hr_date_time&lt;/span&gt;  
&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;cast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="s1"&gt;'$.bpm'&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nb"&gt;integer&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;bpm&lt;/span&gt;  
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;read_json&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'./Physical Activity/*.ndjson'&lt;/span&gt;  
&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nb"&gt;dateTime&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'TIMESTAMP'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'JSON'&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;  
&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'newline_delimited'&lt;/span&gt;  
&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timestampformat&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'%m/%d/%y %H:%M:%S'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We can load all the .ndjson files by setting the format to ’newline_delimited’. Note we can extract the BPM (beats per minute) with the JSON extraction and cast into an integer.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fv2%2Fresize%3Afit%3A640%2Fformat%3Awebp%2F1%2A07nRQIvy5Dw6z4RYlarVRg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fv2%2Fresize%3Afit%3A640%2Fformat%3Awebp%2F1%2A07nRQIvy5Dw6z4RYlarVRg.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;DuckDB is blazing fast at processing JSON- Screenshot by the author.&lt;/p&gt;

&lt;p&gt;It’s worth highlighting here how insanely fast DuckDB is — it took only 2.8 seconds to load 12 million records!&lt;/p&gt;

&lt;h1&gt;
  
  
  Load heart rate into data frame
&lt;/h1&gt;

&lt;p&gt;With 12 million hear rate measurements loaded, let’s load a single days worth of data into a data frame for the 21st of May.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;hr_df&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt;   
  &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;time_bucket&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;interval&lt;/span&gt; &lt;span class="s1"&gt;'1 minutes'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hr_date_time&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;created_day&lt;/span&gt;  
  &lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="k"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bpm&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;bpm_min&lt;/span&gt;  
  &lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="k"&gt;avg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bpm&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;bpm_avg&lt;/span&gt;  
  &lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="k"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bpm&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;bpm_max&lt;/span&gt;  
  &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;heart_rate&lt;/span&gt;  
  &lt;span class="k"&gt;where&lt;/span&gt; &lt;span class="n"&gt;hr_date_time&lt;/span&gt; &lt;span class="k"&gt;between&lt;/span&gt; &lt;span class="s1"&gt;'2023-05-21 00:00'&lt;/span&gt; &lt;span class="k"&gt;and&lt;/span&gt; &lt;span class="s1"&gt;'2023-05-21 23:59'&lt;/span&gt;  
  &lt;span class="k"&gt;group&lt;/span&gt; &lt;span class="k"&gt;by&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This DuckDB query aggregates the variability of heart rate into time bucks of 1 minute; banding into min, average and maximum within each period.&lt;/p&gt;

&lt;h1&gt;
  
  
  Plot Heart rate
&lt;/h1&gt;

&lt;p&gt;I can plat the heart rate using a plot like this (and also to show off I actually did go for a run at 6am)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;matplotlib.pyplot&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;plt&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;matplotlib.dates&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DateFormatter&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;figure&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;figsize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;xticks&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rotation&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;45&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;myplot&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lineplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;hr_df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;created_day&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bpm_min&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;myplot&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lineplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;hr_df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;created_day&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bpm_avg&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;myplot&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lineplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;hr_df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;created_day&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bpm_max&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;myFmt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;DateFormatter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;%H:%M&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;myplot&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;xaxis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_major_formatter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;myFmt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;myplot&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;xlabel&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Time of day&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ylabel&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Heart BPM&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Heart rate&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fv2%2Fresize%3Afit%3A720%2Fformat%3Awebp%2F1%2A7jo_2M-VKrq7MgWVhFBRQg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fv2%2Fresize%3Afit%3A720%2Fformat%3Awebp%2F1%2A7jo_2M-VKrq7MgWVhFBRQg.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Heart rate over a day — Screenshot by the author.&lt;/p&gt;

&lt;p&gt;Exploring heart rate with fine granularity allows me to track my fitness goals — especially if I stick with my regular running routine.&lt;/p&gt;

&lt;h2&gt;
  
  
  Steps
&lt;/h2&gt;

&lt;p&gt;Steps are recorded in daily files named &lt;code&gt;Physical Activity/steps-2023-02-26.json&lt;/code&gt;. This appears to be a fine grain count of steps during period blocks (every 5 to 10 minutes) throughout the day&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="nl"&gt;"dateTime"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"02/25/23 13:17:00"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="nl"&gt;"value"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"0"&lt;/span&gt;&lt;span class="w"&gt;  
&lt;/span&gt;&lt;span class="p"&gt;},{&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="nl"&gt;"dateTime"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"02/25/23 13:52:00"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="nl"&gt;"value"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"5"&lt;/span&gt;&lt;span class="w"&gt;  
&lt;/span&gt;&lt;span class="p"&gt;},{&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="nl"&gt;"dateTime"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"02/25/23 14:00:00"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="nl"&gt;"value"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"0"&lt;/span&gt;&lt;span class="w"&gt;  
&lt;/span&gt;&lt;span class="p"&gt;},{&lt;/span&gt;&lt;span class="w"&gt;  
&lt;/span&gt;&lt;span class="err"&gt;::&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;::&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;::&lt;/span&gt;&lt;span class="w"&gt;  
&lt;/span&gt;&lt;span class="p"&gt;},{&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="nl"&gt;"dateTime"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"03/24/23 08:45:00"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="nl"&gt;"value"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"15"&lt;/span&gt;&lt;span class="w"&gt;  
&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To aggregate the steps into daily counts I needed to convert GMT into my local timezone (GMT+11)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;steps_df&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt;
&lt;span class="k"&gt;select&lt;/span&gt; &lt;span class="k"&gt;cast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;time_bucket&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;interval&lt;/span&gt; &lt;span class="s1"&gt;'1 day'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;dateTime&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;INTERVAL&lt;/span&gt; &lt;span class="mi"&gt;11&lt;/span&gt; &lt;span class="n"&gt;hours&lt;/span&gt;  &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nb"&gt;DATE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;activity_day&lt;/span&gt;
&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;steps&lt;/span&gt;
&lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="n"&gt;read_json&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'MyFitbitData/SimonAubury/Physical Activity/steps-2023-02-26.ndjson'&lt;/span&gt;
&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;auto_detect&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;True&lt;/span&gt;
&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'newline_delimited'&lt;/span&gt;
&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timestampformat&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'%m/%d/%y %H:%M:%S'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
&lt;span class="k"&gt;group&lt;/span&gt; &lt;span class="k"&gt;by&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Aggregating the number of daily steps into the &lt;em&gt;steps_df&lt;/em&gt; dataframe allows me to explore the longer term activity trends as I attempt to exceed 10,000 steps to realise the &lt;a href="https://www.10000steps.org.au/articles/healthy-lifestyles/health-check-do-we-really-need-take-10000-steps-day/" rel="noopener noreferrer"&gt;increased health benefits&lt;/a&gt;.&lt;/p&gt;

&lt;h1&gt;
  
  
  Plot daily steps
&lt;/h1&gt;

&lt;p&gt;We can now take dataframe and plot a daily step count&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;matplotlib.pyplot&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;plt&lt;/span&gt;  
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;matplotlib.dates&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DateFormatter&lt;/span&gt;  
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;figure&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;figsize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;  
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;xticks&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rotation&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;45&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  

&lt;span class="n"&gt;myplot&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;barplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;steps_df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;activity_day&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;steps&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
&lt;span class="n"&gt;myplot&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;xlabel&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Day&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ylabel&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Steps&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Daily steps&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fv2%2Fresize%3Afit%3A720%2Fformat%3Awebp%2F1%2AEjNqz1eRARy-FVh1ZEIhCw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fv2%2Fresize%3Afit%3A720%2Fformat%3Awebp%2F1%2AEjNqz1eRARy-FVh1ZEIhCw.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Daily step count — Screenshot by the author.&lt;/p&gt;

&lt;p&gt;Which shows I’ve still got to work at my daily step goal — another strike against my new years fitness resolution.&lt;/p&gt;

&lt;h1&gt;
  
  
  GPS Mapping
&lt;/h1&gt;

&lt;p&gt;Fitbit stores GPS logged activities as &lt;a href="https://en.wikipedia.org/wiki/GPS_Exchange_Format" rel="noopener noreferrer"&gt;TCX (Training Center XML)&lt;/a&gt; files. These XML files are &lt;em&gt;not&lt;/em&gt; in the downloaded ZIP, but we have a reference to their location in the Physical Activity files, which I can query like this.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;select&lt;/span&gt; &lt;span class="n"&gt;tcxLink&lt;/span&gt;   
&lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="n"&gt;physical_activity&lt;/span&gt;  
&lt;span class="k"&gt;where&lt;/span&gt; &lt;span class="n"&gt;tcxLink&lt;/span&gt; &lt;span class="k"&gt;is&lt;/span&gt; &lt;span class="k"&gt;not&lt;/span&gt; &lt;span class="k"&gt;null&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The tcxLink field is a URL reference to their location in the Physical Activity files.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fv2%2Fresize%3Afit%3A628%2Fformat%3Awebp%2F1%2A_kfZQTI1b6W5tOvYnF0Tfg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fv2%2Fresize%3Afit%3A628%2Fformat%3Awebp%2F1%2A_kfZQTI1b6W5tOvYnF0Tfg.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The URL for each TCX file — Screenshot by the author.&lt;/p&gt;

&lt;p&gt;We can use this URL directly in a browser (once logged onto the Fitbit website) to do download the GPS XML file. Looking inside the TCX file, we find low level GPS locations every few seconds.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fv2%2Fresize%3Afit%3A720%2Fformat%3Awebp%2F1%2AO-AQrO0btjTH-t1M76XgkQ.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fv2%2Fresize%3Afit%3A720%2Fformat%3Awebp%2F1%2AO-AQrO0btjTH-t1M76XgkQ.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;TCX GPS XML file sample contentents - Screenshot by the author.&lt;/p&gt;

&lt;p&gt;The good news is this has some obvious fields like latitude, longitude and time. The not so good news is this is XML, so we need to pre-process these files prior to loading into DuckDB as presently XML isn’t supported by the file reader. We can convert XML files into JSON files with another bit of Python code, looping over each &lt;em&gt;.tcx&lt;/em&gt; file.&lt;/p&gt;

&lt;p&gt;There is a bit of nasty XML nesting going on here, with the location data found under &lt;em&gt;TrainingCenterDatabase/Activities/Activity/Lap&lt;/em&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;glob&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ndjson&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;xmltodict&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;xml_src_file&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;sorted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;glob&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;glob&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;MyFitbitData/tcx/*.tcx&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)):&lt;/span&gt;
    &lt;span class="n"&gt;json_dst_file&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;\.[a-z]*$&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;.ndjson&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;xml_src_file&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;xml_src_file&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; --&amp;gt;  &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;json_dst_file&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;xml_src_file&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f_xml_src_file&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# erase file if it exists
&lt;/span&gt;        &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json_dst_file&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;w&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
        &lt;span class="n"&gt;data_dict&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;xmltodict&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f_xml_src_file&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

        &lt;span class="c1"&gt;# Loop over the "laps" in the file; roughly every 1km
&lt;/span&gt;        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;lap&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;data_dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;TrainingCenterDatabase&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Activities&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Activity&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Lap&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
            &lt;span class="n"&gt;data_dict_inner&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;lap&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Track&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Trackpoint&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
            &lt;span class="c1"&gt;# append file
&lt;/span&gt;            &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json_dst_file&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;a&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;outfile&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;ndjson&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dump&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data_dict_inner&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;outfile&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="n"&gt;outfile&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Loading GPS Geospatial data
&lt;/h2&gt;

&lt;p&gt;We can load the Geospatial data like this&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;route_df&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="nb"&gt;time&lt;/span&gt;
    &lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;position&lt;/span&gt;
    &lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;cast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json_extract_string&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;position&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'$.LatitudeDegrees'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;latitude&lt;/span&gt;
    &lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;cast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json_extract_string&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;position&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'$.LongitudeDegrees'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;longitude&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;read_json&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'MyFitbitData/tcx/54939192717.ndjson'&lt;/span&gt;
    &lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nb"&gt;Time&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'TIMESTAMP'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;Position&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'JSON'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;AltitudeMeters&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'FLOAT'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;DistanceMeters&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'FLOAT'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;HeartRateBpm&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'JSON'&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'newline_delimited'&lt;/span&gt;
    &lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timestampformat&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'%Y-%m-%dT%H:%M:%S.%f%z'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This DuckDB query flattens the JSON, converts the latitude, longitude and time into the correct data types and loads into the &lt;em&gt;route_df&lt;/em&gt; dataframe.&lt;/p&gt;

&lt;h1&gt;
  
  
  Visualize GPS Routes with Folium
&lt;/h1&gt;

&lt;p&gt;Having a table of location information isn’t very descriptive, so I wanted to start plotting my running routes on an interactive map. I used this blog to help &lt;a href="https://betterdatascience.com/data-science-for-cycling-how-to-visualize-gpx-strava-routes-with-python-and-folium/" rel="noopener noreferrer"&gt;Visualize routes with Folium&lt;/a&gt;. Modify the code helped me plot my own runs, for example this is a plot of a recent run while on holiday in &lt;a href="https://en.wikipedia.org/wiki/Canberra" rel="noopener noreferrer"&gt;Canberra&lt;/a&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;folium&lt;/span&gt;

&lt;span class="n"&gt;route_map&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;folium&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;location&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;35.275&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;149.129&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;zoom_start&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;13&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;tiles&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;openstreetmap&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;width&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;height&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;600&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;coordinates&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;tuple&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;route_df&lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;latitude&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;longitude&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]].&lt;/span&gt;&lt;span class="nf"&gt;to_numpy&lt;/span&gt;&lt;span class="p"&gt;()]&lt;/span&gt;
&lt;span class="n"&gt;folium&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;PolyLine&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;coordinates&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;weight&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;color&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;red&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;add_to&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;route_map&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;display&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;route_map&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fv2%2Fresize%3Afit%3A720%2Fformat%3Awebp%2F1%2A38wf2eG-fR2xwrU0W53k3A.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fv2%2Fresize%3Afit%3A720%2Fformat%3Awebp%2F1%2A38wf2eG-fR2xwrU0W53k3A.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Folium map plot of a run — Screenshot by the author.&lt;/p&gt;

&lt;p&gt;Which generates a plot of my run using &lt;a href="https://openmaptiles.org/" rel="noopener noreferrer"&gt;open street map&lt;/a&gt; tiles, giving me a great interactive detailed map of my run.&lt;/p&gt;

&lt;h1&gt;
  
  
  Data goals and fitness goal summary
&lt;/h1&gt;

&lt;p&gt;Did I get get closer to my goal of analysis my Fitbit device data — absolutely! DuckDB proved to be an ideal flexible lightweight analytical tool for wrangling my extensive and chaotic Fitbit data achieve. Blazing through literally millions of records in seconds with the extensive SQL support and flexible file parsing options locally into dataframes makes DuckDB ideal for building my own personal data warehouse.&lt;/p&gt;

&lt;p&gt;As for my fitness goal — I have some work to do. I think I should leave this blog now as I’m short of my step goal target for today&lt;/p&gt;

&lt;h1&gt;
  
  
  Code
&lt;/h1&gt;

&lt;p&gt;🛠️Code for Fitbit activity analysis with DuckDB — &lt;a href="https://github.com/saubury/duckdb-fitbit" rel="noopener noreferrer"&gt;https://github.com/saubury/duckdb-fitbit&lt;/a&gt;&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Mastodon usage — counting toots with Kafka, DuckDB &amp; Seaborn 🐘🦆📊</title>
      <dc:creator>Simon Aubury</dc:creator>
      <pubDate>Thu, 23 Feb 2023 10:29:52 +0000</pubDate>
      <link>https://dev.to/saubury/mastodon-usage-counting-toots-with-kafka-duckdb-seaborn-aok</link>
      <guid>https://dev.to/saubury/mastodon-usage-counting-toots-with-kafka-duckdb-seaborn-aok</guid>
      <description>&lt;h1&gt;
  
  
  Mastodon usage — counting toots with Kafka, DuckDB &amp;amp; Seaborn 🐘🦆📊
&lt;/h1&gt;

&lt;p&gt;Mastodon is a decentralized social networking platform. Users are members of a specific Mastodon instance, and servers are capable of joining other servers to form a federated social network.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;I wanted to start exploring Mastodon usage; and perform exploratory data analysis of user activity, server popularity and language usage. I used distributed stream processing tools to collect data from multiple instances to get a glimpse into what’s happening in the &lt;a href="https://en.wikipedia.org/wiki/Fediverse" rel="noopener noreferrer"&gt;fediverse&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This blog covers the tools for data collection and data processing (&lt;a href="https://kafka.apache.org/" rel="noopener noreferrer"&gt;Apache Kafka&lt;/a&gt; stream processing). If this doesn’t interest you you can jump straight to the data analysis (&lt;a href="https://duckdb.org/" rel="noopener noreferrer"&gt;DuckDB&lt;/a&gt; and &lt;a href="https://seaborn.pydata.org/" rel="noopener noreferrer"&gt;Seaborn&lt;/a&gt;). For the enthusiastic you can &lt;a href="https://github.com/saubury/mastodon-stream" rel="noopener noreferrer"&gt;run the code&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2c39npgy5br329v1dtuz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2c39npgy5br329v1dtuz.png" alt="Collection of open source projects used" width="800" height="450"&gt;&lt;/a&gt;&lt;em&gt;Collection of open source projects used&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Tools used
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://mastodonpy.readthedocs.io/" rel="noopener noreferrer"&gt;Mastodon.py&lt;/a&gt; — Python library for interacting with the Mastodon API&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://kafka.apache.org/" rel="noopener noreferrer"&gt;Apache Kafka&lt;/a&gt; — distributed event streaming platform&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://duckdb.org/" rel="noopener noreferrer"&gt;DuckDB&lt;/a&gt; — in-process SQL OLAP database and the &lt;a href="https://duckdb.org/docs/extensions/httpfs.html" rel="noopener noreferrer"&gt;HTTPFS DuckDB extension&lt;/a&gt; for reading remote/writing remote files of object storage using the S3 API&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://min.io/" rel="noopener noreferrer"&gt;MinIO&lt;/a&gt; — S3 compatible server&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://seaborn.pydata.org/" rel="noopener noreferrer"&gt;Seaborn&lt;/a&gt; — visualization library&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Data collection — the Mastodon listener
&lt;/h2&gt;

&lt;p&gt;ℹ️ If you’re not interested in the data collection … jump straight to the data analysis&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo3kj857rtcnzj73obi8g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo3kj857rtcnzj73obi8g.png" alt="Data collection architecture" width="800" height="351"&gt;&lt;/a&gt;&lt;em&gt;Data collection architecture&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;There is a &lt;a href="https://joinmastodon.org/servers" rel="noopener noreferrer"&gt;large collection&lt;/a&gt; of Mastodon servers with a wide variety of subjects, topics and interesting communities. Accessing a public stream is generally possible without authenticating, so no account is required to work out what’s happening on each server.&lt;/p&gt;

&lt;p&gt;In the decentralized Mastodon network, not every message is sent to every server. &lt;a href="https://commons.wikimedia.org/wiki/File:Mastodon_timelines.png" rel="noopener noreferrer"&gt;Generally&lt;/a&gt;, public toots from instance-A will only be sent to instance-B if a user from B follows that user from A.&lt;/p&gt;

&lt;p&gt;I wrote a python application mastodonlisten to listen for public posts from a given server. By running multiple listeners I could collect toots from both popular and niche instances. Each python listener collects public toots from that server to and publishes to a private Kafka broker. To illustrate how multiple Mastodon listeners can be run in the background like this:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;python mastodonlisten.py --baseURL https://mastodon.social --enableKafka &amp;amp;

python mastodonlisten.py --baseURL https://universeodon.com --enableKafka &amp;amp;

python mastodonlisten.py --baseURL https://hachyderm.io --enableKafka &amp;amp;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  Kafka Connect
&lt;/h2&gt;

&lt;p&gt;I’ve now got multiple Mastodon listeners feeding public posts from multiple servers into a single Kafka topic. My next task is to understand what’s going on with all this activity from the decentralised network.&lt;/p&gt;

&lt;p&gt;I decided to incrementally dump the “toots” into Parquet files on an S3 object store. Parquet is a columnar storage that is optimised for analytical querying. I chose to use Kafka Connect to streaming data between my Kafka topic and land the data into S3 using the S3SinkConnector&lt;/p&gt;

&lt;p&gt;That sounds like a lot of work — but the TL;DR is with a bit of configuration, I can instruct Kafka Connect to do everything for me. To consume the mastodon-topic from Kafka and create a new parquet file on S3 every 1000 records I can accomplish with this configuration&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{
    "name": "mastodon-sink-s3",
    "connector.class": "io.confluent.connect.s3.S3SinkConnector",
    "topics": "mastodon-topic",
    "format.class": "io.confluent.connect.s3.format.parquet.ParquetFormat",
    "flush.size": "1000",
    "s3.bucket.name": "mastodon",
    "aws.access.key.id": "minio",
    "aws.secret.access.key": "minio123",
    "storage.class": "io.confluent.connect.s3.storage.S3Storage",
    "store.url": "http://minio:9000"
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;To check this is all working correctly, I can see new files are being regularly created by looking in the MinIO web-based object browser.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4buq29jfxz9fitw2tgzz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4buq29jfxz9fitw2tgzz.png" alt="MinIO web-based object browser" width="800" height="380"&gt;&lt;/a&gt;&lt;em&gt;MinIO web-based object browser&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Data analysis
&lt;/h2&gt;

&lt;p&gt;Now we have collected a week of Mastodon activity, let’s have a look at some data. These steps are detailed in the &lt;a href="https://github.com/saubury/mastodon-stream/blob/main/notebooks/mastodon-analysis.ipynb" rel="noopener noreferrer"&gt;notebook&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;⚠️ Observations in this section are based on naive interpretation on a few days worth of data. Please don’t rely on any of this analysis, but feel free to use these techniques yourself to explore and learn&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Firstly, some quick statistics from the data collected over 10 days (3 Feb to 12 Feb 2023)&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;🔢 Number of Mastodon toots seen 1,622,149&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;👤 Number of unique Mastodon users 142,877&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;💻 Number of unique Mastodon instances 8,309&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;🌏 Number of languages seen 131&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;✍️ Shortest toot 0 characters, average toot length 151 characters and longest toot 68,991 characters (If you’re curious, the longest toot was a silly comment followed the repeating the same emoji 68,930 times)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;📚 Total of all toots 245,245,677 characters&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;🦆 DuckDB memory used to hold &lt;strong&gt;1.6 million toots is just 745.5MB&lt;/strong&gt; (which is tiny!)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;⏱ Time it takes to calculate the above statistics in a single SQL query is** 0.7 seconds** (wow — fast!)&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;DuckDB’s Python client can be used &lt;a href="https://duckdb.org/docs/guides/python/jupyter" rel="noopener noreferrer"&gt;directly in Jupyter notebook&lt;/a&gt;. The first step is import the relevant libraries. DuckDB Python package can run queries directly on Pandas. With a few &lt;a href="https://www.datacamp.com/tutorial/sql-interface-within-jupyterlab" rel="noopener noreferrer"&gt;SqlMagic&lt;/a&gt; settings it’s possible to configure the notebook to directly output data to Pandas&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;%load_ext sql
%sql duckdb:///:memory:
%config SqlMagic.autopandas = True
%config SqlMagic.feedback = False
%config SqlMagic.displaycon = False
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Plus we can use the &lt;a href="https://duckdb.org/docs/extensions/httpfs.html" rel="noopener noreferrer"&gt;HTTPFS DuckDB extension&lt;/a&gt; for reading remote/writing remote files of object storage using the S3 API&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;%%sql
INSTALL httpfs;
LOAD httpfs;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  Establish s3 endpoint
&lt;/h2&gt;

&lt;p&gt;Here we’re using a local &lt;a href="https://min.io/" rel="noopener noreferrer"&gt;MinIO&lt;/a&gt; as an Open Source, Amazon S3 compatible server (and no, you shouldn’t share your secret_access_key). Set the S3 endpoint settings like this&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;%%sql
set s3_endpoint='localhost:9000';
set s3_access_key_id='minio';
set s3_secret_access_key='minio123';
set s3_use_ssl=false;
set s3_region='us-east-1';
set s3_url_style='path';
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;And I can now query the parquet files directly from s3&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;%%sql
select *
from read_parquet('s3://mastodon/topics/mastodon-topic/partition=0/*');
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdb2fywpw8er8psxaknv1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdb2fywpw8er8psxaknv1.png" alt="Reading parquet from s3 without leaving the notebook" width="612" height="208"&gt;&lt;/a&gt;&lt;em&gt;Reading parquet from s3 without leaving the notebook&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This is pretty cool — we can read the parquet data directly, which is sitting in our S3 bucket.&lt;/p&gt;

&lt;h2&gt;
  
  
  DuckDB SQL to process Mastodon activity
&lt;/h2&gt;

&lt;p&gt;Before moving on, I had a bit of data cleanup which I could do within DuckDB, loading remote parquet files (from s3).&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Map the ISO 639–1 (two-letter) language code (zh, cy, en) to a language description (Chinese, Welsh, English). We can create alanguage lookup table and load languages from &lt;a href="//../duckdb/language.csv"&gt;language.csv&lt;/a&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Calculate created_timestamp from the created_at integer. The created_at timestamp is calculated as number of seconds from EPOC (1/1/1970)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Determine the originating instance with a regular expression to strip the URL&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Create the table mastodon_toot as a join of mastodon_toot_raw to language&lt;/p&gt;

&lt;p&gt;CREATE TABLE language(lang_iso VARCHAR PRIMARY KEY, language_name VARCHAR);&lt;/p&gt;

&lt;p&gt;insert into language&lt;br&gt;
select *&lt;br&gt;
from read_csv('./language.csv', AUTO_DETECT=TRUE, header=True);&lt;/p&gt;

&lt;p&gt;create table mastodon_toot_raw as&lt;br&gt;
select m_id&lt;br&gt;
, created_at, ('EPOCH'::TIMESTAMP + INTERVAL (created_at::INT) seconds)::TIMESTAMPTZ  as created_tz&lt;br&gt;
, app&lt;br&gt;
, url&lt;br&gt;
, regexp_replace(regexp_replace(url, '^http[s]://', ''), '/.&lt;em&gt;$', '') as from_instance&lt;br&gt;
, base_url&lt;br&gt;
, language&lt;br&gt;
, favourites&lt;br&gt;
, username&lt;br&gt;
, bot&lt;br&gt;
, tags&lt;br&gt;
, characters&lt;br&gt;
, mastodon_text&lt;br&gt;
from read_parquet('s3://mastodon/topics/mastodon-topic/partition=0/&lt;/em&gt;');&lt;br&gt;
create table mastodon_toot as&lt;br&gt;
select mr.*, ln.language_name&lt;br&gt;
from mastodon_toot_raw mr &lt;br&gt;
left outer join language ln on (mr.language = ln.lang_iso);&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;🪄Being able to do this cleanup and transformation in SQL and have it execute in 0.8 seconds is like magic to me.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F02v282tm5gb50l35khhc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F02v282tm5gb50l35khhc.png" alt="Very fast processing — less then a second" width="362" height="144"&gt;&lt;/a&gt;&lt;em&gt;Very fast processing — less then a second&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Daily Mastodon usage
&lt;/h2&gt;

&lt;p&gt;We can query the mastodon_toot table directly to see the number of &lt;em&gt;toots&lt;/em&gt;, &lt;em&gt;users&lt;/em&gt; each day by counting and grouping the activity by the day. We can use the &lt;a href="https://duckdb.org/docs/sql/aggregates.html#statistical-aggregates" rel="noopener noreferrer"&gt;mode&lt;/a&gt; aggregate function to find the most frequent “bot” and “not-bot” users to find the most active Mastodon users&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;%%sql
select strftime(created_tz, '%Y/%m/%d %a') as "Created day"
, count(*) as "Num toots"
, count(distinct(username)) as "Num users"
, count(distinct(from_instance)) as "Num urls"
, mode(case when bot='False' then username end) as "Most freq non-bot"
, mode(case when bot='True' then username end) as "Most freq bot"
, mode(base_url) as "Most freq host"
from mastodon_toot
group by 1
order by 1;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdg8tza61i5gx1fbr6hqx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdg8tza61i5gx1fbr6hqx.png" alt="Raw daily counts of activity" width="800" height="226"&gt;&lt;/a&gt;&lt;em&gt;Raw daily counts of activity&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;️ℹ️ ️The first few days were a bit sporadic as I was playing with the data collection. Once everything was setup I was generally seeing&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;200,000 toots a day from 50,000 users&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;masterdon.social was the most popular host&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;news organisations are the biggest generator of content (and they don’t always set the “bot” attribute)&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Mastodon app landscape
&lt;/h2&gt;

&lt;p&gt;What clients are used to access mastodon instance. We take the query the mastodon_toot table, excluding "bots" and load query results into the mastodon_app_df Panda dataframe&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;%%sql
mastodon_app_df &amp;lt;&amp;lt; 
    select *
    from mastodon_toot
    where app is not null 
    and app &amp;lt;&amp;gt; ''
    and bot='False';
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;a href="https://seaborn.pydata.org/" rel="noopener noreferrer"&gt;Seaborn&lt;/a&gt; is a visualization library for statistical graphics in Python, built on the top of &lt;a href="https://matplotlib.org/" rel="noopener noreferrer"&gt;matplotlib&lt;/a&gt;. It also works really well with Panda data structures.&lt;/p&gt;

&lt;p&gt;We can use &lt;a href="https://seaborn.pydata.org/generated/seaborn.countplot.html" rel="noopener noreferrer"&gt;seaborn.countplot&lt;/a&gt; to show the counts of Mastodon app usage observations in each categorical bin using bars. Note, we are limiting this to the 10 highest occurances&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sns.countplot(data=mastodon_app_df, y="app", order=mastodon_app_df.app.value_counts().iloc[:10].index)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frr0thpzrfgyljrq2qkx0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frr0thpzrfgyljrq2qkx0.png" alt="Web, Ivory and Moa are popular ways of toot’ing" width="800" height="502"&gt;&lt;/a&gt;&lt;em&gt;Web, Ivory and Moa are popular ways of toot’ing&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;ℹ️ The Mastodon application landscape is rapidly changing. Web usage is the preferred client, with mobile apps like Ivory, Moa, Tusky and the Mastodon app&lt;/p&gt;

&lt;p&gt;⚠️ The &lt;a href="https://mastodonpy.readthedocs.io/en/stable/02_return_values.html#toot-status-dicts" rel="noopener noreferrer"&gt;Mastodon API&lt;/a&gt; attempts to report application for the client used to post the toot. Generally this attribute does not federate and is therefore undefined for remote toots.&lt;/p&gt;

&lt;h2&gt;
  
  
  Time of day Mastodon usage
&lt;/h2&gt;

&lt;p&gt;Let’s see when Mastodon is used throughout the day and night. I want to get a raw hourly count of &lt;em&gt;toots&lt;/em&gt; each hour of each day. We can load the results of this query into the mastodon_usage_df dataframe&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;%%sql
mastodon_usage_df &amp;lt;&amp;lt; 
    select strftime(created_tz, '%Y/%m/%d %a') as created_day
    , date_part('hour', created_tz) as created_hour
    , count(*) as num
    from mastodon_toot
    group by 1,2 
    order by 1,2;

sns.lineplot(data=mastodon_usage_df, x="created_hour", y="num", hue="created_day").set_xticks(range(24))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv366h7iwid51h9hpaowe.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv366h7iwid51h9hpaowe.png" width="800" height="539"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;⏰ It was interesting to see daily activity follow a very similar usage pattern.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;The lowest activity was seen at 3:00pm in Australia (12:00pm in China, 8:00pm in California and 4:00am in London)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The highest activity was seen at 2:00am in Australia (11:00pm in China, 7:00am in California and 3:00pm in London)&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Language usage
&lt;/h2&gt;

&lt;p&gt;The language of the toot, can be specified by the server or the client — so it’s not always an accurate indicator of the language within the toot. So consider this is a wildly inaccurate investigation of language tags.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;%%sql
mastodon_usage_df &amp;lt;&amp;lt; 
    select *
    from mastodon_toot;

sns.countplot(data=mastodon_usage_df, y="language_name", order=mastodon_usage_df.language_name.value_counts().iloc[:20].index)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4ckt4njtiovd706fxw4g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4ckt4njtiovd706fxw4g.png" alt="English is prominent language, followed by Japanese" width="800" height="549"&gt;&lt;/a&gt;&lt;em&gt;English is prominent language, followed by Japanese&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Toot length by language usage
&lt;/h2&gt;

&lt;p&gt;I was also curious what the length of toots looked like over different languages.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;%%sql
mastodon_lang_df &amp;lt;&amp;lt; 
    select *
    from mastodon_toot
    where language not in ('unknown');

sns.boxplot(data=mastodon_lang_df, x="characters", y="language_name", whis=100, orient="h", order=mastodon_lang_df.language_name.value_counts().iloc[:20].index)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7hjzwp6kgg6ego3irgdx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7hjzwp6kgg6ego3irgdx.png" width="800" height="550"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;What’s interesting to see is the typical Chinese, Japanese and Korean toot is shorter than English, wheres Galicia and Finish messages are longer. A possible explanation is logographic languages (like Mandarin) may be able to convey more with fewer characters.&lt;/p&gt;

&lt;h2&gt;
  
  
  Closing thoughts
&lt;/h2&gt;

&lt;p&gt;The &lt;a href="https://www.bleepingcomputer.com/news/technology/mastodon-now-has-over-1-million-users-amid-twitter-tensions/" rel="noopener noreferrer"&gt;rise of Mastodon&lt;/a&gt; is something I’ve been really interested in. The open sharing nature has helped with the rapid adoption by communities and new users (&lt;a href="https://data-folks.masto.host/@saubury" rel="noopener noreferrer"&gt;myself included&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;It’s been great to explore the &lt;a href="https://en.wikipedia.org/wiki/Fediverse" rel="noopener noreferrer"&gt;fediverse&lt;/a&gt; with powerful open source distributed stream processing tools. Performing exploratory data analysis in a Jupyter notebook with DuckDB is like pressing a turbo button ⏩. Reading parquet from s3 without leaving the notebook is neat, and DuckDB’s ability to run queries directly on Pandas data without ever importing or copying any data is really snappy.&lt;/p&gt;

&lt;p&gt;I’m going to conclude with my two favourite statistics. DuckDB memory used to hold &lt;strong&gt;1.6 million toots is just 745.5MB&lt;/strong&gt; and to process my results in &lt;strong&gt;0.7 seconds&lt;/strong&gt; is like a super power 🪄&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;⚒️ &lt;a href="https://github.com/saubury/mastodon-stream/" rel="noopener noreferrer"&gt;Code&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;🐘 &lt;a href="https://data-folks.masto.host/@saubury" rel="noopener noreferrer"&gt;Simon on Mastodon&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>watercooler</category>
    </item>
    <item>
      <title>Real-Time Wildlife Monitoring with Apache Kafka</title>
      <dc:creator>Simon Aubury</dc:creator>
      <pubDate>Fri, 20 Jan 2023 10:49:52 +0000</pubDate>
      <link>https://dev.to/saubury/real-time-wildlife-monitoring-with-apache-kafka-3pbj</link>
      <guid>https://dev.to/saubury/real-time-wildlife-monitoring-with-apache-kafka-3pbj</guid>
      <description>&lt;p&gt;Wildlife monitoring is critical for keeping track of population changes of vulnerable animals. As part of the Confluent Hackathon ʼ22, I was inspired to investigate if a streaming platform could help with tracking animal movement patterns. The challenge was to examine trends in identified species and demonstrate how animal movement patterns can be observed in the wild using Apache Kafka® and open source dashboarding.&lt;/p&gt;

&lt;p&gt;Note : This article was originally written and published for the &lt;a href="https://www.confluent.io/blog/real-time-detection-monitoring-with-apache-kafka/" rel="noopener noreferrer"&gt;Confluent blog&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F0%2AKDgirObuWyTPcSHH" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F0%2AKDgirObuWyTPcSHH" alt="Dashbaord with upding animal counts" width="600" height="350"&gt;&lt;/a&gt;&lt;em&gt;Dashbaord with upding animal counts&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;I’ve been using Kafka in my “day job” for many years, building streaming solutions in retail, telemetrics, finance, and energy — but this hackathon challenged me to build something new and novel. The goal was ambitious. Before scaling up to monitor and alert on more exotic creatures, I initially chose to test the viability at a smaller scale by tracking wildlife in my own back garden.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvrh0rm1ceb55sdqk4p32.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvrh0rm1ceb55sdqk4p32.jpg" alt="Simplified archiecture diagram" width="800" height="315"&gt;&lt;/a&gt;&lt;em&gt;Simplified archiecture diagram&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Backyard animal detection
&lt;/h2&gt;

&lt;p&gt;To test the viability of the project, I built a “backyard monitoring” experiment using a Raspberry Pi along with an attached camera. The images were processed locally to identify and classify animals, with the observation events published to &lt;a href="https://www.confluent.io/confluent-cloud/tryfree/" rel="noopener noreferrer"&gt;Confluent Cloud&lt;/a&gt; for stream processing.&lt;/p&gt;

&lt;p&gt;This project used TensorFlow Lite with Python on a Raspberry Pi 4 to perform real-time object classification using images streamed from the attached Raspberry Pi camera. TensorFlow is an open source platform for machine learning, and TensorFlow Lite is a slimmed-down library suitable for deploying models on low-powered, battery-operated edge devices such as a Raspberry Pi. TensorFlow also has a great number of community resources, so I was able to make use of a detection model already pre-trained to detect numerous animals, including zebras, elephants, cats, dogs and more importantly, teddy bears.&lt;/p&gt;

&lt;p&gt;I deployed a small Python application to run on the Raspberry Pi. The application continuously captures images from the camera — and each detected animal is given an object detection score. To connect the Raspberry Pi to the Kafka cluster, I used confluent_kafka API, which is a powerful Python client library for interacting with Kafka. With some basic setup (and some secret tokens for connecting) my Python application acts as a Kafka producer. Whenever an animal is detected, be it an elephant, zebra, kangaroo, or household cat, it is sent as a new record to the objects topic.&lt;/p&gt;

&lt;p&gt;Once tested, I deployed the Raspberry Pi, camera, and battery into the “field” ( aka my backyard) to monitor the local wildlife. This worked surprisingly well, capturing the cats, dogs, and several birds that appeared during the week of testing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Peeking into zoo wildlife
&lt;/h2&gt;

&lt;p&gt;I was happy with the minimal viable product (MVP), but was disappointed I hadn’t spotted any exotic animals in my local garden. To increase the variety of animal encounters, I deployed a second Kafka producer, but this time connected to a webcam at a local zoo. Live animal webcams provide a great source of video feeds with an increased likelihood of spotting giraffes, elephants, and zebras over what may be roaming in my back garden. Similar to the Python application described earlier, a stream of webcam images provided a great source of interesting animal encounters that could be detected with the TensorFlow object classification. Regardless of the source, animal detection events were sent to a shared Kafka cluster with a payload describing the image source and animals detected.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{
  "camera_name": "zoo-webcam",
  "objects_count": {
    "elephant": 1,
    "zebra": 2
  }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq7n2rdfpz0pq7dq38uit.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq7n2rdfpz0pq7dq38uit.jpg" alt="Elephants and zebras with identifed JSON stream" width="800" height="412"&gt;&lt;/a&gt;&lt;em&gt;Elephants and zebras with identifed JSON stream&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Animal crossing — the stream processing edition
&lt;/h2&gt;

&lt;p&gt;So I now had two Kafka producers sending animal detection events into a Kafka cluster, and my next challenge was to intercept the animals detection payloads. The goal was to understand the population observations over short time periods (such as animals seen this hour) and longer-term trends (such as population changes day on the day). I elected to use &lt;a href="https://ksqldb.io/" rel="noopener noreferrer"&gt;ksqlDB&lt;/a&gt; to help transform the raw objects topic into some meaningful observations.&lt;/p&gt;

&lt;p&gt;The first task was to declare an objects stream in ksqlDB—allowing me to author SQL statements against the underlying Kafka objects topic.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;create stream objects (camera_name VARCHAR, objects_found VARCHAR, objects_count VARCHAR) 
WITH (KAFKA_TOPIC='objects', VALUE_FORMAT='json');
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The objects identified by the Raspberry Pi and webcam were transported as JSON records. Understanding the number and variety of animals identified in each Kafka record required digging into the JSON. Fortunately, ksqlDB has a handy EXTRACTJSONFIELD function to retrieve nested field values from a string of JSON. It returns the number of each animal if the field key exists in that message; otherwise, it returns NULL. The following is an example of extracting counts of each animal into appropriately named fields:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;create stream animals as
select camera_name
, objects_count
, cast(extractjsonfield(objects_count, '$.elephant') as bigint) as elephant
, cast(extractjsonfield(objects_count, '$.bear') as bigint) as bear
, cast(extractjsonfield(objects_count, '$.zebra') as bigint) as zebra
, cast(extractjsonfield(objects_count, '$.giraffe') as bigint) as giraffe
, cast(extractjsonfield(objects_count, '$.teddybear') as bigint) as teddybear
from objects;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Something I noticed during testing was that the TensorFlow detection didn’t always detect all the animals in each frame. Two animals in the frame were often only detected as a single animal. I discovered I could “smooth” out the observations from the detection model by finding the highest count of each animal type within a sliding 30-second window. This was achieved by defining a tumbling window to count the number of each animal that had been observed in the time window. I figured it was appropriate to call this the zoo table.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;create table zoo as
select camera_name
, max(elephant) as max_elephant
, max(bear) as max_bear
, max(zebra) as max_zebra
, max(giraffe) as max_giraffe
, max(teddybear) as max_teddybear
from animals
window tumbling (size 30 seconds)
group by camera_name
emit changes;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Due to the limited processing on the Kafka producer, it was common to undercount the number of animals seen in consecutive frames of the video. By adding the max ksqlDB function I could create a table to project the highest occurrence count of each animal seen in the time window.&lt;/p&gt;

&lt;h2&gt;
  
  
  Visualizing
&lt;/h2&gt;

&lt;p&gt;With animal observations running and stream transformations deployed I moved on to building the analytics system. To complete the project I wanted to create a live dashboard, with animal counts and a visual image of population trends over time. A Kibana dashboard was ideal, so I just needed to populate the underlying Elasticsearch indexes to create a visual analytics dashboard.&lt;/p&gt;

&lt;p&gt;I wanted a simple way to send the Kafka data onwards to my analytics dashboard. Kafka Connect is a framework for connecting Kafka with external systems such as relational databases, document databases, and key-value stores. I used the Kafka Connect Elasticsearch connector to send both the animals and zoo Kafka topics to Elasticsearch indexes. Once configured, the connector consumes records from the two Kafka topics and writes to a corresponding index in Elasticsearch. With the Elasticsearch index created and populated, a Kibana dashboard provides a great way to visualize the mix of animals, and the trending population of the zoo.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmnfvw571ons1f33r2pcx.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmnfvw571ons1f33r2pcx.jpg" width="800" height="440"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Alerting for the ultra-rare teddy bear sighting
&lt;/h2&gt;

&lt;p&gt;With the detection, transformation, and visualization complete for the wildlife monitoring system, there was still an opportunity for one last feature: rare animal alerting!&lt;/p&gt;

&lt;p&gt;In addition to the more traditional animals, I noticed the detection model I was using had the ability to identify teddy bears. This provided an ideal situation to demonstrate exceptional processing conditions — after all, what’s more rare than the arrival of a teddy bear in the back garden?&lt;/p&gt;

&lt;p&gt;Inspired by &lt;a href="https://dev.to/rmoff/building-a-telegram-bot-with-apache-kafka-go-and-ksqldb-4and"&gt;Building a Telegram Bot with Apache Kafka&lt;/a&gt;, I set up a Telegram bot to alert me when a rare animal is sighted — or at least send a phone push notification to my phone if a teddy bear is seen in the garden.&lt;/p&gt;

&lt;p&gt;To create the special notification simply required a new ksqlDB stream. The teddytopic stream contains a recording signifying the arrival of a teddy bear.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;create stream teddytopic as 
select '📢 Just saw a 🧸 TEDDY BEAR in the garden' as message 
from animals 
where teddybear &amp;gt; 0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Next, I wanted to build the alerting mechanism when the accompanying teddytopic Kafka topic acquired a record when an “endangered” teddy bear was sighted.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://core.telegram.org/bots/api" rel="noopener noreferrer"&gt;Telegram Bot API&lt;/a&gt; allows me to create managed bots for interacting with Telegram — a popular messaging service. To call Telegram I needed to create a Kafka wildlife bot, which provides an HTTP-based interface to let me know instantly when something novel had been observed. With the bot created, I used the Kafka Connect HTTP Sink Connector to make an HTTPS API call for each record in the teddytopic topic. Once configured, the connector consumes records from Kafka topic, sending the record value in the request body to the configured http.api.url. In this case, the endpoint was a preconfigured api.telegram.org.&lt;/p&gt;

&lt;p&gt;With these steps completed, I get instant notification on my phone whenever a rare teddy bear is observed.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2254tnt0lexw03tuodhy.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2254tnt0lexw03tuodhy.jpg" width="616" height="122"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Takeaways
&lt;/h2&gt;

&lt;p&gt;Building this project as part of the Confluent Hackathon ʼ22 was both a great learning experience and a fun way to mix together some impressive technologies. Combining cheap computing devices and open source machine learning libraries along with the Kafka streaming platform provides a new tool to examine trends in animal population and movement observations.&lt;/p&gt;

&lt;p&gt;Although wildlife watcher is a simple proof of concept, I hope a project like this demonstrates how incredible components can be incorporated quickly. Stream processing with ksqlDB, the flexibility of Confluent Cloud, and the integration of data systems with Kafka Connect allowed me to build a fun project with less than 200 lines of code and perhaps contribute a small part in helping to solve real-world wildlife challenges.&lt;/p&gt;

&lt;h2&gt;
  
  
  Links to code
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/saubury/wildlife-watch" rel="noopener noreferrer"&gt;GitHub project&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
    </item>
    <item>
      <title>Can ML predict where my cat is now — part 2</title>
      <dc:creator>Simon Aubury</dc:creator>
      <pubDate>Mon, 04 Jul 2022 21:56:22 +0000</pubDate>
      <link>https://dev.to/saubury/can-ml-predict-where-my-cat-is-now-part-2-4ep0</link>
      <guid>https://dev.to/saubury/can-ml-predict-where-my-cat-is-now-part-2-4ep0</guid>
      <description>&lt;h1&gt;
  
  
  Can ML predict where my cat is now — part 2
&lt;/h1&gt;

&lt;p&gt;Can ML predict where Snowy the cat would go throughout her day? With months of location &amp;amp; temperature data captured, this second blogs covers how to train a machine learning (ML) model to predict where Snowy would go throughout her day. For the impatient, you can skip directly to the prediction web-app here.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;a href="https://simon-aubury.medium.com/can-ml-predict-where-my-cat-is-now-part-1-cfb194b51aab"&gt;Part 1 of this blog&lt;/a&gt; covered the hardware required build a history of which room she used for her favourite sleeping spots.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--kKwt8rWS--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/2000/0%2ApcBZHKQDm-IwLH-w" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--kKwt8rWS--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/2000/0%2ApcBZHKQDm-IwLH-w" alt="Cat location prediction using Streamlit web apps" width="600" height="465"&gt;&lt;/a&gt;&lt;em&gt;Cat location prediction using Streamlit web apps&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Where are we starting?
&lt;/h2&gt;

&lt;p&gt;This &lt;a href="https://simon-aubury.medium.com/can-ml-predict-where-my-cat-is-now-part-1-cfb194b51aab"&gt;first blog&lt;/a&gt; described the the method for locating Snowy and data collection platform. I had collected over three months of location observations, with over &lt;strong&gt;12 million&lt;/strong&gt; location, temperature, humidity and rainfall observations (I &lt;em&gt;may&lt;/em&gt; have gone over the top with data collection).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--7yLuttLV--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/3622/1%2AZpHFIHMN58FXuRtF1XA2Bw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--7yLuttLV--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/3622/1%2AZpHFIHMN58FXuRtF1XA2Bw.png" alt="" width="880" height="325"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The question I’ve been trying to answer, can I use these historic observations to build a prediction model of where she is likely to go? How confident can I be using a machine to predict where a cat is likely to be at predicting the hiding spot for Snowy?&lt;/p&gt;

&lt;h2&gt;
  
  
  ML Bootcamp
&lt;/h2&gt;

&lt;p&gt;Supervised learning is the ML task of creating a function that maps an input to an output based on example input-output pairs. In my case, I want to take historic observations about cat location, temperature, time of day etc., as inputs and find patterns … a function (inference) that predicts future cat location.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--XWA9SpED--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/3324/1%2ANncPI9cL6dCDozOgXEdZ3g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--XWA9SpED--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/3324/1%2ANncPI9cL6dCDozOgXEdZ3g.png" alt="Temperature, time and day — can it map to location?" width="880" height="409"&gt;&lt;/a&gt;&lt;em&gt;Temperature, time and day — can it map to location?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;My assumption is the problem can be generalised from this data; e.g. future data will follow some common pattern of past cat behaviour (for a cat — this assumption may be questionable) .&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--tfrQDEQr--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/3744/1%2AvBucY3DB1DfM_UJqEeZ7ug.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--tfrQDEQr--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/3744/1%2AvBucY3DB1DfM_UJqEeZ7ug.png" alt="Cat location prediction" width="880" height="338"&gt;&lt;/a&gt;&lt;em&gt;Cat location prediction&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The training uses past information to build a model that is a deployable artefact. Once a candidate model is trained, it can be tested for predication accuracy and finally deployed. In my case, I wish to create a web application to make predictions on where Snowy is likely to be napping.&lt;/p&gt;

&lt;p&gt;What’s also important is that the model doesn’t have to explicitly output an absolute location, but can give its answer in terms of a confidence. If it output P(location:study) near 1.0 it’s confident, but values near 0.5 represent “unsure” about the confidence of predicting Snowy’s location.&lt;/p&gt;

&lt;h2&gt;
  
  
  Summarising data with dbt
&lt;/h2&gt;

&lt;p&gt;As covered in &lt;a href="https://simon-aubury.medium.com/can-ml-predict-where-my-cat-is-now-part-1-cfb194b51aab"&gt;part 1&lt;/a&gt; — my data platform Home assistant stores each sensor update in the &lt;a href="https://www.home-assistant.io/docs/backend/database/"&gt;states&lt;/a&gt; table. This is *really *fine-grained, with updates added every few seconds from all the sensors (in my case, around 18,000 sensor updates a day). My goal was to summarise the data into hourly updates — essentially a single (most prevalent) location, along with temperature and humidity readings.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--zSqTmcwY--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/3394/1%2AlrglkFoIFjc3B9dmB5QMNA.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--zSqTmcwY--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/3394/1%2AlrglkFoIFjc3B9dmB5QMNA.png" alt="Summarising lots of data into hourly summaries" width="880" height="209"&gt;&lt;/a&gt;&lt;em&gt;Summarising lots of data into hourly summaries&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Initially I was manually running the data processing with a bunch of SQL statements (like &lt;a href="https://github.com/saubury/cat-predictor/blob/master/sql/extract.sql"&gt;this&lt;/a&gt;) to process the data. However, I found this fairly cumbersome as I wanted to retraining the model with newer location and environmental conditions. I settled on using the trusty data engineering tool &lt;a href="https://www.getdbt.com/"&gt;dbt&lt;/a&gt; to simplify the creation of the SQL transformation in my database to make retraining more effective.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--oUYq7YNL--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/3526/1%2AhTuWToji5sce680xpOCgNw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--oUYq7YNL--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/3526/1%2AhTuWToji5sce680xpOCgNw.png" alt="The dbt lineage graph showing the transformation of data" width="880" height="449"&gt;&lt;/a&gt;&lt;em&gt;The dbt lineage graph showing the transformation of data&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;dbt handles turning these my select statements into tables and views, performing the transforming data already inside of my postgres data warehouse.&lt;/p&gt;

&lt;h2&gt;
  
  
  Model training &amp;amp; evaluation
&lt;/h2&gt;

&lt;p&gt;I used a Scikit-learn random &lt;a href="https://www.datacamp.com/tutorial/random-forests-classifier-python"&gt;forest decision tree&lt;/a&gt; classification for my predictive model. Random forests creates decision trees on randomly selected data samples, gets prediction from each tree and selects the best solution by means of voting. It also provides a pretty good indicator of the feature importance.&lt;/p&gt;

&lt;p&gt;If you look at the &lt;a href="https://github.com/saubury/cat-predict/tree/master/notebooks"&gt;python notebook&lt;/a&gt; you can see the steps taken to assigns a class label to inputs, based on many examples it has been trained on from thousands of past observations of time of day, temperature and location.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--aqu493ES--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/2652/1%2A_YIXxul5hKvKUBF8AO2ovA.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--aqu493ES--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/2652/1%2A_YIXxul5hKvKUBF8AO2ovA.png" alt="Python code segment for visualizing feature importance" width="880" height="307"&gt;&lt;/a&gt;&lt;em&gt;Python code segment for visualizing feature importance&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;One really cool thing about the Scikit-learn decision tree models is how easy it is to visualise what’s going on. By visualizing the model features (above) I can see that “hour of the day” is the most significant feature in the model.&lt;/p&gt;

&lt;p&gt;Intuitively this makes sense — time of day is likely to have the most significant impact on where Snowy is likely to be. The second most significant feature in predicting Snowy’s location is outside air temperature. Again this makes sense — too hot or too cold is likely to change is she wants to be outside. What I found surprising was the &lt;em&gt;least significant&lt;/em&gt; feature was the is-raining feature. One possible explanation is the feature only makes sense during daylight hours, the is-raining won’t have an effect on the model when Snowy is sleeping inside at night.&lt;/p&gt;

&lt;p&gt;It’s also possible to &lt;a href="https://towardsdatascience.com/how-to-visualize-a-decision-tree-from-a-random-forest-in-python-using-scikit-learn-38ad2d75f21c"&gt;visualize a decision tree&lt;/a&gt; from a random forest in Python using Scikit-Learn.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--OH0OVw8V--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/3414/1%2A7KkpmxEWN0GW1sKkV86YtQ.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--OH0OVw8V--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/3414/1%2A7KkpmxEWN0GW1sKkV86YtQ.png" alt="A visual decision tree showing the hour and day decision points" width="880" height="359"&gt;&lt;/a&gt;&lt;em&gt;A visual decision tree showing the hour and day decision points&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Here in my display tree I can see the hour of the day is the initial decision point in the prediction — with 7:00am an interesting part of the algorithm. This is the time when alarm clocks go off in our household — and the cat is motivated to get up and look for food. Another interesting part of the tree is the “day of the week ≤ 5.5”. This equates to day of day of week being Monday through Friday — and again this part of the algorithm makes sense as we (and the cat) generally get up a bit later on week-ends&lt;/p&gt;

&lt;h2&gt;
  
  
  The cat predictor web-app in Streamlit
&lt;/h2&gt;

&lt;p&gt;With the model created, I now wanted to build a web application to predict Snowy’s location based on a range of inputs. &lt;a href="https://docs.streamlit.io/"&gt;Streamlit&lt;/a&gt; is an open-source Python library that makes it easy to create web apps (without me having to learn a bunch of front-end frameworks). I added sliders and selection boxes to for feature values, such as day and temperature.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--6L6Io6BU--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/2000/1%2AQ6taGyYXoITfcBE6qldU0Q.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--6L6Io6BU--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/2000/1%2AQ6taGyYXoITfcBE6qldU0Q.png" alt="Web application — with inputs as slider controls" width="880" height="355"&gt;&lt;/a&gt;&lt;em&gt;Web application — with inputs as slider controls&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;And voila — with a bit more &lt;a href="https://github.com/saubury/cat-predict-app/blob/master/cat_predictor_app.py"&gt;python code&lt;/a&gt; I’ve created a Cat Prediction App; a web-app that predicts the likely location of Snowy the cat. I found some &lt;a href="https://towardsdatascience.com/a-quick-tutorial-on-how-to-deploy-your-streamlit-app-to-heroku-874e1250dadd"&gt;excellent instruction&lt;/a&gt;s to deploy my Streamlit app to Heroku. So I can now &lt;a href="https://cat-predict-app.herokuapp.com/"&gt;share my Cap Predicator app&lt;/a&gt; with the world!&lt;/p&gt;

&lt;h2&gt;
  
  
  Links to code
&lt;/h2&gt;

&lt;p&gt;Hope you find this blog and code helpful for all your pet location prediction needs&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Data platform and ML prediction: &lt;a href="https://github.com/saubury/cat-predict"&gt;https://github.com/saubury/cat-predict&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Streamlit App: &lt;a href="https://github.com/saubury/cat-predict-app"&gt;https://github.com/saubury/cat-predict-app&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
    </item>
    <item>
      <title>Can ML predict where my cat is now — part 1</title>
      <dc:creator>Simon Aubury</dc:creator>
      <pubDate>Thu, 03 Feb 2022 10:00:04 +0000</pubDate>
      <link>https://dev.to/saubury/can-ml-predict-where-my-cat-is-now-part-1-444g</link>
      <guid>https://dev.to/saubury/can-ml-predict-where-my-cat-is-now-part-1-444g</guid>
      <description>&lt;h1&gt;
  
  
  Can ML predict where my cat is now — part 1
&lt;/h1&gt;

&lt;p&gt;It’s 9am on a rainy Tuesday morning — can a simple ML model predict where my cat will be sleeping? How I used a bluetooth tracker, a dozen microcontrollers plus a bit of Python to predict where Snowy the cat would be napping in the next hour.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--QN4-ljut--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/2000/1%2ADHIT-bmgUc1pJ_Pb-Rev7w.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--QN4-ljut--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/2000/1%2ADHIT-bmgUc1pJ_Pb-Rev7w.png" alt="Predicting where Snowy the cat is likely to be based on time and weather" width="717" height="608"&gt;&lt;/a&gt;&lt;em&gt;Predicting where Snowy the cat is likely to be based on time and weather&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;With some inexpensive hardware (and a cat ambivalent to data privacy concerns) I wanted to see if I could train a machine learning (ML) model to predict where Snowy would go throughout her day.&lt;/p&gt;

&lt;p&gt;Home based location &amp;amp; temperature tracking allowed me to build up an extensive history of which room she used for her favourite sleeping spots. I had a theory with sufficient data collected I’d be able to train an ML model to predict where the cat was likely to be.&lt;/p&gt;

&lt;p&gt;This two part blog describes the hardware and software necessary to collect the data, build a prediction model and test the real-world accuracy of cat behaviour estimation. This first blog describes the hardware and data collection, and part 2 describes building the prediction model&lt;/p&gt;

&lt;h2&gt;
  
  
  Hardware for room level cat tracking
&lt;/h2&gt;

&lt;p&gt;The first task was to collect a &lt;em&gt;lot&lt;/em&gt; of data on where Snowy historically spent her time — along with environmental factors such as temperature &amp;amp; rainfall. I set an arbitrary target of collecting hourly updates for around three months of movement in and around the house.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--0DoRVeOh--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/2000/1%2AMuUAaWKGzRv-cXhKR8vghQ.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--0DoRVeOh--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/2000/1%2AMuUAaWKGzRv-cXhKR8vghQ.png" alt="Finding the room Snowy is in relies on a base station in each likely location" width="517" height="281"&gt;&lt;/a&gt;&lt;em&gt;Finding the room Snowy is in relies on a base station in each likely location&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Cat’s aren’t great at data entry; so I needed an automated way of collecting her location. I asked Snowy to wear a &lt;a href="https://www.thetileapp.com/en-us/products"&gt;Tile&lt;/a&gt; — a small, battery powered bluetooth tracker. This simply transmits a regular and unique BLE signal. I then used eight stationary receivers to listen for the BLE Tile signal. These receiver nodes were &lt;a href="https://en.wikipedia.org/wiki/ESP32"&gt;ESP32&lt;/a&gt; based presence detection nodes each running &lt;a href="https://espresense.com/"&gt;ESPresense&lt;/a&gt;. The nodes were placed in named rooms in and around the house (6 inside, 2 outside).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--jGj1wolm--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/2000/1%2A6rkhPumX6uEOfHcp9unK3w.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--jGj1wolm--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/2000/1%2A6rkhPumX6uEOfHcp9unK3w.png" alt="A collection of ESP32 modules and a BLE Tile (white square)" width="880" height="606"&gt;&lt;/a&gt;&lt;em&gt;A collection of ESP32 modules and a BLE Tile (white square)&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Each node constantly looking for the unique BLE signal of Snowy’s tile and measuring the received signal strength indicator (&lt;a href="https://en.wikipedia.org/wiki/Received_signal_strength_indication"&gt;RSSI&lt;/a&gt;). The stronger the signal, the closer Snowy is to that beacon (either that, or she’s messing with the battery). If I got a few seconds of strong signal next to the study sensor for example, I could assume Snowy was likely very close to that room.&lt;/p&gt;

&lt;p&gt;Each ESP32 module is powered by a micro-usb power supply and communicate back to the base station by joining the home WiFi network. The networking is important as multiple receivers can simultaneously receive a signal — and they need to determine which base station has the “strongest” signal.&lt;/p&gt;

&lt;h2&gt;
  
  
  Hardware for logging environmental information
&lt;/h2&gt;

&lt;p&gt;Snowy avoids the outside garden when it rains, and tends to fall asleep in the warm (but not hot) rooms of the house. I wanted to collect environmental conditions, as I figured temperature and rainfall would play a significant role in determining where Snowy would hang out.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Z6CCfHlL--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/2176/1%2AY8SmOS0v8kIWJwclzCjP0w.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Z6CCfHlL--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/2176/1%2AY8SmOS0v8kIWJwclzCjP0w.png" alt="Xiaomi Aqara Temperature and Humidity Sensors" width="880" height="663"&gt;&lt;/a&gt;&lt;em&gt;Xiaomi Aqara Temperature and Humidity Sensors&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;I selected the &lt;a href="https://www.xiaomiproducts.nl/fr/xiaomi-aqara-temperature-and-humidity-sensor.html"&gt;Xiaomi Temperature and Humidity Sensor&lt;/a&gt; as they run for months on a battery, and communicate over large distances via the &lt;a href="https://en.wikipedia.org/wiki/Zigbee"&gt;Zigbee&lt;/a&gt; wireless mesh network. I placed these sensors throughout the house and in two external locations to capture outside conditions&lt;/p&gt;

&lt;h2&gt;
  
  
  Integration — building a data collection platform
&lt;/h2&gt;

&lt;p&gt;For the data collection platform I used &lt;a href="https://www.home-assistant.io/"&gt;Home Assistant&lt;/a&gt; running on a Raspberry Pi. Home Assistant is a free and open-source software for home automation that is designed to be the central control system for smart home devices. I was able to track Snowy’s location via the &lt;a href="https://espresense.com/home_assistant"&gt;binary sensor&lt;/a&gt; configuration. Essentially the room based beacon receiving the strongest signal from Snowy’s BLE tile updates an MQTT topic with her current location.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--3Un8Tq0n--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/2000/1%2AW24hzFd2yHwFAmrtDXB1vQ.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--3Un8Tq0n--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/2000/1%2AW24hzFd2yHwFAmrtDXB1vQ.png" alt="Home Assistant display of location" width="415" height="301"&gt;&lt;/a&gt;&lt;em&gt;Home Assistant display of location&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;For temperature and humidity measurements, I used the Xiaomi integration to get a constant update of room level environment conditions. (Worthy of another blog: TL;DR flash the Xiaomi Zigbee hub with &lt;a href="https://tasmota.github.io/docs/Zigbee/"&gt;Tasmota&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--6329VJLr--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/2000/1%2A6OGE5cb1lgUh68z55t-bPQ.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--6329VJLr--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/2000/1%2A6OGE5cb1lgUh68z55t-bPQ.png" alt="Home Assistant display of temperature and humidity" width="374" height="494"&gt;&lt;/a&gt;&lt;em&gt;Home Assistant display of temperature and humidity&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The TL;DR summary — with the support of the amazing Home Assistant and Tasmota community I was able to gather accurate cat location along with fine grained temperature and humidity readings.&lt;/p&gt;

&lt;h2&gt;
  
  
  Data preparation — extracting data from Home Assistant
&lt;/h2&gt;

&lt;p&gt;Home Assistant by default uses a SQLite database with a 10 day retention. I actually wanted to retain a lot more historic data to train the model. By modifying the &lt;a href="https://www.home-assistant.io/integrations/recorder/"&gt;recorder integration&lt;/a&gt; I pushed all the data storage into a Postgres database with 6 months of retention.&lt;/p&gt;

&lt;p&gt;Home assistant stores each sensor update in the &lt;a href="https://www.home-assistant.io/docs/backend/database/"&gt;states&lt;/a&gt; table. This is *really *fine-grained, with updates added every few seconds from all the sensors (in my case, around 18,000 sensor updates a day). My goal was to summarise the data into hourly updates — essentially a single (most prevalent) location, along with temperature and humidity readings.&lt;/p&gt;

&lt;p&gt;I extracted the initial three months (SQL &lt;a href="https://github.com/saubury/cat-predictor/blob/master/sql/extract.sql"&gt;here&lt;/a&gt;) of hourly location and environmental conditions to train the model.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--gf7icthe--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/2920/1%2AF5dvGQdB4gCUW_8LUkvhsA.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--gf7icthe--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/2920/1%2AF5dvGQdB4gCUW_8LUkvhsA.png" alt="Extract of hourly location and environmental readings" width="880" height="294"&gt;&lt;/a&gt;&lt;em&gt;Extract of hourly location and environmental readings&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What’s next in part 2
&lt;/h2&gt;

&lt;p&gt;This first blog described the the method for locating Snowy and data collection platform. The next blog will describe building the prediction model, and how accurate can a ML model be when determining where a cat is likely to be.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--b6odCiUS--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/2000/1%2AoWgiVe16mX2E_yAWeW67mw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--b6odCiUS--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/2000/1%2AoWgiVe16mX2E_yAWeW67mw.png" alt="Snowy looking forward to reviewing the confusion matrix" width="601" height="493"&gt;&lt;/a&gt;&lt;em&gt;Snowy looking forward to reviewing the confusion matrix&lt;/em&gt;&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Cleaning messy sensor data in Kafka with ksqlDB</title>
      <dc:creator>Simon Aubury</dc:creator>
      <pubDate>Mon, 26 Jul 2021 04:31:22 +0000</pubDate>
      <link>https://dev.to/saubury/cleaning-messy-sensor-data-in-kafka-with-ksqldb-2181</link>
      <guid>https://dev.to/saubury/cleaning-messy-sensor-data-in-kafka-with-ksqldb-2181</guid>
      <description>&lt;h1&gt;
  
  
  Cleaning messy sensor data in Kafka with ksqlDB
&lt;/h1&gt;

&lt;blockquote&gt;
&lt;p&gt;Collecting streaming data is easy — but understanding it is much harder! With months of 🐱 weight data captured I discovered sensor data can be very messy⚡. Let me share some of the real world data problems I encountered — and how I solved my stream processing &amp;amp; cat dining challenges with ksqlDB&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--aIZbgGPF--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://miro.medium.com/max/700/1%2ArcHo0yBkc7zV5TQEMWl2lg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--aIZbgGPF--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://miro.medium.com/max/700/1%2ArcHo0yBkc7zV5TQEMWl2lg.png" alt="Snowy the cat — an expert in streaming data"&gt;&lt;/a&gt;&lt;em&gt;Snowy the cat — an expert in streaming data&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;You may have seen me &lt;a href="https://simon-aubury.medium.com/snowys-eating-tweeting-my-cats-weight-dining-habits-with-a-raspberry-pi-3218e340c20c"&gt;tweeting my cats weight &amp;amp; dining habits with a Raspberry Pi&lt;/a&gt;. This small project captures the weight of both my cat and her food bowl and places these measurements into a Kafka topic about every second.&lt;/p&gt;

&lt;p&gt;These weight measurements are pretty messy — and Snowy 😺 doesn’t want to stand still on the scale! She has been causing havoc with the measurement data by interfering with both the food scale and cat scale when eating. The four problems I needed to overcome&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Data type casting — forming strings into timestamps&lt;/li&gt;
&lt;li&gt;Windowing — finding the start and end of dining sessions&lt;/li&gt;
&lt;li&gt;Resequencing — handling refilling of food tray during the day&lt;/li&gt;
&lt;li&gt;Discarding — weight mismatch when the cat stands on the food scale&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Fortunately the time series measurements are captured in an infinite event stream in Kafka. This gives me a chance to reprocess the stream to make sense of her weight and dining habits. Let me share some of the sensor data problems I encountered — and how I solved my stream processing cat dining challenges with ksqlDB&lt;/p&gt;

&lt;h1&gt;
  
  
  Context — getting weight data into Kafka
&lt;/h1&gt;

&lt;p&gt;As described in earlier the Raspberry Pi is constantly measuring the weight of the food and cat using load cells weight sensor. The &lt;a href="https://github.com/saubury/catfit"&gt;catfit&lt;/a&gt; project is written in Python, so the code to write weight measurements looks something like the code below. TL;DR weight measurements are written to the feed_log Kafka topic roughly every second with cat and food values.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from confluent_kafka import Producer, KafkaError
import json

# Kafka Producer
producer_conf = {
'bootstrap.servers': config.bootstrap_servers, 
'sasl.username': config.sasl_username, 
'sasl.password': config.sasl_password,
'security.protocol': 'SASL_SSL', 
'sasl.mechanisms': 'PLAIN'
}

producer = Producer(producer_conf)
# Regularly save the cat &amp;amp; food weight measurement

producer.produce('feed_log',  value=json.dumps({"event_date": event_date.strftime("%d/%m/%Y %H:%M:%S"), "cat_weight": cat_weight, "food_weight": food_weight}))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I’m using &lt;a href="https://www.confluent.io/confluent-cloud/"&gt;Confluent Cloud&lt;/a&gt; as a fully managed Kafka service; but the code would look almost identical if you were connecting to another Kafka cluster.&lt;/p&gt;

&lt;h1&gt;
  
  
  Kafka Event stream processing with ksqlDB
&lt;/h1&gt;

&lt;p&gt;Now I have a stream of time-based measurements streaming into Kafka, I want to understand Snowy’s eating habits. I could dump the feeding events into a database and then perform some analysis. A better approach would be &lt;a href="https://en.wikipedia.org/wiki/Stream_processing"&gt;stream processing&lt;/a&gt; — which means I can write an application to respond to new data events at the moment they occur. I’m a big fan of &lt;a href="https://ksqldb.io/"&gt;ksqlDB&lt;/a&gt; as a platform to create event streaming applications. I can perform stream processing data clean-up with a few lines of SQL (as opposed to writing a stream processor in Scala or Java).&lt;/p&gt;

&lt;h1&gt;
  
  
  Create Stream
&lt;/h1&gt;

&lt;p&gt;First job is to create a KSQL &lt;a href="https://docs.ksqldb.io/en/latest/concepts/streams/"&gt;stream&lt;/a&gt; describing the Kafka feed_log topic. A KSQL stream is a simple way to describe the contents of a Kafka topic. It’s pretty self describing and looks something like …&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CREATE STREAM feed_stream_raw 
(event_date varchar, cat_weight double, food_weight double) 
WITH (kafka_topic='feed_log', value_format='json');
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When I created the project I had (foolishly) placed the timestamp of the events in a text field called event_date. In hindsight this was a bit silly, but I had already collected over a month of data — so I needed to convert it to a timestamp datatype. I got a little fancy and created a second stream called weight_stream with the converted datatype, and set the TIMESTAMP property to the expected value. I also set the serialization format to &lt;a href="https://docs.ksqldb.io/en/latest/reference/serialization/#avro"&gt;AVRO&lt;/a&gt;. Not only does this neatly register everything in a &lt;a href="https://docs.confluent.io/platform/current/schema-registry/index.html"&gt;schema registry&lt;/a&gt;, it also makes me feel like I’m &lt;a href="https://en.wikipedia.org/wiki/Apache_Avro#Logo"&gt;building an airplane&lt;/a&gt; in my garage.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;create stream weight_stream
with (timestamp='event_ts', value_format='avro') 
as 
select 'snowy' as cat_name
, stringtotimestamp(event_date, 'dd/MM/yyyy HH:mm:ss') as event_ts
, event_date
, cat_weight
, food_weight 
from debug_stream_raw ;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h1&gt;
  
  
  Challenge 1 - Find the start and end of dining sessions
&lt;/h1&gt;

&lt;p&gt;I want to find periods of time when Snowy if eating. If I was ambitious I’d train her to press a button at the end of her meal. Instead I’ll settle for using some stream processing magic to work out her dining sessions.&lt;/p&gt;

&lt;p&gt;I’ll define an eating session as when Snowy is on the cat scale. That is, if the cat scale is indicating a 5.8kg fluffy mass is on the scale I can assume some eating is underway. When the weight (of the cat scale) drops to zero I can assume she has stopped eating. A zero weight means she has wondered off to have a nap somewhere.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Kb1OxNIf--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://miro.medium.com/max/568/1%2Ak18PyQ3kmSNeYie-OOrDkQ.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Kb1OxNIf--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://miro.medium.com/max/568/1%2Ak18PyQ3kmSNeYie-OOrDkQ.png" alt="Eating windows — jumping from one session to another"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Eating windows — jumping from one session to another&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Events (such as periodic weight measurements) are simply a stream of numbers landing in a Kafka topic. I can use a windowing state store to aggregate all the records received so far within the defined window boundary. The window closes when a sufficient gap of inactivity (say 1 minute) has passed with near zero weight measurements.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--KwERAzKL--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_880/https://miro.medium.com/max/700/1%2Ar4qOHG0_O6uenist300txg.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--KwERAzKL--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_880/https://miro.medium.com/max/700/1%2Ar4qOHG0_O6uenist300txg.gif" alt="ksqlDB Session Windows"&gt;&lt;/a&gt;&lt;em&gt;ksqlDB Session Windows&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;A windowing state store is used to store the aggregation results per window — and allows me to determine how much food is consumed within each eating session. Here’s a quick idea of how to write a &lt;a href="https://docs.ksqldb.io/en/latest/concepts/time-and-windows-in-ksqldb-queries/#session-window"&gt;time session window query&lt;/a&gt; in KSQL. This code is finding a period of activity separated by (at least) a 60 second gap of inactivity.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;select avg(cat_weight) as cat_weight_avg
, max(food_weight) as food_weight_max
from  weight_stream 
window session (60 seconds) 
group by 1
emit changes;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--PQQSQElg--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://miro.medium.com/max/409/1%2AWGjtIrC_uC39BPtVLmfa-Q.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--PQQSQElg--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://miro.medium.com/max/409/1%2AWGjtIrC_uC39BPtVLmfa-Q.png" alt="Two session windows"&gt;&lt;/a&gt;&lt;em&gt;Two session windows&lt;/em&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  Challenge 2 — Ignore refilling of the food
&lt;/h1&gt;

&lt;p&gt;A practical problem with the monitoring station, I need to refill 🍴 the food tray. This means the food weight can increase throughout the day, at fairly random times.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--CEuuxLJK--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://miro.medium.com/max/577/1%2Aqje_Jycox-v-VazdZGNT1Q.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--CEuuxLJK--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://miro.medium.com/max/577/1%2Aqje_Jycox-v-VazdZGNT1Q.png" alt="An eating session spanning almost 3 minutes"&gt;&lt;/a&gt;&lt;em&gt;An eating session spanning almost 3 minutes&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Fortunately I can ignore the actual weight of the food, as I really only care about the change of food weight during an eating window. As long as I know the beginning food weight and final food weight during an eating window we know how much Snowy has eaten during that meal. I can find the difference in food weight within a window like this …&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;select ...
, max(food_weight) - min(food_weight) as food_eaten
from  weight_stream
window ...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h1&gt;
  
  
  Challenge 3— Stepping on the wrong scale
&lt;/h1&gt;

&lt;p&gt;There are two independent scales — but there are times when Snowy leaves her sensor plate and places her weight on the food scale. Suddenly her measured weight will drop (by around 1kg) and the weight of the food will seem to increase.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--64eDNCd---/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://miro.medium.com/max/623/1%2AvJBGFM_DryCSaru0tKeo9g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--64eDNCd---/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://miro.medium.com/max/623/1%2AvJBGFM_DryCSaru0tKeo9g.png" alt="Data mismatch — cat stepping on food plate"&gt;&lt;/a&gt;&lt;em&gt;Data mismatch — cat stepping on food plate&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;I initially tried to account for this momentary difference by calculating the shifted mass. But after a bit of tinkering I decided it was easier to simply ignore these spurious events with some boundary checks. To only keep the “sensible” events I can use a predicate like this …&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;select ..
from  weight_stream
where food_weight &amp;lt; 1100 and cat_weight &amp;gt; 5800 and cat_weight &amp;lt; 6200
...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h1&gt;
  
  
  Putting it all together
&lt;/h1&gt;

&lt;p&gt;I now have a stream of events, grouped into eating session windows. I’ve excluded the rogue data and calculated the food consumed for each meal. I can materialize the result into a KSQL table which represents a snapshot (a point in time) of how much food was eaten within a window&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;create table cat_weight_table 
as 
select avg(cat_weight) as cat_weight_avg
, max(food_weight) - min(food_weight) as food_eaten
from  weight_stream 
window session (600 seconds) 
where food_weight &amp;lt; 1100 and cat_weight &amp;gt; 5800 and cat_weight &amp;lt; 6200
group by cat_name
having count(*) &amp;gt; 4;

select  timestamptostring(windowstart, 'dd/MM/yyyy HH:mm:ss') 
, timestamptostring(windowend, 'dd/MM/yyyy HH:mm:ss') 
, (windowend-windowstart) / 1000 as eat_seconds
, round(cat_weight_avg) as cat_weight_grams
, round(food_weight_max - food_weight_min) as food_eaten_grams
, cnt 
from cat_weight_table 
where cat_name = 'snowy';
Which results in a projection with clear dining sessions, the duration and weight of feed eaten.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--obApY53z--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://miro.medium.com/max/700/1%2A7C88u58UR3Z6YkBRwuoWAQ.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--obApY53z--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://miro.medium.com/max/700/1%2A7C88u58UR3Z6YkBRwuoWAQ.png" alt="Eating sessions — with elapsed time and food eaten"&gt;&lt;/a&gt;&lt;em&gt;Eating sessions — with elapsed time and food eaten&lt;/em&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  Conclusion
&lt;/h1&gt;

&lt;p&gt;An uncooperative cat plus unexpected load cell measuring inaccuracies meant I had a lot of messy data. A bit of stream processing with ksqlDB makes it much easier to work out when your cat needs to go on a diet.&lt;br&gt;
Feel free to download and try out this project yourself (with data) — &lt;a href="https://github.com/saubury/catfit/blob/master/ksqldb/ksqldb.md"&gt;catfit (github.com)&lt;/a&gt;&lt;/p&gt;

</description>
    </item>
  </channel>
</rss>
