Pavel P.

Posted on Dec 28, 2024

Exploring OSM changesets via DuckDB

#osm #dataengineering #duckdb #database

Is the statement, "DuckDB - A game changer for analyzing OpenStreetMap" true?
Let's find out together!

There were so many positive posts about this analytical database system for some time, especially for geospatial data, so I decided: it's time for doubts and it's better to test it by myself. Together with out-of-curiosity testing for dbt outside prod environment, it ended up as a small ETL pet project.

What's the idea?

Let’s take data from OpenStreetMap, load it into DuckDB, run some analytical queries, and show the result. Quite simple, right?

But first of all, people do that all the time, so it’s already boring.

What about not only geodata, but “metadata”? Every time someone draws a house on a map, a changeset is created with additional information like user_id, timestamp, software used, tags, and some others.
Looks interesting! If I can combine that with the data itself, it can literally “leverage cross-functional expertise to catalyze impactful decisions for key imaginary stakeholders!”

Where should we start?
I started sitting in my IKEA chair behind an old 5-year-old laptop with 8 cores and 7.24G RAM in WSL, downloading 6.8 GB of the latest .osm.bz2 changeset file with XML data inside from planet.openstreetmap.org.

Let the project begin!

Changesets

What do we have?
We have 7 gigs of archived XML with data starting from 2005-04-09

osmaboardbt$ bzcat data/in/osm_data/changesets-latest.osm.bz2 | head -n 5
<?xml version="1.0" encoding="UTF-8"?>
<osm license="http://opendatacommons.org/licenses/odbl/1-0/" copyright="OpenStreetMap and contributors" version="0.6" generator="planet-dump-ng 1.2.4" attribution="http://www.openstreetmap.org/copyright" timestamp="2024-11-25T00:59:59Z">
 <bound box="-90,-180,90,180" origin="http://www.openstreetmap.org/api/0.6"/>
 <changeset id="1" created_at="2005-04-09T19:54:13Z" closed_at="2005-04-09T20:54:39Z" open="false" user="Steve" uid="1" min_lat="51.5288506" min_lon="-0.1465242" max_lat="51.5288620" max_lon="-0.1464925" num_changes="2" comments_count="48"/>
 <changeset id="2" created_at="2005-04-17T14:45:48Z" closed_at="2005-04-17T15:51:14Z" open="false" user="nickw" uid="94" min_lat="51.0025063" min_lon="-1.0052705" max_lat="51.0047760" max_lon="-0.9943439" num_changes="11" comments_count="5"/>

What do I want?

I want a folder with Parquet files, each one close to 1GB (following some best practices I’ve heard somewhere I don’t remember), with structured data keeping all columns from the source: id, created_at, closed_at, uid, user, num_changes, bbox, tags.

I couldn’t find a tool that would do this for me, so it seems like this isn’t a common thing people do. Maybe my Python script might help someone in the future.

It was also a good idea to prepare test data first. What I’ve learned in the past is that I might have dozens of test runs, and saving minutes by running them on a slice of data will end up saving hours in the end. So, I also made a bash script to take just one month from every year and process it.

Together with GNU Parallel utilizing almost all my hardware, I was able to split the source file into 12 .osm.bz2 ones and later into 12 Parquet files in reasonable time, where just the latest file should be updated/processed next week.

Cool! Well, not 100% cool. There are duplicates across files due to the logic of the --after and --before filters of the osmium-changeset-filter tool. But I decided to follow Unix philosophy here and handle deduplication at the dbt models level.

D select filename, count(*) cnt from read_parquet('data/out/parquet/changesets_*.parquet', filename = true) group by 1 order by 1;

┌───────────────────────────────────────────────┬──────────┐
│                   filename                    │   cnt    │
│                    varchar                    │  int64   │
├───────────────────────────────────────────────┼──────────┤
│ data/out/parquet/changesets_2005_2012.parquet │ 13777528 │
│ data/out/parquet/changesets_2013_2015.parquet │ 20731429 │
│ data/out/parquet/changesets_2016.parquet      │  8420095 │
│ data/out/parquet/changesets_2017.parquet      │ 10091179 │
│ data/out/parquet/changesets_2018.parquet      │ 10756788 │
│ data/out/parquet/changesets_2019.parquet      │ 13068229 │
│ data/out/parquet/changesets_2020.parquet      │ 17588235 │
│ data/out/parquet/changesets_2021.parquet      │ 18808719 │
│ data/out/parquet/changesets_2022.parquet      │ 15036354 │
│ data/out/parquet/changesets_2023.parquet      │ 14889321 │
│ data/out/parquet/changesets_2024.parquet      │ 12775670 │
│ data/out/parquet/changesets_latest.parquet    │   659191 │
├───────────────────────────────────────────────┴──────────┤
│ 12 rows                                        2 columns │
└──────────────────────────────────────────────────────────┘

OSM PBF

To be honest, I thought that processing changeset data would be the hardest part here. But it turned out that loading PBF files into DuckDB might have even more issues.

Issue 1: Data Sources
There are several sources for these files. The most popular one, Geofabrik, provides two "types" of dumps: the regular one from download.geofabrik.de and a second one from osm-internal.download.geofabrik.de, which includes full personal metadata. I needed this metadata for joining with the changeset data. Honestly, I don’t understand why changeset_id is considered personal information, but it is what it is.

Issue 2: Missing Metadata in DuckDB.
Even having the *internal.osm.pbf file doesn’t make things easier because… neither QuackOSM nor loading the file with the Spatial extension shows you that information! A little disappointing for a "game-changer," as far as I’m concerned.
I also encountered some issues documented on a GitHub page:

WSL usage: sometimes code can break since DuckDB is trying to use all available memory, that can be occupied by Windows.

Temporary files: Around 14G(!) of temporary files were created during the process.

Issue 3: Osmium Export and DuckDB
Next, I turned to the good old team player Osmium, where you can update the config file to get everything you need. But yet again…

At first, I thought I could use a simple CSV-like PostgreSQL output and load it into the table like I’ve done dozens of times. Unfortunately, DuckDB couldn’t handle large lines (the max_line_size=3600000 parameter helped) or complex escape symbols with things like four slashes and multiple quotes. Sure, I could use sed or chain replace() functions over and over, but I don’t want workarounds. A game-changer was promised to me!

Then, I tried GeoJSON (RFC7946) option: a 15GB file containing one FeatureCollection with complex tags and so on. Nope. Errors like "maximum_object_size of bla-bla bytes exceeded while reading file," plus some others I can’t even remember.

Okay, maybe GeoJSON with it's maximum_object_size limits isn’t ideal. Enter GeoJSON Text Sequence (RFC8142), also known as geojsonseq or jsonseq. Each line (beginning with an RS (0x1E, record separator) and ending with a linefeed character) contains one GeoJSON object, making it suitable for streaming. Sounds perfect, right? Nope. DuckDB couldn’t read it.

Believe me, I tried dozens of options and approaches. I don’t know how someone new to geospatial data or even data engineering could find this situation acceptable.

The Solution: shellfs extension
Finally, after hours of googling, I found a gem: the Shellfs extension for DuckDB. This made DuckDB finally useful for my case. Seriously, give it a star on GitHub!

With this, I could load my PBF file, complete with all the data I needed, without creating large temporary files. Just a short one-liner:

./duckdb poland_internal.duckdb -c "load shellfs; set memory_limit = '$(duckdb_max_memory)'; create table poland_internal_raw as select json_extract(json, '$.properties') as properties, json_extract(json, '$.geometry') as geom_geojson from read_json_objects('OSMIUM_POOL_THREADS=$(osmium_pool_threads) osmium export -i sparse_file_array -c osmium.config.json -f geojsonseq data/in/osm_data/poland-latest-internal.osm.pbf | jq -c |', format='auto');"

Make sure both duckdb and osmium have enough RAM to run at the same time.

dbt and export to S3

Moving furter. Now we have 2 .duckdb files with raw osm data. It's super easy to add both to profiles.yml file.

osmaboar_dbt:
  materialized: table
  outputs:
    dev:
      type: duckdb
      path: ../osmaboardbt.duckdb
      attach:
        - path: ../poland_internal.duckdb
      extensions: 
        - spatial
        - json
        - shellfs
      threads: 6
      memory_limit: '4G'

I don’t want to spend much time here; the reader is welcome to check GitHub for details. I created two models to load raw OSM data into a more useful form, with some casts and adjustments. One model defines the user’s first changeset, which was impossible to get from the source data: changesets_count, the number of changesets the user has made before the current one, only works for iD and Rapid OSM editors.
Another model removes duplicates (mentioned above) from Parquet files using a standart row_number() approach.

Three other models are for aggregations and analytics, where DuckDB really shines:

Number of changesets per month and editor
Number of changesets by hosts
Filtered data for amenities with focused on editor, shops and opening hours.

What I want to highlight here is that dbt macros are a great feature! I’ve used them before, and they’re a fantastic tool for reusing code and keeping business logic centralized, which is essential for larger teams. For example, I used macros to clean up software versions and retain only editor names.
Once the data is ready, it’s pushed to S3 and is accessible for you!
All links, along with the dbt project, are available on GitHub and in the notebook in the next chapter.

Showing results

Data processed, CSV is ready, what can we do now? Grapics!
I've created a notebook in Databricks Community Edition so you can check it out. Nothing special about the code — loading the CSV, casting types, and showing results.
And the results are quite interesting!

For example, I didn’t think StreetComplete was so popular for improving maps. Even the stats for people making their first edits look great. It seems like gamification works!

I also noticed that Maps.me is still a tool for OSM contributors, even though Organic Maps has been on the market for quite some time. Really, why are people still using Maps.me? It’s okay for overall stats, but for first edits? I don’t get it.

Another interesting one is OsmBiz. It looks like businesses are aware of OSM. I’d love to learn more about them.

When it comes to everyday usage, I care about the working hours of places. Let’s take shops as an example. If we separate them into two groups — those with working hours and those without — it’s quite disappointing. 94% of places don’t have working hours. But when shops do have them, StreetComplete, Every Door, and B-Jazz show up again.

You can explore the data yourself. Just clone the notebook! Or leave me a message here or on GitHub issues — it would be great to know how people might use it!

Summary and some thoughts.

Is the statement "DuckDB - A game changer for analyzing OpenStreetMap" true? I don't think so, at least not for now.

On one hand, I came to the same conclusion as the author of Should You Ditch Spark for DuckDb or Polars? (very good post, take a look!).

The issues with OOM errors, the hype surrounding the tool, and its primary use case as an interactive, ad-hoc query tool for data exploration rather than for ETL workflows—all of these seem true to me. But even for OSM data exploration, DuckDB lacks flexibility when it comes to handling metadata, and there are some minor issues with loading geodata into the database.

It seems like PostGIS is still the better, more mature solution here. Even if I can run some queries in three seconds instead of a couple of minutes, PostGIS still offers more functions, better logging, improved observability, and a robust ecosystem and community. There’s also the option to explore pg_parquet. I'm also looking forward to seeing more progress with Iceberg. :)

It's no surprise that no technology is a silver bullet. As always, consider your specific workload and pay close attention to DBMS types. The distinction between multi-user and single-user systems isn't just about the number of connected agents; it's also about accommodating different human users and their needs.

DuckDB + Spatial is promising, but it's still a bit too young to replace PostgreSQL + PostGIS + Osmium at this stage. Perhaps in a couple of years…

Baits and hot-takes

One more thing I wanted to add to make the post a bit more fun :)
After seeing all the errors and illogical decisions, here are my shout-outs:

Every tool and script based on DuckDB should have a max_memory option! Seriously, I want to limit memory usage in my WSL/Docker environment, and I want control over it.
As one of the main rules in business goes, "Get money from clients in any way they use," every new database should be able to ingest my data in any form I provide. Long strings? Handle them! WKT/WKB for geography? No problem, ingest them! Some other kind of JSON? Eat it! I want my data inside the system without any hiccups at that stage.

It's almost 2025, and OpenStreetMap should start sharing data in Parquet file format. Archived XML is so "nineties".

What do you think?

DEV Community

Exploring OSM changesets via DuckDB

What's the idea?

Changesets

OSM PBF

dbt and export to S3

Showing results

Summary and some thoughts.

Baits and hot-takes

Top comments (0)

Read next

Infamous N+1 Query Problem with Entity Framework Core

Build your own Dynamo-like key/value database - Part 1 - TCP Server

Mastering Data Cleaning: Your Guide to a Cleaner, Reliable Dataset 🚀

Idempotency in DynamoDB