DEV Community

Jan Tschada
Jan Tschada

Posted on

Building an OSM to RDF Pipeline for AI Agents: A Practical Guide

books imager
Photo by Patrick Tomasso on Unsplash

You want your AI agent to understand places; ports, roads, buildings, construction sites. OpenStreetMap is the obvious source. But feeding raw OSM data to an agent is like giving someone a raw database dump and asking for insights. It's possible, but painful.

We spent the last few weeks building a pipeline that turns OSM into clean RDF knowledge graphs. It's not a polished product; it's a working prototype that taught us a lot. This post shares the important steps, the mistakes we made, and the code that actually works.

What you'll get from this post:

  • Why bash scripts fail silently (and how to fix it)
  • How to generate RDF from OSM without losing your mind
  • A working Python pipeline you can adapt
  • Real examples from maritime and construction extraction

Let's dive in.

The problem with OSM + bash

Our first version was simple: a bash script that calls osmconvert to clip a region, then osmfilter to extract categories like transportation, buildings, and POIs.

Here's what the filter generation looked like (broken version):

# This is what we thought would work
filter_string="(building=house,apartments) or (building=yes)"
osmfilter input.osm --keep="$filter_string" -o=output.osm
Enter fullscreen mode Exit fullscreen mode

Result: An empty file. No error message. Exit code 0.

Turns out osmfilter has a complex syntax including parentheses hell. Or the word "and". It expects a simple key=value or key=value syntax. But the (something) or (something) pattern is what every beginner writes.

We fixed it by manually expanding comma-separated values and removing parentheses:

# Correct syntax (but still fragile)
filter_string="building=house or building=apartments or building=yes"
Enter fullscreen mode Exit fullscreen mode

But maintaining this in bash was a nightmare. Every new filter pattern risked breaking the regex-based parsing. Error handling was non-existent. We couldn't write unit tests. And debugging meant adding echo statements and hoping.

So we made a decision: rewrite the bash part in Python.

Step 1: A proper FilterBuilder (the bug fix that matters)

The heart of the pipeline is filter_builder.py. It takes a list of filter strings from your YAML config and builds a valid osmfilter --keep string.

Here's what it looks like:

# author: Jan Tschada
# SPDX-License-Identifer: Apache-2.0

def build_filter_string(self, filters: List[str]) -> str:
    expanded = []
    for filter_str in filters:
        if '=' in filter_str:
            key, values = filter_str.split('=', 1)
            if ',' in values:
                # Expand "building=house,apartments" into multiple
                for value in values.split(','):
                    expanded.append(f"{key}={value.strip()}")
            else:
                expanded.append(f"{key}={values}")
        else:
            expanded.append(filter_str)

    # Join with " or " – no parentheses, no "and"
    return " or ".join(expanded)
Enter fullscreen mode Exit fullscreen mode

Example: ["building=house,apartments", "amenity=school"] becomes "building=house or building=apartments or amenity=school".

We also added validation to catch invalid syntax before calling osmfilter:

# author: Jan Tschada
# SPDX-License-Identifer: Apache-2.0

def validate_filter_syntax(self, filter_string: str) -> bool:
        """
        Validate that the filter string uses proper osmfilter syntax.

        Args:
            filter_string: The filter string to validate

        Returns:
            True if syntax appears valid, False otherwise
        """
        if not filter_string:
            return False

        # Check for invalid patterns that caused the original bug
        invalid_patterns = [
            "&&",  # No logical AND operators
            "||",  # No logical OR operators
        ]

        for pattern in invalid_patterns:
            if pattern in filter_string:
                self.logger.warning(f"Invalid filter syntax detected: '{pattern}' in '{filter_string}'")
                return False

        # Basic structure validation
        parts = filter_string.split(" or ")
        for part in parts:
            part = part.strip()
            if not part:
                continue
            if '=' not in part and part != "*":
                self.logger.warning(f"Invalid filter part: '{part}' (missing '=')")
                return False

        return True
Enter fullscreen mode Exit fullscreen mode

This alone eliminated 90% of our "empty file" headaches.

Lesson: Never trust external tools to fail loudly. Always validate inputs before calling them.

Step 2: Retry logic and timeout handling

Bash scripts don't retry. If osmconvert fails because of a temporary network issue (yes, reading a local file can still fail), the whole pipeline dies.

We built a simple retry wrapper:

# author: Jan Tschada
# SPDX-License-Identifer: Apache-2.0

def _run_command_with_retry(self, cmd, description, max_retries=3):
    for attempt in range(1, max_retries + 1):
        try:
            result = subprocess.run(cmd, capture_output=True, text=True, timeout=300)
            if result.returncode == 0:
                return True
            else:
                logging.warning(f"Attempt {attempt} failed: {result.stderr}")
                time.sleep(2 ** attempt)  # exponential backoff
        except subprocess.TimeoutExpired:
            logging.warning(f"Timeout on attempt {attempt}")
    return False
Enter fullscreen mode Exit fullscreen mode

Now when osmfilter chokes on a huge file, we retry with increasing delays. It's not elegant, but it works.

Lesson: External commands are black boxes. Wrap them with retries and timeouts, especially when processing large OSM extracts.

Step 3: Generating RDF (the AI-friendly format)

OSM XML is great for editing maps. But AI agents prefer graphs. Specifically, RDF triples.

We added an --rdf flag to the CLI. When enabled, the pipeline does two extra steps after each category extraction:

  1. Convert the OSM file to PBF using osmconvert
  2. Convert the PBF to TTL using osm2rdf

Here's the code (simplified):

# author: Jan Tschada
# SPDX-License-Identifer: Apache-2.0

def _generate_rdf(self, osm_file, category, output_prefix):
    pbf_file = self.working_dir / f"{output_prefix}_{category}.pbf"
    ttl_file = self.output_dir / f"{output_prefix}_{category}.ttl"

    # Step 1: OSM -> PBF
    subprocess.run(["osmconvert", str(osm_file), f"-o={pbf_file}"], check=True)

    # Step 2: PBF -> TTL
    subprocess.run(["osm2rdf", str(pbf_file), "-o", str(ttl_file)], check=True)

    return ttl_file
Enter fullscreen mode Exit fullscreen mode

Now you get both .osm and .ttl files. The TTL file contains triples like:

:node_123456 a :Node ;
    :hasTag "harbor=yes" ;
    :hasName "Bandar Abbas" ;
    :hasWikidata "Q207137" .
Enter fullscreen mode Exit fullscreen mode

An AI agent can load this into memory or a RAG graph database and run SPARQL queries:

SELECT ?port ?name WHERE {
    ?port :hasTag "harbor=yes" .
    ?port :hasName ?name .
    ?port :hasWikidata ?wikidata .
}
Enter fullscreen mode Exit fullscreen mode

Lesson: Don't make agents parse OSM XML. Convert to RDF once, then let them query.

Step 4: Tag control – include only what matters

Raw OSM objects often have dozens of tags. Most are irrelevant for high-level intelligence. A port might have name, name:en, operator, wikidata, seamark:light:colour, seamark:light:period, source, created_by, etc.

We added include_tags to the YAML configuration:

maritime:
  subcategories:
    seamark:
      filters:
        - "landuse=harbour"
        - "harbor=yes"
      include_tags:
        - "name"
        - "name:en"
        - "operator"
        - "wikidata"
Enter fullscreen mode Exit fullscreen mode

In filter_builder.py, we translate this into the --keep-tags argument:

# author: Jan Tschada
# SPDX-License-Identifer: Apache-2.0

if keep_tags:
    # osmfilter requires 'all' at the beginning
    cmd.extend(["--keep-tags", "all " + " ".join(keep_tags)])
Enter fullscreen mode Exit fullscreen mode

Now the output RDF only contains those tags. Smaller files, cleaner graphs, faster agent queries.

Lesson: Configuration is better than hardcoding. Let users specify which tags matter for their domain.

Step 5: Directory handling and idempotency

A good pipeline should be safe to run multiple times. We added:

  • --force flag to overwrite existing files
  • --dry-run to show what would be executed
  • Automatic creation of output and working directories
# author: Jan Tschada
# SPDX-License-Identifer: Apache-2.0

if output_file.exists() and not self.force_overwrite:
    logging.info(f"Output exists, skipping: {output_file}")
    return output_file
Enter fullscreen mode Exit fullscreen mode

This saved us countless headaches during testing.

Lesson: Always make your tools idempotent. It costs almost nothing and saves hours.

Real-world examples that worked

We tested on three real scenarios:

1. Maritime features (Strait of Hormuz)

osm-extract --bbox "56.037612,26.951262,56.098037,26.977960" \
            --config maritime_features.yaml \
            --rdf --verbose
Enter fullscreen mode Exit fullscreen mode

Output: A small OSM file and a TTL file containing every port, harbour, and naval base in that area. The RDF was clean enough to answer: "Which ports have a Wikidata ID but no English name?"

2. Construction sites

We created a separate config that only extracts landuse=construction:

construction:
  subcategories:
    landuse:
      filters: ["landuse=construction"]
      include_tags: ["name", "operator", "construction"]
Enter fullscreen mode Exit fullscreen mode

Running this on Tehran gave us a list of active construction zones. An agent monitoring change over time could detect new infrastructure before it's officially announced.

3. Global extraction

Yes, you can extract the whole planet:

osm-extract --bbox="-180,-90,180,90" --config features.yaml --rdf
Enter fullscreen mode Exit fullscreen mode

This takes a while (and a lot of RAM), but it works. We don't recommend it for regular use, but it's good for training geospatial foundation models.

What we learned (the hard way)

1. Empty files are still valuable.

We initially skipped RDF generation for files smaller than 1MB. That was a mistake. A 200-byte file with a single harbor=yes node tells the agent "there is a harbor here, but we know nothing else." That's still intelligence. We removed the size check.

2. Logging saves hours.

We added verbose and debug flags. When something fails, we can see exactly which command was run, what the exit code was, and what stderr said. No more guessing.

3. Unit tests are not optional.

Once we fixed the parentheses bug, we wrote tests for it. That bug has never returned. Every new filter pattern is now tested.

4. OSM's human quirks: different tags, same concept

Compressed files are trivial. The real challenge is that OSM is a human‑generated, consensus‑driven dataset. Two different mappers will tag the same real‑world feature in completely different ways.

The complete pipeline in one diagram (text version)

User input (region or bbox)
    ↓
Geocoder → bounding box
    ↓
osmconvert → region.o5m
    ↓
For each enabled category:
    ├── Get filters from YAML
    ├── Build osmfilter command (with proper syntax)
    ├── Run osmfilter → category.osm
    └── If --rdf:
            ├── osmconvert → category.pbf
            └── osm2rdf → category.ttl
    ↓
Generate manifest.json
    ↓
Done.
Enter fullscreen mode Exit fullscreen mode

What's next (our roadmap)

  • Auto-generate configs from OSM Wiki – scrape tag documentation and create extraction rules automatically.
  • Agent-in-the-loop extraction – let an agent request custom filters ("all hospitals with helipads within 10km of a highway").
  • Spatial reasoning over RDF – load TTL into graph database and enable geospatial queries.

But even without those, the current prototype is useful. It turns OSM from a raw data dump into a structured knowledge source that AI agents can actually consume.

Final thoughts

Building this pipeline took longer than we expected. Most of the time wasn't spent on the core logic – it was spent wrestling with edge cases, silent failures, and undocumented tool behavior - and stupid me situations.

But now it works. It's not perfect, but it's reliable enough for real-world extraction. And because it's open, others can improve it.

Let us know if you would be interested into using it...

Top comments (0)