kalyanakannan padivasu

Posted on Mar 17

pyGAEB: The Python Library That Unlocks GAEB Construction Data

#gaeb #bim #constructiontechnology #opensource

pyGAEB: The Python Library That Unlocks GAEB Construction Data

Parse, validate, classify, and write GAEB DA XML files — with optional LLM-powered item classification — in one open-source Python package.

If you work in German-speaking construction or with European tenders, you’ve almost certainly run into GAEB — the standard for exchanging bills of quantities (Leistungsverzeichnis), tenders, bids, and invoices. GAEB DA XML is the modern format: XML-based, versioned (2.0 through 3.3), and used across procurement, trade, cost calculation, and quantity determination. The catch? Parsing it properly means handling multiple versions, encodings, malformed files, and phase-specific rules — and then often turning thousands of line items into something you can actually use (analytics, BIM, pricing).

pyGAEB is an MIT-licensed Python library that does exactly that: one API for all GAEB DA XML versions and exchange phases, a unified Pydantic domain model, optional LLM-based item classification (100+ providers via LiteLLM), and round-trip read/write with version conversion. This article walks through what it does and how to use it.

Why GAEB, and Why Python?

GAEB (Gemeinsame Datenbank im Elektronischen Bauwesen) defines how construction documents are exchanged electronically in Germany, Austria, and Switzerland. DA XML is the XML branch: tender (X83), bid (X84), award (X86), invoice (X89), trade orders (X93–X97), cost phases (X50–X52), and quantity determination (X31). Different phases carry different fields; versions 2.x use German element names, 3.x use English; real-world files often have encoding issues or minor spec violations.

Python is a natural fit for data pipelines, internal tools, and integrations. You want to:

Ingest GAEB files from different software (iTWO, ARRIBA, etc.) without caring about version.
Validate structure and business rules (e.g. totals vs. qty × unit price).
Enrich items with semantic types (Door, Wall, Pipe) for BIM, costing, or grouping.
Export to JSON/CSV or write back GAEB for downstream systems.

pyGAEB is built for that workflow: developer-friendly API, tolerant parsing by default, and optional LLM classification that stays out of the way if you don’t need it.

One Parser for All DA XML Versions

You don’t pick a parser by version — you pass a path. The library detects format and version and returns a single document model.

from pygaeb import GAEBParser

doc = GAEBParser.parse("tender.X83")   # DA XML 3.x
doc = GAEBParser.parse("old.D83")     # DA XML 2.x — same call

print(doc.source_version)   # e.g. SourceVersion.DA_XML_33
print(doc.exchange_phase)   # ExchangePhase.X83
print(doc.item_count)       # Total number of items
print(doc.grand_total)      # Decimal("1234567.89")

All monetary and quantity values are Decimal — no floats, so rounding is predictable and auditable. The same GAEBDocument type is produced whether the source was 2.0, 2.1, or 3.0–3.3.

Iterate Items the Same Way for Every Document Type

Procurement (X80–X89), trade (X93–X97), cost (X50–X52), and quantity (X31) documents all expose a consistent item-level view:

for item in doc.iter_items():
    print(item.oz)           # "01.02.0030"
    print(item.short_text)   # "Mauerwerk der Innenwand…"
    print(item.qty)          # Decimal("1170.000")
    print(item.unit)         # "m2"
    print(item.unit_price)   # Decimal("45.50")
    print(item.total_price) # Decimal("53235.00")
    print(item.item_type)    # ItemType.NORMAL

So your analytics or export code can be phase-agnostic.

Document Metadata and Project Info

Every parsed document exposes metadata and, for procurement/trade/cost docs, full project and client details:

# Document summary
print(doc.item_count)           # Total item count
print(doc.document_kind)        # e.g. DocumentKind.PROCUREMENT, TRADE, COST, QUANTITY
print(doc.gaeb_info.version)   # GAEB software version that created the file

# Project info (AwardInfo)
a = doc.award
print(a.project_name, a.project_no, a.description)
print(a.open_date, a.construction_start, a.construction_end)
print(a.currency_label, a.contract_no)
print(a.warranty_duration, a.warranty_unit)
print(a.bid_comm_perm, a.alter_bid_perm)   # Bid permissions

# Owner / client (OWN)
print(a.client, a.award_no)
addr = a.owner_address
if addr:
    print(addr.name, addr.street, addr.pcode, addr.city, addr.country, addr.contact)

Items also carry long text (specifications) as a RichText model — use item.long_text.plain_text for a flat string when exporting or searching.

Navigate the BoQ Hierarchy (Lots → Categories → Positions)

For viewers or reports you often need the structure: lots, then category groups with headers and subtotals, then positions. Use the BoQ hierarchy instead of only flat iter_items():

boq = doc.award.boq
print(boq.is_multi_lot)   # True if document has multiple lots

for lot in boq.lots:
    print(lot.label or lot.rno)           # Lot label / number
    for ctgy in lot.body.categories:
        print("  ", ctgy.rno, ctgy.label)  # Category number and title
        for item in ctgy.items:
            print("    ", item.oz, item.short_text, item.qty, item.unit)
        if ctgy.totals and ctgy.totals.total is not None:
            print("    Subtotal:", ctgy.totals.total)
        for sub in ctgy.subcategories:     # Nested subcategories
            # ... same pattern for sub.rno, sub.label, sub.items

So you can render a proper Bill of Quantities with group headers and position tables (e.g. to Markdown or PDF) without losing the hierarchy.

Validation: Lenient by Default, Strict When You Need It

Real GAEB files often have small spec violations. pyGAEB collects validation issues (structural, numeric, phase-specific) instead of failing on the first error:

from pygaeb import GAEBParser, ValidationMode

doc = GAEBParser.parse("tender.X83")
for issue in doc.validation_results:
    print(issue.severity, issue.message)

# For CI or strict pipelines:
doc = GAEBParser.parse("tender.X83", validation=ValidationMode.STRICT)

You can also register custom validators (e.g. “every item must have a unit”) and get their results in doc.validation_results or pass them per call via extra_validators.

Round-Trip and Version Conversion

Parse, change something, write back — including to another phase (e.g. tender → bid) or another DA XML version:

from pygaeb import GAEBParser, GAEBWriter, ExchangePhase
from decimal import Decimal

doc = GAEBParser.parse("tender.X83")
item = doc.award.boq.get_item("01.02.0030")
item.unit_price = Decimal("48.00")

GAEBWriter.write(doc, "bid.X84", phase=ExchangePhase.X84)

For version conversion without editing:

from pygaeb import GAEBConverter, SourceVersion

# Upgrade 2.x → 3.3
report = GAEBConverter.convert("old.D83", "modern.X83")

# Downgrade for compatibility
report = GAEBConverter.convert(
    "tender.X83", "compat.X83",
    target_version=SourceVersion.DA_XML_32,
)
print(f"Converted {report.items_converted} items, data loss: {report.has_data_loss}")

Export to JSON and CSV

For analytics, BI tools, or custom apps:

from pygaeb.convert import to_json, to_csv

to_json(doc, "boq.json")   # Full nested BoQ tree
to_csv(doc, "items.csv")   # Flat item table (optionally with classification columns)

LLM Classification: Optional and Provider-Agnostic

Many use cases need to know what each line item is — Door, Wall, Pipe, etc. — for BIM linkage, cost grouping, or catalog matching. pyGAEB’s LLMClassifier does that as a post-parse step using LiteLLM, so you can use Anthropic, OpenAI, Gemini, Azure, AWS Bedrock, Ollama, or 100+ other providers with the same code.

from pygaeb import LLMClassifier

classifier = LLMClassifier(model="anthropic/claude-sonnet-4-6")
# classifier = LLMClassifier(model="gpt-4o")
# classifier = LLMClassifier(model="ollama/llama3")  # Local, no API key

# Check cost before running
estimate = await classifier.estimate_cost(doc)
print(f"Will classify {estimate.items_to_classify} items for ~${estimate.estimated_cost_usd:.2f}")

await classifier.enrich(doc)

for item in doc.iter_items():
    if item.classification:
        print(item.oz, item.classification.element_type, item.classification.confidence)

Classification uses a three-level taxonomy (trade → element type → sub-type), optional IFC/DIN 276 codes, and confidence scores. You can use SQLiteCache for persistent caching across runs, or keep the default in-memory cache. Local models (e.g. Ollama) show $0.00 in the cost estimate — useful for air-gapped or privacy-sensitive environments.

Structured Extraction: Your Schema, Your Attributes

After classification you often want typed attributes per item type (e.g. door width, fire rating, material). The StructuredExtractor lets you define a Pydantic schema and pull those attributes from item text, filtered by classification:

from pydantic import BaseModel, Field
from typing import Optional
from pygaeb import StructuredExtractor

class DoorSpec(BaseModel):
    door_type: str = Field("", description="single, double, sliding")
    width_mm: Optional[int] = Field(None, description="Width in mm")
    fire_rating: Optional[str] = Field(None, description="T30, T60, T90")
    glazing: bool = Field(False, description="Has glass panels")
    material: str = Field("", description="wood, steel, aluminium")

extractor = StructuredExtractor(model="anthropic/claude-sonnet-4-6")
doors = await extractor.extract(doc, schema=DoorSpec, element_type="Door")
for item, spec in doors:
    print(item.oz, spec.door_type, spec.fire_rating, spec.width_mm)

You can filter by trade, element_type, or sub_type; use built-in starter schemas (DoorSpec, WindowSpec, WallSpec, PipeSpec) or define your own. Results are cached and can be stored on item.extractions[schema_name] for later use.

Security and Robustness

pyGAEB is built for untrusted or messy input:

XXE prevention and Billion Laughs protection in XML parsing
File size limit (configurable, default 100 MB)
Recursion depth limits on hierarchy walks
Encoding repair (e.g. mojibake in German text) via ftfy before parsing
Malformed XML recovery (two-pass parse with warnings) when possible

So you can safely plug it into pipelines that receive files from many sources.

Extensibility

Custom validators — project-specific rules, global or per parse
Post-parse hooks — extract vendor-specific XML into item.raw_data
Raw data collection — collect_raw_data=True to keep unknown elements
Custom taxonomy and prompts for the LLM classifier
Custom cache backends — implement the CacheBackend protocol (e.g. Redis) and pass it to the classifier or extractor

Trade, Cost, and Quantity Phases

Beyond classic procurement (X83/X84/X86/X89), pyGAEB supports:

Trade (X93–X97) — orders, order items, supplier info
Cost & calculation (X50–X52) — elemental costing, cost elements
Quantity determination (X31) — take-off, REB 23.003, catalogs, attachments

Same idea: parse → get a typed document → iterate or export. There is also cross-phase validation (e.g. tender vs. bid) to check structural consistency.

Installation and Docs

# Core parser, writer, export — no LLM dependencies
pip install pyGAEB

# With LLM classification (LiteLLM + instructor)
pip install pyGAEB[llm]

Python 3.9+, MIT license. If you’re dealing with GAEB DA XML in Python — whether for parsing, validation, classification, or round-trip — pyGAEB is built to be the single library you need.

DEV Community

pyGAEB: The Python Library That Unlocks GAEB Construction Data

pyGAEB: The Python Library That Unlocks GAEB Construction Data

Why GAEB, and Why Python?

One Parser for All DA XML Versions

Iterate Items the Same Way for Every Document Type

Document Metadata and Project Info

Navigate the BoQ Hierarchy (Lots → Categories → Positions)

Validation: Lenient by Default, Strict When You Need It

Round-Trip and Version Conversion

Export to JSON and CSV

LLM Classification: Optional and Provider-Agnostic

Structured Extraction: Your Schema, Your Attributes

Security and Robustness

Extensibility

Trade, Cost, and Quantity Phases

Installation and Docs

Top comments (0)