DecDEPO

Posted on Apr 17 • Originally published at github.com

Building an Open Bilingual Q&A Dataset for Swedish Construction Law (503 entries, CC BY 4.0)

#dataset #machinelearning #nlp #opensource

I spent the last few weeks building something that felt missing in the Swedish AI ecosystem: an open, bilingual, legally-grounded Q&A dataset for the construction industry.

Just released v1.2.1 with 503 question-answer pairs across 39 categories, in both Swedish and English, under CC BY 4.0. Here's what I learned building it, how it's structured, and how to use it.

The problem

Swedish construction law (PBL, BBR, ABS 18, AB 04) is dense, fragmented, and lives in PDFs scattered across municipal websites, Boverket, Skatteverket, and court archives.

If you've ever tried to answer "do I need a bygglov for this renovation?" you know the pain — three websites, two PDFs, one Skatteverket hotline, and maybe an answer.

This is exactly the kind of problem LLMs can help with — but only if there's grounded training data. Most open multilingual datasets barely include Swedish at all, and when they do, construction/legal Swedish is a rounding error.

So I built the data.

What's in the dataset

503 Q&A pairs × 2 languages = 1,006 entries

39 categories covering:

Permits: bygglov, attefallshus, tillbyggnad, marklov
Taxes: ROT-avdrag, RUT-avdrag, F-skatt, omvänd moms, personalliggare
Trades: takläggning, fasadrenovering, köksrenovering, badrumsrenovering, isolering, VVS, elinstallation, ventilation, värmesystem
Legal: dolda fel, verifiera byggfirma, ABS18/AB04/ABT06 contracts, arbetsmiljö (AFS)
Regulation: BBR, PBL, Miljöbalken, Energideklaration
Costs & disputes: kostnader, offerter, ARN, dispute resolution

Each answer:

30–150 words
Grounded in a specific Swedish statute or authority guidance
Cites the source (PBL § 9:2, BBR 6:5321, Skatteverket handledning)
Hand-reviewed for factual accuracy

Design choices (and mistakes I learned from)

1. Don't translate legal terms

Early attempt: I translated "bygglov" to "building permit" everywhere in the English set. Bad idea. A Swede reading the English set wants to see bygglov (building permit) so they can map to the original. And an English-speaking researcher working on Swedish legal NLP wants the Swedish term preserved.

Rule I landed on: keep Swedish legal terminology in the English set, with English gloss in parentheses the first time it appears.

2. Cite sources inside the answer, not just a metadata field

Initial structure had a separate sources: [...] array. Worked for humans, but when fine-tuning, the model doesn't always learn to carry the citation into its output.

Now: citations appear inline in the answer text ("Enligt PBL 9 kap. 2 §...") AND in the metadata field. The model learns to cite, not just to answer.

3. 30–150 words per answer

Tested with shorter (15 words) and longer (500 words). Shorter loses grounding; longer drifts. 30-150 is the sweet spot for factual legal Q&A.

4. Multi-format release

Shipped in 5 formats:

faq.json — master with metadata
faq.jsonl — HuggingFace-native (one record per line)
faq-alpaca.jsonl — Alpaca instruction format
faq-sharegpt.jsonl — ShareGPT conversation format
faq.csv — for non-ML users (Excel / Google Sheets)

Same data, 5 pipelines, zero conversion friction.

How to use it

Via HuggingFace datasets:

from datasets import load_dataset
ds = load_dataset("DecDEPO/swedish-construction-faq")
# Swedish: ds["train"] (503 rows)
# English: load_dataset(..., "english")

Via pip:

pip install zaragoza-construction-faq

import zaragoza_construction_faq as zcf

zcf.load()                    # 503 SV Q&A as list of dicts
zcf.load(lang="en")           # 503 EN
zcf.load("bygglov")           # filter by category
zcf.categories()              # all 39 categories

# Iterators for LLM fine-tuning
for rec in zcf.iter_alpaca():
    # rec = {"instruction": "...", "output": "..."}
    train(rec)

for rec in zcf.iter_sharegpt(lang="en"):
    # rec = {"conversations": [{"role": "user", ...}, {"role": "assistant", ...}]}
    train(rec)

Via Kaggle / CSV:

Kaggle dataset page — download as zip, drop into your notebook.

Academic citation (DOI assigned)

Zenodo assigned a permanent DOI, so it's citable:

@dataset{zaragoza_swedish_construction_faq_2026,
  author    = {{Zaragoza AB}},
  title     = {Swedish Construction FAQ — Open Q\&A Dataset (SV + EN)},
  year      = {2026},
  publisher = {Zenodo},
  doi       = {10.5281/zenodo.19630803}
}

DOI: 10.5281/zenodo.19630803

License

CC BY 4.0 — free for commercial and research use, attribution required.

Intentionally chose BY over BY-SA because I want this in commercial products (fine-tune a chatbot for a construction firm, build a RAG system for a municipal permit office, whatever) with no copyleft friction.

Where it lives

🐙 GitHub: zaragoza-ab/swedish-construction-faq-1000
🤗 HuggingFace: DecDEPO/swedish-construction-faq
📦 PyPI: zaragoza-construction-faq
📊 Kaggle: decdepo/swedish-construction-faq
📜 Zenodo: DOI 10.5281/zenodo.19630803

What's next

Target is 1000+ Q&As (v2.0). The areas most underrepresented right now:

Kommun-specific rules (each of Sweden's 290 municipalities has its own bygglov process quirks)
Post-2020 case law (big shifts on dolda fel doctrine)
Cross-border cases (what if your contractor is Polish, Romanian, Baltic?)

If you're Swedish and find something outdated or wrong, open a PR or an issue. I'll merge within a day.

Also releasing a 510-entry trilingual construction glossary (Swedish / English / Polish) in a sibling repo, because Polish construction workers in Sweden are a huge demographic and there's zero open terminology for them.

Built this for Zaragoza AB (Helsingborg) — a small construction firm that's using the dataset internally for their customer Q&A chatbot. Open-sourced because Swedish AI needs more domain data and there's no business reason to keep it closed.

Feedback welcome. Especially if you're working on Swedish NLP, building a Swedish legal RAG system, or just trying to renovate your kök and wondering if bygglov applies.

DEV Community