DEV Community

DecDEPO
DecDEPO

Posted on • Originally published at github.com

Building an Open Bilingual Q&A Dataset for Swedish Construction Law (503 entries, CC BY 4.0)

I spent the last few weeks building something that felt missing in the Swedish AI ecosystem: an open, bilingual, legally-grounded Q&A dataset for the construction industry.

Just released v1.2.1 with 503 question-answer pairs across 39 categories, in both Swedish and English, under CC BY 4.0. Here's what I learned building it, how it's structured, and how to use it.

The problem

Swedish construction law (PBL, BBR, ABS 18, AB 04) is dense, fragmented, and lives in PDFs scattered across municipal websites, Boverket, Skatteverket, and court archives.

If you've ever tried to answer "do I need a bygglov for this renovation?" you know the pain — three websites, two PDFs, one Skatteverket hotline, and maybe an answer.

This is exactly the kind of problem LLMs can help with — but only if there's grounded training data. Most open multilingual datasets barely include Swedish at all, and when they do, construction/legal Swedish is a rounding error.

So I built the data.

What's in the dataset

503 Q&A pairs × 2 languages = 1,006 entries

39 categories covering:

  • Permits: bygglov, attefallshus, tillbyggnad, marklov
  • Taxes: ROT-avdrag, RUT-avdrag, F-skatt, omvänd moms, personalliggare
  • Trades: takläggning, fasadrenovering, köksrenovering, badrumsrenovering, isolering, VVS, elinstallation, ventilation, värmesystem
  • Legal: dolda fel, verifiera byggfirma, ABS18/AB04/ABT06 contracts, arbetsmiljö (AFS)
  • Regulation: BBR, PBL, Miljöbalken, Energideklaration
  • Costs & disputes: kostnader, offerter, ARN, dispute resolution

Each answer:

  • 30–150 words
  • Grounded in a specific Swedish statute or authority guidance
  • Cites the source (PBL § 9:2, BBR 6:5321, Skatteverket handledning)
  • Hand-reviewed for factual accuracy

Design choices (and mistakes I learned from)

1. Don't translate legal terms

Early attempt: I translated "bygglov" to "building permit" everywhere in the English set. Bad idea. A Swede reading the English set wants to see bygglov (building permit) so they can map to the original. And an English-speaking researcher working on Swedish legal NLP wants the Swedish term preserved.

Rule I landed on: keep Swedish legal terminology in the English set, with English gloss in parentheses the first time it appears.

2. Cite sources inside the answer, not just a metadata field

Initial structure had a separate sources: [...] array. Worked for humans, but when fine-tuning, the model doesn't always learn to carry the citation into its output.

Now: citations appear inline in the answer text ("Enligt PBL 9 kap. 2 §...") AND in the metadata field. The model learns to cite, not just to answer.

3. 30–150 words per answer

Tested with shorter (15 words) and longer (500 words). Shorter loses grounding; longer drifts. 30-150 is the sweet spot for factual legal Q&A.

4. Multi-format release

Shipped in 5 formats:

  • faq.json — master with metadata
  • faq.jsonl — HuggingFace-native (one record per line)
  • faq-alpaca.jsonl — Alpaca instruction format
  • faq-sharegpt.jsonl — ShareGPT conversation format
  • faq.csv — for non-ML users (Excel / Google Sheets)

Same data, 5 pipelines, zero conversion friction.

How to use it

Via HuggingFace datasets:

from datasets import load_dataset
ds = load_dataset("DecDEPO/swedish-construction-faq")
# Swedish: ds["train"] (503 rows)
# English: load_dataset(..., "english")
Enter fullscreen mode Exit fullscreen mode

Via pip:

pip install zaragoza-construction-faq
Enter fullscreen mode Exit fullscreen mode
import zaragoza_construction_faq as zcf

zcf.load()                    # 503 SV Q&A as list of dicts
zcf.load(lang="en")           # 503 EN
zcf.load("bygglov")           # filter by category
zcf.categories()              # all 39 categories

# Iterators for LLM fine-tuning
for rec in zcf.iter_alpaca():
    # rec = {"instruction": "...", "output": "..."}
    train(rec)

for rec in zcf.iter_sharegpt(lang="en"):
    # rec = {"conversations": [{"role": "user", ...}, {"role": "assistant", ...}]}
    train(rec)
Enter fullscreen mode Exit fullscreen mode

Via Kaggle / CSV:

Kaggle dataset page — download as zip, drop into your notebook.

Academic citation (DOI assigned)

Zenodo assigned a permanent DOI, so it's citable:

@dataset{zaragoza_swedish_construction_faq_2026,
  author    = {{Zaragoza AB}},
  title     = {Swedish Construction FAQ — Open Q\&A Dataset (SV + EN)},
  year      = {2026},
  publisher = {Zenodo},
  doi       = {10.5281/zenodo.19630803}
}
Enter fullscreen mode Exit fullscreen mode

DOI: 10.5281/zenodo.19630803

License

CC BY 4.0 — free for commercial and research use, attribution required.

Intentionally chose BY over BY-SA because I want this in commercial products (fine-tune a chatbot for a construction firm, build a RAG system for a municipal permit office, whatever) with no copyleft friction.

Where it lives

What's next

Target is 1000+ Q&As (v2.0). The areas most underrepresented right now:

  • Kommun-specific rules (each of Sweden's 290 municipalities has its own bygglov process quirks)
  • Post-2020 case law (big shifts on dolda fel doctrine)
  • Cross-border cases (what if your contractor is Polish, Romanian, Baltic?)

If you're Swedish and find something outdated or wrong, open a PR or an issue. I'll merge within a day.

Also releasing a 510-entry trilingual construction glossary (Swedish / English / Polish) in a sibling repo, because Polish construction workers in Sweden are a huge demographic and there's zero open terminology for them.


Built this for Zaragoza AB (Helsingborg) — a small construction firm that's using the dataset internally for their customer Q&A chatbot. Open-sourced because Swedish AI needs more domain data and there's no business reason to keep it closed.

Feedback welcome. Especially if you're working on Swedish NLP, building a Swedish legal RAG system, or just trying to renovate your kök and wondering if bygglov applies.

Top comments (0)