DecDEPO

Posted on Apr 17

How I put my open dataset on the Wikidata knowledge graph (and why you should too)

#opensource #opendata #nlp #python

Last week we released the Swedish Construction FAQ — 503 bilingual Q&A pairs, CC BY 4.0, DOI 10.5281/zenodo.19630803. The dataset got onto Hugging Face, Zenodo, Kaggle, PyPI — the usual distribution stack.

But one step moved the needle more than the others: putting it on Wikidata as a first-class entity.

This post is a practical walkthrough. No philosophy, just what we did and why.

What a Wikidata entity gets you

A stable identifier — Q139393633 is the QID for our dataset. It'll outlive our GitHub account, our domain, our company.
Machine-readable citability — Wikidata is the knowledge graph that Google, Siri, Alexa, OpenAI, Anthropic, Perplexity, and basically every LLM training pipeline reads from.
A place for the DOI and license to live together — P275 (license) + P356 (DOI) + P407 (language) + P1476 (title) as proper RDF triples.
Free cross-linking — every related entity (the company, the companion datasets, the subject matter) shows up in the Special:WhatLinksHere graph.
It costs nothing — no API key, no app registration, no fees. Just a SUL account.

The entities we created

For the FAQ dataset alone, we ended up with six connected entities:

QID	Entity	What it is
`Q139393633`	Swedish Construction FAQ	The flagship dataset
`Q139393658`	Zaragoza AB	The publisher/creator (our company)
`Q139393817`	Construction Terminology Glossary	Companion trilingual glossary
`Q139393818`	Building Materials Specifications	Companion spec dataset
`Q139393819`	Construction Inspection Templates	Companion template dataset
`Q139393821`	Renovation Timeline Reference	Companion timeline dataset

Each companion dataset has P123 (publisher) and P170 (creator) pointing to Q139393658, which means the company's entity automatically aggregates everything we publish.

The minimum viable dataset entity

Here's the shape we used — it's the minimum that actually gets you into the graph in a way that indexers respect.

{
  "labels": {
    "en": { "language": "en", "value": "Swedish Construction FAQ" },
    "sv": { "language": "sv", "value": "Svensk byggbransch FAQ" }
  },
  "descriptions": {
    "en": { "language": "en", "value": "open bilingual Q&A dataset on Swedish construction" }
  },
  "claims": [
    {"property": "P31",  "value": "Q1172284"},       // instance of → dataset
    {"property": "P275", "value": "Q20007257"},      // license → CC BY 4.0
    {"property": "P356", "value": "10.5281/zenodo.19630803"}, // DOI
    {"property": "P407", "value": "Q9027"},          // language → Swedish
    {"property": "P407", "value": "Q1860"},          // language → English
    {"property": "P123", "value": "Q139393658"},     // publisher → Zaragoza AB
    {"property": "P170", "value": "Q139393658"},     // creator → Zaragoza AB
    {"property": "P856", "value": "github.com/..."}, // official website
    {"property": "P1324", "value": "github.com/..."},// source repo
    {"property": "P577", "value": "+2026-04-17"}     // publication date
  ]
}

Calling the MediaWiki API from Node

The API takes three requests: get a login token, log in, get a CSRF token. Then you can wbeditentity.

const API = 'https://www.wikidata.org/w/api.php';
const jar = new Map();
const cookie = () => [...jar].map(([k,v]) => `${k}=${v}`).join('; ');

async function call(params, method = 'POST') {
  const body = new URLSearchParams({ ...params, format: 'json' });
  const opts = {
    method,
    headers: { 'User-Agent': 'your-app/1.0', 'Cookie': cookie() },
  };
  if (method === 'POST') {
    opts.headers['Content-Type'] = 'application/x-www-form-urlencoded';
    opts.body = body;
  }
  const res = await fetch(method === 'GET' ? `${API}?${body}` : API, opts);
  // Parse cookies into the jar so subsequent calls are authenticated
  (res.headers.get('set-cookie') || '').split(/,(?=[^;]+=)/).forEach(c => {
    const [kv] = c.split(';'); const [k,v] = kv.split('=');
    if (k && v) jar.set(k.trim(), v.trim());
  });
  return res.json();
}

// 1. Login token
const lt = await call({ action: 'query', meta: 'tokens', type: 'login' }, 'GET');

// 2. Login (SUL credentials work — no bot password required on Wikidata)
await call({
  action: 'login',
  lgname: process.env.WIKI_USER,
  lgpassword: process.env.WIKI_PASS,
  lgtoken: lt.query.tokens.logintoken,
});

// 3. CSRF token for the edit
const ct = await call({ action: 'query', meta: 'tokens', type: 'csrf' }, 'GET');

// 4. Create the entity
const r = await call({
  action: 'wbeditentity',
  new: 'item',
  data: JSON.stringify(entity),
  token: ct.query.tokens.csrftoken,
  summary: 'Create entity for Swedish Construction FAQ (CC BY 4.0)',
});

console.log(r.entity.id); // → "Q139393633"

That's the whole thing. About 30 lines of code, one HTTP session, and your dataset has a permanent identity in the global knowledge graph.

Gotchas we hit

Abuse filter 11 ("badwords") will flag entities with the word "Awesome" in the label. We had to rename one repo's entity description. Doesn't matter what the actual word is — if the filter warns, your write fails with abusefilter-warning and you have to retry with different wording.
CAPTCHAs appear on newer accounts when you add external links. On sv.wikipedia.org, we had to strip URLs from our first edit. On wikidata.org, none of our edits hit a CAPTCHA — the Wikibase API is more permissive here.
SUL (Single User Login) isn't automatic across all projects. Our account worked on wikidata.org from day one but not on sv.wikipedia.org (login failed with "wrongpassword"). If you want to edit multiple projects, expect one manual login per project to "attach" your SUL.
Rate limits are generous. We created six entities in under five minutes with a 2-second sleep between them. No throttling.

Connecting the graph

Once the entities exist, the real value comes from linking them. We made the dataset a "work by" the company entity:

Q139393633 (dataset) — has P123 and P170 → Q139393658 (company)
All companion datasets do the same
The company entity itself has P31 (aktiebolag), P17 (Sweden), P159 (Helsingborg), P452 (construction industry)

Run SPARQL on query.wikidata.org and the whole graph falls out:

SELECT ?dataset ?datasetLabel WHERE {
  ?dataset wdt:P123 wd:Q139393658 .
  ?dataset wdt:P31 wd:Q1172284 .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en,sv" }
}

That's how external tools find everything we've published — one query.

What changed after we did this

Small, obvious things:

The Hugging Face dataset card now links to six Wikidata QIDs — readers click through.
Our Zenodo record has related_identifiers pointing at the Wikidata URI — so when someone cites the DOI, the graph picks up the citation.
Wikidata's own "What links here" page now serves as a free backlink-monitoring tool for us.

None of this is rocket science. But it's the one layer most open datasets skip.

Go do it

Your dataset already has a DOI, probably a GitHub repo, probably a Hugging Face mirror. Putting it on Wikidata takes ~30 lines of Node and one afternoon.

The entity you create today will outlive your current URLs. That's worth the afternoon.

Swedish Construction FAQ is maintained by Zaragoza AB, Helsingborg. All datasets CC BY 4.0. Explore the graph: Q139393633 · Q139393658.

DEV Community