szymon-szym

Posted on May 5

LLM wiki for the historical research

#ai #opensource

Introduction

I wouldn't call myself a historian, as I switched careers a long time ago, and my skills and knowledge have faded away. However, when I look at the current AI revolution changing for better or worse the software industry, I am wondering how historical studies will be impacted.
Recently, I have been exploring the idea by Andrej Karpathy of creating a wiki as a knowledge base for LLMs. I thought it would be interesting to test this concept using the historical documents.

Background

When working with LLMs, one of the crucial parts is to provide them with the relevant context. Frontier models have incredible knowledge stored in their parameters, yet usually it is not enough to navigate through a specific domain.
RAG is a set of techniques that are meant to solve this issue and provide LLM with relevant information as a context. It is a state-of-the-art strategy.
The thing is that a lot of RAG systems are based on semantic searches, which should yield relevant documents. It might feel a bit random, as we rely on the vectors' similarity.

If I get it right, the LLM wiki idea is to replace "vectors' similarity black box" with the structured data formed as a graph. We are looking for a way to provide LLM with the relevant context, and still, we are able to cross-check if the created wiki is correct.

Goal

I plan to create the wiki based on Polish laws from the 17th century. I use Obsidian as my editor and OpenCode to prepare the documents and create the wiki itself.

Wiki source

In my example, I am using Volumina Legum. It is a set of laws ("constitutions") passed by the Sejm (Polish parliament). They were printed in the 19th century and are available for free as scans.

This is an example page:

Build the wiki

Here are the repositories I was working on:

Data ingest repo

Wiki repo

Extract text

To work with the source data, I need to convert scans to text. It looks like a simple OCR task, but it turned out to be a bit tricky. The input is laid out in two columns. The typeface might be problematic for a machine to read, especially for the parts written in Latin.

It feels like transcribing this type of text would require some kind of understanding of the source context. I decided that LLM would be perfect for this task.

The input files are saved in djvu format, so I needed to transform them to PDF. I used DjVuLibre for this, and once done, I started trying different models to read the text.

I hoped to use Qwen3-VL locally, but the sad reality is that I don't have enough VRAM. I tried out a few cloud-based models and initially ended up with Gemini 3 Flash. I was very happy with the quality of transcription, but it was a bit pricy, as I was using direct API calls.

Then I switched to the subscription model plus a "harness". It was much slower, it used many more tokens, and it was cheaper. I decided to go with a $5 subscription for the OpenCode, which provides, among others, the Qwen-3.6 model. Qwen performed pretty great when it came to reading the text.

In general, the preparation process looked like this: djvu files -> pdf files -> text files -> logical chunks of text (e.g., paragraphs)

I wanted each step to be done separately, so I could pause and continue with more files at any time. I created skills for each step that would describe the expected flow.

Converting steps were fairly straightforward, as the file format could be easily transformed using scripts.

For the reading text from the image, it was quite hard to define the SKILL.md. From time to time, the model was confused about where to look for the log, and would sometimes create a new log file. This would cause reading the same file more than once.

The other thing was that when trying to read text from images using OpenCode TUI, my context would become enormous after reading a few files. I instructed the model to read exactly one file, and I looped over the opencode CLI invocations in the bash script.

Some pages weren't transcribed because the model found the content ... inappropriate. It was unexpected and quite funny.

Anyway, I ended up with a bunch of text files split logically from the transcribed text.

From

Here are markdown files served with GitHub pages: https://szymon-szym.github.io/volumina_legum_read/

Quality

There are two main areas of focus when checking and improving the results of data preparation.

First is the transcription quality. The language of the text is an archaic Polish with a significant amount of Latin phrases. There are also a lot of names and places that might be spelled inconsistently. All of this makes transcription part tedious and error-prone. Overall, the quality is really good, but in the long run, the transcription would need some QA.

The second thing is that the created chunks might be incomplete in some cases. Parts of the paragraphs might be lost. I guess the flow would benefit from the additional step for linting and cross-checking created chunks.

Create Wiki

Wiki repo

I created a new vault with Obsidian. Then I launched OpenCode in the created directory and started to shape the wiki.

First, I needed the AGENTS.md definition, so LLM would understand how to digest, lint, and use the wiki.

There are some examples online, but I believe the best way is to generate one by yourself. I copied over the description from Andrej Karpathy's gist and added some context about the source we will work on. The result is more than enough for the first iteration.

Ingesting files and building the wiki takes some time.
After a few iterations, the wiki starts to look a bit messy, with a bunch of orphan links to non-existent pages. (The gray nodes on the graph are orphaned links)

This is why running lint a few times throughout the process is crucial. It helps keep our wiki healthy and usable.

The result is pleasant for the eye.

Let's see if it could be useful.

Querying wiki

Let's ask a general question related to the this specific period. I start from the general model, without wiki:

Qwen3.6 is pretty damn smart and has decent knowledge about history. Let's compare it with the wiki-powered answer:

The version with the wiki is much better, mostly thanks to citations. In this case, the difference doesn't look spectacular, but really pointing to the specific sources makes a difference.

There is a lot of flexibility in how to construct a query logic in the AGENTS.md. For me, having the citation of actual sources is crucial.

I also like that I could follow the reasoning of the model by checking the pages it read. Based on this information, I could update the query process to modify the model's behavior.

Summary

I have created a wiki for LLM based on the constitutions from the 17th century. The goal was to structure the laws of the Polish-Lithuanian Commonwealth into the wiki, which can be used by LLMs. Here are my thoughts about this experiment

Supervising is crucial

Errors will happen on each step of the wiki creation, starting from preparing data and ending with building the wiki. Transcription can be skewed, and prepared files might be incomplete. I was surprised to see that in one of the final wiki lints, the model decided to create plenty of nodes not based on the source documents. It basically ruined my wiki and forced me to restart the ingesting process.

Error handling

Even for the small proof of concept, it is clear that creating a non-trivial wiki is not something we should just throw at LLM and hope for a valuable result. We need an easy way to correct the result of each step incrementally and to use fixes in the later stages of the flow. When I fix the content of the single paragraph, I don't want to recreate the whole wiki.

Cost

Data preparation and wiki creation require using a decent model, which generates costs. In my experiment, I transcribed around 150 pages and built a wiki. I've spent ~50% of my monthly limit in the OpenCode plan, which is quite generous. Using API calls would reduce time and increase the cost further.

Elasticity

Data preparation is a step to be done once. From the same data, we can build different wikis that would focus on different aspects. We could prompt the model to look for links in the specific areas, or to focus on specific regional aspects

It would be interesting to see a growing library of historical sources that are ready to be used by LLMs in different ways for different studies.

DEV Community