DEV Community

shashank ms
shashank ms

Posted on

Building Humanities Tools with LLMs: A Technical Guide

We are building a primary source analyzer that ingests historical texts and emits structured annotations: entities, dates, relationships, and historical context. It helps historians and digital humanists turn unstructured archives into queryable data without maintaining brittle NLP pipelines or regular expression rules.

What you'll need

  • Python 3.10+
  • The OpenAI SDK: pip install openai
  • An Oxlo.ai API key from https://portal.oxlo.ai
  • A directory of plain text primary source documents

Step 1: Set up the Oxlo.ai client and test connectivity

I always verify the connection before adding application logic. This snippet initializes the client and makes a lightweight call to confirm the endpoint is alive.

from openai import OpenAI
import os

client = OpenAI(
    base_url="https://api.oxlo.ai/v1",
    api_key=os.environ.get("OXLO_API_KEY")
)

response = client.chat.completions.create(
    model="llama-3.3-70b",
    messages=[
        {"role": "user", "content": "Reply with exactly: connection ok"}
    ],
)

assert "connection ok" in response.choices[0].message.content.lower()
print("Oxlo.ai client ready")

Step 2: Ingest and chunk primary source documents

Archival documents vary wildly in length. Because Oxlo.ai uses request-based pricing rather than token-based metering, we can pass large chunks in a single call without cost scaling by the word. See https://oxlo.ai/pricing for plan details. I split on paragraph boundaries and cap at 8,000 words per chunk to keep context windows comfortable while minimizing API calls.

import glob

def load_corpus(directory="sources"):
    files = glob.glob(f"{directory}/*.txt")
    chunks = []
    for path in files:
        with open(path, "r", encoding="utf-8") as f:
            text = f.read()
        paragraphs = text.split("\n\n")
        current_chunk = []
        current_len = 0
        for para in paragraphs:
            words = para.split()
            if current_len + len(words) > 8000:
                chunks.append({
                    "source": path,
                    "text": "\n\n".join(current_chunk)
                })
                current_chunk = [para]
                current_len = len(words)
            else:
                current_chunk.append(para)
                current_len += len(words)
        if current_chunk:
            chunks.append({
                "source": path,
                "text": "\n\n".join(current_chunk)
            })
    return chunks

Step 3: Design the historian system prompt

The system prompt is the only training signal we need. I keep it strict about output format and evidentiary standards so the model behaves like a careful research assistant rather than a chatbot.

SYSTEM_PROMPT = """You are a digital humanities research assistant analyzing primary historical documents.

For each input text, produce a JSON object with this exact structure:
{
  "entities": [
    {"name": "string", "type": "PERSON|PLACE|ORGANIZATION|EVENT", "context": "relevant sentence"}
  ],
  "dates": [
    {"expression": "original text", "normalized": "YYYY-MM-DD or YYYY-MM or YYYY", "confidence": 0.0-1.0}
  ],
  "relationships": [
    {"subject": "entity name", "predicate": "string", "object": "entity name", "evidence": "quoted phrase"}
  ],
  "summary": "one sentence historical context"
}

Rules:
- Only include information explicitly stated or directly inferable from the text.
- Use null for unknown date components.
- Confidence reflects textual clarity, not historical certainty.
"""

Step 4: Build the structured analysis function

I wrap the API call in a small function that enforces JSON mode. I use kimi-k2.6 here because its reasoning and long-context handling are excellent for dense archival prose.

import json

def analyze_chunk(chunk):
    response = client.chat.completions.create(
        model="kimi-k2.6",
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": chunk["text"]},
        ],
        response_format={"type": "json_object"},
        temperature=0.1,
    )
    raw = response.choices[0].message.content
    parsed = json.loads(raw)
    parsed["provenance"] = chunk["source"]
    return parsed

Step 5: Batch process a corpus with provenance tracking

Now we wire it together. I process every chunk and write the results to a newline-delimited JSON file for downstream analysis in Pandas, R, or a web frontend.

def process_corpus(directory="sources", output="annotations.jsonl"):
    chunks = load_corpus(directory)
    with open(output, "w", encoding="utf-8") as out:
        for chunk in chunks:
            try:
                result = analyze_chunk(chunk)
                out.write(json.dumps(result, ensure_ascii=False) + "\n")
                print(f"Processed {chunk['source']}")
            except Exception as e:
                print(f"Failed on {chunk['source']}: {e}")

if __name__ == "__main__":
    process_corpus()

Run it

Create a file named sources/sample.txt with this public domain excerpt:

Jefferson's letter to Adams, dated July 5th, 1814, discusses the Library of Congress and the burning of Washington by British forces. Adams replies from Quincy on the 15th of August, noting his concern for the collection.

Then execute python analyze.py. You should see:

Processed sources/sample.txt

And inside annotations.jsonl:

{"entities": [{"name": "Jefferson", "type": "PERSON", "context": "Jefferson's letter to Adams"}, {"name": "Adams", "type": "PERSON", "context": "letter to Adams"}, {"name": "Library of Congress", "type": "ORGANIZATION", "context": "discusses the Library of Congress"}, {"name": "British forces", "type": "ORGANIZATION", "context": "burning of Washington by British forces"}, {"name": "Quincy", "type": "PLACE", "context": "Adams replies from Quincy"}], "dates": [{"expression": "July 5th, 1814", "normalized": "1814-07-05", "confidence": 1.0}, {"expression": "15th of August", "normalized": "1814-08-15", "confidence": 0.9}], "relationships": [{"subject": "Jefferson", "predicate": "wrote to", "object": "Adams", "evidence": "Jefferson's letter to Adams"}, {"subject": "British forces", "predicate": "burned", "object": "Washington", "evidence": "burning of Washington by British forces"}], "summary": "Correspondence between Jefferson and Adams regarding the War of 1812 and the destruction of the Library of Congress.", "provenance": "sources/sample.txt"}

Next steps

Feed the annotations.jsonl file into a network graph tool like Gephi to visualize entity relationships across your entire archive. If your corpus contains multilingual sources, swap in qwen-3-32b without changing any client code, since Oxlo.ai exposes the same OpenAI-compatible interface for every model.

Top comments (0)