DEV Community

shashank ms
shashank ms

Posted on

Building Humanities Tools with LLMs: A Step-by-Step Guide

We are going to build a primary source analyzer that ingests raw historical texts and emits structured research briefs. This tool helps historians and literary scholars extract entities, sentiment, historical context, and research questions without maintaining custom NLP pipelines. Because Oxlo.ai charges one flat cost per request, analyzing long letters or diary entries never inflates the price as the document grows.

What you'll need

Step 1: Set up the Oxlo.ai client

I start by initializing the OpenAI-compatible client pointing at Oxlo.ai. I use Llama 3.3 70B because it follows structured instructions reliably and handles long context windows well. If you later work with multilingual archives, Oxlo.ai also hosts Qwen 3 32B, and for documents up to 131K tokens you can switch to Kimi K2.6 without changing any other code.

from openai import OpenAI
import json
import os

client = OpenAI(
    base_url="https://api.oxlo.ai/v1",
    api_key=os.environ.get("OXLO_API_KEY", "YOUR_OXLO_API_KEY")
)

# Quick connectivity check
response = client.chat.completions.create(
    model="llama-3.3-70b",
    messages=[{"role": "user", "content": "Say 'Oxlo.ai client ready'"}],
    max_tokens=20
)
print(response.choices[0].message.content)

Step 2: Define the system prompt

The system prompt turns the model into a careful archival assistant. It enforces valid JSON output and forbids anachronistic assumptions.

SYSTEM_PROMPT = """You are a primary source analysis assistant for humanities researchers.
Your job is to read a historical document and produce a structured JSON object with exactly these keys:
- summary: a one-paragraph summary of the document's content
- entities: an array of objects, each with {name, type, description}. Types may be person, place, organization, event, or concept.
- sentiment: the overall emotional tone of the author, chosen from hopeful, anxious, neutral, angry, or mournful
- historical_context: one paragraph situating the document in its likely time period
- research_questions: an array of three specific questions a scholar might pursue based on this text
- anomalies: an array of anything unusual, damaged text, or likely transcription errors

Rules:
- Base all claims strictly on the provided text.
- Do not use outside knowledge unless explicitly labeled as historical_context.
- Output valid JSON only, with no markdown fences or commentary.
"""

Step 3: Build the analysis function with JSON mode

This function sends the document to Oxlo.ai and requests a JSON object. JSON mode guarantees we can parse the result without fragile regex.

def analyze_primary_source(text: str, model: str = "llama-3.3-70b") -> dict:
    """Send a historical document to Oxlo.ai and return a structured analysis."""
    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": f"Analyze the following primary source:\n\n{text}"},
        ],
        response_format={"type": "json_object"},
        temperature=0.2,
    )
    
    raw = response.choices[0].message.content
    return json.loads(raw)

Step 4: Batch-process a directory of transcripts

Archival projects rarely involve a single file. This helper reads every .txt file in a folder and runs analysis on each, aggregating the results.

from pathlib import Path

def analyze_directory(dir_path: str, model: str = "llama-3.3-70b") -> list[dict]:
    """Analyze all .txt files in a directory."""
    results = []
    for file_path in Path(dir_path).glob("*.txt"):
        text = file_path.read_text(encoding="utf-8")
        analysis = analyze_primary_source(text, model=model)
        analysis["source_file"] = file_path.name
        results.append(analysis)
    return results

Step 5: Add a CLI entrypoint

I wrap the tool in a small command-line interface so researchers can point it at a file or a folder without touching Python.

if __name__ == "__main__":
    import sys
    
    if len(sys.argv) < 2:
        print("Usage: python analyzer.py <path_to_txt_file_or_directory>")
        sys.exit(1)
    
    target = Path(sys.argv[1])
    
    if target.is_file():
        text = target.read_text(encoding="utf-8")
        result = analyze_primary_source(text)
        print(json.dumps(result, indent=2, ensure_ascii=False))
    elif target.is_dir():
        batch = analyze_directory(target)
        out_file = target / "analysis_results.json"
        out_file.write_text(
            json.dumps(batch, indent=2, ensure_ascii=False),
            encoding="utf-8"
        )
        print(f"Analyzed {len(batch)} files. Results written to {out_file}")
    else:
        print("Invalid path.")
        sys.exit(1)

Run it

Save the script as analyzer.py, create a file named letter_1862.txt, and run:

python analyzer.py letter_1862.txt

Here is a sample input and the resulting output.

Input (letter_1862.txt):

Camp near Falmouth, Va.
December 14th, 1862

My dear wife,

We have just returned from the battlefield near Fredericksburg. The loss on our side is terrible. I am unhurt, though three bullets passed through my coat. The men are in low spirits. Burnside has ordered another assault for tomorrow, though I cannot see how we shall carry those heights. The wounded fill every barn for miles. I pray this letter finds you well.

Your husband,
Thomas

Output:

{
  "summary": "A Union soldier writes to his wife after the Battle of Fredericksburg, describing heavy casualties, his narrow escape, low morale, and impending further assaults ordered by General Burnside.",
  "entities": [
    {"name": "Thomas", "type": "person", "description": "The author, a Union soldier writing to his wife"},
    {"name": "Falmouth, Va.", "type": "place", "description": "Location of the Union camp"},
    {"name": "Fredericksburg", "type": "place", "description": "Site of the recent battle"},
    {"name": "Burnside", "type": "person", "description": "Union general ordering continued attacks"}
  ],
  "sentiment": "mournful",
  "historical_context": "The letter dates from December 1862 during the American Civil War, following the Union defeat at the Battle of Fredericksburg under General Ambrose Burnside.",
  "research_questions": [
    "What regiment did Thomas belong to, and how did its casualty rate compare to brigade averages at Fredericksburg?",
    "How did letters describing low morale circulate among home communities, and did they influence enlistment or support for the war?",
    "What medical infrastructure existed in the barns converted to hospitals around Falmouth in December 1862?"
  ],
  "anomalies": []
}

Next steps

Two concrete ways to extend this tool.

First, add semantic search across a corpus by piping the same texts through Oxlo.ai's embeddings endpoint. Store the vectors in a local database such as Chroma or pgvector so researchers can ask, "Find me all letters that mention desertion."

Second, wire the analyzer into a lightweight web UI with Gradio or Streamlit. Non-technical collaborators can then upload transcripts and view structured briefs without touching the command line.

Top comments (0)