We are going to build a primary source analyzer that ingests raw historical texts and emits structured research briefs. This tool helps historians and literary scholars extract entities, sentiment, historical context, and research questions without maintaining custom NLP pipelines. Because Oxlo.ai charges one flat cost per request, analyzing long letters or diary entries never inflates the price as the document grows.
What you'll need
- Python 3.10 or higher
- An Oxlo.ai API key from https://portal.oxlo.ai
- The OpenAI SDK:
pip install openai
Step 1: Set up the Oxlo.ai client
I start by initializing the OpenAI-compatible client pointing at Oxlo.ai. I use Llama 3.3 70B because it follows structured instructions reliably and handles long context windows well. If you later work with multilingual archives, Oxlo.ai also hosts Qwen 3 32B, and for documents up to 131K tokens you can switch to Kimi K2.6 without changing any other code.
from openai import OpenAI
import json
import os
client = OpenAI(
base_url="https://api.oxlo.ai/v1",
api_key=os.environ.get("OXLO_API_KEY", "YOUR_OXLO_API_KEY")
)
# Quick connectivity check
response = client.chat.completions.create(
model="llama-3.3-70b",
messages=[{"role": "user", "content": "Say 'Oxlo.ai client ready'"}],
max_tokens=20
)
print(response.choices[0].message.content)
Step 2: Define the system prompt
The system prompt turns the model into a careful archival assistant. It enforces valid JSON output and forbids anachronistic assumptions.
SYSTEM_PROMPT = """You are a primary source analysis assistant for humanities researchers.
Your job is to read a historical document and produce a structured JSON object with exactly these keys:
- summary: a one-paragraph summary of the document's content
- entities: an array of objects, each with {name, type, description}. Types may be person, place, organization, event, or concept.
- sentiment: the overall emotional tone of the author, chosen from hopeful, anxious, neutral, angry, or mournful
- historical_context: one paragraph situating the document in its likely time period
- research_questions: an array of three specific questions a scholar might pursue based on this text
- anomalies: an array of anything unusual, damaged text, or likely transcription errors
Rules:
- Base all claims strictly on the provided text.
- Do not use outside knowledge unless explicitly labeled as historical_context.
- Output valid JSON only, with no markdown fences or commentary.
"""
Step 3: Build the analysis function with JSON mode
This function sends the document to Oxlo.ai and requests a JSON object. JSON mode guarantees we can parse the result without fragile regex.
def analyze_primary_source(text: str, model: str = "llama-3.3-70b") -> dict:
"""Send a historical document to Oxlo.ai and return a structured analysis."""
response = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": f"Analyze the following primary source:\n\n{text}"},
],
response_format={"type": "json_object"},
temperature=0.2,
)
raw = response.choices[0].message.content
return json.loads(raw)
Step 4: Batch-process a directory of transcripts
Archival projects rarely involve a single file. This helper reads every .txt file in a folder and runs analysis on each, aggregating the results.
from pathlib import Path
def analyze_directory(dir_path: str, model: str = "llama-3.3-70b") -> list[dict]:
"""Analyze all .txt files in a directory."""
results = []
for file_path in Path(dir_path).glob("*.txt"):
text = file_path.read_text(encoding="utf-8")
analysis = analyze_primary_source(text, model=model)
analysis["source_file"] = file_path.name
results.append(analysis)
return results
Step 5: Add a CLI entrypoint
I wrap the tool in a small command-line interface so researchers can point it at a file or a folder without touching Python.
if __name__ == "__main__":
import sys
if len(sys.argv) < 2:
print("Usage: python analyzer.py <path_to_txt_file_or_directory>")
sys.exit(1)
target = Path(sys.argv[1])
if target.is_file():
text = target.read_text(encoding="utf-8")
result = analyze_primary_source(text)
print(json.dumps(result, indent=2, ensure_ascii=False))
elif target.is_dir():
batch = analyze_directory(target)
out_file = target / "analysis_results.json"
out_file.write_text(
json.dumps(batch, indent=2, ensure_ascii=False),
encoding="utf-8"
)
print(f"Analyzed {len(batch)} files. Results written to {out_file}")
else:
print("Invalid path.")
sys.exit(1)
Run it
Save the script as analyzer.py, create a file named letter_1862.txt, and run:
python analyzer.py letter_1862.txt
Here is a sample input and the resulting output.
Input (letter_1862.txt):
Camp near Falmouth, Va.
December 14th, 1862
My dear wife,
We have just returned from the battlefield near Fredericksburg. The loss on our side is terrible. I am unhurt, though three bullets passed through my coat. The men are in low spirits. Burnside has ordered another assault for tomorrow, though I cannot see how we shall carry those heights. The wounded fill every barn for miles. I pray this letter finds you well.
Your husband,
Thomas
Output:
{
"summary": "A Union soldier writes to his wife after the Battle of Fredericksburg, describing heavy casualties, his narrow escape, low morale, and impending further assaults ordered by General Burnside.",
"entities": [
{"name": "Thomas", "type": "person", "description": "The author, a Union soldier writing to his wife"},
{"name": "Falmouth, Va.", "type": "place", "description": "Location of the Union camp"},
{"name": "Fredericksburg", "type": "place", "description": "Site of the recent battle"},
{"name": "Burnside", "type": "person", "description": "Union general ordering continued attacks"}
],
"sentiment": "mournful",
"historical_context": "The letter dates from December 1862 during the American Civil War, following the Union defeat at the Battle of Fredericksburg under General Ambrose Burnside.",
"research_questions": [
"What regiment did Thomas belong to, and how did its casualty rate compare to brigade averages at Fredericksburg?",
"How did letters describing low morale circulate among home communities, and did they influence enlistment or support for the war?",
"What medical infrastructure existed in the barns converted to hospitals around Falmouth in December 1862?"
],
"anomalies": []
}
Next steps
Two concrete ways to extend this tool.
First, add semantic search across a corpus by piping the same texts through Oxlo.ai's embeddings endpoint. Store the vectors in a local database such as Chroma or pgvector so researchers can ask, "Find me all letters that mention desertion."
Second, wire the analyzer into a lightweight web UI with Gradio or Streamlit. Non-technical collaborators can then upload transcripts and view structured briefs without touching the command line.
Top comments (0)