Harnessing LLM for Natural Language Processing in Text Analysis

#learnai #oxlo #ai

We are building a batch text analysis agent that reads raw .txt files and emits structured JSON containing sentiment, entities, topics, and a summary. It is aimed at developers who need to process unstructured documents without maintaining separate NLP libraries for each task. We will run it against Oxlo.ai, where flat per-request pricing means cost does not scale with input length, unlike token-based providers such as Together AI, Fireworks AI, or OpenRouter. That makes Oxlo.ai a strong fit for long-context text analysis workloads. See https://oxlo.ai/pricing for details.

What you'll need

Python 3.10 or newer
The OpenAI SDK installed with pip install openai
An Oxlo.ai API key from https://portal.oxlo.ai

Step 1: Define the system prompt

The system prompt forces the model to return only a JSON object with four required keys. Keeping this strict reduces parsing errors later.

SYSTEM_PROMPT = """You are a precise text analysis engine. Analyze the user provided text and respond with a single JSON object containing exactly these keys:
- sentiment: one of Positive, Negative, Neutral, or Mixed
- entities: an array of objects with keys name and type (Person, Organization, Location, Product, or Event)
- topics: an array of up to five strings representing main themes
- summary: a one sentence summary of the text

Respond with only the JSON object. No markdown fences, no commentary."""

Step 2: Initialize the client and build the analysis function

We point the OpenAI SDK at Oxlo.ai and wrap the call in a small function that strips accidental markdown and parses JSON. I use Llama 3.3 70B here because it follows structured instructions reliably.

from openai import OpenAI
import json

client = OpenAI(base_url="https://api.oxlo.ai/v1", api_key="YOUR_OXLO_API_KEY")

def analyze_text(text: str) -> dict:
    response = client.chat.completions.create(
        model="llama-3.3-70b",
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": text},
        ],
    )
    raw = response.choices[0].message.content.strip()
    if raw.startswith("

```"):
        raw = raw.strip("`").strip()
        if raw.lower().startswith("json"):
            raw = raw[4:].strip()
    return json.loads(raw)

Step 3: Add batch file ingestion

Real workloads rarely involve a single string. This helper reads every .txt file in a directory and attaches the filename to the result so we can trace output back to its source.

from pathlib import Path

def analyze_directory(directory: str):
    results = []
    for path in Path(directory).glob("*.txt"):
        text = path.read_text(encoding="utf-8")
        try:
            analysis = analyze_text(text)
            analysis["file"] = path.name
            results.append(analysis)
        except Exception as e:
            print(f"Failed on {path.name}: {e}")
    return results

Step 4: Harden against malformed output

Even with a strong prompt, an LLM can occasionally prepend a stray word. Rather than crashing the pipeline, we catch JSON errors and make one retry with a corrected prompt.

import json

def safe_analyze(text: str) -> dict:
    try:
        return analyze_text(text)
    except json.JSONDecodeError:
        response = client.chat.completions.create(
            model="llama-3.3-70b",
            messages=[
                {"role": "system", "content": SYSTEM_PROMPT},
                {"role": "user", "content": text},
                {"role": "assistant", "content": "That was not valid JSON. Return only the JSON object."},
            ],
        )
        raw = response.choices[0].message.content.strip()
        if raw.startswith("```

"):
            raw = raw.strip("`").strip()
            if raw.lower().startswith("json"):
                raw = raw[4:].strip()
        return json.loads(raw)

Step 5: Build the CLI interface

We add a small argparse wrapper so the script accepts a directory and writes newline-delimited JSON. This makes it easy to pipe results into jq or load them into Pandas.

import argparse
import json
from pathlib import Path
from openai import OpenAI

client = OpenAI(base_url="https://api.oxlo.ai/v1", api_key="YOUR_OXLO_API_KEY")

SYSTEM_PROMPT = """You are a precise text analysis engine. Analyze the user provided text and respond with a single JSON object containing exactly these keys:
- sentiment: one of Positive, Negative, Neutral, or Mixed
- entities: an array of objects with keys name and type (Person, Organization, Location, Product, or Event)
- topics: an array of up to five strings representing main themes
- summary: a one sentence summary of the text

Respond with only the JSON object. No markdown fences, no commentary."""

def analyze_text(text: str) -> dict:
    response = client.chat.completions.create(
        model="llama-3.3-70b",
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": text},
        ],
    )
    raw = response.choices[0].message.content.strip()
    if raw.startswith("```"):
        raw = raw.strip("`").strip()
        if raw.lower().startswith("json"):
            raw = raw[4:].strip()
    return json.loads(raw)

def safe_analyze(text: str) -> dict:
    try:
        return analyze_text(text)
    except json.JSONDecodeError:
        response = client.chat.completions.create(
            model="llama-3.3-70b",
            messages=[
                {"role": "system", "content": SYSTEM_PROMPT},
                {"role": "user", "content": text},
                {"role": "assistant", "content": "That was not valid JSON. Return only the JSON object."},
            ],
        )
        raw = response.choices[0].message.content.strip()
        if raw.startswith("```"):
            raw = raw.strip("`").strip()
            if raw.lower().startswith("json"):
                raw = raw[4:].strip()
        return json.loads(raw)

def analyze_directory(directory: str):
    results = []
    for path in Path(directory).glob("*.txt"):
        text = path.read_text(encoding="utf-8")
        try:
            analysis = safe_analyze(text)
            analysis["file"] = path.name
            results.append(analysis)
        except Exception as e:
            print(f"Failed on {path.name}: {e}")
    return results

def main():
    parser = argparse.ArgumentParser(description="Batch text analysis with Oxlo.ai")
    parser.add_argument("directory", help="Path to directory containing .txt files")
    parser.add_argument("--output", default="analysis.jsonl", help="Output file")
    args = parser.parse_args()
    
    results = analyze_directory(args.directory)
    with open(args.output, "w", encoding="utf-8") as f:
        for r in results:
            f.write(json.dumps(r, ensure_ascii=False) + "\n")
    print(f"Wrote {len(results)} analyses to {args.output}")

if __name__ == "__main__":
    main()

Run it

Create a few sample documents and invoke the script. Because Oxlo.ai uses flat per-request pricing, these files cost the same whether each one is two hundred or two thousand tokens.

mkdir sample_docs
echo "Apple Inc. reported record quarterly earnings yesterday, driven by strong iPhone sales across Asia and a new partnership with a major Japanese carrier." > sample_docs/tech.txt
echo "The city council approved a new downtown park project after months of debate. Local residents and the Green Earth Organization celebrated the decision." > sample_docs/local.txt
python analyze.py sample_docs --output results.jsonl
cat results.jsonl

Expected output:

{"sentiment": "Positive", "entities": [{"name": "Apple Inc.", "type": "Organization"}, {"name": "Asia", "type": "Location"}], "topics": ["earnings", "iPhone sales", "partnership"], "summary": "Apple Inc. reported record quarterly earnings driven by strong iPhone sales in Asia and a new Japanese partnership.", "file": "tech.txt"}
{"sentiment": "Positive", "entities": [{"name": "Green Earth Organization", "type": "Organization"}, {"name": "city council", "type": "Organization"}], "topics": ["urban planning", "public parks", "local government"], "summary": "The city council approved a new downtown park project celebrated by residents and the Green Earth Organization.", "file": "local.txt"}

Wrap-up

You now have a working batch text analysis pipeline that turns unstructured files into structured JSON. Two concrete next steps: deploy this as a FastAPI endpoint so other services can POST text directly for analysis, or experiment with qwen-3-32b on Oxlo.ai for multilingual document processing without changing any client code.