DEV Community

shashank ms
shashank ms

Posted on

Building a Document Summarization Tool with LLM: A Step-by-Step Guide

Document summarization is one of the most common production tasks for LLMs. In this guide, I will walk through building a Python CLI tool that ingests a text file, chunks it if necessary, and returns a structured summary using Oxlo.ai. The flat per-request pricing makes it practical to throw long reports or meeting transcripts at the model without watching token costs scale.

What you'll need

Step 1: Configure the Oxlo.ai client

Create a file called summarize.py. Start with imports and the client setup. I use llama-3.3-70b as the default because it handles long context well and responds quickly.

import os
from openai import OpenAI

client = OpenAI(
    base_url="https://api.oxlo.ai/v1",
    api_key=os.environ.get("OXLO_API_KEY"),
)

MODEL = "llama-3.3-70b"

Step 2: Load and chunk the document

A real document might exceed context limits, so we split on double newlines to keep paragraphs intact. If a single chunk is still too large, we fall back to a character limit.

def load_and_chunk(file_path, max_chars=3000):
    with open(file_path, "r", encoding="utf-8") as f:
        text = f.read()

    paragraphs = [p.strip() for p in text.split("\n\n") if p.strip()]
    chunks = []
    current_chunk = ""

    for para in paragraphs:
        if len(current_chunk) + len(para) + 2 <= max_chars:
            current_chunk += para + "\n\n"
        else:
            if current_chunk:
                chunks.append(current_chunk.strip())
            current_chunk = para + "\n\n"

    if current_chunk:
        chunks.append(current_chunk.strip())

    return chunks

Step 3: Write the system prompt

The system prompt keeps the model focused on extracting key points rather than just rewriting sentences. I ask for bullet points and a one-line headline.

SYSTEM_PROMPT = """You are a precise document summarizer. Your task is to summarize the provided text into 3 to 5 bullet points. 
Start with a one-sentence headline that captures the main topic. 
Each bullet should be concise, specific, and preserve quantitative details like dates, names, and numbers. 
Do not use introductory phrases like 'This document discusses'. Output only the headline and bullets."""

Step 4: Summarize each chunk

We map the summarization function over every chunk. Because Oxlo.ai uses flat per-request pricing, the cost per chunk is predictable regardless of how long the paragraph is. For very long inputs, models like kimi-k2.6 or deepseek-v4-flash on Oxlo.ai support 131K and 1M context windows respectively, but the chunking approach works with any model. Check current rates at https://oxlo.ai/pricing.

def summarize_chunk(chunk_text):
    response = client.chat.completions.create(
        model=MODEL,
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": chunk_text},
        ],
        temperature=0.3,
        max_tokens=512,
    )
    return response.choices[0].message.content.strip()

def map_summaries(chunks):
    return [summarize_chunk(chunk) for chunk in chunks]

Step 5: Reduce to a final summary

If we only have one chunk, we return it directly. Otherwise, we concatenate the intermediate summaries and run one final pass to merge them into a coherent output.

def reduce_summaries(partial_summaries):
    if len(partial_summaries) == 1:
        return partial_summaries[0]

    combined = "\n\n".join(
        f"Section {i+1} summary:\n{s}" for i, s in enumerate(partial_summaries)
    )

    final_prompt = (
        "The following are summaries of different sections of a longer document. "
        "Synthesize them into one coherent summary with 3 to 5 bullet points and a single headline. "
        "Preserve key facts and remove redundancy.\n\n" + combined
    )

    response = client.chat.completions.create(
        model=MODEL,
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": final_prompt},
        ],
        temperature=0.3,
        max_tokens=512,
    )
    return response.choices[0].message.content.strip()

Step 6: Wire up the CLI

Add a small main block so we can run the tool from the terminal. It reads the file, chunks it, maps summaries, and reduces them.

if __name__ == "__main__":
    import sys

    if len(sys.argv) < 2:
        print("Usage: python summarize.py <file.txt>")
        sys.exit(1)

    file_path = sys.argv[1]
    chunks = load_and_chunk(file_path)

    print(f"Loaded {len(chunks)} chunk(s). Summarizing...")
    partials = map_summaries(chunks)
    final = reduce_summaries(partials)

    print("\n" + final)

Run it

Create a sample document and test the pipeline. Here is a fake quarterly update you can save as report.txt.

Q3 Engineering Update

The platform migration to Kubernetes finished six weeks ahead of schedule. 
API latency dropped by 34 percent, and error rates are now below 0.01 percent. 
We hired three senior backend engineers and closed two enterprise contracts worth 1.2 million dollars. 
Next quarter we plan to ship role-based access control and audit logging.

Run the tool:

export OXLO_API_KEY="YOUR_OXLO_API_KEY"
python summarize.py report.txt

Example output:

Loaded 1 chunk(s). Summarizing...

Q3 Engineering Update: Kubernetes Migration Ahead of Schedule and Strong Growth
- Completed Kubernetes migration six weeks early, reducing API latency by 34% and error rates to under 0.01%
- Expanded team by three senior backend engineers to support scaling efforts
- Secured two enterprise contracts valued at $1.2 million
- Planned Q4 deliverables include role-based access control and audit logging features

Next steps

Swap in kimi-k2.6 or deepseek-v4-flash to process far larger chunks, or even entire long documents in a single request thanks to their extended context windows on Oxlo.ai. You could also add JSON mode to return machine-readable summaries with fields like headline, key_points, and action_items for downstream automation.

Top comments (0)