How to Summarize Huge Documents with LLMs: Beyond Token Limits and Basic Prompts

#ai #llm #nlp #devtools

Summarizing text is one of the main use cases for large language models. Clients often want to summarize articles, financial documents, chat history, tables, pages, books, and more. We all expect that LLM will distill only the important pieces of information, especially from long texts. However, this isn't always possible with the expected level of quality. Even a larger token limit isn’t a guaranteed solution. Fortunately, there are approaches that help summarize texts of different lengths - whether it’s a couple of sentences, paragraphs, pages, an entire book, or an unknown amount of text.

This guide, built from Belitsoft's experience as a custom software development company, explores those advanced approaches. We provide full-cycle generative AI implementation, from selecting model architectures to deploying scalable systems for processing complex documents like legal contracts, medical records, and financial disclosures. In this article, we'll break down the practical techniques that make large-scale LLM summarization work.

Basic Prompts to Summarize a Couple of Sentences

This is default behavior across almost any LLM, whether it's OpenAI, Anthropic, Mistral, Llama or others.

In this case, we simply copy and paste some text from the source and put it inside a prompt, giving the LLM an instruction like: “Please provide a summary of the following passage”.

If the output is still a little too complicated, we can adjust the instructions to get a different type of summary, for example: “Please provide a summary of the following text. Your output should be in a manner that a five-year-old would understand”, to get a much more digestible result.

This approach works when there aren’t too many tokens in the prompt (let’s say 200 tokens or 150 words). But as the number of tokens increases, like with larger documents, the summarization using basic prompts won’t be accurate and many things will be omitted, whether they were important for us or not.

Prompt Templates to Get the Summary in a Preferred Format

Prompt Templates help deal with the issue of inconsistent summary output across different texts - something that often happens when the input is long and you're using only a basic prompt like “Summarize this”.

For example, the prompt template may look like a rule: "Please write a one-sentence summary of the following text {}". Notice that we ask for “one sentence” instead of just a “summarize”.

With Prompt Templates, we can influence the quality of summary output in specific directions: format and length (“1 sentence” or “3 bullet points”), tone (simple language, executive style), and focus (only risks, only outcomes, only decisions). Keeping the output structure uniform is what we need for automation.

MapReduce method to summarize…summaries

Most LLM users think of summarization as "throw the whole text at the model → get a summary”. That’s why the output is often not as expected.

But the MapReduce method changes that into "break the text into chunks → summarize each →summarize the summaries”. You first generate individual summaries (map), then combine and condense them into one final summary (reduce). This reflects how we deal with long texts: we read in parts, take notes, then consolidate them to get a big picture.

So again, the main idea of the MapReduce method is to “chunk our document into pieces (that fit within the token limit), get a summary of each individual chunk, and then finally get a summary of the summaries”.

MapReduce method is mostly used for creating customized apps or workflows that run on top of general-purpose LLMs (like GPT-4, Claude, LLaMA, etc.) using frameworks like LangChain or raw Python.

Here is the general workflow for summarization using the MapReduce technique:

Load the input document into RAM (LangChain equivalent: open(file).read())
Estimate whether the text exceeds the token limit (for example, 2,000 tokens) (LangChain equivalent: llm.get_num_tokens(text))
Split the text into smaller chunks that fit within the LLM's context window (LangChain equivalent: RecursiveCharacterTextSplitter())
Convert chunks into a structured format (for example, a list of texts or document objects) (LangChain equivalent: create_documents())
Write a per-chunk prompt template to summarize each individual chunk (LangChain equivalent: map_prompt_template)
Write a final combine prompt template that summarizes the chunk-level summaries into bullet points or another format (LangChain equivalent: combine_prompt_template) 7.** Run the Map phase** - apply the per-chunk prompt to each chunk (LangChain: map_reduce)
Run the Reduce phase - apply the final combine prompt to the intermediate summaries (LangChain: map_reduce)
Output the final result - the combined summary (in list, bullet point, or paragraph form) (LangChain equivalent: print(output))

As you can see, if you try to build a real application that does reliable document summarization - say, for legal teams, financial analysts, internal knowledge search, or anything beyond casual reading - the tools available in the ChatGPT web interface or a raw API call aren't enough on their own.

Yes, you can upload a PDF into a web version of ChatGPT, and ask it to apply the MapReduce method, and if the file is small and the content is simple, you’ll get a decent summary. But, you’ll hit limitations: unpredictable behavior, like content skipped or compressed too aggressively. It's very hard to control how the content is split, to loop over each section with a consistent prompt, and combine those outputs in a structured way.

Even if you use OpenAI API, you still have to build everything else yourself: chunk the input, manage prompts for each part, send multiple API calls, and then combine the outputs. The API just gives you the LLM - it doesn’t provide a system for managing workflows.

That’s where a middle layer comes in. You build a lightweight backend that handles the logic: read a long document, split it, summarize each piece with the same prompt, and combine the results at the end. This logic is what frameworks like LangChain or LlamaIndex help with, but you can also build it yourself in plain Python.

Embeddings and Clustering to Summarize… Books

Some PDFs may contain as much content as a full book, and sometimes we want to summarize that amount of text as well. What kind of size are we talking about? For example, 140,000 tokens - roughly 100,000+ words.

If you send a prompt with that much text to a commercial LLM, it’ll cost you a significant amount, even if the model can process it in one go. Commercial LLMs like ChatGPT charge you twice: once for the input tokens it processes, and once for the output tokens it generates.

Moreover, there’s something important to understand that should make you think twice before applying the MapReduce method to such a large amount of text: semantic similarity. Experienced readers know that a book rarely contains completely unique content from beginning to end. The same ideas are often repeated - just phrased in different ways.

That’s yet another reason against blindly chunking a book and applying MapReduce - you’ll likely send multiple chunks with the same meaning to the LLM, get nearly identical summaries, and overpay for it. That’s not the kind of situation smart people who care about costs want to end up in.

Starting from the idea of semantic similarity, we may realize that all we need to do before sending a book’s text to an LLM is remove parts that are similar (in other words, not important for the summary because they repeat the same ideas). So, in general, our goal becomes compressing the meaning before submitting it for processing.

This is exactly the stage where we start thinking about using text preprocessing methods like embedding (converting each text chunk into a vector that captures its meaning, like [0.11, -0.44, 0.85, ...], so we can measure similarity between chunks) and clustering (grouping similar vectors together to avoid redundancy and pick one best passage from each group).

So again, we’re not going to feed the entire book to the model - only the important parts, let’s say the 10 best sections that represent most of the meaning. We ignore the rest because it adds no new angle or dimension to what we want to learn from the summary.

At this stage, what we really want is to scientifically select only those sections of the book that represent a holistic and diverse view - covering the most important, distinct parts that describe the book best. To do this, we need to form “meaning clusters,” and from each diverse cluster, we want to select just one best representative - the one that is closest to the “cluster centroid” (each cluster has its own centroid).

Building AI Agents to Summarize Documents

At the very least, what should we do if our workflow logic requires summarizing an unknown amount of text? We should use agents for this.

Such agents are able to handle complex tasks. For example, the question we want to ask the LLM requires several steps to answer: searching more than one source, summarizing, and combining the findings.

The agent should grab the first doc, pull out the key points, then do the same with the second, etc. After that, it should combine overlapping ideas and write the final answer. The agentic approach is designed to handle that chain of steps automatically.