DEV Community: Arun Venkataswamy

Exploring the Extractive Capabilities of Large Language Models – Beyond Generation and Copilots

Arun Venkataswamy — Tue, 16 Jul 2024 16:29:57 +0000

We have all seen the power of Large Language Models in the form of a GPT-based personal assistant from OpenAI called ChatGPT. You can ask questions about the world, ask for recipes, or ask it to generate a poem about a person. We have all been awestruck by the capabilities of this personal assistant.

Unlike many other personal assistants, this is not a toy. It has significant capabilities which can increase your productivity. You can ask it to write a marketing copy or Python script for work. You can ask it to provide a detailed itinerary for a weekend getaway.

This is powered by Large Language Models (LLMs) using a technology called Generative Pre-trained Transformer (GPT). LLMs are a subset of a broader category of AI models known as neural networks, which are systems inspired by the human brain. It all started with the pivotal paper “Attention is all you need” released by Google in 2017.

Since then, brilliant scientists and engineers have created and mastered the transformer model which has created groundbreaking changes that are disrupting the status quo in things ranging from creative writing, language translation, image generation, and software coding to personalized education.

This technology harnesses the patterns in the vast quantities of text data they have been trained with to predict and generate outputs. Till now, this path-breaking technology has been used by enterprises primarily in these areas:

Personal assistants
Chatbots
Content generation (marketing copies, blogs, etc)
Question answering over documents (RAG)

One of the main capabilities of these LLMs is their ability to reason within a given context. We do not know if they reason the way we humans reason, but they do show some emergent behaviour that has the capacity to somehow do it, given the right prompts to do so. This might not match humans, but it is good enough to extract information from a given context. This extraction capability powers the question-answering use case of LLMs.

Structured data from unstructured sources

Multiple analysts estimate that up to 80% of the data available with enterprises exist in an unstructured form. That is information stored in text documents, video, audio, social media, server logs etc. It is a known fact that if enterprises can extract information from these unstructured sources it would give them a huge comparative advantage.

Unfortunately, today if we have to extract information from these unstructured sources, we need humans to do it and it is costly, slow, and error-prone. We could write applications to extract information, but that would be a very difficult and expensive project and in some cases impossible. Given the ability of LLMs to “see” patterns in text and do some form of “pseudo reasoning”, they would be a good choice to extract information from these vast troves of unstructured data in the form of PDFs and other document files.

Defining our use case

For the sake of this discussion, let us define our typical use cases. These are actual real-world use cases many of our customers have. Note that some of the customers need information extracted from tens of thousands of these types of documents every month.

Information extracted could be simple ones like personal data (name, email address, address) and complex ones like line items (details of each product/service item in the invoice, details of all companies in prior employment in resumes etc)

Most of these documents have between 1 and 20 pages and they fit into the context size of OpenAI’s GPT4 Turbo and Google’s Gemini Pro LLMs.

Information extraction from Invoices.
Information extraction from Resumes.
Information extraction from Purchase orders.
Information extraction from Medical bills.
Information extraction from Insurance documents.
Information extraction from Bank and Credit card statements.
Information extraction from SaaS contracts.

Traditional RAG is an overkill for many use cases

Retrieval-Augmented Generation is a technique used in natural language processing that combines the capabilities of a pre-trained language model with information retrieval to enhance the generation of text. This method leverages the strengths of two different types of models: a language model and a document retrieval system.

RAG is typically used in a question-answering scenario. When we have a bunch of documents or one large document and we want to answer a specific question. We would use RAG techniques to:

Determine which document contains the information
Determine which part of the document contains the information
Send this part of the document as a context along with the question to an LLM and get an answer

The above steps are for the simplest of RAG use cases. Libraries like Llamaindex and Langchain provide the tools to deploy a RAG solution. And they have workflows for more complex and smarter RAG implementations.

RAGs are very useful to implement smart chatbots and allow employees or customers of enterprises to interact with vast amounts of information. RAGs can be used for information extraction too, but it would be an overkill for many use cases. Sometimes it could become expensive to do so too.

We deal with some customers who need information extracted from tens of thousands of documents every month. And the information extracted is not for human consumption. The information goes straight into a database or to other downstream automated services. Here is where a simple prompt based extraction could be way more efficient than traditional RAG. Both from a cost perspective and computational complexity perspective. More information in the next section.

Prompt-based data extraction

The context windows of LLMs are increasing and the cost of LLM services are coming down. We can comfortably extrapolate this and conclude that this trend will continue into the near future. We can make use of this and use direct prompting techniques to extract information from documents.

Source document
Let’s take a couple of restaurant invoices as the source documents to explore the extraction process. An enterprise might encounter hundreds of these documents in claims processing. Also, note that these two documents are completely different in their form and layouts. Traditional machine learning and intelligent document processing (IDP) tools will not be able to parse both documents using the same learning or setups. The true power of LLMs is their ability to understand the context through language. We will see how LLMs are capable of extracting information from these documents using the same prompts.

Document #1 - Photo of printed restaurant invoice

Document #2 - PDF of restaurant invoice

Traditional machine learning and intelligent document processing (IDP) will not be able to parse different documents using the same learning or setups.

Preprocessing
LLMs required pure text as inputs. This means that all documents need to be converted to plain text. The weakest link in setting up an LLM-based toolchain to do extraction is the conversion of the original document into a pure text document which LLMs can consume as input.

Most documents available in enterprises are in PDF format. PDFs can contain text or their pages can be made of scanned documents that exist as images inside the document. Even if information is stored as text inside PDFs, extracting them is no simple task. PDFs were not designed as a text store. They contain layout information that can reproduce the “document” for printing or visual purposes. The text inside the PDFs can be broken and split at random places. They do not always follow a logical order. But they contain layout information which will be used by the PDF rendering software which will make it look as if the text is coherent to a human eye.

For example, the simple text “Hello world, welcome to PDFs” could be split up as “Hello”, “world, wel ”, “come”, “to” and “PDFs”. And the order can be mixed up too. But precise location information would be available for the rendering software to reassemble the text visually.

The PDF-to-text converter has to consider the layout information and try to reconstruct the text as intended by the author and make grammatical sense. In the case of scanned PDF documents, the information inside is in the form of images and we need to use an OCR to extract the text from the PDF

The following texts are extracted from the documents mentioned above using Unstract’s LLM Whisperer.

Data extracted from Document #1
From the photo of the physical restaurant invoice
Extracted with LLMWhisperer

Data extracted from Document #2
PDF of restaurant invoice
Extracted with LLMWhisperer

Extraction prompt engineering

Constructing an extraction prompt for a LLM is an iterative process in general. We will keep tweaking the prompt till we are able to extract the information you require. In the case of generalised extraction – when the same prompt has to work over multiple different documents more care might be taken by experimenting with a sample set of documents. When I say “multiple different documents” I mean different documents with the same central context. Take for example the two documents we consider in this article. Both are restaurant invoices but their form and layouts are completely different. But their context is the same. They are restaurant invoices.

The following prompt structure is what we use while dealing with relatively big LLMs like GPT3.5, GPT4 and Gemini Pro:

Preamble
Context
Grammar
Task
Postamble

A preamble is the text we prepend to every prompt. A typical preamble would look like this:

Your ability to extract and summarise this restaurant invoice accurately is essential for effective analysis. Pay close attention to the context's language, structure, and any cross-references to ensure a comprehensive and precise extraction of information. Do not use prior knowledge or information from outside the context to answer the questions. Only use the information provided in the context to answer the questions.

Context is the text we extracted from PDF or image

Grammar is used when we want to provide synonyms information. Especially for smaller models. For example for the document type we are considering, restaurant invoices – invoice can be “bill” in some countries. For sake of this example, we will ignore grammar information.

**Task **is the actual prompt or question you want to ask. The crux of the extraction.

Postamble is text we add to the end of every prompt. A typical postamble would look like this:

Do not include any explanation in the reply. Only include the extracted information in the reply.

Note that Except for the context and task none of the other sections of the prompt is compulsory.

Let’s put an entire prompt together and see the results. Let’s ignore the grammar bit for now. In this example, our task prompt would be,

Extract the name of the restaurant

The entire prompt to send to the LLM:

Your ability to extract and summarise this restaurant invoice accurately is essential for effective analysis. Pay close attention to the context's language, structure, and any cross-references to ensure a comprehensive and precise extraction of information. Do not use prior knowledge or information from outside the context to answer the questions. Only use the information provided in the context to answer the questions.

Context:
—-------

          BURGER       SEIGNEUR 

            No. 35, 80 feet road, 
              HAL 3rd Stage, 
           Indiranagar, Bangalore 
         GST: 29AAHFL9534H1ZV 

   Order Number    : T2- 57 

   Type : Table 
   Table   Number:     2 

   Bill   No .: T2 -- 126653 
  Date:2023-05-31   23:16:50 
  Kots: 63 

  Item               Qty    Amt 

  Jack The 
  Ripper           1        400.00 
  Plain Fries + 
  Coke 300 ML      1        130.00 

  Total Qty:        2 
  SubTotal:                 530.00 

  GST@5%                     26.50 
      CGST @2.5%       13.25 
      SGST @2.5%       13.25 

 Round Off :                  0.50 
 Total Invoice   Value:        557 

       PAY    : 557 

 Thank you, visit   again! 

Powered   by - POSIST

-----------
Extract the name of the restaurant.

Your response:

Copy and paste the above prompt into ChatGPT virtual assistant. Or you may use their APIs directly to complete the prompt.

The result you get is this:

The name of the restaurant is Burger Seigneur.

If you just need the name of the restaurant and not a verbose answer, you can play around with the postamble or the task definition itself. Let’s change the task to be more specific:

Extract the name of the restaurant. Reply with just the name.

The result you get now is:

BURGER SEIGNEUR

If you construct a similar prompt for document #2, you will get the following result:

CHAI KINGS

Here is a list of task prompts and their results
Please note that if you use the same prompts in ChatGPT, the results can be a bit more verbose. These results are from the Azure OpenAI with GPT4 turbo model accessed through their API. You can always tweak the prompts to get the desired outputs.

Task Prompt 1
Extract the name of the restaurant
Document 1 response
BURGER SEIGNEUR
Document 2 response
Chai Kings

Task Prompt 2
Extract the date of the invoice
Document 1 response
2023-05-31
Document 2 response
07 March 2024

Task Prompt 3
Extract the customer name if it is present. Else return null
Document 1 response
NULL
Document 2 response
Arun Venkataswamy

Task Prompt 4
Extract the address of the restaurant in the following JSON format:
{
    "address": "",
    "city": "" 
}
Document 1 response
{ 
    "address": "No. 35, 80 feet road, HAL 3rd Stage, Indiranagar",
    "city": "Bangalore" 
}
Document 2 response
{
    "address": "Old Door 28, New 10, Kader Nawaz Khan Road, Thousand Lights",
    "city": "Chennai"
}

Task Prompt 5
What is the total value of the invoice
Document 1 response
557
Document 2 response
₹196.84

Task Prompt 6
Extract the line items in the invoice in the following JSON format:
[
    {
        "item": "",
        "quantity": 0,
        "total_price": 0
    }
]
Document 1 response
[
    {
        "item": "Jack The Ripper",
        "quantity": 1,
        "total_price": 400
    },
    {
        "item": "Plain Fries + Coke 300 ML",
        "quantity": 1,
        "total_price": 130
    }
]
Document 2 response
[
    {
        "item": "Bun Butter Jam",
        "quantity": 1,
        "total_price": 50
    },
    {
        "item": "Masala Pori",
        "quantity": 2,
        "total_price": 50
    },
    {
        "item": "Ginger Chai",
        "quantity": 1,
        "total_price": 158
     }
]

As we can see from the above, LLMs are pretty smart with their ability to extract information from a given context. A single prompt works across multiple documents with different forms and layouts. This is a huge step up from traditional machine learning models and methods.

Post-processing
We can extract almost any piece of information from the given context using LLMs. But sometimes, it might require multiple passes with an LLM to get a result that can be directly sent to a downstream application. For example, if the downstream application or database requires a number, we have to convert the result to a number. Take a look at the invoice value extraction prompt in the above table. For document #2 it returns a number with a currency symbol. The LLM returned ₹196.84. So in this case we need to have one more step to convert the extracted information to an acceptable format. This can be done in two ways:

Programmatically: We can programmatically convert the result into a number format. But this would be a difficult task since the formatting could include hundreds separators too. For example $1,456.34. This needs to be converted to 1456.34. Similarly, the hundreds separator could be different for different locales. Example €1.456,34.
With LLMs: Using LLMs to convert the result into the format we require could be much easier. Since the full context is not required, the cost involved will also be relatively much smaller compared to the actual extraction itself. We can create a prompt like this: "Convert the following to a number which can be directly stored in the database: $1,456.34. Answer with just the number. No explanations required“. Will produce the output: 1456.34

Similar to numbers, we might have to post the process results for dates and boolean values too.

Introducing Unstract and LLMWhisperer

Unstract is a no-code platform to eliminate manual processes involving unstructured data using the power of LLMs. The entire process discussed above can be set up without writing a single line of code. And that’s only the beginning. The extraction you set up can be deployed in one click as an API or ETL pipeline.

With API deployments you can expose an API to which you send a PDF or an image and get back structured data in JSON format. Or with an ETL deployment, you can just put files into a Google Drive, Amazon S3 bucket or choose from a variety of sources and the platform will run extractions and store the extracted data into a database or a warehouse like Snowflake automatically. Unstract is an Open Source software and is available at https://github.com/Zipstack/unstract.

If you want to quickly try it out, signup for our free trial. More information here.

LLMWhisperer is a document-to-text converter. Prep data from complex documents for use in Large Language Models. LLMs are powerful, but their output is as good as the input you provide. Documents can be a mess: widely varying formats and encodings, scans of images, numbered sections, and complex tables. Extracting data from these documents and blindly feeding them to LLMs is not a good recipe for reliable results. LLMWhisperer is a technology that presents data from complex documents to LLMs in a way they’re able to best understand it.

If you want to quickly take it for test drive, you can checkout our free playground.

Note: I originally posted this on the Unstract blog a couple of weeks ago.

PDF Hell and Practical RAG Applications

Arun Venkataswamy — Mon, 01 Jul 2024 11:01:34 +0000

If you have tried to extract text from PDFs you would have come across a myriad of complications related to it. It is relatively easy to do a POC or experiment, but when it comes to handling PDFs from the real world on a consistent basis, it is a tremendously difficult problem to solve. In this blog post, we explore the common but often difficult challenge: extracting text from PDFs for use in RAG, natural language processing, and other applications of large language models (LLMs). While PDFs are a universal and ubiquitous format, valued for their ability to preserve the layout and integrity of content across different platforms, they were not originally designed for easy extraction of the text they contain. This presents a unique set of challenges for developers who need to repurpose content from PDF documents into dynamic, text-based applications.

Our experience stems from building LLMWhisperer, a Text Extraction service that extracts data from images and PDFs, preparing it and optimizing it for consumption by Large Language Models or LLMs.

Advanced PDF Text Extractor Architecture

Why is it difficult to extract meaningful text from PDFs?

PDFs are primarily designed to maintain the exact layout and presentation of content across varied devices and platforms. Also ensuring that documents look the same regardless of where they're viewed or printed. This design goal is highly beneficial for document preservation, consistent printing, and sharing fully-formatted documents between users. Another popular use case is PDF forms that can be electronically and portably filled out.

However, this very strength of the PDF format can become a challenge when extracting text for RAG or natural language processing (NLP) applications. Let’s delve a little deeper into how text is organized in PDFs. Refer to the figure below. Text in a PDF file is organized as text frames or records. It is based on a fixed layout and lacks any logical or semantic structure.

Note: Libreoffice is a good tool to open PDFs to understand how it is organized. It opens PDF documents in the drawing tool. You can make minor edits but it is not really designed for easy editing of PDFs.

Fixed Layout

The fixed layout of PDFs is essential for ensuring documents appear identical across different platforms and devices (unlike in say, HTML where text generally adapts to the device’s form factor it’s being displayed on). This fixed layout feature is particularly valuable in contexts like legal documents, invoices, academic papers, and professional publications, where formatting is important. However, for NLP tasks, this fixed layout presents several issues:

Non-linear Text Flow: Text in PDFs might be visually organized in columns, sidebars, or around images. This makes intuitive sense to a human reader navigating the page visually, but when the text is extracted programmatically, the order can come out mixed up. For example, a text extraction tool might read across a two-column format from left to right, resulting in sentences that alternate between columns, completely breaking the text semantically.

Position-Based Text: Since text placement in PDFs is based on exact coordinates rather than relational structure, extracting text often yields strings of content without the contextual positioning that would inform a reader of headings, paragraph beginnings, or document sections. This spatial arrangement must be programmatically interpreted, which is not always straightforward and often requires advanced processing to deduce the structure from the raw coordinates.

Lack of Logical Structure

While theoretically the ability exists, in the wild, PDFs most often do not encode the semantic structure of their content. While a visually formatted document might appear to have a clear organization into headings, paragraphs, and sections, this structure is often not explicitly represented in the PDF's internal data hierarchy.

Visual vs. Semantic Markup: Unlike HTML, which uses tags to denote headings, paragraphs, and other content blocks, PDFs typically lack these semantic markers. Text might be larger or in bold to indicate a heading to a human, but without proper tagging, a text extraction tool sees only a string of characters. This makes it difficult to programmatically distinguish between different types of content like titles, main text, or captions.

Absence of Standard Structure Tags: Although PDF/A (an ISO-standardized version of PDF specialized for archiving and long-term preservation) and tagged PDFs exist, most PDFs in the real world do not take advantage of these enhancements. Tagged PDFs include metadata about document structure, which aids in reflowing text and improving accessibility. Without these tags, automated tools must rely on heuristic methods to infer the document structure, such as analyzing font sizes and styles, indentation, or the relative position on the page.

To address these challenges in NLP use cases, we might have to write sophisticated and hybrid document analysis tools that combine optical character recognition (OCR) and machine learning models that can learn from large datasets of documents to better predict and reconstruct the logical ordering of text.

Tools/Libraries for text extraction from text PDFs

A list of popular Python libraries for extracting text from PDFs:

pdfplumber (Our favorite, it is based on pdfminer.six)
PyPDF2
pdfminer.six
PyMuPDF

Each library has its own pros and cons. Choosing the right one will be based on what type of PDF documents you are going to process and/or the eventual use of the text extracted.

Why is it even more difficult to extract meaningful text from PDFs?

Many PDFs are not “text” PDFs. They contain scanned or photographed images of pages. In these cases the only option is to either extract the image from the PDF or convert the PDF pages to images and then use an OCR application to extract the text from these images. Then the output from the OCR should be reconstructed as a page of text.

how a PDF file with scanned contents is organized" width="800" height="611">
A sample PDF file opened in Libreoffice to show how a PDF file with scanned contents is organized

Preprocessing

Many scanned PDFs are not perfect. Scanned images might contain unwanted artifacts which will cause OCR output quality to degrade. If the PDF has a photo of some document page rather than a proper of it — the issues you might face are potentially multiplied. Lighting conditions, rotation, skew, coverage and compression levels of the original photo might lead to even more degradation of OCR output quality.

Preprocessing is an important step which might need to be taken up before sending the image to OCR. Preprocessing typically involves noise reduction, rescaling, de-rotation, cropping, level adjustments and grayscale conversion. Note that some of the OCR providers have the preprocessing step built in. For example when you use LLMWhisperer, preprocessing is done automatically and frees the user from worrying about it.

OCR

If you’ve read thus far, you probably already know OCR stands for Optical Character Recognition. It represents a family of technologies that convert images that contain text to machine readable text (generally speaking, conversion of text in images to ASCII or Unicode). It is a technology that is incredibly useful in digitizing printed text or text images leading to the ability of editing, searching and storing the contents of the original document. In the context of this blog post, it helps us extract text from scanned documents or photographed pages.

Tools/Libraries for text extraction from scanned/image PDFs

A small list of utilities for extracting text from images. Note that this list shown here is very small subset and there are a lot of tools out there:

Locally runnable
- Tesseract
- Paddle OCR
Cloud services

Choosing a OCR is based on multiple factors and not the quality of extraction alone. OCR is a continuously evolving technology. Recent improvements in machine learning have made the quality of extraction reach new heights. But unfortunately not everyone has access to high end CPUs and GPUs to run the models. The cloud services from the big three have very high quality OCRs. But if there is a constraint on user privacy and secrecy, the cloud-based services might not be an option for you.

And the other woes

Apart from the difficulties created by the actual format itself, functional requirements and the quality of PDFs can add to the complexities of extracting text from them. Samples from the real world can have a bewildering list of issues making it extremely challenging to extract text. Based on our experience developing and running LLMWhisperer, here are some functional and quality issues we commonly see in the wild.

Searchable PDFs

This format allows the document to maintain the visual nature of the original scanned image while also including searchable and selectable text thanks to the OCR’d layer. This makes it easy to search for specific words or phrases within the document, which would not be possible with a simple image based PDF. Take a look at the image below. The top is how it appears in a PDF viewer. The bottom image has been adjusted to show the two layers. The gray layer is the original scanned image. The white text is the OCR’d text which has been added to the PDF and hidden behind the original scanned image. This is what is “selectable” when seen in a PDF viewer.

a searchable text layer which has been OCR’d and added." width="800" height="788">
A sample searchable PDF file containing a scanned image layer and
a searchable text layer which has been OCR’d and added.

This searchable feature is very useful when humans are interacting with the document. But when we want to extract the text programmatically it introduces a bunch of difficulties:

Detecting whether it is a searchable PDF

We could detect if there is a large image covering the entire page while also looking for text records in the PDF. But this does not work all the time because many PDFs like certificates or fancy brochures have a background image which can be mistaken for a scanned PDF. This is a difficult problem to solve.

Quality of the original OCR

Not all OCRs are great. The original OCR used to generate the searchable text might have created a low quality text base. This is not often easy to detect and objectively quantify especially when there is no human in the loop. In these cases, it is better to consider the document as a purely scanned PDF and take the OCR route and use your own OCR for text extraction, hoping yours is better than the one used to generate the original text.

Searchable PDFs are for searching and not full-text extraction

The text records which are available in these PDFs are not for extraction use cases. They can be split at random locations. Take a look at the example shown above. Text frames/records “one” and “space-character-to…” are part of the same sentence but are split. When trying to rebuild text for NLP purposes it is difficult to merge them without using complex techniques. Another example is the text “Learning Algorithms” in the figure above. This title text is not only split into two words but since the size of the text is large, the original OCR overlay system has double spaced the characters (to match location of letters) in the result (take a look at the right pane). There are two records – “L e a r n i n g” and “A l g o r i t h m s”. Again a difficult problem to solve to de-double space the characters when we extract text. There is also a mistake in positions. “Algorithms” has backed into “Learning” creating an overlap. Just everyday difficulties extracting text from PDFs!

Extracting tables

Unlike HTML or other document formats, PDF is a fixed layout format. This makes it very difficult to semantically understand where a table is and how it is organized. There are many approaches to extracting tables. Some of them try to understand the layout and some of them use computer vision based libraries to detect tables.

Popular Python libraries to extract tables:

Some of the common approaches used are:

Rules based extraction

This approach defines a set of rules and tries to identify table data using the rules. The rules can be based on identifiable markers of cells or boundaries, keywords and other similar items. This is effective when the format of the PDF remains consistent. This works very well when all the documents we process are of the same format or variety. Unfortunately in the real world, PDFs come in so many different forms, a simple rule based approach is not very reliable except for certain controlled use cases.

Computer vision

This approach uses computer vision models to detect lines that can be used to identify tables. The visual structure is analyzed to differentiate between rows, columns and cells. This can be used for identifying tables where traditional approaches fail. But keep in mind that this involves adding machine learning libraries and models which is going to bloat your application and will require some serious CPU (or GPU) horsepower to keep it quick. While this provides good results in many use cases, many more PDFs in the real world have tables which do not have good visual differentiation (fancy tables with colors used to define cells etc). Also note that this requires converting even text PDFs to images for the CV libraries to work. This can get very resource intensive, especially for longer documents.

Machine learning

Machine learning models can be trained to recognize structures and patterns that are typical of tables. Machine learning models can give better results than computer vision-based systems as they understand the context rather than depending only on visual cues. Again, just like computer vision, machine learning also increases the footprint of your application and requires more resources to run. Also, training a model from scratch is a pretty involved process and getting training data might not be an easy task. It is best to depend on ready made table extraction libraries mentioned earlier.

Hybrid approach

In the real world, no single approach works for a broad variety of document types. We most likely will have to settle for a combination of techniques to reliably extract tables from PDFs.

LLMWhisperer’s approach

We at Unstract, designed LLMWhisperer to extract and reproduce the table’s layout faithfully rather than trying to extract the table’s individual rows and columns while also extracting hints on where each cell is. Most of our customers use the extracted text from PDF to drive LLM/RAG/Search use cases and this approach works great. From our experience, LLMs are able to comprehend tables when layout is preserved. There is no need to bend over backwards to recreate the whole table from the PDF as a HTML table or as a markdown table. LLMs are smart enough to figure out the contents of most tables when the layout of the table is preserved in the output with tabs or spaces separating columns.

Page orientation

A PDF file’s pages can be organized in:

Portrait mode
Landscape mode
Hybrid, portrait and landscape mode
Scanned pages in landscape mode which are rotated 90°, 180°, 270°
Scanned pages or photographed pages might be rotated arbitrarily by ±30°

Sample of a scanned PDF which has been rotated while photographing the original

Trying to extract text from portrait mode or landscape mode is relatively simple. The extraction becomes more difficult when we have a hybrid PDF in which some pages are in portrait mode and some are in landscape mode. If it is a text based PDF, it is relatively easier, but for scanned PDFs we need to detect this change using direct or indirect methods. When dealing with pages that are arbitrarily rotated (especially PDFs created from photographed documents) detection and correction is never easy. We will have to use image processing libraries and probably machine learning to automatically correct such pages before sending them to an OCR.

Bad (for extraction) PDF generators

Some PDF generators will consider every element inside the documents as “curves”. Even characters of the language are stored as “curve” representations. This has certain advantages as it can be reproduced in every medium without the requirement of having font information. But it makes it very difficult to extract text from. The only way to extract text from these documents is to convert the pages to images and then use an OCR for extraction. Figuring out that the given PDF has curves instead of text is a step which needs to be performed before attempting to extract.

Each character is represented as a Bezier curve" width="800" height="613">

Multi column page layout and fancy layouts

A zoomed-out portion of a PDF file with curves instead of text.
Each character is represented as a Bezier curve

Multi column page layout is very common in scientific publications and documents like invoices. The text is laid out as two columns as shown in the image below. As mentioned earlier, text in PDFs have a fixed layout. This makes it very difficult to semantically extract the text as paragraphs from these types of documents. We need to use heuristics to intelligently extract text in a semantic order from these types of documents. Some text based PDF generators are smart enough to arrange the text records following semantic ordering. But, as always, in the wild, we have to be prepared to encounter badly created PDFs which have absolutely no semantic ordering of text records. When we have scanned documents (with or without searchable content), we have no option but to use intelligent methods to understand multi-column layouts and extract text which makes semantic sense.

A two column PDF file.The lines and arrows indicate how text records are organized in a multi column PDF.

In the example shown above, text records can be organized in a semantically correct order as shown in the red lines. But in some PDFs (and all OCR’d documents) text records can be organized in a non-semantic order reading left to right over to the next column before moving to the next line. When text is collected this way, the final text will make no sense to downstream pipeline steps. We need smart ways to reorganize such text to make semantic sense.

Note that the problems described above are also applicable to pages with fancy layouts like invoices, receipts and test reports.

Background images and Watermarks

Background images in PDF files can be a problem for both text based PDFs and scanned PDFs. In text based PDFs, the extractor can confuse the background image to be indicating a scanned PDF and switch to an OCR based extraction which will be hundreds of times slower and cost way more. When using such PDFs that feature background images in an OCR based extraction, it can confuse the OCR if it has contrasting colors or patterns. Especially if the background image and text in front of it have little contrast difference. For example black text on top of dark coloured background images. Human eyes can easily pick it up but for many OCR systems it is a challenge.

_Sample PDF with a strong watermark which can interfere with text extraction

Some background images are watermarks and these watermarks can be text. When using OCR for extraction, these watermark texts can get added into the main body of texts. This is also the case for fancy backgrounds containing text in certificates etc.

In some cases, while using OCR for extracting text (which is the only way for scanned PDFs) background images with text can completely ruin text extraction, making it unextractable without human intervention.

Hand written forms

PDFs with hand written text are scanned document PDFs. These are typically forms or documents with handwritten notes and then scanned. Not all OCRs are capable of handwriting recognition. Also, OCRs capable of recognising handwritten text might be prohibitively expensive, especially when processing larger volumes.

PDFs with form elements like checkboxes and radiobuttons

A PDF form is a document that includes interactive fields where users can enter information. PDF forms can contain text fields, checkboxes, radio buttons, drop-down menus, and other interactive elements that allow users to input or select information.

Sample PDF with form elements

Many PDF libraries are not capable of extracting form elements from the PDF. Even less can extract the form elements' contents which the user has filled in. Even if we decide to convert the form into an image for OCR use, there are a couple of issues:

The PDF to image conversion software or library should understand the form elements. Very few of them support this. PDF.js supports this, but that would sit well in a NodeJS stack. If you are using a Pythonbased stack, your options are not many.
Not all OCR are capable of understanding form elements like checkboxes and radiobuttons. Your only option might be to train the OCR to recognize and render such elements if you are not willing to use 3rd party web services.

Large scanned PDFs

Scanned PDFs require OCR to extract the contents. OCR by its nature is a compute intensive process and takes time to convert a page into text. When we are dealing with very large documents (> 100 pages) the time to extract all pages can be significant. Apart from time latencies, high quality OCR services also involve a non-trivial cost factor.

Low level text extraction library bugs

PDF files are complicated and the sheer variety and generation variations are so many that writing libraries to process them is an inherently difficult task. There will always be some corner cases the authors of the library could have never anticipated. This will lead to runtime errors which need to be handled. And if a significant portion of your target use is affected by this, then there is no option to either write your own extractor to handle these cases or contribute to the library if it is open source.

Headers and Footers

Many PDFs have headers and footers. Headers typically contain information about the document and the owner (company name, address etc) and the footer contains copyright information and page numbers etc. These are repeated across all pages. This information is generally not required in most RAG and information extraction use cases. These headers and footers simply add noise to the context when used with LLMs and other machine learning use cases. Though usually not a major issue, a good extraction tool should be able to ignore or even better, remove them from the final extracted text.

PDFs with both text and text as images

Some PDFs can have both native text and embedded images that in turn contain text in them. This requires special handling. The simple solution is to send the entire page to an OCR to extract the text. But this method might be expensive for high volume use cases. This can also substantially increase the time latencies of extraction. If cost or time is important, a custom extraction library has to be used in these cases.

Tables spread out horizontally into many pages

This is not a regular use case. But we might encounter PDFs with wide tables which extend into the next page. It is a very difficult problem to solve. Detecting where the table’s horizontal direction extends into is very difficult. The next logical page may contain the following rows instead of the horizontal overflow. In some cases the next logical page may contain the horizontal overflow. These use cases should be considered special cases and custom logic has to be written. It is easier when you know that all documents that will be processed have a similar structure. If that is the case custom extractors can be written. Unfortunately if these types of documents are not specially dealt with, it might be impossible to handle this case.

Privacy issues

As discussed above, writing a high quality PDF extraction library is a huge challenge. If you want to use 3rd party services to do extraction, there might be privacy and security issues. You will be sending information to a 3rd party service. If your digital mandate or rules require strict privacy requirements, you will have to use 3rd party services which can provide on-premise services where your data does not leave your network. LLMWhisperer service is one of the services that can be run on-premise, protecting your data from leaving your network.

Layout Preservation

If the target use case is to use the extracted text with LLMs and RAG applications, preserving the layout of the original PDF document leads to better accuracy. Large Language Models do a good job of extracting complex data, especially repeating sections and line items when the layout of documents is preserved in the extracted text. Most PDF extraction libraries or OCRs do not provide a layout preserving output mode. You will have to build the layout preserving output with the help of positional metadata provided for the text by either the PDF library or the OCR.

What a PDF to text convertor architecture would look like

Considering all the cases described above, a block diagram of a high quality PDF to text convertor would look like this:

Build vs Buy

Building a high quality PDF extractor is a complex and massive exercise. Building your own tool allows for complete control over the functionality and integration with existing systems. However, this approach requires significant investment in time, expertise and ongoing maintenance. On the other hand, purchasing a ready made, pre-built solution can be quicker to deploy and often comes with continuous updates and professional support.The choice ultimately depends on your specific needs, strategic priorities, resources and budgets.

Introducing LLMWhisperer

LLMWhisperer is a general purpose PDF to text converter service from Unstract.

LLMs are powerful, but their output is as good as the input you provide. LLMWhisperer is a technology that presents data from complex documents (different designs and formats) to LLMs in a way that they can best understand.

Features

Layout preserving modes Large Language Models do a good job of extracting complex data, especially repeating sections and line items when the layout of documents is preserved in the extracted text. LLMWhisperer’s Layout Preserving mode lets you realize maximum accuracy from LLMs.
Auto mode switching While processing documents, LLMWhisperer can switch automatically to OCR mode if text mode extraction fails to generate sufficient output. You don’t have to worry about the extraction mode when sending documents.
Auto-compaction The more tokens that go to the LLM, the more time it takes to process your prompts and the more expensive it becomes. With LLMWhisperer’s Auto-compaction, tokens that might not add value to the output are compacted—all while preserving layout.
Pre-processing To get the best of results, you can control how pre-processing of the scanned images is done. Parameters like Median Filter and Gaussian Blur can be influenced via the API, if needed.
Flexible deployment options
- SaaS High-performance, fully managed SaaS offering. No more dealing with updates, security, or other maintenance tasks – we’ve got you covered.
- On-Premise We offer a reliable way of deploying LLMwhisperer on your own servers to ensure the security of ultra-sensitive data.
And much more:
- Support for PDFs and the most common image formats
- High-performance cloud for consistently low-latency processing
- Settable page demarcation
- Three output modes: Layout preserving, Text, Text-Dump

What’s next? Action items for the curious

If you want to quickly test LLMWhisperer with your own documents, you can check our free playground.
Alternatively, you can sign up for our free trial which allows you to process up to 100 pages a day for free.

Even better, schedule a call with us. We’ll help you understand how Unstract leverages AI to help document processing automation and how it differs from traditional OCR and RPA solutions.

Test drive LLMWhisperer with your own documents. No sign up needed!

Note: I originally posted this on the Unstract blog a couple of weeks ago.