DEV Community: David Maigari

OCR vs ADE: Mechanisms Behind the Methods

David Maigari — Thu, 09 Oct 2025 09:03:04 +0000

Introduction

In today’s data-driven world, there is a heavy reliance on extracting information from physical sources and interpreting their content. When we talk about extracting data from documents, files, and other sources, there are two approaches that stand out among others. Optical Character Recognition (OCR) and Agentic Document Extraction (ADE) are two methods with distinct capabilities.

The principles of these two mechanisms differ, and understanding their applications helps you leverage them effectively. In a nutshell, OCR displays a visual representation of texts and converts them into readable characters. ADE, on the other hand, employs a more complex approach to representing data, extracting structured data from a wide range of unstructured inputs.

What is Optical Character Recognition (OCR)?

OCR is an effective method for interpreting text in images and files. This method involves certain mechanisms, like image processing and character extraction. Although it was designed to digitize books and newspapers, OCR has evolved into a vital technology in scanning, document processing, and machine learning applications.

With the capability of identifying and converting printed, handwritten, or raw image data into readable text, OCR relies on several stages of processing. It first employs image processing on the input data. This step helps reduce lines and segment text from other elements, providing the recognition engine with a clean image to work with.

OCRs also perform feature extraction to create distinct attributes within every character, making it possible to analyze various texts. Modern OCR models now employ CNN architecture to make feature extraction more seamless. And finally, it uses a post–processing technique that provides more context to the words identified to avoid errors. For example, misreading “rn” as “m” can be corrected by analyzing word probability.

What is Agentic Document Extraction?

In contrast to OCR, Agentic Document Extraction provides more data when parsing through files and images. It can even provide context to the data. One of the reasons behind ADE’s robust output is that it can read more than just plain text; ADE is also capable of identifying tables, forms, and so on.

ADE can extract structured data from unstructured documents, files, and images. It performs this process by identifying tables, forms, and texts in a document, while also understanding the context within each element. Agentic Document Extraction can parse your files and return the data in hierarchical JSON format with its capabilities.

What features make ADE so robust and effective? ADE leverages several vital capabilities to execute document extraction tasks. These include the following:

Element Detection: ADE can identify and segment regions of a document with form fields, tables, checkboxes, and text.
Recognizes structured order: Each element in a document has a distinct structure, and ADE has an effective understanding of the relationship between texts, lines, and captions present in different regions around the document.
Visual Grounding: The JSON output tells you where each piece of information came from in the document. It also says what page and where on the page. This method makes it easy to double-check and track records.
Flexible Output: ADE’s ability to return results in Markdown and JSON makes it ready for use in RAG applications and workflows.

OCR vs ADE: Working principle

The approach and output of OCR and ADE differ significantly. While OCR specializes in recognizing characters from images and documents, ADE provides more output data by extracting a structured representation of text and other elements.

OCR’s method starts with image processing. This stage ensures that the document goes through a ‘noise reduction’ technique to give a clean image before segmenting the text regions.

There is a different technique ADE employs in its tasks. It uses data parsing to break down the inputs into logical structures. It then uses a blend of pattern-based methods (regex, templates) and AI-driven extraction (NLP, NER) to pull out key fields such as invoice numbers, names, or dates.

In essence, OCR is the foundation for converting raw images into text, while ADE builds on top of that (or directly from digital text) to deliver actionable, validated, and structured information. OCR answers the “what characters are on this page?” question, whereas ADE answers “what does this document mean, and how do we use its contents in business workflows?”

OCR vs ADE: Comparative Analysis

Let’s put into practice what these systems can carry out. Firstly, we will explore Landing AI’s Agentic Document Extraction using the agentic_doc Python library. We will run ADE on a payment report from Bolt —a ride-hailing app- and see how this software solution understands the visual representation of the document and its context.

So, let’s use the agentic_doc library to extract information from the PDF file above.

Prerequisites

A Google Colab Environment
Vision agent API Key and
A sample file to extract data from, i.e, jpeg, pdf, png..

Install dependencies

`!pip install agentic-doc`

The agentic-doc is the Python library that wraps Landing AI’s ADE. So, it is a compulsory package to have in your environment.

Parsing the document

from agentic_doc.parse import parse


# Parse a local file
result = parse("/content/NG1124-459199.pdf")


# Get the extracted data as markdown
print("Extracted Markdown:")
print(result[0].markdown)


# Get the extracted data as structured chunks of content in a JSON schema
print("Extracted Chunks:")
print(result[0].chunks)

The parse function reads the file path “/content/NG1124-459199.pdf” and returns a result that contains a structured output of the data in the document. The result in the document is then converted into a clean, readable markdown text, and finally, result[0].chunks displays the content as structured JSON-like chunks, making it easy to work with specific sections.

Here is the unaltered output appearance on Google Colab.

The output in readable text shows that ADE understands every section and element in the document.

OCR, on the other hand, cannot provide this much context to a document. We will run a file through an OCR model to demonstrate its functionality.

How to Run an OCR Model

Many OCR models work by displaying the text and highlighting it with bounding boxes. Others are only capable of recognizing single-line text in files and even handwritten images. TrOCR is a good example of this; others include easyocr, PaddleOCR, and others.

Here is a simple illustration of how this works.

Prerequisite

A Google Colab environment.
Install dependencies

Installing Dependencies

pip install torch torchvision torchaudio
pip install transformers
pip install pillow
pip install requests

You can run the model using this code;

from transformers import TrOCRProcessor, VisionEncoderDecoderModel

from PIL import Image

import requests


# load image from the IAM database (actually, this model is meant to be used on printed text)

url = 'https://fki.tic.heia-fr.ch/static/img/a01-122-02-00.jpg'

image = Image.open(requests.get(url, stream=True).raw).convert("RGB")


processor = TrOCRProcessor.from_pretrained('microsoft/trocr-base-printed')
model = VisionEncoderDecoderModel.from_pretrained('microsoft/trocr-base-printed')
pixel_values = processor(images=image, return_tensors="pt").pixel_values

Generating the results

generated_text

This shows the text in the image that was handwritten.

OCR vs ADE: Key Takeaway

While OCR and ADE operate differently, their real power emerges when used together. In enterprise settings like invoice processing, OCR first transforms a scanned PDF into readable text. This step converts static images into digital characters, creating the foundation for further analysis.

ADE then takes over, identifying key details such as vendor names, invoice numbers, dates, and totals. Once extracted, the data is validated and seamlessly entered into systems like ERPs. This synergy creates a smooth automation pipeline, reducing manual effort while ensuring accuracy at scale.

Wrapping Up

These two software solutions complement each other regardless of their varying purposes and capabilities. So, it is essential to know how to use them independently or together.

Pymupdf: A Python Library That Reduces the Size of PDF files

David Maigari — Tue, 29 Apr 2025 20:46:47 +0000

Many libraries and tools perform various tasks in processing files, images, and documents. PDF files are a good example of files that can undergo several processing stages using Python libraries and AI models.

Reducing the size of a PDF file can be necessary for various reasons, and various tools can accomplish this. One fascinating way is to use the PyMuPDF library. This library is designed for various tasks, including extracting text, creating images from documents, adding images to documents, and performing optical character recognition (OCR) on document pages.

However, it can also employ a special technique to reduce the file size of a PDF. This article will examine how Hanif utilized the PyMuPDF library to accomplish this task.

What is PyMuPDF?

PyMuPDF is a Python library that excels in extracting, analyzing, manipulating, and converting data from PDF files and other document formats. With this library, you can extract text and images from a PDF file. But its capabilities do not end here. PyMuPDF can also extract vector graphics and merge PDF files.

Another interesting feature of this library is its ability to manipulate and process PDF files. This library can perform a range of tasks, including adding watermarks and images, as well as embedding and attaching files.

How Do PDFs Get Bloated?

There are several reasons why a PDF file can become too large. High image resolutions, embedded fonts, unused objects, and other factors can affect the size of these files. Compressing them is a very effective optimization technique, as it can provide a better overall experience. Compressing larger PDF files is important for several reasons, including saving storage space, improving web performance, and facilitating easy sharing.

Reducing PDF files using PyMuPDF involves a seamless method. This library can operate in low-resource environments and yield excellent results. We’ll explore how this technique creates the tool and how you can achieve your desired results.

How to Reduce PDF Using this Library

We can begin by setting up the environment and utilizing PyMuPDF to compress the size of a PDF file. Then, we can install the essential modules for processing the file. So, let’s dive in.

Setting up the environment
To run this operation, you must install the PyMuPDF Python library, which allows you to work with PDF files and other document formats.

!pip install pymupdf

Importing the Necessary Libraries

import os
import fitz  # PyMuPDF

This operation requires two libraries: OS and Fitz. The OS library provides functions that enable interaction with the operating system by managing file paths. The PyMuPDF library's main module is Fitz, responsible for handling PDFs and other document formats. Fitz can easily open, load, read, and manipulate PDF files, which is particularly significant for tasks involving size reduction.

Compressing the PDF file

def compress_pdf(input_path, output_path, zoom_x=0.75, zoom_y=0.75):
   try:
       document = fitz.open("input_path")
       new_document = fitz.open()

This function is designed to compress PDF files by reducing the size of their pages. The function loads the original PDF file using the PyMuPDF library, also known as fitz.open function. Then, it creates a new empty document that allows you to add the compressed file.
PDF operation.

In addition to setting the function that opens the original file and creates a new document, you must set the structure for compressing the file in a suitable format.

       for page_num in range(len(document)):
           page = document.load_page(page_num)
           mat = fitz.Matrix(zoom_x, zoom_y)
           pix = page.get_pixmap(matrix=mat, alpha=False)
           img_bytes = pix.tobytes("jpeg")
           new_page = new_document.new_page(width=pix.width, height=pix.height)
           new_page.insert_image(new_page.rect, stream=img_bytes)

This block loops through each page of the original PDF. For every page, it creates a scaling matrix (fitz.Matrix) to reduce the page content, renders the page as a pixmap (image), converts it to JPEG bytes, and then inserts the image into a new, blank page in the new, compressed PDF. This process effectively reduces the file size by saving the visual content as images.

Saving Newly Compressed PDF

       new_document.save(output_path)
       new_document.close()
       document.close()

This code helps you save the newly compressed PDF file to the output_path. However, it closes the original and compressed documents to free up memory. This ensures all changes are written and no files are left open in memory.

Results

       print(f"✅ Compressed: {output_path}")
       print(f"   Original: {os.path.getsize(input_path)/1024:.2f} KB")
       print(f"   Compressed: {os.path.getsize(output_path)/1024:.2f} KB")

With this code, you can display the path to the compressed file and also view the sizes of the original and the new documents. This operation is carried out using the os.path module. getsize() to get the size of the new document. The output is also formatted to two decimal points, which helps with readability.

Batch PDF Compression and Error Handling

   except Exception as e:
       print(f"❌ Error compressing {input_path}: {str(e)}")


if __name__ == "__main__":
   folder = "/content"


   if not os.path.isdir(folder):
       raise FileNotFoundError(f"Folder '{folder}' does not exist.")


   pdf_files = [f for f in os.listdir(folder) if f.endswith(".pdf")]


   if not pdf_files:
       print("⚠️ No PDF files found in the folder.")
   else:
       for filename in pdf_files:
           input_pdf = os.path.join(folder, filename)
           output_pdf = os.path.join(folder, filename.replace(".pdf", "-compressed.pdf"))
           compress_pdf(input_pdf, output_pdf)

This code handles errors during compression with a try-except block. If something goes wrong, it catches the exception and prints an error message showing which file failed and why. Then, in the main execution block, it sets a folder path (in this case, /content) where the PDF files are located.

It checks if the folder exists and raises an error if it doesn't. Next, it looks for all PDF files in that folder. If no PDFs are found in the folder, a warning is printed. Otherwise, it loops through each PDF file, creates a new output filename for the compressed version, and calls the compress_pdf() function to perform the compression.

Results

Running this code locally yields the output of your compressed document, tagged as ‘compressed.’ All these operations are in the folder; you can get a resized PDF.

Application of PyMuPDF for Reducing File Size

This tool is particularly useful in various settings where file size reduction is crucial. Good examples include Email Attachment optimization, improved web and app performance, and archiving and storage management.

In many corporate workplaces, email attachments are limited (e.g., 20MB), but with PyMuPDF, you can compress files, reports, proposals, and other documents to stay within the given limits.

A reduced size can also enhance loading speed and improve the user experience. This is especially necessary for institutions, banks, academic institutions, and legal organizations that host online PDF files.

Optimizing storage is another reason why the tool is relevant. Hospitals, firms, government agencies, and even some businesses store a large archive of PDF documents. Reducing file size is essential for saving storage and memory.

Exploring PyMuPDF With Streamlit

You can apply this compressor using a simple user interface with Streamlit to power the compression algorithm. This tool can help you set up a page that can handle these files. Therefore, you can test the compressor while also allowing usage for individuals with a deep technical understanding of coding.

This section will explain a simple guide to setting up a page to run this compression tool.

Page Setup

The first step in using Streamlit is to utilize its page setup features. This allows you to create a centered heading and minimalist interface where you can add attributes, links, buttons, file uploads, and other inputs. In this case, using PyMuPDF would require adding an interface to upload files, which brings us to the next step.

File Handling

You can use the bordered container on Streamlit to create a file uploader that accepts only PDF files. This attribute helps you handle the uploaded file effectively, as it allows you to avoid saving the file on disk.

User Control (Added Feature)

This tool features a high-quality slider that ranges from 0 to 100, allowing users to control the desired compression level precisely.

Displays an informative message explaining that a lower quality setting results in a smaller file size. Features a large "Compress PDF" button that spans the width of the container, making it highly visible and easy to click.

Processing Flow

When a PDF is uploaded, the app captures the file name and size, then enables compression options. Upon clicking "Compress PDF," a loading spinner appears, the file is processed using temporary storage, and the compression algorithm is executed.

After compression, it displays file size comparisons, provides a download button, and shows an error message if compression fails.

Running Streamlit Locally

As we mentioned earlier, this platform allows you to set up a simple User Interface to host your mini projects. It also permits simple attributes, such as buttons, file uploaders, and links. Here is a simple guide on how to run this;

Clone the GitHub repo

Navigate to and clone the GitHub Repository at https://github.com/Hanif-adedotun/compress-pdf.git

Navigate to the root of the folder
Open up the terminal on the path
Run pip install -r requirements.txt
Run streamlit streamlit.py
Open up your browser at localhost:8501

Voila! You now have access to the localhost version of the project running on your system.

Install Streamlit

Pip install Streamlit

Once you have successfully installed it, you can run the Python code. If you encounter no errors, it means Streamlit has been installed successfully.

Run the Streamlit File

Streamlit run filename.py

This command will direct you to a local URL on your web browser, and you can then proceed to start building your page.

With this tool, you can use Streamlit’s basic functions and attributes to run your mini projects, like the PDF compressor. There are lots of features you can add with little to no knowledge of HTML and CSS. Check out this guide for more information on using Streamlit.

Wrapping Up

The article examines how the PyMuPDF Python library effectively compresses the size of PDF files. Users can optimize PDFs for email, web, and storage by converting pages to compressed images and rebuilding the document.

Hanif demonstrates a practical and effective method using simple code, batch processing, and error handling. This technique is particularly valuable for organizations that handle large volumes of digital documents.