The Future of Document Scanning: A Look at LLM-Powered OCR

#rag #ollama #granite #llama

Hands-on experience using ‘Llama’ & ‘Granite’ Vision Models locally with Ollama

Introduction

Have you ever looked at a stack of old documents, a pile of scanned invoices, or a library of PDFs and felt like you were staring at mountains of inaccessible data? For decades, businesses have relied on Optical Character Recognition (OCR) to solve this problem. OCR technology acts as a digital eye, transforming text from images and paper documents into machine-readable formats. While this has been a game-changer for simple data entry and digitization, it often leaves a crucial piece of the puzzle unsolved: comprehension.

In industrial settings, for example, OCR is a powerful tool for inventory and asset management. Imagine a factory floor filled with machinery, each with a small nameplate detailing its model number, serial number, and technical specifications. Manually cataloging this information is a slow, error-prone task. With OCR, a simple photo from a smartphone can instantly digitize this data, providing a quick and efficient way to build a digital twin of a physical inventory. However, without context, this data is just a collection of numbers and letters, offering little insight into the machine’s function or maintenance history.

Traditional OCR is excellent at transcribing “what” the text says, but it struggles with “what it means.” It can’t understand context, interpret tables, or pull out key insights from a complex report. That’s where the new era of Large Language Models (LLMs) comes in. By combining the transcription power of OCR with the analytical and generative capabilities of LLMs, we can unlock a new level of intelligent document processing. This powerful fusion not only digitizes information but enriches it, paving the way for advanced applications like Retrieval-Augmented Generation (RAG), which allows you to have a natural conversation with your documents and get instant, accurate answers from a vast, private dataset.

Implementation and Test

For my testing, I built a simple OCR setup using Ollama, which allows for running powerful LLMs locally. I specifically used two small, vision-enabled models available for download from the Ollama site: “granite3.2-vision” and “llama3.2-vision.” By running these models on a local machine, I can perform comprehensive OCR tests while maintaining data privacy and achieving fast, low-latency results without needing to rely on a cloud-based API.

To test the models’ capabilities, my methodology was straightforward. I searched for and downloaded four completely random — and even fake — images of industrial nameplates from the internet. I then developed a simple Python application to process these images. The application’s output is a series of JSON files, with each file not only describing the contents of the image but also recording the exact time it took for the local LLM to generate the file.

So let’s jump into the tests!

Download the LLMs using Ollama;

ollama run llama3.2-vision
ollama run granite3.2-vision

Prepare your Python environment;

#!/bin/sh 
python3 -m venv venv
source venv/bin/activate

pip install --upgrade pip
pip install ollama

Code Implementation ⬇️

import os
import json
import base64
import ollama
import time
from pathlib import Path

# --- Configuration ---
INPUT_DIR = "input"
MODELS = [
    {"name": "granite3.2-vision", "output_dir": "granite-vision-output"},
    {"name": "llama3.2-vision", "output_dir": "llama-vision-output"},
]

# --- Helper Functions ---
def is_image_file(filename):
    """Checks if a file has a common image extension."""
    image_extensions = {".jpg", ".jpeg", ".png", ".gif", ".bmp", ".webp", ".tiff"}
    return Path(filename).suffix.lower() in image_extensions

def get_base64_image(image_path):
    """Reads an image file and returns its Base64-encoded string."""
    try:
        with open(image_path, "rb") as image_file:
            return base64.b64encode(image_file.read()).decode("utf-8")
    except IOError as e:
        print(f"Error reading image file {image_path}: {e}")
        return None

def process_image_with_ollama(model_name, image_path, base64_image):
    """Sends an image to the Ollama model for text extraction."""
    print(f"Processing image with {model_name}: {image_path}")
    start_time = time.time()
    try:
        response = ollama.chat(
            model=model_name,
            messages=[
                {
                    "role": "user",
                    "content": "Extract all text from this image.",
                    "images": [base64_image],
                }
            ],
        )
        end_time = time.time()
        extracted_text = response['message']['content']
        elapsed_time = end_time - start_time
        return extracted_text, elapsed_time
    except Exception as e:
        end_time = time.time()
        elapsed_time = end_time - start_time
        print(f"Error with Ollama API for {image_path} using model {model_name}: {e}")
        return None, elapsed_time

def save_to_json(filepath, data):
    """Saves data to a JSON file."""
    try:
        with open(filepath, "w") as json_file:
            json.dump(data, json_file, indent=4)
        print(f"Successfully saved text to {filepath}")
    except IOError as e:
        print(f"Error writing to JSON file {filepath}: {e}")

# --- Main Application Logic ---
def main():
    """Main function to run the OCR process for multiple models."""
    print("Starting Ollama OCR application...")

    # Ensure the input directory exists
    if not Path(INPUT_DIR).exists():
        print(f"Error: The input directory '{INPUT_DIR}' does not exist. Please create it and add images.")
        return

    # Create a list of all image paths to process
    image_paths = []
    for root, _, files in os.walk(INPUT_DIR):
        for filename in files:
            if is_image_file(filename):
                image_paths.append(Path(root) / filename)

    if not image_paths:
        print(f"No image files found in '{INPUT_DIR}'. Please add images to the folder.")
        return

    # Process images with each model specified in the MODELS list
    for model_info in MODELS:
        model_name = model_info["name"]
        output_dir = model_info["output_dir"]

        # Ensure the model-specific output directory exists
        Path(output_dir).mkdir(parents=True, exist_ok=True)

        print(f"\nProcessing images with model: {model_name}")

        for image_path in image_paths:
            # Get the Base64 representation of the image
            base64_image = get_base64_image(image_path)
            if not base64_image:
                continue  # Skip to the next file if there was an error

            # Process the image with the Ollama model and get the elapsed time
            extracted_text, elapsed_time = process_image_with_ollama(model_name, image_path, base64_image)

            # Print the elapsed time for the current image
            print(f"  - Time elapsed for this photo: {elapsed_time:.2f} seconds.")

            if extracted_text:
                # Construct the output filename, preserving the relative path
                relative_path = image_path.relative_to(INPUT_DIR)
                output_sub_dir = Path(output_dir) / relative_path.parent

                # Create subdirectories in the output folder to mirror the input folder structure
                output_sub_dir.mkdir(parents=True, exist_ok=True)

                # New filename format: {model_name}_{original_filename_stem}.json
                output_filename = f"{model_name.replace(':', '_')}_{image_path.stem}.json"
                output_filepath = output_sub_dir / output_filename

                # Prepare data for JSON file, including the elapsed time
                json_data = {
                    "original_image_path": str(image_path),
                    "extracted_text": extracted_text,
                    "model_used": model_name,
                    "processing_time_seconds": round(elapsed_time, 2)
                }

                # Save the extracted data
                save_to_json(output_filepath, json_data)
            else:
                print(f"Could not extract text from {image_path} with model {model_name}. Skipping.")

    print("\nOllama OCR application finished.")

if __name__ == "__main__":
    main()

Test #1

{
    "original_image_path": "input/20200605-202957.jpg",
    "extracted_text": "The image shows a gray card with black text on it, which appears to be a product label or label for a product. The card is placed on a stand and has a white background.\n\n* The card has the following text:\n\t+ \"HYDRO 71\" in large letters at the top\n\t+ \"21 Rue des Acacias - 71200 LE CREUSOT\" in smaller letters below\n\t+ \"T\u00e9l: 03.85.80.76.59 - Fax: 03.85.55.34.62\" in even smaller letters below\n\t+ \"Email: hydro71@orange.fr\" in the smallest letters at the bottom\n* The card also has some other text, including:\n\t+ \"Type: GROUPE \u00c9CHANGEUR\"\n\t+ \"Num\u00e9ro: T16361\"\n\t+ \"Ann\u00e9e: 2017\"\n\t+ \"0,37 KW\"\n\t+ \"12 L/min\"\n\t+ \"20 Bars\"\n\t+ \"380V - 50HZ\"\n\t+ \"500 KG\"\n\t+ \"Puissance: 0,37 KW\"\n\t+ \"D\u00e9bit: 12 L/min\"\n\t+ \"Pression: 20 Bars\"\n\t+ \"Tension: 380V - 50HZ\"\n\t+ \"Poids: 500 KG\"\n\nThe card appears to be a label for a product called \"HYDRO 71\", which is likely a type of water pump or similar device. The text on the card provides information about the product's type, number, and year of production, as well as its technical specifications.",
    "model_used": "llama3.2-vision",
    "processing_time_seconds": 28.06
}



{
    "original_image_path": "input/20200605-202957.jpg",
    "extracted_text": "\nThe text on the label includes:\n- HYDRO 71\n- 21 Rue des Acadies - 71200 LE CREUSOT\n- T\u00e9l : 03.85.80.76.59 - Fax : 03.85.55.34.62\n- Email: hydro71@orange.fr\n- Type: GROUPE \u00c9CHANGEUR\n- Num\u00e9ro: T16361\n- Ann\u00e9e: 2017\n- Puissance: 0,37 kW\n- D\u00e9bit : 12 L/min\n- Pression: 20 bars\n- Poids : 500 KG",
    "model_used": "granite3.2-vision",
    "processing_time_seconds": 15.71
}

Test #2

{
    "original_image_path": "input/plaque-aluminium-gravure-paris.jpg",
    "extracted_text": "The image displays a label with text in French. The text on the label is as follows:\n\n**Header**\n\n* DIRECTION de l'ARMEE\n* du MATERIEL de TERRE\n\n**Label Content**\n\n* D\u00e9signation: GRIC ROULEUR\n* HYDROPNEUMATIQUE - 60T\n* Marque et mod\u00e8le: CAO - N20-3C\n* S\u00e9rie: L.EV005\n* Ann\u00e9e de fabrication: 2013\n* N\u00b0 d'immatriculation: 4910 33 209 7613\n* N\u00b0 Marche: 11 01 597\n* Date: [blank]\n* Titulaire: CAO EQUIPEMENTS\n* 542 044 375 00070\n\n**Additional Text**\n\n* DIRECTION de l'ARMEE\n* du MATERIEL de TERRE\n* D\u00c9SIGNATION\n* HYDROPNEUMATIQUE\n* 60T\n* MARQUE ET MODELE\n* CAO - N20-3C\n* SERIE\n* L.EV005\n* ANN\u00c9E DE FABRICATION\n* 2013\n* N\u00b0 D'IMMARTICULATI\n* 4910 33 209 7613\n* N\u00b0 MARCH\u00c9\n* 11 01 597\n* DATE\n* TITULAIRE\n* CAO EQUIPEMENTS\n* 542 044 375 00070\n* D\u00c9SIGNATION\n* HYDROPNEUMATIQUE\n* 60T\n* MARQUE ET MODELE\n* CAO - N20-3C\n* SERIE\n* L.EV005\n* ANN\u00c9E DE FABRICATION\n* 2013\n* N\u00b0 D'IMMARTICULATI\n* 4910 33 209 7613\n* N\u00b0 MARCH\u00c9\n* 11 01 597\n* DATE\n* TITULAIRE\n* CAO EQUIPEMENTS\n* 542 04\n* 375 00070\n* D\u00c9SIN\n* HY\n* 60T\n* MAR\n* CAO - N20-3C\n* SER\n* L.EV\n* ANN\n* 2013\n* N\n* 4910\n* 33\n* 209\n* 7613\n* N\n* 11\n* 01\n* 597\n* D\n* T\n* CAO\n* 542\n* 04\n* 375\n* 00\n* D\n* H\n* 60T\n* M\n* CAO\n* N\n* L\n* E\n* V\n* 005\n* A\n* 201\n* 13\n* N\n* 49\n* 10\n* 33\n* 20\n* 9\n* 761\n* 3\n* N\n* 11\n* 01\n* 597\n* D\n* T\n* CAO\n* 542\n* 04\n* 375\n* 00\n\nNote: The image appears to be a label for a piece of equipment or a piece of machinery. The label is in French and appears to be part of a larger system for tracking and cataloging equipment. The label includes a serial number, a model number, a manufacturing date, and other information. The label is likely part of a larger system for tracking and cataloging equipment. The label is in French and appears to be part of a larger system for tracking and cataloging equipment. The label includes a serial number, a model number, a manufacturing date, and other information. The label is likely part of a larger system for tracking and cataloging equipment. The label is in French and appears to be part of a\u00a0\u2026 The rest of the text is not legible.",
    "model_used": "llama3.2-vision",
    "processing_time_seconds": 42.33
}


{
    "original_image_path": "input/plaque-aluminium-gravure-paris.jpg",
    "extracted_text": "\nThe text extracted from the image reads as follows:\n\n- DIRECTION de l'ARMEE\n- D\u00e9signation HYDRONEUMATIQUE - 60T\n- Marque et modelle CAO- N20-3C\n- Ann\u00e9e de fabrication 2013\n- N\u00b0 d'immatriculation 4910 33 209 7613\n- N\u00b0 March\u00e9 11 01 597\n- Date 2013\n- CAO EQUIPEMENTS 542 044 375 00070\n- Titulaire 542 044 375 00070\n- N\u00b0 d'ordre LEV005\n- Ann\u00e9e de fabrication 2013\n- N\u00b0 March\u00e9 11 01 597\n- Date 2013\n- CAO EQUIPEMENTS 542 044 375 00070\n- Titulaire 542 044 375 00070\n\nThis label appears to be a part of an industrial or military equipment, specifically a hydraulic cylinder with the model number CAO- N20-3C. The information includes details such as the year of manufacture, registration number, and possibly other identifiers related to the equipment's ownership and specifications.",
    "model_used": "granite3.2-vision",
    "processing_time_seconds": 14.65
}

Test #3

{
    "original_image_path": "input/plaque-machine-mjt.jpg",
    "extracted_text": "The image displays a blue and white label with text in French. The label appears to be for a machine or device, possibly a turbine or generator.\n\n*   **Top Left Corner**\n    *   A logo for \"MJ2 Technologies\" is displayed in white text.\n    *   Below the logo, smaller white text reads \"Conception et Fabrication de Turbines Hydro\u00e9lectriques\".\n    *   The address \"ZA Millau-Larzac - 12230 La Cavalerie (France)\" is listed below the smaller text.\n    *   A website URL \"www.vlh-turbine.com\" is provided.\n*   **Top Right Corner**\n    *   A logo for \"VLH\" is displayed in white text.\n*   **Main Body**\n    *   The main body of the label contains various pieces of information about the machine or device, including:\n        *   **Machine Type**: Groupe turbog\u00e9n\u00e9rateur VLH DN 4000\n        *   **Series Number**: 13 VLH 005 / Le Rondeau\n        *   **Serial Number**: #61\n        *   **Year of Fabrication**: 2013\n        *   **Tension G\u00e9n\u00e9rateur**: 527 VAC\n        *   **Puissance G\u00e9n\u00e9rateur**: 585 kW\n        *   **Chute Nette**: 3.99 m\n        *   **Vitesse Nominale**: 53.8 tr/min\n*   **CE Mark**\n    *   A CE mark is displayed in the bottom-right corner of the label, indicating that the device meets European safety and health standards.\n*   **Background**\n    *   The background of the label is a solid blue color, with a white border around the edges.\n\nOverall, the label appears to be a technical document that provides information about the machine or device, including its type, series number, serial number, year of fabrication, and various technical specifications. The CE mark suggests that the device has been certified to meet European safety and health standards.",
    "model_used": "llama3.2-vision",
    "processing_time_seconds": 34.02
}


{
    "original_image_path": "input/plaque-machine-mjt.jpg",
    "extracted_text": "\n<doc>     MJ2 \n     ConceptionetFabrication \n     ZA Millau-Larzac \n     Machine type N\u00b0deserie: Ann\u00e9edefab: Tensiongener Puissancegener Chutenetr Vitessenominale \n     technologies    tiondeTurbinesHydro\u00e9lectriques 1230LaCavalierie(France)-www.vlh-turbine.com \n     GroupturbogenerateurVLHD \n     13VLH005/LeRondeau \n     cation:   2013    ateur:    VAC \n     585 kW \n     3,99 m \n     Cale:    53,8 tr/min </doc>",
    "model_used": "granite3.2-vision",
    "processing_time_seconds": 15.7
}

Test #4

{
    "original_image_path": "input/Plaque-signaletique-gravure-laser-sur-acrylique-bicouche-scaled-e1639557765712-800x420.jpg",
    "extracted_text": "The image shows two black metal plates with white text on them. The text is in French and appears to be some sort of product or product information.\n\n* The first plate has the following text:\n\t+ \"EMBALLAGE SPECIFIQUE (BOITIER DISTRIBUTION EBMR)\" in large letters\n\t+ \"FASL3 8320371000\" in smaller letters\n\t+ \"Masse : 27kg\" in small letters\n\t+ \"Volume : 0,223m3\" in small letters\n\t+ \"Dim. Ext. : 0,80x0,62x0,45m\" in small letters\n* The second plate has the following text:\n\t+ \"QUE BDE (BOITIER ELECTRONIQUE BDE)\" in large letters\n\t+ \"FASL3 8320371000\" in smaller letters\n\nThe plates appear to be some sort of product or product information, possibly related to a specific product or product line. The text on the plates is in French and includes information such as mass, volume, and dimensions.",
    "model_used": "llama3.2-vision",
    "processing_time_seconds": 25.39
}


{
    "original_image_path": "input/Plaque-signaletique-gravure-laser-sur-acrylique-bicouche-scaled-e1639557765712-800x420.jpg",
    "extracted_text": "\nThe text on the label reads as follows:\n\n- EMBALLAGE SPECIFIQUE BDE (BOITIER DISTRIBUTION ELEC)\n- FASL3 8320371000\n- Masse : 27kg\n- Volume : 0.223m3\n- Dim. Ext.: 0.80x0.62x0.45m\n\nThis label appears to be a specification sheet for an electrical component, likely a ballast or distribution box, given the context provided by the text. The dimensions and mass are specified in metric units, which is common for such components.",
    "model_used": "granite3.2-vision",
    "processing_time_seconds": 10.44
}

After completing my sample tests, the conclusion was clear: the Granite model is significantly faster, while the Llama model provides a more verbose and descriptive output. Both models, however, demonstrated an impressive ability to recognize and process text in French, which is a great bonus. The key takeaway is that the choice of model depends on the use case. You might opt for Granite when speed is the priority, or choose Llama for tasks that require more detailed descriptions. The results from either model can be further enriched with external tools and used to build a robust Retrieval-Augmented Generation (RAG) system for a variety of industrial business use cases.

In addition to these models, tools such as Docling offer powerful enrichment features that can be added to the conversion pipeline. These additional steps allow for processing specific document components, like code blocks or pictures, to extract more comprehensive information. However, these extra steps often require the execution of additional models, which can significantly increase processing time. For this reason, most enrichment models are disabled by default, leaving the user to decide on the optimal balance between speed and data richness.

Picture description
The picture description step allows to annotate a picture with a vision model. This is also known as a “captioning” task. The Docling pipeline allows to load and run models completely locally as well as connecting to remote API which support the chat template. Below follow a few examples on how to use some common vision model and remote services.

from docling.datamodel.pipeline_options import granite_picture_description

pipeline_options.picture_description_options = granite_picture_description

…

from docling.datamodel.pipeline_options import PictureDescriptionVlmOptions

pipeline_options.picture_description_options = PictureDescriptionVlmOptions(
    repo_id="",  # <-- add here the Hugging Face repo_id of your favorite VLM
    prompt="Describe the image in three sentences. Be consise and accurate.",
)

Conclusion

In summary, this process demonstrates a powerful and pragmatic approach to intelligent document processing. I leveraged local LLMs via Ollama, and I was able to combine the raw transcription power of OCR with the contextual intelligence of vision models. My testing with Granite and Llama showed that while different models have distinct strengths — speed versus descriptive detail — both can provide high-quality, multilingual results. Ultimately, this approach allows for the creation of customized, RAG-enabled systems that can transform raw data into a truly valuable, searchable, and conversational knowledge base for any industrial application.