DEV Community

Cover image for How to Run a Local Model for Text Recognition in Images
Ivan
Ivan

Posted on

1

How to Run a Local Model for Text Recognition in Images

Want to extract text from images without relying on cloud services?

You can run a powerful optical character recognition (OCR) model right on your own computer. This local approach gives you full control over the process and keeps your data private. In this article, we'll walk you through setting up and using a popular open-source OCR engine. You'll learn how to install the necessary libraries, load pre-trained models, and process images to recognize text in various languages. Whether you're working on a personal project or developing an application, this guide will help you get started with local text recognition quickly and easily.

This guide uses Windows 11, the Ollama model runner, the Llama 3.2 Vision model, and Python. Let's get started!

1. Install Ollama

First, head to https://ollama.com/download. Download the installer (it's about 768 MB) and run it to install Ollama.

2. Pull the Llama 3.2 Vision Model

Open your command prompt or terminal. We'll download the Llama 3.2 Vision model using Ollama. You have two size options.

The 11B and 90B parameters refer to the size of the Llama 3.2 Vision models, indicating the number of trainable parameters in each model:

  • 11B model: This is the smaller version with 11 billion parameters.
  • 90B model: This is the larger version with 90 billion parameters.

Both models are designed for multimodal tasks, capable of processing both text and images. They excel in various applications such as:

  • Document-level understanding
  • Chart and graph analysis
  • Image captioning
  • Visual grounding
  • Visual question answering

The choice between the 11B and 90B models depends on the specific use case, available computational resources, and the desired level of performance for complex visual reasoning tasks.

For the smaller model (11B, needs at least 8GB of VRAM (video memory)):

ollama pull llama3.2-vision:11b
Enter fullscreen mode Exit fullscreen mode

For the larger model (90B, needs a whopping 64GB of VRAM):

ollama pull llama3.2-vision:90b
Enter fullscreen mode Exit fullscreen mode

For home use, running the 90B model locally is extremely challenging due to its massive hardware requirements.

3. Run the Model

Once the model is downloaded, run it locally with:

ollama run llama3.2-vision
Enter fullscreen mode Exit fullscreen mode

4. Install ollama-ocr

To easily process images, we'll use the ollama-ocr Python library. Install it using pip:

pip install ollama-ocr
Enter fullscreen mode Exit fullscreen mode

5. Python Code for OCR

Here's the Python code to recognize text in an image:

from ollama_ocr import OCRProcessor

ocr = OCRProcessor(model_name='llama3.2-vision:11b')

result = ocr.process_image(
    image_path="./your_image.jpg",
    format_type="text"
)
print(result)
Enter fullscreen mode Exit fullscreen mode

6. Run the Code

Replace "./your_image.jpg" with the actual path to your image file. Save the code as a .py file (e.g., ocr_script.py). Run the script from your command prompt:

python ocr_script.py
Enter fullscreen mode Exit fullscreen mode

The script will send the image to your locally running Llama 3.2 Vision model, and the recognized text will be printed in your terminal.

To complement our guide on using Llama 3.2 Vision locally, we conducted performance tests on a home desktop computer. Here are the results:

Performance Test Results

We ran the Llama 3.2 Vision 11B model on a home desktop with the following specifications:

  • Processor: 13th Gen Intel(R) Core(TM) i7-13700K
  • Graphics Card: Gigabyte RTX 3060 Gaming OC 12G
  • RAM: 64.0 GB DDR4
  • Operating System: Windows 11 Pro 24H2

Image for Testing

For testing, we chose this amusing image.

The image is a meme image for testing the locally run Llama 3.2 Vision 11B model

Test Output

Using our Python script, we tasked the model with recognizing text in an image using the standard system prompt. After running the script multiple times on a single test image, we observed processing times ranging from 16.78 to 47.23 seconds. It's worth noting that these results were achieved with the graphics card running at default settings, without any additional tuning or optimizations.

The image is a black-and-white meme featuring two panels with stick figures and speech bubbles.

**Panel 1:**
In the first panel, a stick figure on the left side of the image has its arms outstretched towards another stick figure in the center. The central figure holds a large circle labeled "WEEKEND" in bold white letters. The stick figure on the right side of the image is partially cut off by the edge of the frame.

**Panel 2:**
In the second panel, the same two stick figures are depicted again. However, this time, the central figure now holds a smaller circle labeled "MONDAY" instead of "WEEKEND." The stick figure on the left side of the image has its arms outstretched towards the central figure once more.

**Text and Labels:**
The text in both panels is presented in white letters with bold outlines. In the first panel, the labels read:

* "ME" (on the stick figure's chest)
* "WEEKEND" (inside the large circle)

In the second panel, the labels are:

* "MONDAY" (inside the smaller circle)
* "ME" (on the stick figure's chest)

**Overall:**
The meme humorously portrays the anticipation and excitement of approaching the weekend, as well as the disappointment that follows when it arrives. The use of simple yet expressive stick figures and speech bubbles effectively conveys this sentiment in a relatable and entertaining manner.
Enter fullscreen mode Exit fullscreen mode

Conclusion

That's it! You're now running a local image text recognition system using Ollama and Python. Remember to experiment with different images and adjust your approach as needed for best results.

You can find the scripts referenced in this article in the repository at https://github.com/karavanjo/dev-content/tree/main/llama-local-run.

Demonstration of the model running in the video at the link.

Billboard image

The Next Generation Developer Platform

Coherence is the first Platform-as-a-Service you can control. Unlike "black-box" platforms that are opinionated about the infra you can deploy, Coherence is powered by CNC, the open-source IaC framework, which offers limitless customization.

Learn more

Top comments (0)

A Workflow Copilot. Tailored to You.

Pieces.app image

Our desktop app, with its intelligent copilot, streamlines coding by generating snippets, extracting code from screenshots, and accelerating problem-solving.

Read the docs

👋 Kindness is contagious

Please leave a ❤️ or a friendly comment on this post if you found it helpful!

Okay