Yuvan Shankar

Posted on Feb 12

Implementing Tamil OCR Using Python and Tesseract

#programming #python #beginners #career

INTRODUCTION:

Optical Character Recognition (OCR) is a technology that converts images containing text into machine-readable digital text. In this project, I implemented a Tamil OCR system using Python and Tesseract OCR engine. The goal was to test how accurately the system detects text from two different sources:
Handwritten text on white paper
Printed text from a newspaper
This blog explains the complete setup process and how the system works.

Part 1: Installing Python
Step 1: Download Python

First, download Python from the official website:
https://www.python.org/downloads/

While installing, it is very important to check the box:
“Add Python to PATH”
This allows Python to be accessed from the Command Prompt.
After installation, verify it by opening Command Prompt and typing:

If Python is installed correctly, it will display the installed version number.

Part 2: Installing Tesseract OCR
Python alone cannot perform OCR. We need an OCR engine, which is Tesseract.

Step 2: Download Tesseract for Windows
Download the Windows installer from:
https://github.com/UB-Mannheim/tesseract/wiki

Install it in the default location:

C:\Program Files\Tesseract-OCR
After installation, verify it by typing in Command Prompt:

tesseract --version

If the version details are displayed, it means Tesseract is installed correctly.

Part 3: Adding Tamil Language Support

To detect Tamil text, we must ensure that the Tamil trained data file is available.

Go to:

C:\Program Files\Tesseract-OCR\tessdata

Check if the file:

tam.traineddata
exists.

If not, download it from:

https://github.com/tesseract-ocr/tessdata

and place it inside the tessdata folder.

Part 4: Installing Required Python Libraries

Open Command Prompt and install the required libraries:

pip install pytesseract opencv-python pillow

These libraries are used for:
pytesseract → Connecting Python with Tesseract

opencv-python → Image processing

pillow → Image handling

Part 5: Project Setup
Create a project folder named:

OCR_Project
Inside the folder, create:
ocr_test.py (Python file)
test.jpg (Input image)

Part 6: Python OCR Code
Below is the Python code used for Tamil text detection:

Python

import cv2
import pytesseract

Specify Tesseract path

pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"

Load image

img = cv2.imread("test.jpg")

Convert to grayscale

gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

Apply thresholding

_, thresh = cv2.threshold(gray, 150, 255, cv2.THRESH_BINARY)

Perform OCR in Tamil

text = pytesseract.image_to_string(thresh, lang='tam')

print("Detected Text:")
print(text)

Part 7: Running the Program
Navigate to the project folder in Command Prompt:

cd Desktop\OCR_Project
Run the program:

python ocr_test.py
The detected Tamil text will be printed in the console.

HOW THE OCR SYSTEM WORKS INTERNALLY:

*The system follows these steps:
The image is loaded.

*It is converted to grayscale to simplify processing.

*Thresholding is applied to separate text from background.

*Tesseract detects text regions.
The Tamil language model recognizes characters.

*The final detected text is returned as output.

Accuracy Testing: White Paper vs Newspaper

WHITE PAPER TEST :

Clean background
Clear handwriting
Good contrast
Result:
Accuracy is usually high (around 80–95%) because the text is clearly separated from the background.

NEWS PAPER TEST:

Small font size
Multiple columns
Images and advertisements
Background noise
Result:
Accuracy decreases (around 60–80%) because of complex layout and noise.

DEV Community