INTRODUCTION:
Optical Character Recognition (OCR) is a technology that converts images containing text into machine-readable digital text. In this project, I implemented a Tamil OCR system using Python and Tesseract OCR engine. The goal was to test how accurately the system detects text from two different sources:
Handwritten text on white paper
Printed text from a newspaper
This blog explains the complete setup process and how the system works.
Part 1: Installing Python
Step 1: Download Python
First, download Python from the official website:
https://www.python.org/downloads/
While installing, it is very important to check the box:
“Add Python to PATH”
This allows Python to be accessed from the Command Prompt.
After installation, verify it by opening Command Prompt and typing:
If Python is installed correctly, it will display the installed version number.
Part 2: Installing Tesseract OCR
Python alone cannot perform OCR. We need an OCR engine, which is Tesseract.
Step 2: Download Tesseract for Windows
Download the Windows installer from:
https://github.com/UB-Mannheim/tesseract/wiki
Install it in the default location:
C:\Program Files\Tesseract-OCR
After installation, verify it by typing in Command Prompt:
tesseract --version
If the version details are displayed, it means Tesseract is installed correctly.
Part 3: Adding Tamil Language Support
To detect Tamil text, we must ensure that the Tamil trained data file is available.
Go to:
C:\Program Files\Tesseract-OCR\tessdata
Check if the file:
tam.traineddata
exists.
If not, download it from:
https://github.com/tesseract-ocr/tessdata
and place it inside the tessdata folder.
Part 4: Installing Required Python Libraries
Open Command Prompt and install the required libraries:
pip install pytesseract opencv-python pillow
These libraries are used for:
pytesseract → Connecting Python with Tesseract
opencv-python → Image processing
pillow → Image handling
Part 5: Project Setup
Create a project folder named:
OCR_Project
Inside the folder, create:
ocr_test.py (Python file)
test.jpg (Input image)
Part 6: Python OCR Code
Below is the Python code used for Tamil text detection:
Python
import cv2
import pytesseract
Specify Tesseract path
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
Load image
img = cv2.imread("test.jpg")
Convert to grayscale
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
Apply thresholding
_, thresh = cv2.threshold(gray, 150, 255, cv2.THRESH_BINARY)
Perform OCR in Tamil
text = pytesseract.image_to_string(thresh, lang='tam')
print("Detected Text:")
print(text)
Part 7: Running the Program
Navigate to the project folder in Command Prompt:
cd Desktop\OCR_Project
Run the program:
python ocr_test.py
The detected Tamil text will be printed in the console.
HOW THE OCR SYSTEM WORKS INTERNALLY:
*The system follows these steps:
The image is loaded.
*It is converted to grayscale to simplify processing.
*Thresholding is applied to separate text from background.
*Tesseract detects text regions.
The Tamil language model recognizes characters.
*The final detected text is returned as output.
Accuracy Testing: White Paper vs Newspaper
WHITE PAPER TEST :
Clean background
Clear handwriting
Good contrast
Result:
Accuracy is usually high (around 80–95%) because the text is clearly separated from the background.
NEWS PAPER TEST:
Small font size
Multiple columns
Images and advertisements
Background noise
Result:
Accuracy decreases (around 60–80%) because of complex layout and noise.
Top comments (0)