Image processing and data extraction has become one of the most powerful features of Machine Learning now. But doing it from scratch is a pain in the a**. The one thing programming taught me that no one else did is not to reinvent the wheel every time and to prioritize getting the job done. Keeping that in mind, I have come across an easy solution for the problem at hand.
The Problem At Hand
For simplicity's sake, let us consider this to be a page from a book. We want to highlight the word comment wherever it occurs. This could be an intuitive feature for image search engines to direct the users' attention to their desired content.
Solution
We are going to be using an OCR (Optical Character Recognition) engine called Tesseract for the image-to-text recognition part. It is free software, released under the Apache License. Install the engine for your desired OS from their official website. I'm using Windows for this. Add the installation path to your environment variables.
Create a python project with a virtual environment set up on it. Install the necessary packages.
pip install opencv-python # for image processing
pip install pytesseract # to use the ocr engine in your project
pip install pandas # to conduct search queries
In your main.py import the necessary libraries and define the necessary variables. Read the image from the source using the imread method. Make a copy of the original image for the overlay. Extract text information from the image. It is important to set the output_type to be a pandas Dataframe object which will ease the filtering process.
import cv2
from pytesseract import pytesseract, Output
ALPHA = 0.4
filename = "devto.png"
query = "comment"
img = cv2.imread(filename)
# make a copy of the original image for the highlight overlay
overlay = img.copy()
# extract text data from the image as a pandas Dataframe object
boxes = pytesseract.image_to_data(img, lang="ben+eng", output_type=Output.DATAFRAME)
The dataframe object returned has the following structure:
level | page_num | block_num | par_num | line_num | word_num | left | top | width | height | conf | text |
---|---|---|---|---|---|---|---|---|---|---|---|
5 | 1 | 5 | 1 | 1 | 4 | 169 | 537 | 99 | 14 | 96.276794 | comments |
We are only interested in the text, left, top, width, and height columns. We need to prepare the dataframe for this specific job by applying various filters. Drop the rows that have NaN or empty string in the text column to make our data error-proof and the computations more efficient. The text column usually contains single words. We can iterate through each row to find out if any of them matches our query string.
# drop rows that have NaN values in the text column
boxes = boxes.dropna(subset=["text"])
# remove empty text rows
boxes = boxes[boxes["text"].str.len() > 1]
# Search through the text column for matching words
boxes[boxes["text"].str.contains(query.strip(), case=False)]
Now we can get started with the highlighting part. We will draw rectangular highlight boxes around the matched positions.
for _, box in boxes.iterrows():
left = box["left"]
top = box["top"]
width = box["width"]
height = box["height"]
# draw a yellow rectangle around the matched text
cv2.rectangle(
overlay,
(left, top),
(left + width, top + height),
(0, 255, 255),
-1,
)
# Add the overlay on the original image
img_new = cv2.addWeighted(overlay, ALPHA, img, 1 - ALPHA, 0)
# Some more image processing to make the highlights more realistic
r = 1000.0 / img_new.shape[1]
dim = (1000, int(img_new.shape[0] * r))
resized = cv2.resize(img_new, dim, interpolation=cv2.INTER_AREA)
Show the modified image using opencv's imshow method.
cv2.imshow("Highlighted", resized)
cv2.waitKey(0)
cv2.destroyAllWindows()
The result is this modified image with every occurring comment highlighted in yellow.
Bonus Tip
The search-through mechanism in this process can only detect and highlight a single word or full sentence with exact matches. If we want to highlight words that are not in a single sentence, we just need to filter the dataframe with a little bit of pandas magic.
+ from pandas import concat
- boxes[boxes["text"].str.contains(query.strip(), case=False)]
+ boxes = concat(
[
boxes[boxes["text"].str.contains(word.strip(), case=False)]
for word in query.split()
]
)
With this, the user can query "essential comments" and it will highlight essential and comments even though they are not together.
Top comments (0)