Many businesses (including my own) suffer from unproductive processes, such as manual data processing. These issues can be solved through automation, using structural systems such as CRM and custom tools. Throughout the years I've dealt with complex environments that require a lot of data processing, analysis and reporting. And "data" can mean anything that's digital.
Some time ago I faced a client who had thousands of unstructured documents, these piled up throughout the years. And it has become a very unproductive environment especially when information had to be retrieved, but couldn't happen efficiently. Fortunately technology can help us. OCR stands for Optical Character Recognition, it's a machine learning discipline focusing on extracting text from images/pictures.
Suppose you have hundreds of files, and most of these are copies of passports, contracts and invoices. Some images were made by phone, some were scanned, some are PDF files containing text and/or images. The demo screenshots below illustrate how we can extract text/keywords from these kinds of documents.
Using the extracted text/keywords we can process these files according to our own business rules, such as rename/copy/move/backup; but we can also send/upload these files to some other pipeline for further processing. Keep in mind that OCR is pretty good but it's not perfect, it works best when images are clear and don't contain strange characters. Most languages are supported.
# Basic usage of our OCR library
import ocr
your_file = './demo_files/doc1.pdf'
text = ocr.process(your_file)
# your business rules
if 'CONTRACT' in text:
...
else:
...
As easy as that, you only need basic python knowledge to get started. For more information visit our Git repository.
https://github.com/healzer/PyCRM
The "PyCRM" project is a collection of useful tools, tips and tricks for your business. These can be used in almost any industry that has some digital processes: managing clients/data, data extraction & analysis, reports, process automation, etc.
Top comments (11)
How good is this with grocery receipts ? I have been planning to look into setting up a system to track my grocery receipts to further explore my spending habits.
Hey, could you email me a few samples of your grocery receipts, I'll run them through the system for you :)
I don't have one right now with me. I usually throw it right at the store, will collect a few and send you next time.
PixLab offer Passports & ID Cards scanning capabilities using state of the art PP-OCR algorithm via its DOCSCAN REST API endpoint. You can find more information at blog.pixlab.io/2020/06/passport-do...
Good stuff, However I think you should hide the passport details.
It's a random passport image I found on Google, no harm I guess
Taught as much
Interesting Stuff
Does it collect the image and all text from passport and Local ID Card
Eeeeek don't provide your passport photo online!
Identity theft is a big issue.
Saw it's someone elses... probably still not good to spread it around? otherwise interesting article and cool tech