DEV Community

Zeeshan Ahmad
Zeeshan Ahmad

Posted on

Scanned Documents Classification using Machine Learning

First off, I am a web developer who has recently started exploring machine learning domain.

So I am looking for some help/starters/guidelines on how to implement a machine learning based scanned document/image classifier that predicts a document falls into one of 29 categories.

The documents are mostly letters, memos and reports (having tabular data). So far, I have found Tesseract OCR and OpenCV which I think will be the tools needed for this task. I also think I will need to use kind of NLP techniques to extract the meaning and better predict. However, it will be great if someone can dumb it down for me the strategy and route to take for this. What are some of the specific techniques/skills/tools/packages I need to learn? Since the scanned images are of varying quality, what image processing techniques I can employ to get the best results.

Top comments (3)

Collapse
 
veselinastaneva profile image
Vesi Staneva

My team just completed an open-sourced Content Moderation Service built Node.js, TensorFlowJS, and ReactJS that we have been working over the past weeks. We have now released the first part of a series of three tutorials - How to create an NSFW Image Classification REST API that might help you answer some of those questions. Any comments & suggestions are more than welcome. Thanks in advance!
(Fork it on GitHub or click🌟star to support us and stay connected🙌)

Collapse
 
rjs417 profile image
rjs417

I'm also looking for the same.
I'll follow this post

Collapse
 
gyandeeps profile image
Gyandeep Singh

i am also looking for some insight into this.
Thanks for posting this question.