DEV Community

Vasil Vasilev
Vasil Vasilev

Posted on

TensorFlow to filter PDF files

Hello everyone,

So I got the idea to filter resumes using Tensorflow but I don't know how to do it o:).. I found some tutorial about training the model to recognize if on the image is dog or cat.

So what I want to do? Training the model with some keywords (let's say some kind of dictionary) -> I want to filter pdf file depend on those keywords and store it different folder.

Any help?

Oldest comments (6)

Collapse
 
lukaszkuczynski profile image
lukaszkuczynski • Edited

I think its great with stripping data from pdf first.
I was thinking how to employ Elasticsearch some time ago..
There are ways how to easily index PDF docs, then you can use similarity or scoring to search.

Collapse
 
soi_dev profile image
Thành

I'm also interested in the topic of tensorflow :))

Collapse
 
msoedov profile image
Alex Miasoiedov

I would use pdf - to- text and then feed the data to github.com/vi3k6i5/flashtext with annotated dictionary of keywords. Elasticsearch seems to complex solution since you only want some basic filtering like Qa/Devops/Java Dev/Python dev to group resumes by category.

If you still want to play with tensorflow I suggest you to think about what kind of feature you can extract from image and what kind of output to expect. That's 90% of success the rest 10% is just to code up tensorflow model

Collapse
 
vascov profile image
Vasil Vasilev

Hey, Alex thanks for the answer makes sense to me!

I have a question: So what you are saying is to create a dictionary with the keywords and then extract the pdf to text and filter the cv correspondingly? Right?

How will you do the PDF -> txt?

Collapse
 
msoedov profile image
Alex Miasoiedov

Well that's for you to figure out :)

Collapse
 
affinestructure profile image
KYLE RASMUSSEN

Why don't you just do OCR to extract the PDF to txt?