Long story short, I've wound up starting work on a small Tesseract OCR program. I call it docshund-rs
, because it finds things in documents like a dachshund finds gophers in holes, and it's written in Rust. I'm intensely creative.
It took me longer to remember how Rust does Result<> type returns and accordingly unwrap the results of the tesseract-rs
calls than it did to get the program working.
Though, all things told, it's already pretty cool. It can successfully scan image files like JPEG, PNG and TIF with a reasonable degree of accuracy.
Ultimately I think docshund-rs
will be a program that can take a PDF file, turn it into images, and then process a bunch of those pages concurrently before barfing the output back out into a searchable PDF, or at least just a text file dump.
This is also subject to my interest level in the project, which usually varies wildly.
Though I think I'll keep a running tab of Tiny Programs and link it all together as a series, regardless.
Title photo by James Watson on Unsplash
Top comments (0)