During a project designed to split singular .pdf files (that had been condensed by a third party) back into their constituent parts, I ran across a problem I'm surprised I'd never encountered before...
The segmenting to sections, I did intentionally sans-AI. All of the logic is handled through traditional pattern matching, with an extra step of "ignore words" to help correct false positives. Surprisingly, this worked incredibly well. The pesky problem appeared when I was trying to also generate thumbnail preview images of the new .pdf components.
Every .pdf could arrive with arbitrary names. Many included UUID, some had dates... zero standardization and there were lots of other "words" that might sometimes appear anywhere in the string.
I quickly discovered that a massive list of words to remove and diligent pattern matching wasn't going to get me very far. At one point I stubbornly did a Google on 'how to remove gibberish from a string, Python'. In 20 years, I'd never had this exact problem. When I've encountered it in some form before, the patterns were easy to match or I was able to correct the problem further back in the pipeline, nullifying the issue.
This time, those options were not producing desirable results. Far from it. The best I could do with "classical" methods was to get around 70% accuracy on removing the correct parts of the string. As I was also appending this string to the 'sections' of various .pdf being generated, the conundrum eventually led me to Named Entity Recognition.
If you're like me, you may have a surface interest in AI and ML, but have never really pondered Named Entity Recognition. The solution to my problem was something people had been working on for decades.
While you could easily download a corpus of the English language, you'd be taking a step in the wrong direction if your intention is to pull the names of people, places or things out of text: you'd be excluding them in almost every instance.
What is it called when you're looking for a word... that is possibly a name, but definitely not NOT a word? Named Entity Recognition. Lisa F. Rau is credited as being the first person to actually implement one of these solutions.
At the very outset of this project, I had the idea that I could just convert all the .pdf to text (or at least the first 150 characters of each page) and feed it into an LLM (or even Local LLM) for page ranges to extract. While this is a valid solution to the problem, it consumes a lot of resources. Some pages are just images and require OCR, and we're also doing Named Entity Recognition on each incoming file. There are also many instances where, even with advanced parsing, the first 150 characters of any given document could be nearly identical to an unrelated document (or worse, nonsensical encoding garbage from Docusign).
Other developers who are miserly might appreciate the solution I used in Python, called spaCy.
A warning: spaCy took an act of congress for me to get working properly. At one point, I was compiling various dependencies (been a while since I used cmake). The results were well worth it: barely added any processing time and performed almost flawlessly.
The downside is that there were two other server environments (one older, one newer, go figure, all three Ubuntu), where I could NOT get spaCy and other dependencies of this project to work properly ("could not" as in, gave up after squandering several hours trying to complete a task I'd just done hours prior).
Don't let that scare you, however, I'm fairly n00bish with Python in general. If you're using a proper virtual environment and don't normally have problems with these type of things, you should be fine.
I'm able to use spaCy on an $8 per month unmanaged VPS - meaning the resources required to utilize it are pretty minimal.
NER is a rabbit hole that leads you down to how machine learning and large language models depend on rather arbitrary rules and semantics to tag and understand the world around them. This was rather serendipitous for me and a great ride.
"How do I remove gibberish from a string?"
Top comments (0)