DEV Community

Claudio Fior for Abbrevia

Posted on

3

Extract data from a PDF

One of our providers give us some data ad PDF and I have to produce a JSON object for further elaborations.

Demo of the report

For the textual information non problem: I used pdftotext to extract the text.

$content = shell_exec('pdftotext -enc UTF-8 -layout input.pdf -');
Enter fullscreen mode Exit fullscreen mode

Then I used regular expressions to extract the data

 $anagrafica=array();
 if(preg_match('/^Denominazione\W*(.*)/m', $content, $aDenominazione)) {
     $anagrafica['denominazione']=$aDenominazione[1];
 }
Enter fullscreen mode Exit fullscreen mode

How to extract the data of the semaphores that are images without labels?

I used the linux command pdftohtml

$rawImages = shell_exec('pdftohtml -enc UTF-8 -noframes -stdout -xml "'.$this->filePath.'" - | grep image');
$tok = strtok($rawImages,"\r\n");
while ($tok !== false) {
    $oImage = simplexml_load_string($tok);
    $images[]=$oImage;
    $tok = strtok("\r\n");
}
Enter fullscreen mode Exit fullscreen mode

The output of pdftohtml in a xml document for each text box or image.

$rawImages is an array of the xml elements of the images ans I put them as SimpleXmlObjects in $images array.

Than I searched trough the array the images with 77 pixel of width and sort the by the vertical position.

The images are saved in the current directory of the script.

I queried the color of a pixel in a specific position of the image with convert command of ImageMagick library and saved the data in the JSON object.

$color = shell_exec('convert "'.$imagePath.'" -format \'%[pixel:p{100,50}]\' info:- ');
switch ($color) {
    case 'srgb(253,78,83)':
        $anagrafica[$this::chekcs[$pos]]='red';
    break;
    case 'srgb(123,196,78)':
        $anagrafica[$this::chekcs[$pos]]='green';
    break;
    case 'srgb(254,211,80)':
        $anagrafica[$this::chekcs[$pos]]='yellow';
    break;
};
Enter fullscreen mode Exit fullscreen mode

At this point: is there an easy way to do the trick?

Sentry image

Hands-on debugging session: instrument, monitor, and fix

Join Lazar for a hands-on session where you’ll build it, break it, debug it, and fix it. You’ll set up Sentry, track errors, use Session Replay and Tracing, and leverage some good ol’ AI to find and fix issues fast.

RSVP here →

Top comments (0)

Billboard image

Create up to 10 Postgres Databases on Neon's free plan.

If you're starting a new project, Neon has got your databases covered. No credit cards. No trials. No getting in your way.

Try Neon for Free →

👋 Kindness is contagious

Please leave a ❤️ or a friendly comment on this post if you found it helpful!

Okay