DEV Community

Claudio Fior for Abbrevia

Posted on

Extract data from a PDF

One of our providers give us some data ad PDF and I have to produce a JSON object for further elaborations.

Demo of the report

For the textual information non problem: I used pdftotext to extract the text.

$content = shell_exec('pdftotext -enc UTF-8 -layout input.pdf -');
Enter fullscreen mode Exit fullscreen mode

Then I used regular expressions to extract the data

 if(preg_match('/^Denominazione\W*(.*)/m', $content, $aDenominazione)) {
Enter fullscreen mode Exit fullscreen mode

How to extract the data of the semaphores that are images without labels?

I used the linux command pdftohtml

$rawImages = shell_exec('pdftohtml -enc UTF-8 -noframes -stdout -xml "'.$this->filePath.'" - | grep image');
$tok = strtok($rawImages,"\r\n");
while ($tok !== false) {
    $oImage = simplexml_load_string($tok);
    $tok = strtok("\r\n");
Enter fullscreen mode Exit fullscreen mode

The output of pdftohtml in a xml document for each text box or image.

$rawImages is an array of the xml elements of the images ans I put them as SimpleXmlObjects in $images array.

Than I searched trough the array the images with 77 pixel of width and sort the by the vertical position.

The images are saved in the current directory of the script.

I queried the color of a pixel in a specific position of the image with convert command of ImageMagick library and saved the data in the JSON object.

$color = shell_exec('convert "'.$imagePath.'" -format \'%[pixel:p{100,50}]\' info:- ');
switch ($color) {
    case 'srgb(253,78,83)':
    case 'srgb(123,196,78)':
    case 'srgb(254,211,80)':
Enter fullscreen mode Exit fullscreen mode

At this point: is there an easy way to do the trick?

Top comments (0)