DEV Community

Alain Airom
Alain Airom

Posted on

1

Using “Docling Parse”!

Extract text, paths and bitmap images with coordinates from programmatic PDFs!

Image description

Introduction — what is Docling Parse?

Docling Parse is a simple package to extract text, paths and bitmap images with coordinates from programmatic PDFs. This package is used in the Docling PDF conversion. Below, we show a few output of the latest parser with char, word and line level output for text, in addition to the extracted paths and bitmap resources.

Usage

It’s quite easy to use Docling Parse in a Python application. Install the required package.

pip install docling-parse
Enter fullscreen mode Exit fullscreen mode

Try it against a PDF. I used a crowded PDF file, as the screen capture below illustrates one of the file's pages!

Image description

And I ran the sample code below.

from docling_core.types.doc.page import TextCellUnit
from docling_parse.pdf_parser import DoclingPdfParser, PdfDocument

parser = DoclingPdfParser()

pdf_doc: PdfDocument = parser.load(
    path_or_stream="./pdf/granite-foundation-models.pdf"
)

# PdfDocument.iterate_pages() will automatically populate pages as they are yielded.
for page_no, pred_page in pdf_doc.iterate_pages():

    # iterate over the word-cells
    for word in pred_page.iterate_cells(unit_type=TextCellUnit.WORD):
        print(word.rect, ": ", word.text)

        # create a PIL image with the char cells
    img = pred_page.render_as_image(cell_unit=TextCellUnit.CHAR)
    img.show()
Enter fullscreen mode Exit fullscreen mode

Sample console output…

...
r_x0=302.497 r_y0=324.897 r_x1=320.43 r_y1=324.897 r_x2=320.43 r_y2=331.739 r_x3=302.497 r_y3=331.739 coord_origin=<CoordOrigin.BOTTOMLEFT: 'BOTTOMLEFT'> :  0.584
r_x0=332.783 r_y0=324.897 r_x1=350.716 r_y1=324.897 r_x2=350.716 r_y2=331.739 r_x3=332.783 r_y3=331.739 coord_origin=<CoordOrigin.BOTTOMLEFT: 'BOTTOMLEFT'> :  0.411
r_x0=363.07 r_y0=324.897 r_x1=381.003 r_y1=324.897 r_x2=381.003 r_y2=331.739 r_x3=363.07 r_y3=331.739 coord_origin=<CoordOrigin.BOTTOMLEFT: 'BOTTOMLEFT'> :  0.420
r_x0=393.356 r_y0=324.897 r_x1=411.289 r_y1=324.897 r_x2=411.289 r_y2=331.739 r_x3=393.356 r_y3=331.739 coord_origin=<CoordOrigin.BOTTOMLEFT: 'BOTTOMLEFT'> :  0.463
r_x0=423.642 r_y0=324.897 r_x1=441.575 r_y1=324.897 r_x2=441.575 r_y2=331.739 r_x3=423.642 r_y3=331.739 coord_origin=<CoordOrigin.BOTTOMLEFT: 'BOTTOMLEFT'> :  0.462
r_x0=170.425 r_y0=315.93 r_x1=259.857 r_y1=315.93 r_x2=259.857 r_y2=322.772 r_x3=170.425 r_y3=322.772 coord_origin=<CoordOrigin.BOTTOMLEFT: 'BOTTOMLEFT'> :  mixtral-8x7b-instruct-v01-q
r_x0=272.21 r_y0=315.93 r_x1=290.143 r_y1=315.93 r_x2=290.143 r_y2=322.772 r_x3=272.21 r_y3=322.772 coord_origin=<CoordOrigin.BOTTOMLEFT: 'BOTTOMLEFT'> :  0.572
r_x0=302.497 r_y0=315.93 r_x1=320.43 r_y1=315.93 r_x2=320.43 r_y2=322.772 r_x3=302.497 r_y3=322.772 coord_origin=<CoordOrigin.BOTTOMLEFT: 'BOTTOMLEFT'> :  0.587
r_x0=332.783 r_y0=315.93 r_x1=350.716 r_y1=315.93 r_x2=350.716 r_y2=322.772 r_x3=332.783 r_y3=322.772 coord_origin=<CoordOrigin.BOTTOMLEFT: 'BOTTOMLEFT'> :  0.368
r_x0=363.07 r_y0=315.93 r_x1=381.003 r_y1=315.93 r_x2=381.003 r_y2=322.772 r_x3=363.07 r_y3=322.772 coord_origin=<CoordOrigin.BOTTOMLEFT: 'BOTTOMLEFT'> :  0.420
r_x0=393.356 r_y0=315.93 r_x1=411.289 r_y1=315.93 r_x2=411.289 r_y2=322.772 r_x3=393.356 r_y3=322.772 coord_origin=<CoordOrigin.BOTTOMLEFT: 'BOTTOMLEFT'> :  0.467
r_x0=423.642 r_y0=315.93 r_x1=441.575 r_y1=315.93 r_x2=441.575 r_y2=322.772 r_x3=423.642 r_y3=322.772 coord_origin=<CoordOrigin.BOTTOMLEFT: 'BOTTOMLEFT'> :  0.467
r_x0=277.415 r_y0=304.174 r_x1=284.939 r_y1=304.174 r_x2=284.939 r_y2=311.016 r_x3=277.415 r_y3=311.016 coord_origin=<CoordOrigin.BOTTOMLEFT: 'BOTTOMLEFT'> :  de
r_x0=307.701 r_y0=304.174 r_x1=315.225 r_y1=304.174 r_x2=315.225 r_y2=311.016 r_x3=307.701 r_y3=311.016 coord_origin=<CoordOrigin.BOTTOMLEFT: 'BOTTOMLEFT'> :  en
r_x0=337.988 r_y0=304.174 r_x1=345.512 r_y1=304.174 r_x2=345.512 r_y2=311.016 r_x3=337.988 r_y3=311.016 coord_origin=<CoordOrigin.BOTTOMLEFT: 'BOTTOMLEFT'> :  de
r_x0=368.274 r_y0=304.174 r_x1=375.798 r_y1=304.174 r_x2=375.798 r_y2=311.016 r_x3=368.274 r_y3=311.016 coord_origin=<CoordOrigin.BOTTOMLEFT: 'BOTTOMLEFT'> :  en
r_x0=398.56 r_y0=304.174 r_x1=406.084 r_y1=304.174 r_x2=406.084 r_y2=311.016 r_x3=398.56 r_y3=311.016 coord_origin=<CoordOrigin.BOTTOMLEFT: 'BOTTOMLEFT'> :  de
r_x0=428.847 r_y0=304.174 r_x1=436.371 r_y1=304.174 r_x2=436.371 r_y2=311.016 r_x3=428.847 r_y3=311.016 coord_origin=<CoordOrigin.BOTTOMLEFT: 'BOTTOMLEFT'> :  en
r_x0=170.425 r_y0=294.809 r_x1=248.795 r_y1=294.809 r_x2=248.795 r_y2=301.651 r_x3=170.425 r_y3=301.651 coord_origin=<CoordOrigin.BOTTOMLEFT: 'BOTTOMLEFT'> :  granite-20b-multilingual
...
Enter fullscreen mode Exit fullscreen mode

Image description

It is also possible to try the tool using the command line.

# To do the visualizations yourself, simply run (change word into char or line),
poetry run python ./docling_parse/visualize.py -i <path-to-pdf-file> -c word --interactive
Enter fullscreen mode Exit fullscreen mode

Conclusion

Docling Parse from the set of Docling family tools is quite handy and useful for extracting text, paths and bitmap images with coordinates from programmatic PDFs.

Useful links
Docling Parse GitHub repository: https://github.com/docling-project/docling-parse?tab=readme-ov-file
Docling: https://github.com/docling-project

Hostinger image

Get n8n VPS hosting 3x cheaper than a cloud solution

Get fast, easy, secure n8n VPS hosting from $4.99/mo at Hostinger. Automate any workflow using a pre-installed n8n application and no-code customization.

Start now

Top comments (0)

👋 Kindness is contagious

Explore a trove of insights in this engaging article, celebrated within our welcoming DEV Community. Developers from every background are invited to join and enhance our shared wisdom.

A genuine "thank you" can truly uplift someone’s day. Feel free to express your gratitude in the comments below!

On DEV, our collective exchange of knowledge lightens the road ahead and strengthens our community bonds. Found something valuable here? A small thank you to the author can make a big difference.

Okay