Using “Docling Parse”!

#docling #rag #llm

Extract text, paths and bitmap images with coordinates from programmatic PDFs!

Introduction — what is Docling Parse?

Docling Parse is a simple package to extract text, paths and bitmap images with coordinates from programmatic PDFs. This package is used in the Docling PDF conversion. Below, we show a few output of the latest parser with char, word and line level output for text, in addition to the extracted paths and bitmap resources.

Usage

It’s quite easy to use Docling Parse in a Python application. Install the required package.

pip install docling-parse

Try it against a PDF. I used a crowded PDF file, as the screen capture below illustrates one of the file's pages!

And I ran the sample code below.

from docling_core.types.doc.page import TextCellUnit
from docling_parse.pdf_parser import DoclingPdfParser, PdfDocument

parser = DoclingPdfParser()

pdf_doc: PdfDocument = parser.load(
    path_or_stream="./pdf/granite-foundation-models.pdf"
)

# PdfDocument.iterate_pages() will automatically populate pages as they are yielded.
for page_no, pred_page in pdf_doc.iterate_pages():

    # iterate over the word-cells
    for word in pred_page.iterate_cells(unit_type=TextCellUnit.WORD):
        print(word.rect, ": ", word.text)

        # create a PIL image with the char cells
    img = pred_page.render_as_image(cell_unit=TextCellUnit.CHAR)
    img.show()

Sample console output…

...
r_x0=302.497 r_y0=324.897 r_x1=320.43 r_y1=324.897 r_x2=320.43 r_y2=331.739 r_x3=302.497 r_y3=331.739 coord_origin=<CoordOrigin.BOTTOMLEFT: 'BOTTOMLEFT'> :  0.584
r_x0=332.783 r_y0=324.897 r_x1=350.716 r_y1=324.897 r_x2=350.716 r_y2=331.739 r_x3=332.783 r_y3=331.739 coord_origin=<CoordOrigin.BOTTOMLEFT: 'BOTTOMLEFT'> :  0.411
r_x0=363.07 r_y0=324.897 r_x1=381.003 r_y1=324.897 r_x2=381.003 r_y2=331.739 r_x3=363.07 r_y3=331.739 coord_origin=<CoordOrigin.BOTTOMLEFT: 'BOTTOMLEFT'> :  0.420
r_x0=393.356 r_y0=324.897 r_x1=411.289 r_y1=324.897 r_x2=411.289 r_y2=331.739 r_x3=393.356 r_y3=331.739 coord_origin=<CoordOrigin.BOTTOMLEFT: 'BOTTOMLEFT'> :  0.463
r_x0=423.642 r_y0=324.897 r_x1=441.575 r_y1=324.897 r_x2=441.575 r_y2=331.739 r_x3=423.642 r_y3=331.739 coord_origin=<CoordOrigin.BOTTOMLEFT: 'BOTTOMLEFT'> :  0.462
r_x0=170.425 r_y0=315.93 r_x1=259.857 r_y1=315.93 r_x2=259.857 r_y2=322.772 r_x3=170.425 r_y3=322.772 coord_origin=<CoordOrigin.BOTTOMLEFT: 'BOTTOMLEFT'> :  mixtral-8x7b-instruct-v01-q
r_x0=272.21 r_y0=315.93 r_x1=290.143 r_y1=315.93 r_x2=290.143 r_y2=322.772 r_x3=272.21 r_y3=322.772 coord_origin=<CoordOrigin.BOTTOMLEFT: 'BOTTOMLEFT'> :  0.572
r_x0=302.497 r_y0=315.93 r_x1=320.43 r_y1=315.93 r_x2=320.43 r_y2=322.772 r_x3=302.497 r_y3=322.772 coord_origin=<CoordOrigin.BOTTOMLEFT: 'BOTTOMLEFT'> :  0.587
r_x0=332.783 r_y0=315.93 r_x1=350.716 r_y1=315.93 r_x2=350.716 r_y2=322.772 r_x3=332.783 r_y3=322.772 coord_origin=<CoordOrigin.BOTTOMLEFT: 'BOTTOMLEFT'> :  0.368
r_x0=363.07 r_y0=315.93 r_x1=381.003 r_y1=315.93 r_x2=381.003 r_y2=322.772 r_x3=363.07 r_y3=322.772 coord_origin=<CoordOrigin.BOTTOMLEFT: 'BOTTOMLEFT'> :  0.420
r_x0=393.356 r_y0=315.93 r_x1=411.289 r_y1=315.93 r_x2=411.289 r_y2=322.772 r_x3=393.356 r_y3=322.772 coord_origin=<CoordOrigin.BOTTOMLEFT: 'BOTTOMLEFT'> :  0.467
r_x0=423.642 r_y0=315.93 r_x1=441.575 r_y1=315.93 r_x2=441.575 r_y2=322.772 r_x3=423.642 r_y3=322.772 coord_origin=<CoordOrigin.BOTTOMLEFT: 'BOTTOMLEFT'> :  0.467
r_x0=277.415 r_y0=304.174 r_x1=284.939 r_y1=304.174 r_x2=284.939 r_y2=311.016 r_x3=277.415 r_y3=311.016 coord_origin=<CoordOrigin.BOTTOMLEFT: 'BOTTOMLEFT'> :  de
r_x0=307.701 r_y0=304.174 r_x1=315.225 r_y1=304.174 r_x2=315.225 r_y2=311.016 r_x3=307.701 r_y3=311.016 coord_origin=<CoordOrigin.BOTTOMLEFT: 'BOTTOMLEFT'> :  en
r_x0=337.988 r_y0=304.174 r_x1=345.512 r_y1=304.174 r_x2=345.512 r_y2=311.016 r_x3=337.988 r_y3=311.016 coord_origin=<CoordOrigin.BOTTOMLEFT: 'BOTTOMLEFT'> :  de
r_x0=368.274 r_y0=304.174 r_x1=375.798 r_y1=304.174 r_x2=375.798 r_y2=311.016 r_x3=368.274 r_y3=311.016 coord_origin=<CoordOrigin.BOTTOMLEFT: 'BOTTOMLEFT'> :  en
r_x0=398.56 r_y0=304.174 r_x1=406.084 r_y1=304.174 r_x2=406.084 r_y2=311.016 r_x3=398.56 r_y3=311.016 coord_origin=<CoordOrigin.BOTTOMLEFT: 'BOTTOMLEFT'> :  de
r_x0=428.847 r_y0=304.174 r_x1=436.371 r_y1=304.174 r_x2=436.371 r_y2=311.016 r_x3=428.847 r_y3=311.016 coord_origin=<CoordOrigin.BOTTOMLEFT: 'BOTTOMLEFT'> :  en
r_x0=170.425 r_y0=294.809 r_x1=248.795 r_y1=294.809 r_x2=248.795 r_y2=301.651 r_x3=170.425 r_y3=301.651 coord_origin=<CoordOrigin.BOTTOMLEFT: 'BOTTOMLEFT'> :  granite-20b-multilingual
...

It is also possible to try the tool using the command line.

# To do the visualizations yourself, simply run (change word into char or line),
poetry run python ./docling_parse/visualize.py -i <path-to-pdf-file> -c word --interactive

Conclusion

Docling Parse from the set of Docling family tools is quite handy and useful for extracting text, paths and bitmap images with coordinates from programmatic PDFs.

Useful links
Docling Parse GitHub repository: https://github.com/docling-project/docling-parse?tab=readme-ov-file
Docling: https://github.com/docling-project

Get n8n VPS hosting 3x cheaper than a cloud solution

Get fast, easy, secure n8n VPS hosting from $4.99/mo at Hostinger. Automate any workflow using a pre-installed n8n application and no-code customization.

Start now

DEV Community

Using “Docling Parse”!

Introduction — what is Docling Parse?

Usage

Conclusion

Get n8n VPS hosting 3x cheaper than a cloud solution

Top comments (0)

Okay