Extract text, paths and bitmap images with coordinates from programmatic PDFs!
Introduction — what is Docling Parse?
Docling Parse is a simple package to extract text, paths and bitmap images with coordinates from programmatic PDFs. This package is used in the Docling PDF conversion. Below, we show a few output of the latest parser with char, word and line level output for text, in addition to the extracted paths and bitmap resources.
Usage
It’s quite easy to use Docling Parse in a Python application. Install the required package.
pip install docling-parse
Try it against a PDF. I used a crowded PDF file, as the screen capture below illustrates one of the file's pages!
And I ran the sample code below.
from docling_core.types.doc.page import TextCellUnit
from docling_parse.pdf_parser import DoclingPdfParser, PdfDocument
parser = DoclingPdfParser()
pdf_doc: PdfDocument = parser.load(
path_or_stream="./pdf/granite-foundation-models.pdf"
)
# PdfDocument.iterate_pages() will automatically populate pages as they are yielded.
for page_no, pred_page in pdf_doc.iterate_pages():
# iterate over the word-cells
for word in pred_page.iterate_cells(unit_type=TextCellUnit.WORD):
print(word.rect, ": ", word.text)
# create a PIL image with the char cells
img = pred_page.render_as_image(cell_unit=TextCellUnit.CHAR)
img.show()
Sample console output…
...
r_x0=302.497 r_y0=324.897 r_x1=320.43 r_y1=324.897 r_x2=320.43 r_y2=331.739 r_x3=302.497 r_y3=331.739 coord_origin=<CoordOrigin.BOTTOMLEFT: 'BOTTOMLEFT'> : 0.584
r_x0=332.783 r_y0=324.897 r_x1=350.716 r_y1=324.897 r_x2=350.716 r_y2=331.739 r_x3=332.783 r_y3=331.739 coord_origin=<CoordOrigin.BOTTOMLEFT: 'BOTTOMLEFT'> : 0.411
r_x0=363.07 r_y0=324.897 r_x1=381.003 r_y1=324.897 r_x2=381.003 r_y2=331.739 r_x3=363.07 r_y3=331.739 coord_origin=<CoordOrigin.BOTTOMLEFT: 'BOTTOMLEFT'> : 0.420
r_x0=393.356 r_y0=324.897 r_x1=411.289 r_y1=324.897 r_x2=411.289 r_y2=331.739 r_x3=393.356 r_y3=331.739 coord_origin=<CoordOrigin.BOTTOMLEFT: 'BOTTOMLEFT'> : 0.463
r_x0=423.642 r_y0=324.897 r_x1=441.575 r_y1=324.897 r_x2=441.575 r_y2=331.739 r_x3=423.642 r_y3=331.739 coord_origin=<CoordOrigin.BOTTOMLEFT: 'BOTTOMLEFT'> : 0.462
r_x0=170.425 r_y0=315.93 r_x1=259.857 r_y1=315.93 r_x2=259.857 r_y2=322.772 r_x3=170.425 r_y3=322.772 coord_origin=<CoordOrigin.BOTTOMLEFT: 'BOTTOMLEFT'> : mixtral-8x7b-instruct-v01-q
r_x0=272.21 r_y0=315.93 r_x1=290.143 r_y1=315.93 r_x2=290.143 r_y2=322.772 r_x3=272.21 r_y3=322.772 coord_origin=<CoordOrigin.BOTTOMLEFT: 'BOTTOMLEFT'> : 0.572
r_x0=302.497 r_y0=315.93 r_x1=320.43 r_y1=315.93 r_x2=320.43 r_y2=322.772 r_x3=302.497 r_y3=322.772 coord_origin=<CoordOrigin.BOTTOMLEFT: 'BOTTOMLEFT'> : 0.587
r_x0=332.783 r_y0=315.93 r_x1=350.716 r_y1=315.93 r_x2=350.716 r_y2=322.772 r_x3=332.783 r_y3=322.772 coord_origin=<CoordOrigin.BOTTOMLEFT: 'BOTTOMLEFT'> : 0.368
r_x0=363.07 r_y0=315.93 r_x1=381.003 r_y1=315.93 r_x2=381.003 r_y2=322.772 r_x3=363.07 r_y3=322.772 coord_origin=<CoordOrigin.BOTTOMLEFT: 'BOTTOMLEFT'> : 0.420
r_x0=393.356 r_y0=315.93 r_x1=411.289 r_y1=315.93 r_x2=411.289 r_y2=322.772 r_x3=393.356 r_y3=322.772 coord_origin=<CoordOrigin.BOTTOMLEFT: 'BOTTOMLEFT'> : 0.467
r_x0=423.642 r_y0=315.93 r_x1=441.575 r_y1=315.93 r_x2=441.575 r_y2=322.772 r_x3=423.642 r_y3=322.772 coord_origin=<CoordOrigin.BOTTOMLEFT: 'BOTTOMLEFT'> : 0.467
r_x0=277.415 r_y0=304.174 r_x1=284.939 r_y1=304.174 r_x2=284.939 r_y2=311.016 r_x3=277.415 r_y3=311.016 coord_origin=<CoordOrigin.BOTTOMLEFT: 'BOTTOMLEFT'> : de
r_x0=307.701 r_y0=304.174 r_x1=315.225 r_y1=304.174 r_x2=315.225 r_y2=311.016 r_x3=307.701 r_y3=311.016 coord_origin=<CoordOrigin.BOTTOMLEFT: 'BOTTOMLEFT'> : en
r_x0=337.988 r_y0=304.174 r_x1=345.512 r_y1=304.174 r_x2=345.512 r_y2=311.016 r_x3=337.988 r_y3=311.016 coord_origin=<CoordOrigin.BOTTOMLEFT: 'BOTTOMLEFT'> : de
r_x0=368.274 r_y0=304.174 r_x1=375.798 r_y1=304.174 r_x2=375.798 r_y2=311.016 r_x3=368.274 r_y3=311.016 coord_origin=<CoordOrigin.BOTTOMLEFT: 'BOTTOMLEFT'> : en
r_x0=398.56 r_y0=304.174 r_x1=406.084 r_y1=304.174 r_x2=406.084 r_y2=311.016 r_x3=398.56 r_y3=311.016 coord_origin=<CoordOrigin.BOTTOMLEFT: 'BOTTOMLEFT'> : de
r_x0=428.847 r_y0=304.174 r_x1=436.371 r_y1=304.174 r_x2=436.371 r_y2=311.016 r_x3=428.847 r_y3=311.016 coord_origin=<CoordOrigin.BOTTOMLEFT: 'BOTTOMLEFT'> : en
r_x0=170.425 r_y0=294.809 r_x1=248.795 r_y1=294.809 r_x2=248.795 r_y2=301.651 r_x3=170.425 r_y3=301.651 coord_origin=<CoordOrigin.BOTTOMLEFT: 'BOTTOMLEFT'> : granite-20b-multilingual
...
It is also possible to try the tool using the command line.
# To do the visualizations yourself, simply run (change word into char or line),
poetry run python ./docling_parse/visualize.py -i <path-to-pdf-file> -c word --interactive
Conclusion
Docling Parse from the set of Docling family tools is quite handy and useful for extracting text, paths and bitmap images with coordinates from programmatic PDFs.
Useful links
Docling Parse GitHub repository: https://github.com/docling-project/docling-parse?tab=readme-ov-file
Docling: https://github.com/docling-project
Top comments (0)