Implementing Picture Annotation using Remote Visual Language Models and Docling!
Introduction
Visual Language Models (VLMs) are a fascinating advancement in artificial intelligence, capable of understanding and generating human-like text from visual inputs like images and videos. They bridge the gap between computer vision and natural language processing, allowing machines to “see” and “describe” what they see. Picture annotation, the process of adding descriptive metadata or labels to images, is a critical application of VLMs with wide-ranging use cases across various industries. For instance, in e-commerce, accurate picture annotations help with product categorization, searchability, and creating rich descriptions that enhance the customer experience. In healthcare, VLMs can assist in annotating medical images (like X-rays or MRIs) to highlight anomalies or specific regions, aiding diagnosis and research. Autonomous vehicles rely heavily on real-time picture annotation to identify objects, pedestrians, and road signs for safe navigation. Even in content creation and management, VLMs can automate the tagging and organization of vast image libraries, making them easily searchable and usable. In this context, we will explore a practical example of picture annotation leveraging the power of a remote VLM in conjunction with Docling, a powerful document processing library.
The HuggingFaceTB/SmolVLM-256M-Instruct
model plays a crucial role in our picture annotation pipeline, serving as the lightweight yet effective Visual Language Model that powers the descriptive capabilities. We utilize this specific model because, despite its smaller size, it offers a compelling balance of performance for visual understanding tasks and efficient resource utilization. Being a "SmolVLM" (Small Visual Language Model), it's designed to run effectively in resource-constrained environments or for applications where rapid inference is critical, making it an excellent choice for a local vLLM
setup. Its instruction-tuned nature ensures that it can follow prompts to generate concise and accurate image descriptions, directly fulfilling the annotation requirements of our Docling-based document processing.
For this example, we’ve specifically chosen to utilize the granite-vision-3-2-2b
model, hosted on the IBM Cloud watsonx.ai platform. This decision is driven by several key factors: granite-vision-3-2-2b
stands out as a highly performant, business-ready Large Language Model (LLM) that effectively meets the demands of accurate and robust picture annotation. Its capabilities align perfectly with the requirements for detailed visual understanding and textual generation. Furthermore, leveraging watsonx.ai ensures a reliable and scalable infrastructure for model deployment, and importantly, offers an economically adequate solution for integrating advanced VLM capabilities into our application without the overhead of managing local infrastructure for larger models.
To achieve our goal, we strategically combine Docling’s extraordinary capacities regarding document conversion — encompassing its robust parsing, structuring, and multi-format export capabilities — with the powerful visual understanding of the granite-vision-3-2-2b model. This synergy allows us to not only accurately extract information from complex documents but also to enrich that information with intelligent, AI-generated picture annotations, ensuring a comprehensive and highly descriptive output across various formats like Markdown, HTML, and plain text.
Test and Implementation
As a fundamental best practice in Python development, establishing a virtual environment is crucial for isolating project dependencies and preventing conflicts between different package installations.
Let’s jump into the coding and implementation steps.
- Prepare the virtual environment and the requirements ⬇️
python3 -m venv venv
source venv/bin/activate
pip install --upgrade pip
pip install docling
### I had to do a lots of adjustments on my MacBook M4 for this
pip install torch==2.7.0 torchvision==0.22.0 torchaudio==2.7.0 -f https://download.pytorch.org/whl/torch_stable.html
pip install vllm
- To enable seamless communication between Docling and our chosen Visual Language Model,
HuggingFaceTB/SmolVLM-256M-Instruct
, we initiate a local API server using vLLM. This is achieved by running the command:python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 8000 --model HuggingFaceTB/SmolVLM-256M-Instruct --chat-template '{% for message in messages %}{% if message.role == "user" %}{{ "USER: " + message.content + "\n" }}{% elif message.role == "assistant" %}{{ "ASSISTANT: " + message.content + "\n" }}{% endif %}{% endfor %}'
. This command launches vLLM's OpenAI-compatible server, making the SmolVLM model accessible via a standard API endpoint (http://localhost:8000/v1/chat/completions
). The --chat-template argument is particularly crucial here; it explicitly defines how incoming chat messages (formatted in the OpenAI style by Docling) are translated into the specific input format expected by the SmolVLM-256M-Instruct model, ensuring accurate and effective interaction for generating picture descriptions.
python -m vllm.entrypoints.openai.api_server \
--host 0.0.0.0 \
--port 8000 \
--model HuggingFaceTB/SmolVLM-256M-Instruct \
--chat-template '{% for message in messages %}{% if message.role == "user" %}{{ "USER: " + message.content + "\n" }}{% elif message.role == "assistant" %}{{ "ASSISTANT: " + message.content + "\n" }}{% endif %}{% endfor %}'
This should be running in background while you run the app!
- If you intend to use (like I did) IBM Cloud and watsonx.ai, I implemented the following (possible also with Ollama locally) ⬇️
# .env
WX_API_KEY="xxxxxxxxx"
WX_PROJECT_ID="12345"
Now implement the code. I used the out-of-the-box sample from here: https://docling-project.github.io/docling/examples/pictures_description_api/ and adapted it to my test. Also the sample input file I used come from the Docling’s GitHub repo: https://github.com/docling-project/docling/blob/main/tests/data/pdf/2206.01062.pdf
import logging
import os
import argparse
import time
from pathlib import Path
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import (
PdfPipelineOptions,
PictureDescriptionApiOptions,
)
from docling.document_converter import DocumentConverter, PdfFormatOption
import requests
from docling_core.types.doc import PictureItem
from dotenv import load_dotenv
# Using vLLM
def vllm_local_options(model: str):
options = PictureDescriptionApiOptions(
url="http://localhost:8000/v1/chat/completions", # This is the default vLLM API endpoint
params=dict(
model=model,
seed=42,
max_completion_tokens=200,
),
prompt="Describe the image in three sentences. Be concise and accurate.",
timeout=90,
)
return options
# The other functions (lms_local_options, watsonx_vlm_options) are kept as is
# as they are part of your existing working application.
# (commented out as you're not using LM Studio)
# def lms_local_options(model: str):
# options = PictureDescriptionApiOptions(
# url="http://localhost:1234/v1/chat/completions",
# params=dict(
# model=model,
# seed=42,
# max_completion_tokens=200,
# ),
# prompt="Describe the image in three sentences. Be consise and accurate.",
# timeout=90,
# )
# return options
def watsonx_vlm_options():
load_dotenv()
api_key = os.environ.get("WX_API_KEY")
project_id = os.environ.get("WX_PROJECT_ID")
def _get_iam_access_token(api_key: str) -> str:
res = requests.post(
url="https://iam.cloud.ibm.com/identity/token",
headers={
"Content-Type": "application/x-www-form-urlencoded",
},
data=f"grant_type=urn:ibm:params:oauth:grant-type:apikey&apikey={api_key}",
)
res.raise_for_status()
api_out = res.json()
print(f"{api_out=}")
return api_out["access_token"]
options = PictureDescriptionApiOptions(
url="https://us-south.ml.cloud.ibm.com/ml/v1/text/chat?version=2023-05-29",
params=dict(
model_id="ibm/granite-vision-3-2-2b",
project_id=project_id,
parameters=dict(
max_new_tokens=400,
),
),
headers={
"Authorization": "Bearer " + _get_iam_access_token(api_key=api_key),
},
prompt="Describe the image in three sentences. Be concise and accurate.",
timeout=60,
)
return options
def main():
logging.basicConfig(level=logging.INFO)
parser = argparse.ArgumentParser(description="Process a PDF document with Docling.")
parser.add_argument(
"input_file",
type=str,
help="Path to the input PDF document (e.g., ./data/input/pdf/2206.01062.pdf)"
)
args = parser.parse_args()
input_doc_path = Path(args.input_file)
if not input_doc_path.exists():
print(f"Error: Input file not found at {input_doc_path}")
return
# Define the output directory
output_dir = Path("output")
output_dir.mkdir(parents=True, exist_ok=True) # Create the directory if it doesn't exist
pipeline_options = PdfPipelineOptions(
enable_remote_services=True
)
pipeline_options.do_picture_description = True
pipeline_options.picture_description_options = vllm_local_options(
model="HuggingFaceTB/SmolVLM-256M-Instruct"
)
doc_converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(
pipeline_options=pipeline_options,
)
}
)
print(f"Processing document {input_doc_path}...")
# --- Start Timer ---
start_time = time.time()
print(f"Processing started at: {time.ctime(start_time)}")
result = doc_converter.convert(input_doc_path)
# --- End Timer ---
end_time = time.time()
print(f"Processing finished at: {time.ctime(end_time)}")
total_duration = end_time - start_time
print(f"Total processing time: {total_duration:.2f} seconds")
# --- Exporting documents using Docling's save_as_ methods ---
if result and result.document: # Ensure result and result.document exist
base_output_name = input_doc_path.stem # Filename without extension
try:
# Save as Markdown (.md)
md_output_path = output_dir / f"{base_output_name}.md"
result.document.save_as_markdown(md_output_path)
print(f"Output saved to {md_output_path}")
except Exception as e:
print(f"Error saving markdown output: {e}")
try:
# Save as HTML (.html)
html_output_path = output_dir / f"{base_output_name}.html"
# You can specify image_mode here if needed, e.g., ImageRefMode.EMBEDDED
result.document.save_as_html(html_output_path)
print(f"Output saved to {html_output_path}")
except Exception as e:
print(f"Error saving HTML output: {e}")
try:
# Save as Plain Text (.txt) using save_as_markdown with strict_text=True
txt_output_path = output_dir / f"{base_output_name}.txt"
result.document.save_as_markdown(txt_output_path, strict_text=True)
print(f"Output saved to {txt_output_path}")
except Exception as e:
print(f"Error saving plain text output: {e}")
else:
print("Document processing returned no result or document object.")
if __name__ == "__main__":
main()
- In order to test the application, you can run it as follows;
python app.py
- What you can expect as output are 3 distinct files.
- Excerpt of the text file;
## DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis
Birgit Pfitzmann IBM Research Rueschlikon, Switzerland bpf@zurich.ibm.com
Christoph Auer IBM Research Rueschlikon, Switzerland cau@zurich.ibm.com
Michele Dolfi IBM Research Rueschlikon, Switzerland dol@zurich.ibm.com
Ahmed S. Nassar IBM Research Rueschlikon, Switzerland ahn@zurich.ibm.com
Peter Staar IBM Research Rueschlikon, Switzerland taa@zurich.ibm.com
## ABSTRACT
Accurate document layout analysis is a key requirement for highquality PDF document conversion. With the recent availability of public, large ground-truth datasets such as PubLayNet and DocBank, deep-learning models have proven to be very effective at layout detection and segmentation. While these datasets are of adequate size to train such models, they severely lack in layout variability since they are sourced from scientific article repositories such as PubMed and arXiv only. Consequently, the accuracy of the layout segmentation drops significantly when these models are applied on more challenging and diverse layouts. In this paper, we present DocLayNet , a new, publicly available, document-layout annotation dataset in COCO format. It contains 80863 manually annotated pages from diverse data sources to represent a wide variability in layouts. For each PDF page, the layout annotations provide labelled bounding-boxes with a choice of 11 distinct classes. DocLayNet also provides a subset of double- and triple-annotated pages to determine the inter-annotator agreement. In multiple experiments, we provide baseline accuracy scores (in mAP) for a set of popular object detection models. We also demonstrate that these models fall approximately 10% behind the inter-annotator agreement. Furthermore, we provide evidence that DocLayNet is of sufficient size. Lastly, we compare models trained on PubLayNet, DocBank and DocLayNet, showing that layout predictions of the DocLayNettrained models are more robust and thus the preferred choice for general-purpose document-layout analysis.
## CCS CONCEPTS
· Informationsystems → Documentstructure ; · Appliedcomputing → Document analysis ; · Computing methodologies → Machine learning Computer vision ; ; Object detection ;
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).
KDD '22, August 14-18, 2022, Washington, DC, USA
© 2022 Copyright held by the owner/author(s).
ACM ISBN 978-1-4503-9385-0/22/08.
https://doi.org/10.1145/3534678.3539043
...
- Excerpt of the Markdown;
...
## Example Predictions
To conclude this section, we illustrate the quality of layout predictions one can expect from DocLayNet-trained models by providing a selection of examples without any further post-processing applied. Figure 6 shows selected layout predictions on pages from the test-set of DocLayNet. Results look decent in general across document categories, however one can also observe mistakes such as overlapping clusters of different classes, or entirely missing boxes due to low confidence.
## 6 CONCLUSION
In this paper, we presented the DocLayNet dataset. It provides the document conversion and layout analysis research community a new and challenging dataset to improve and fine-tune novel ML methods on. In contrast to many other datasets, DocLayNet was created by human annotation in order to obtain reliable layout ground-truth on a wide variety of publication- and typesettingstyles. Including a large proportion of documents outside the scientific publishing domain adds significant value in this respect.
From the dataset, we have derived on the one hand reference metrics for human performance on document-layout annotation (through double and triple annotations) and on the other hand evaluated the baseline performance of commonly used object detection methods. We also illustrated the impact of various dataset-related aspects on model performance through data-ablation experiments, both from a size and class-label perspective. Last but not least, we compared the accuracy of models trained on other public datasets and showed that DocLayNet trained models are more robust.
To date, there is still a significant gap between human and ML accuracy on the layout interpretation task, and we hope that this work will inspire the research community to close that gap.
## REFERENCES
- [1] Max Göbel, Tamir Hassan, Ermelinda Oro, and Giorgio Orsi. Icdar 2013 table competition. In 2013 12th International Conference on Document Analysis and Recognition , pages 1449-1453, 2013.
- [2] Christian Clausner, Apostolos Antonacopoulos, and Stefan Pletschacher. Icdar2017 competition on recognition of documents with complex layouts rdcl2017. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR) , volume 01, pages 1404-1410, 2017.
- [3] Hervé Déjean, Jean-Luc Meunier, Liangcai Gao, Yilun Huang, Yu Fang, Florian Kleber, and Eva-Maria Lang. ICDAR 2019 Competition on Table Detection and Recognition (cTDaR), April 2019. http://sac.founderit.com/.
...
That’s all 👋
Conclusion
In conclusion, this refined application showcases a comprehensive pipeline for advanced document processing, integrating cutting-edge VLM capabilities for enriched picture annotation. We’ve successfully combined Docling’s robust document conversion features — allowing for precise parsing and multi-format output in Markdown, HTML, and plain text — with the inferencing power of the HuggingFaceTB/SmolVLM-256M-Instruct model served locally via vLLM
. By adhering to Python best practices with virtual environments and meticulously configuring the vLLM API server with the correct chat template, we've demonstrated a streamlined and efficient method for automatically extracting and enriching visual information within documents. This provides a versatile tool for industries ranging from e-commerce and healthcare to autonomous vehicles and content management, enabling deeper insights and more effective utilization of visual data.
Links
- Docling VLM annotation: https://docling-project.github.io/docling/examples/pictures_description_api/
- Hugging Face SmolVLM used: https://huggingface.co/HuggingFaceTB/SmolVLM-256M-Instruct
- IBM Granite: https://huggingface.co/ibm-granite
Top comments (0)