DEV Community

Yeshwanth Reddy
Yeshwanth Reddy

Posted on

๐Ÿ“Š Exploring Vision Language Models (VLMs) for Structured Data Extraction

Over the past few weeks, I've been studying the effectiveness of Vision Language Models (VLMs) for structured data extraction from documents. While many benchmarks exist, none focus exclusively on structured data extraction, which led me to develop a framework that does just that.

๐Ÿ” Models Tested:

  • Open Source: Qwen2, MiniCPM, Bunny
  • Closed Source: GPT-4o mini, Gemini 1.5 flash, Claude 3.5 These models were chosen because they either offer the most affordable APIs in their respective families or can run on a consumer GPU (less than 24GB VRAM).

๐Ÿ“š Datasets:

  • SROIE and CORD receipt datasets โ€“ standardized, complex, and ideal for benchmarking VLMs on structured data extraction.

๐Ÿ’ก Key Takeaways:

  • Qwen2 is the top-performing open-source model, while Claude 3.5 leads among closed-source modelsโ€”though both are also the most expensive in their categories.
  • Both model types perform similarly on the simpler SROIE dataset, but closed-source models clearly outperform on the more complex CORD dataset.
  • High accuracy at a higher cost is more beneficial to customers, as the cost of error handling in low-cost models may exceed any savings.
  • Open-source VLMs have room for fine-tuning, potentially closing the gap with closed-source models.
  • Discovered a free Google API that offers a decent number of calls per dayโ€”an exciting find!

๐Ÿ”— Check out the full study here: https://nanonets.com/blog/vision-language-model-vlm-for-data-extraction/

๐Ÿ”— Source code available here: github.com/nanonets/hands-on-vision-language-models/

I'm currently on the lookout for new benchmarks that focus on structured data extraction from documents. If you know any relevant datasets, I'd love to benchmark them and update the study in my free time.
Feedback is much appreciated!

Image of Timescale

๐Ÿš€ pgai Vectorizer: SQLAlchemy and LiteLLM Make Vector Search Simple

We built pgai Vectorizer to simplify embedding management for AI applicationsโ€”without needing a separate database or complex infrastructure. Since launch, developers have created over 3,000 vectorizers on Timescale Cloud, with many more self-hosted.

Read full post โ†’

Top comments (0)

Postmark Image

Speedy emails, satisfied customers

Are delayed transactional emails costing you user satisfaction? Postmark delivers your emails almost instantly, keeping your customers happy and connected.

Sign up