DEV Community

Cover image for Intelligent PDF Data Extraction and database creation
Swapnil Vasant Taware
Swapnil Vasant Taware

Posted on

Intelligent PDF Data Extraction and database creation

Project Objective: To create a system that extracts structured and unstructured data from vendor-uploaded PDFs and stores this data in a database for indexing and querying. The system should also support a chatbot capable of answering questions related to the PDF content.

Project Details:

Input Requirements:

PDFs with diverse structures, including plain text, headings, paragraphs, tables, and bullet points.
Examples include Requests for Quotations (RFQs), contracts, manuals, and reports.

Key Features:

Extract all relevant data from PDFs, ignoring irrelevant sections like headers and footers.
Recognize and structure tables accurately, associating them with their respective titles or captions (found in bold text, usually followed by a colon). Handle nested data within tables if applicable.
Identify and extract bullet points within paragraphs and organize them as nested lists.
Dynamically structure text content using headings as keys and their corresponding text as values.
Clean extracted data by removing unnecessary symbols and normalizing spaces.

Data Storage and Querying:

Store extracted data in Elasticsearch for efficient indexing and search capabilities.
Ensure the database schema supports structured data (e.g., tables) and unstructured text.

Technical Challenges:

Data Accuracy: Ensuring tables, bullet points, and text are extracted correctly and associated with the right headings.
Header/Footer Removal: Dynamically ignoring irrelevant header/footer content without affecting the core data.
Title Detection for Tables: Associating tables with the correct titles using proximity and formatting cues.
Nested Content: Structuring paragraphs containing bullet points into hierarchical formats for better clarity.

Desired Outcome:
A script or pipeline that can process a PDF to output structured JSON data. Example format:

{
    "Heading 1": "Text under heading 1",
    "Heading 2": [
        "Bullet point 1",
        "Bullet point 2",
        "Bullet point 3"
    ],
    "Table Title": [
        {"Column 1": "Value 1", "Column 2": "Value 2"},
        {"Column 1": "Value 3", "Column 2": "Value 4"}
    ]
}
Enter fullscreen mode Exit fullscreen mode

Integration with Elasticsearch to index this structured data.

A chatbot API capable of answering natural language questions about the extracted data.

Current Progress:

Developed base Python scripts using pdfplumber and Apache Tika for text and table extraction.
Implemented logic to remove headers and footers and validate extracted tables.
Structured data into key-value pairs using headings as keys and nested bullet points as values.

Help Needed:

Enhancing the table extraction logic to:
Ensure accurate table title detection from bold text.
Handle complex tables with merged cells or irregular structures.
Optimizing the removal of headers/footers to ensure no relevant data is lost.
Recommendations for integrating the chatbot with Elasticsearch for effective querying.
Best practices for handling large PDFs with complex structures.

Expected Community Support:
Looking for code samples, architecture recommendations, and best practices to:
Refine PDF data extraction (focus on accuracy and efficiency).
Improve the organization of nested and tabular data.
Scale the solution for high volumes of data.
Enhance the chatbot's ability to interpret and answer queries effectively.

API Trace View

How I Cut 22.3 Seconds Off an API Call with Sentry

Struggling with slow API calls? Dan Mindru walks through how he used Sentry's new Trace View feature to shave off 22.3 seconds from an API call.

Get a practical walkthrough of how to identify bottlenecks, split tasks into multiple parallel tasks, identify slow AI model calls, and more.

Read more →

Top comments (0)

Billboard image

The Next Generation Developer Platform

Coherence is the first Platform-as-a-Service you can control. Unlike "black-box" platforms that are opinionated about the infra you can deploy, Coherence is powered by CNC, the open-source IaC framework, which offers limitless customization.

Learn more

👋 Kindness is contagious

Dive into an ocean of knowledge with this thought-provoking post, revered deeply within the supportive DEV Community. Developers of all levels are welcome to join and enhance our collective intelligence.

Saying a simple "thank you" can brighten someone's day. Share your gratitude in the comments below!

On DEV, sharing ideas eases our path and fortifies our community connections. Found this helpful? Sending a quick thanks to the author can be profoundly valued.

Okay