Intelligent PDF Data Extraction and database creation

#python #database #pdf #elasticsearch

Project Objective: To create a system that extracts structured and unstructured data from vendor-uploaded PDFs and stores this data in a database for indexing and querying. The system should also support a chatbot capable of answering questions related to the PDF content.

Project Details:

Input Requirements:

PDFs with diverse structures, including plain text, headings, paragraphs, tables, and bullet points.
Examples include Requests for Quotations (RFQs), contracts, manuals, and reports.

Key Features:

Extract all relevant data from PDFs, ignoring irrelevant sections like headers and footers.
Recognize and structure tables accurately, associating them with their respective titles or captions (found in bold text, usually followed by a colon). Handle nested data within tables if applicable.
Identify and extract bullet points within paragraphs and organize them as nested lists.
Dynamically structure text content using headings as keys and their corresponding text as values.
Clean extracted data by removing unnecessary symbols and normalizing spaces.

Data Storage and Querying:

Store extracted data in Elasticsearch for efficient indexing and search capabilities.
Ensure the database schema supports structured data (e.g., tables) and unstructured text.

Technical Challenges:

Data Accuracy: Ensuring tables, bullet points, and text are extracted correctly and associated with the right headings.
Header/Footer Removal: Dynamically ignoring irrelevant header/footer content without affecting the core data.
Title Detection for Tables: Associating tables with the correct titles using proximity and formatting cues.
Nested Content: Structuring paragraphs containing bullet points into hierarchical formats for better clarity.

Desired Outcome:
A script or pipeline that can process a PDF to output structured JSON data. Example format:

{
    "Heading 1": "Text under heading 1",
    "Heading 2": [
        "Bullet point 1",
        "Bullet point 2",
        "Bullet point 3"
    ],
    "Table Title": [
        {"Column 1": "Value 1", "Column 2": "Value 2"},
        {"Column 1": "Value 3", "Column 2": "Value 4"}
    ]
}

Integration with Elasticsearch to index this structured data.

A chatbot API capable of answering natural language questions about the extracted data.

Current Progress:

Developed base Python scripts using pdfplumber and Apache Tika for text and table extraction.
Implemented logic to remove headers and footers and validate extracted tables.
Structured data into key-value pairs using headings as keys and nested bullet points as values.

Help Needed:

Enhancing the table extraction logic to:
Ensure accurate table title detection from bold text.
Handle complex tables with merged cells or irregular structures.
Optimizing the removal of headers/footers to ensure no relevant data is lost.
Recommendations for integrating the chatbot with Elasticsearch for effective querying.
Best practices for handling large PDFs with complex structures.

Expected Community Support:
Looking for code samples, architecture recommendations, and best practices to:
Refine PDF data extraction (focus on accuracy and efficiency).
Improve the organization of nested and tabular data.
Scale the solution for high volumes of data.
Enhance the chatbot's ability to interpret and answer queries effectively.

DEV Community

Intelligent PDF Data Extraction and database creation

Top comments (0)