Adityavardhan Agrawal

Posted on Mar 24

An overview of rules based ingestion in DataBridge

#rag #ai #python #opensource

Rules! Rules! Rules! LLMs are exceptionally good at transforming content and extracting relevant information, making them ideal building blocks for robust document processing. However, without clear rules, even the most advanced LLMs can struggle to deliver consistent and precise outcomes. DataBridge leverages the power of LLMs, guided by user-defined rules, eliminating the need for developers to create complex custom pipelines. With rules acting as straightforward instructions, we enable reliable, repeatable document processing to unlock the full potential of your unstructured data.

Before we jump in, here's a note about DataBridge for those who are new:

DataBridge (Encel AI)

DataBridge is a fully open-source (MIT license) scalable, multimodal database for unstructured data. Developed by Encel AI, DataBridge simplifies the process of handling complex documents by using a rules-based approach that guides LLMs to perform consistent and reliable document processing.

Repository: https://github.com/databridge-org/databridge-core

Now back to rules. This explainer outlines the powerful rules-based document ingestion capabilities in DataBridge, focusing on practical use cases like metadata extraction and content transformation.

Core Concepts

DataBridge's rules system currently allows for two key operations during document ingestion:

Metadata Extraction: Pull structured information from documents into searchable metadata.
Content Transformation: Modify document content during ingestion (redaction, summarization, etc.).

Rules are processed in sequence, allowing you to chain operations (e.g., extract metadata from a document, then redact sensitive information).

Architecture Overview

The rules engine works by:

Accepting rules definitions during document ingestion
Converting each rule to the appropriate model class
Sequentially applying rules to document content
Using an LLM to perform extractions or transformations
Storing both the extracted metadata and modified content

Rule Types

MetadataExtractionRule

This rule type extracts structured data from document content according to a schema. It's perfect for converting unstructured documents into structured, queryable data.

from databridge import DataBridge
from databridge.rules import MetadataExtractionRule
from pydantic import BaseModel

# Define a schema for the metadata you want to extract
class ResumeInfo(BaseModel):
    name: str
    email: str
    phone: str
    skills: list[str]
    education: list[dict]
    experience: list[dict]

# Connect to DataBridge
db = DataBridge()

# Ingest a resume with metadata extraction
doc = db.ingest_file(
    "resume.pdf",
    metadata={"type": "resume"},
    rules=[
        MetadataExtractionRule(schema=ResumeInfo)
    ]
)

# The extracted metadata is now available
print(f"Candidate: {doc.metadata['name']}")
print(f"Skills: {', '.join(doc.metadata['skills'])}")
print(f"Education: {len(doc.metadata['education'])} entries")

NaturalLanguageRule

This rule type transforms document content according to natural language instructions. It's perfect for redaction, summarization, formatting changes, etc.

from databridge import DataBridge
from databridge.rules import NaturalLanguageRule

# Connect to DataBridge
db = DataBridge()

# Ingest a document with PII redaction
doc = db.ingest_file(
    "medical_record.pdf",
    rules=[
        NaturalLanguageRule(
            prompt="Remove all personally identifiable information (PII) including patient names, " 
                  "addresses, phone numbers, and any other identifying details. Replace with [REDACTED]."
        )
    ]
)

# The document is stored with PII removed

Technical Implementation

Under the hood, the rules processing system leverages Language Models (LLMs) to perform both metadata extraction and content transformation. The system is designed to be modular and configurable.

LLM Integration

DataBridge supports multiple LLM providers:

OpenAI: For high-quality results with cloud-based processing
Ollama: For self-hosted, private LLM deployments

The LLM provider is configured in your databridge.toml file:

[rules]
provider = "openai"  # or "ollama" for self-hosted
model = "gpt-4o"     # or your preferred model
batch_size = 4096      # Size of content to process at once

Rules Processing Logic

When a document with rules is ingested:

The document is parsed and text is extracted
The rules are validated and converted to model classes
For each rule:
- The appropriate prompt is constructed based on rule type
- The prompt and document content are sent to the LLM
- The LLM's response is parsed (JSON for metadata, plain text for transformed content)
- Results are stored according to rule type

For metadata extraction, the LLM is instructed to return structured JSON that matches your schema. For content transformation, the LLM modifies the text based on your natural language instructions.

Performance Considerations

For large documents, DataBridge automatically chunks the content and processes rules in batches. This ensures efficient handling of documents of any size. The batch_size configuration in databridge.toml determines how content is split up before passing on the LLM.

Larger batch sizes may improve throughput but require more memory, and with a huge batch size, we could run into unreliable results. For complex rules or larger documents, you might need to adjust this setting based on your hardware capabilities and the latency requirements of your application.

Example Use cases

Resume Processing System

Let's explore a complete example for processing resumes.

from databridge import DataBridge
from databridge.rules import MetadataExtractionRule, NaturalLanguageRule
from pydantic import BaseModel

# Define resume schema
class ResumeInfo(BaseModel):
    name: str
    email: str
    phone: str
    location: str
    skills: list[str]
    years_experience: int
    education: list[dict]
    work_history: list[dict]
    certifications: list[str] = []

# Create DataBridge client
db = DataBridge()

# Process resume with multiple rules
doc = db.ingest_file(
    "example_resume.pdf",
    metadata={"type": "resume", "source": "careers_page"},
    rules=[
        # First extract structured metadata
        MetadataExtractionRule(schema=ResumeInfo),

        # Then anonymize the content
        NaturalLanguageRule(
            prompt="Remove all personal contact information (name, email, phone, address) "
                  "but keep skills, experience, and qualifications."
        )
    ]
)

# Retrieve candidates in specific locations
candidates = db.retrieve_docs(
    "python machine learning",
    filters={"type": "resume", "location": "San Francisco"}
)

for candidate in candidates:
    print(f"Candidate: {candidate.metadata['name']}")
    print(f"Experience: {candidate.metadata['years_experience']} years")
    print(f"Skills: {', '.join(candidate.metadata['skills'])}")
    print(f"Education: {candidate.metadata['education'][0]['degree']} from {candidate.metadata['education'][0]['institution']}")
    print("---")

Medical Document Processing with PII Redaction

Medical documents often contain sensitive information that needs redaction while preserving clinical value.

from databridge import DataBridge
from databridge.rules import MetadataExtractionRule, NaturalLanguageRule
from pydantic import BaseModel

# Define medical record schema
class MedicalRecord(BaseModel):
    patient_id: str  # De-identified ID
    date_of_visit: str
    diagnosis: list[str]
    medications: list[str]
    procedures: list[str]
    lab_results: dict
    doctor_notes: str

# Create DataBridge client
db = DataBridge()

# Process medical document
doc = db.ingest_file(
    "patient_record.pdf",
    metadata={"type": "medical_record", "department": "cardiology"},
    rules=[
        # First extract structured metadata
        MetadataExtractionRule(schema=MedicalRecord),

        # Then redact PII from the content
        NaturalLanguageRule(
            prompt="Redact all patient personally identifiable information (PII) including: "
                  "1. Patient name (replace with 'PATIENT') "
                  "2. Exact dates of birth (keep only year) "
                  "3. Addresses (replace with '[ADDRESS]') "
                  "4. Phone numbers (replace with '[PHONE]') "
                  "5. SSN and other ID numbers (replace with '[ID]') "
                  "6. Names of patient family members "
                  "Preserve all clinical information, symptoms, diagnoses, and treatments."
        )
    ]
)

# The document is now stored with structured metadata and redacted content

records = db.retrieve_docs(
    "myocardial infarction treatment",
    filters={"type": "medical_record", "department": "cardiology"}
)

for record in records:
    print(f"Record ID: {record.metadata['patient_id']}")
    print(f"Diagnosis: {', '.join(record.metadata['diagnosis'])}")
    print(f"Medications: {', '.join(record.metadata['medications'])}")
    print(f"Procedures: {', '.join(record.metadata['procedures'])}")
    print("---")

# Or retrieve specific chunks of content
chunks = db.retrieve_chunks(
    "myocardial infarction symptoms",
    filters={"type": "medical_record", "department": "cardiology"}
)

for chunk in chunks:
    print(f"Content: {chunk.content[:100]}...")  # First 100 chars
    print(f"Document ID: {chunk.document_id}")
    print(f"Relevance Score: {chunk.score}")
    print("---")

Optimizing Rule Performance

Prompt Engineering

The effectiveness of your rules depends significantly on the quality of prompts and schemas:

Be Specific: Clearly define what you want extracted or transformed
Provide Examples: For complex schemas, include examples in the prompt
Limit Scope: Focus each rule on a specific task rather than trying to do too much at once

Rule Sequencing

Rules are processed in sequence, so order matters:

Extract First, Transform Later: Generally, extract metadata before transforming content
Chunking Awareness: For very large documents, be aware that rules are applied to each chunk separately
Rule Complexity: Split complex operations into multiple simpler rules

LLM Selection

Different tasks may benefit from different LLMs:

Metadata Extraction: Benefits from models with strong JSON capabilities (like GPT-4)
Simple Redaction: Can work well with smaller, faster models. For redaction, you may not want to send data across the network, so local models might be more suitable.
Complex Transformation: May require more sophisticated models.

Conclusion

DataBridge's rules-based ingestion provides a powerful, flexible system for extracting structured data from unstructured documents and transforming content during ingestion. This capability can be applied to numerous use cases from resume processing to medical record management to legal document analysis.

The system's architecture balances flexibility, performance, and ease of use:

Client-Side Simplicity: Define rules using simple schemas and natural language
Server-Side Power: Leverages LLMs to handle the complex extraction and transformation
Configurable: Adapt to different deployment scenarios and performance requirements

By separating the rules definition from the ingestion code, you can easily adapt your document processing pipeline to different document types and requirements without changing your application code.

Get n8n VPS hosting 3x cheaper than a cloud solution

Get fast, easy, secure n8n VPS hosting from $4.99/mo at Hostinger. Automate any workflow using a pre-installed n8n application and no-code customization.

Start now

DEV Community