DEV Community

Alain Airom
Alain Airom

Posted on

Hands-on and Testing of Goole “langextract” and further thoughts…

A hands-on testing of Google’s LangExtract and future thoughts.

Image from Langextract GitHub site

Introduction

LangExtract is an open-source Python library developed by Google that leverages Large Language Models (LLMs) to transform unstructured text into structured, auditable data. It is particularly useful for information extraction, entity recognition, and content structuring from documents like clinical notes, legal contracts, or research papers. The library is designed to be flexible, allowing developers to define custom extraction tasks using natural language prompts and a few high-quality examples, without the need for extensive fine-tuning of the underlying model. A key feature is its “source grounding,” which links every extracted piece of information directly back to its original location in the source text, ensuring transparency and reliability. LangExtract supports various LLMs, including Google’s Gemini family, and can handle long documents by chunking and parallelizing the extraction process.

After installing and setting up the library with a document I created — specifically, a text filled with a variety of relations generated by an LLM — I was excited to see how it would perform. The goal was to test its ability to not just identify entities, but also the complex relationships between them.

The test results were promising. The library effectively extracted the predefined entities and their attributes, and more importantly, it was able to capture the relationships between them, demonstrating its capacity to handle a document with complex interconnections. The interactive HTML visualization feature was also a game-changer, making it incredibly easy to review the extractions and verify their accuracy against the original text.

This leads to a fascinating possibility: integrating LangExtract with Docling. Currently, a user must manually convert a file to text before feeding it to LangExtract, which can result in the loss of crucial document layout information. However, Docling is a library that can parse various document formats (like PDFs, DOCX, and PPTX) into a rich, unified representation that preserves the original layout, including bounding boxes and page numbers. By using Docling as a front-end, one could potentially create a single, seamless pipeline where documents are converted and enriched with layout data before being processed by LangExtract. This would allow the extracted information to be mapped back to its precise location in the original document, not just in the text but also visually, which could be invaluable for auditing, document analysis, and building more advanced AI applications like Graph-RAG systems.

Tests

Image from https://developers.googleblog.com/
Press enter or click to view image in full size

To begin my tests, I began by running the provided sample on the Langextract page, and it really rocks!

First things first, prepare your environment.

python3 -m venv venv
source venv/bin/activate

pip install --upgrade pip


pip install langextract
Enter fullscreen mode Exit fullscreen mode

Obtain your API Key: api key: https://aistudio.google.com/apikey

I created a file (*.env) to store the API Key.

LANGEXTRACT_API_KEY="XXXXXXX"
Enter fullscreen mode Exit fullscreen mode

Run the code.

import langextract as lx
import textwrap
import os
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

# Retrieve the API key
# The LangExtract library automatically picks up LANGEXTRACT_API_KEY
# from the environment or .env file.
# You can explicitly set it if needed, but it's often not required if configured correctly:
# os.environ["LANGEXTRACT_API_KEY"] = os.getenv("LANGEXTRACT_API_KEY")

# 1. Define the prompt and extraction rules
prompt = textwrap.dedent("""\
    Extract characters, emotions, and relationships in order of appearance.
    Use exact text for extractions. Do not paraphrase or overlap entities.
    Provide meaningful attributes for each entity to add context.""")

# 2. Provide a high-quality example to guide the model
examples = [
    lx.data.ExampleData(
        text="ROMEO. But soft! What light through yonder window breaks? It is the east, and Juliet is the sun.",
        extractions=[
            lx.data.Extraction(
                extraction_class="character",
                extraction_text="ROMEO",
                attributes={"emotional_state": "wonder"}
            ),
            lx.data.Extraction(
                extraction_class="emotion",
                extraction_text="But soft!",
                attributes={"feeling": "gentle awe"}
            ),
            lx.data.Extraction(
                extraction_class="relationship",
                extraction_text="Juliet is the sun",
                attributes={"type": "metaphor"}
            ),
        ]
    )
]

# The input text to be processed
input_text = "Lady Juliet gazed longingly at the stars, her heart aching for Romeo"

print("Starting information extraction...")
print(f"Input text: '{input_text}'")

try:
    # Run the extraction
    # Ensure your LANGEXTRACT_API_KEY is set in your environment or .env file
    result = lx.extract(
        text_or_documents=input_text,
        prompt_description=prompt,
        examples=examples,
        model_id="gemini-2.5-flash", # Using the model ID specified in your code
    )

    print("\nExtraction complete! Results:")
    print(result)

    # Save the results to a JSONL file
    output_jsonl_filename = "extraction_results.jsonl"
    lx.io.save_annotated_documents([result], output_name=output_jsonl_filename, output_dir=".")
    print(f"\nResults saved to '{output_jsonl_filename}'")

    # Generate the visualization from the file
    output_html_filename = "visualization.html"
    html_content = lx.visualize(output_jsonl_filename)
    # Explicitly convert html_content to a string to ensure compatibility with write()
    with open(output_html_filename, "w") as f:
        f.write(str(html_content)) # Added str() conversion here
    print(f"Interactive visualization saved to '{output_html_filename}'")

except Exception as e:
    print(f"\nAn error occurred during extraction: {e}")
    print("Please ensure your Gemini API key is correctly set in the .env file or as an environment variable.")
    print("Also, verify your internet connection and that the 'gemini-2.5-flash' model is accessible.")

Enter fullscreen mode Exit fullscreen mode

The output is a game-changer. You can interact with it and see how the extracted relations link together, making it incredibly easy to visualize and understand the data.

For the second test, I used a LLM to generate a text with relations within and ran the code against it. Hereafter the text I generated;

"The city of Paris, located in France, is renowned for its iconic Eiffel Tower. It is a popular tourist destination. The tower was designed by Gustave Eiffel. Marie Curie, a famous scientist, was born in Paris and made significant contributions to the field of radioactivity. She worked at the Radium Institute. The Seine River flows through Paris."
Explanation of why this is suitable:
This document contains several entities and relationships that can be easily extracted and represented in a knowledge graph:
• Entities: Paris, France, Eiffel Tower, Gustave Eiffel, Marie Curie, Radium Institute, Seine River
• Relationships:
o Paris is located in France.
o Paris is renowned for the Eiffel Tower.
o The Eiffel Tower was designed by Gustave Eiffel.
o Marie Curie was born in Paris.
o Marie Curie was a scientist.
o Marie Curie made contributions to radioactivity.
o Marie Curie worked at the Radium Institute.
o The Seine River flows through Paris.
A knowledge graph constructed from this document would represent these entities as nodes and the relationships as edges, providing a structured representation of the information.
Enter fullscreen mode Exit fullscreen mode

The slighly modified code to run against the file;

import langextract as lx
import textwrap
import os
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

# Retrieve the API key
# The LangExtract library automatically picks up LANGEXTRACT_API_KEY
# from the environment or .env file.
# You can explicitly set it if needed, but it's often not required if configured correctly:
# os.environ["LANGEXTRACT_API_KEY"] = os.getenv("LANGEXTRACT_API_KEY")

# 1. Define the prompt and extraction rules
prompt = textwrap.dedent("""\
    Extract characters, emotions, and relationships in order of appearance.
    Use exact text for extractions. Do not paraphrase or overlap entities.
    Provide meaningful attributes for each entity to add context.""")

# 2. Provide a high-quality example to guide the model
examples = [
    lx.data.ExampleData(
        text="ROMEO. But soft! What light through yonder window breaks? It is the east, and Juliet is the sun.",
        extractions=[
            lx.data.Extraction(
                extraction_class="character",
                extraction_text="ROMEO",
                attributes={"emotional_state": "wonder"}
            ),
            lx.data.Extraction(
                extraction_class="emotion",
                extraction_text="But soft!",
                attributes={"feeling": "gentle awe"}
            ),
            lx.data.Extraction(
                extraction_class="relationship",
                extraction_text="Juliet is the sun",
                attributes={"type": "metaphor"}
            ),
        ]
    )
]

# Define input and output directories
input_folder = "input"
output_folder = "output"

print("Starting information extraction process...")

try:
    # Create the output directory if it doesn't exist
    os.makedirs(output_folder, exist_ok=True)
    print(f"Ensured output directory '{output_folder}' exists.")

    # Check if the input directory exists
    if not os.path.exists(input_folder):
        print(f"Error: Input directory '{input_folder}' not found.")
        print("Please create an 'input' folder and place your text files inside it.")
        exit()

    # Get a list of all files in the input folder
    input_files = [f for f in os.listdir(input_folder) if os.path.isfile(os.path.join(input_folder, f))]

    if not input_files:
        print(f"No files found in the input directory '{input_folder}'. Please add some text files.")
        exit()

    for input_filename in input_files:
        base_name, file_extension = os.path.splitext(input_filename)
        input_file_path = os.path.join(input_folder, input_filename)

        print(f"\nProcessing file: '{input_filename}'")

        try:
            with open(input_file_path, 'r', encoding='utf-8') as f:
                input_text = f.read()
            print(f"Successfully read input text from '{input_filename}'.")

            # Run the extraction
            # Ensure your LANGEXTRACT_API_KEY is set in your environment or .env file
            result = lx.extract(
                text_or_documents=input_text,
                prompt_description=prompt,
                examples=examples,
                model_id="gemini-2.5-flash", # Using the model ID specified in your code
            )

            print("Extraction complete for this file. Results:")
            # print(result) # Uncomment if you want to see the full result object in console

            # Define output filenames based on input document name
            output_jsonl_filename = f"{base_name}_extraction_results.jsonl"
            output_html_filename = f"{base_name}_visualization.html"

            jsonl_path = os.path.join(output_folder, output_jsonl_filename)
            html_path = os.path.join(output_folder, output_html_filename)

            # Save the results to a JSONL file within the output folder
            # Note: save_annotated_documents expects output_name to be just the filename,
            # and output_dir to be the folder path.
            lx.io.save_annotated_documents([result], output_name=output_jsonl_filename, output_dir=output_folder)
            print(f"Results saved to '{jsonl_path}'")

            # Generate the visualization from the file
            # lx.visualize expects the path to the JSONL file it should visualize
            html_object = lx.visualize(jsonl_path)
            html_content_string = html_object.data

            with open(html_path, "w", encoding='utf-8') as f:
                f.write(html_content_string)
            print(f"Interactive visualization saved to '{html_path}'")

        except UnicodeDecodeError:
            print(f"Skipping '{input_filename}': Cannot decode file. Please ensure it's a plain text file (e.g., UTF-8).")
        except Exception as e:
            print(f"An error occurred while processing '{input_filename}': {e}")
            print("Please ensure your Gemini API key is correctly set and the model is accessible.")

except Exception as e:
    print(f"\nAn unexpected error occurred: {e}")
    print("Please check your setup and try again.")
Enter fullscreen mode Exit fullscreen mode

And the ouput 👍

When you run the application, it produces two key output files for you to explore: a .jsonl file and an .html file. The .jsonl file contains the extracted data in a structured, line-delimited JSON format, making it easy to parse and integrate into other systems. The .html file, on the other hand, provides a powerful and interactive visualization of the extracted information. This makes it incredibly easy to review the data, see how relationships are connected, and verify the accuracy of the extractions against the original text.

{"extractions": [{"extraction_class": "character", "extraction_text": "null", "char_interval": null, "alignment_status": null, "extraction_index": 1, "group_index": 0, "description": null, "attributes": null}, {"extraction_class": "emotion", "extraction_text": "null", "char_interval": null, "alignment_status": null, "extraction_index": 2, "group_index": 0, "description": null, "attributes": null}, {"extraction_class": "relationship", "extraction_text": "these entities as nodes", "char_interval": {"start_pos": 1017, "end_pos": 1040}, "alignment_status": "match_exact", "extraction_index": 3, "group_index": 0, "description": null, "attributes": {"type": "representation"}}, {"extraction_class": "character", "extraction_text": "null", "char_interval": null, "alignment_status": null, "extraction_index": 4, "group_index": 1, "description": null, "attributes": null}, {"extraction_class": "emotion", "extraction_text": "null", "char_interval": null, "alignment_status": null, "extraction_index": 5, "group_index": 1, "description": null, "attributes": null}, {"extraction_class": "relationship", "extraction_text": "the relationships as edges", "char_interval": {"start_pos": 1045, "end_pos": 1071}, "alignment_status": "match_exact", "extraction_index": 6, "group_index": 1, "description": null, "attributes": {"type": "representation"}}], "text": "\"The city of Paris, located in France, is renowned for its iconic Eiffel Tower. It is a popular tourist destination. The tower was designed by Gustave Eiffel. Marie Curie, a famous scientist, was born in Paris and made significant contributions to the field of radioactivity. She worked at the Radium Institute. The Seine River flows through Paris.\"\nExplanation of why this is suitable:\nThis document contains several entities and relationships that can be easily extracted and represented in a knowledge graph:\n• Entities: Paris, France, Eiffel Tower, Gustave Eiffel, Marie Curie, Radium Institute, Seine River\n• Relationships:\no Paris is located in France.\no Paris is renowned for the Eiffel Tower.\no The Eiffel Tower was designed by Gustave Eiffel.\no Marie Curie was born in Paris.\no Marie Curie was a scientist.\no Marie Curie made contributions to radioactivity.\no Marie Curie worked at the Radium Institute.\no The Seine River flows through Paris.\nA knowledge graph constructed from this document would represent these entities as nodes and the relationships as edges, providing a structured representation of the information.", "document_id": "doc_6541634f"}
Enter fullscreen mode Exit fullscreen mode
<style>
.lx-highlight { position: relative; border-radius:3px; padding:1px 2px;}
.lx-highlight .lx-tooltip {
  visibility: hidden;
  opacity: 0;
  transition: opacity 0.2s ease-in-out;
  background: #333;
  color: #fff;
  text-align: left;
  border-radius: 4px;
  padding: 6px 8px;
  position: absolute;
  z-index: 1000;
  bottom: 125%;
  left: 50%;
  transform: translateX(-50%);
  font-size: 12px;
  max-width: 240px;
  white-space: normal;
  box-shadow: 0 2px 6px rgba(0,0,0,0.3);
}
.lx-highlight:hover .lx-tooltip { visibility: visible; opacity:1; }
.lx-animated-wrapper { max-width: 100%; font-family: Arial, sans-serif; }
.lx-controls {
  background: #fafafa; border: 1px solid #90caf9; border-radius: 8px;
  padding: 12px; margin-bottom: 16px;
}
.lx-button-row {
  display: flex; justify-content: center; gap: 8px; margin-bottom: 12px;
}
.lx-control-btn {
  background: #4285f4; color: white; border: none; border-radius: 4px;
  padding: 8px 16px; cursor: pointer; font-size: 13px; font-weight: 500;
  transition: background-color 0.2s;
}
.lx-control-btn:hover { background: #3367d6; }
.lx-progress-container {
  margin-bottom: 8px;
}
.lx-progress-slider {
  width: 100%; margin: 0; appearance: none; height: 6px;
  background: #ddd; border-radius: 3px; outline: none;
}
.lx-progress-slider::-webkit-slider-thumb {
  appearance: none; width: 18px; height: 18px; background: #4285f4;
  border-radius: 50%; cursor: pointer;
}
.lx-progress-slider::-moz-range-thumb {
  width: 18px; height: 18px; background: #4285f4; border-radius: 50%;
  cursor: pointer; border: none;
}
.lx-status-text {
  text-align: center; font-size: 12px; color: #666; margin-top: 4px;
}
.lx-text-window {
  font-family: monospace; white-space: pre-wrap; border: 1px solid #90caf9;
  padding: 12px; max-height: 260px; overflow-y: auto; margin-bottom: 12px;
  line-height: 1.6;
}
.lx-attributes-panel {
  background: #fafafa; border: 1px solid #90caf9; border-radius: 6px;
  padding: 8px 10px; margin-top: 8px; font-size: 13px;
}
.lx-current-highlight {
  text-decoration: underline;
  text-decoration-color: #ff4444;
  text-decoration-thickness: 3px;
  font-weight: bold;
  animation: lx-pulse 1s ease-in-out;
}
@keyframes lx-pulse {
  0% { text-decoration-color: #ff4444; }
  50% { text-decoration-color: #ff0000; }
  100% { text-decoration-color: #ff4444; }
}
.lx-legend {
  font-size: 12px; margin-bottom: 8px;
  padding-bottom: 8px; border-bottom: 1px solid #e0e0e0;
}
.lx-label {
  display: inline-block;
  padding: 2px 4px;
  border-radius: 3px;
  margin-right: 4px;
  color: #000;
}
.lx-attr-key {
  font-weight: 600;
  color: #1565c0;
  letter-spacing: 0.3px;
}
.lx-attr-value {
  font-weight: 400;
  opacity: 0.85;
  letter-spacing: 0.2px;
}

/* Add optimizations with larger fonts and better readability for GIFs */
.lx-gif-optimized .lx-text-window { font-size: 16px; line-height: 1.8; }
.lx-gif-optimized .lx-attributes-panel { font-size: 15px; }
.lx-gif-optimized .lx-current-highlight { text-decoration-thickness: 4px; }
</style>
    <div class="lx-animated-wrapper lx-gif-optimized">
      <div class="lx-attributes-panel">
        <div class="lx-legend">Highlights Legend: <span class="lx-label" style="background-color:#D2E3FC;">relationship</span></div>
        <div id="attributesContainer"></div>
      </div>
      <div class="lx-text-window" id="textWindow">
        &quot;The city of Paris, located in France, is renowned for its iconic Eiffel Tower. It is a popular tourist destination. The tower was designed by Gustave Eiffel. Marie Curie, a famous scientist, was born in Paris and made significant contributions to the field of radioactivity. She worked at the Radium Institute. The Seine River flows through Paris.&quot;
Explanation of why this is suitable:
This document contains several entities and relationships that can be easily extracted and represented in a knowledge graph:
• Entities: Paris, France, Eiffel Tower, Gustave Eiffel, Marie Curie, Radium Institute, Seine River
• Relationships:
o Paris is located in France.
o Paris is renowned for the Eiffel Tower.
o The Eiffel Tower was designed by Gustave Eiffel.
o Marie Curie was born in Paris.
o Marie Curie was a scientist.
o Marie Curie made contributions to radioactivity.
o Marie Curie worked at the Radium Institute.
o The Seine River flows through Paris.
A knowledge graph constructed from this document would represent <span class="lx-highlight lx-current-highlight" data-idx="0" style="background-color:#D2E3FC;">these entities as nodes</span> and <span class="lx-highlight" data-idx="1" style="background-color:#D2E3FC;">the relationships as edges</span>, providing a structured representation of the information.
      </div>
      <div class="lx-controls">
        <div class="lx-button-row">
          <button class="lx-control-btn" onclick="playPause()">▶️ Play</button>
          <button class="lx-control-btn" onclick="prevExtraction()">⏮ Previous</button>
          <button class="lx-control-btn" onclick="nextExtraction()">⏭ Next</button>
        </div>
        <div class="lx-progress-container">
          <input type="range" id="progressSlider" class="lx-progress-slider"
                 min="0" max="1" value="0"
                 onchange="jumpToExtraction(this.value)">
        </div>
        <div class="lx-status-text">
          Entity <span id="entityInfo">1/2</span> |
          Pos <span id="posInfo">[1017-1040]</span>
        </div>
      </div>
    </div>

    <script>
      (function() {
        const extractions = [{"index": 0, "class": "relationship", "text": "these entities as nodes", "color": "#D2E3FC", "startPos": 1017, "endPos": 1040, "beforeText": "o Marie Curie worked at the Radium Institute.\no The Seine River flows through Paris.\nA knowledge graph constructed from this document would represent ", "extractionText": "these entities as nodes", "afterText": " and the relationships as edges, providing a structured representation of the information.", "attributesHtml": "<div><strong>class:</strong> relationship</div><div><strong>attributes:</strong> {<span class=\"lx-attr-key\">type</span>: <span class=\"lx-attr-value\">representation</span>}</div>"}, {"index": 1, "class": "relationship", "text": "the relationships as edges", "color": "#D2E3FC", "startPos": 1045, "endPos": 1071, "beforeText": "Radium Institute.\no The Seine River flows through Paris.\nA knowledge graph constructed from this document would represent these entities as nodes and ", "extractionText": "the relationships as edges", "afterText": ", providing a structured representation of the information.", "attributesHtml": "<div><strong>class:</strong> relationship</div><div><strong>attributes:</strong> {<span class=\"lx-attr-key\">type</span>: <span class=\"lx-attr-value\">representation</span>}</div>"}];
        let currentIndex = 0;
        let isPlaying = false;
        let animationInterval = null;
        let animationSpeed = 1.0;

        function updateDisplay() {
          const extraction = extractions[currentIndex];
          if (!extraction) return;

          document.getElementById('attributesContainer').innerHTML = extraction.attributesHtml;
          document.getElementById('entityInfo').textContent = (currentIndex + 1) + '/' + extractions.length;
          document.getElementById('posInfo').textContent = '[' + extraction.startPos + '-' + extraction.endPos + ']';
          document.getElementById('progressSlider').value = currentIndex;

          const playBtn = document.querySelector('.lx-control-btn');
          if (playBtn) playBtn.textContent = isPlaying ? '⏸ Pause' : '▶️ Play';

          const prevHighlight = document.querySelector('.lx-text-window .lx-current-highlight');
          if (prevHighlight) prevHighlight.classList.remove('lx-current-highlight');
          const currentSpan = document.querySelector('.lx-text-window span[data-idx="' + currentIndex + '"]');
          if (currentSpan) {
            currentSpan.classList.add('lx-current-highlight');
            currentSpan.scrollIntoView({block: 'center', behavior: 'smooth'});
          }
        }

        function nextExtraction() {
          currentIndex = (currentIndex + 1) % extractions.length;
          updateDisplay();
        }

        function prevExtraction() {
          currentIndex = (currentIndex - 1 + extractions.length) % extractions.length;
          updateDisplay();
        }

        function jumpToExtraction(index) {
          currentIndex = parseInt(index);
          updateDisplay();
        }

        function playPause() {
          if (isPlaying) {
            clearInterval(animationInterval);
            isPlaying = false;
          } else {
            animationInterval = setInterval(nextExtraction, animationSpeed * 1000);
            isPlaying = true;
          }
          updateDisplay();
        }

        window.playPause = playPause;
        window.nextExtraction = nextExtraction;
        window.prevExtraction = prevExtraction;
        window.jumpToExtraction = jumpToExtraction;

        updateDisplay();
      })();
    </script>
Enter fullscreen mode Exit fullscreen mode

What if we could combine Docling and LangExtract?

As I mentioned at the start, my initial tests with LangExtract have led me to believe there’s a strong potential for integrating it with Docling to prepare documents. My goal was to test this concept by downloading a free e-book and processing it with Docling first. However, during the LangExtract phase, I ran out of API calls and couldn’t complete the full test. As a result, the code I’m providing here is conceptual and untested. It’s meant to illustrate the idea of how these two libraries could work together, not to serve as a fully functional, proven solution.

  • Setup Docling requirements.
pip install 'docling[all]'
pip install spacy
pip install networkx
pip install matplotlib
pip install nlp
Enter fullscreen mode Exit fullscreen mode
  • The code;
import langextract as lx
import textwrap
import os
import json
import logging
import time
from pathlib import Path
from dotenv import load_dotenv

# Import docling components
from docling.datamodel.accelerator_options import AcceleratorDevice, AcceleratorOptions
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.document_converter import DocumentConverter, PdfFormatOption

# Set up logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
_log = logging.getLogger(__name__)

# Load environment variables from .env file (for LANGEXTRACT_API_KEY)
load_dotenv()

# --- LangExtract Configuration ---
# 1. Define the prompt and extraction rules for LangExtract
langextract_prompt = textwrap.dedent("""\
    Extract characters, emotions, and relationships in order of appearance.
    Use exact text for extractions. Do not paraphrase or overlap entities.
    Provide meaningful attributes for each entity to add context.""")

# 2. Provide a high-quality example to guide LangExtract model
langextract_examples = [
    lx.data.ExampleData(
        text="ROMEO. But soft! What light through yonder window breaks? It is the east, and Juliet is the sun.",
        extractions=[
            lx.data.Extraction(
                extraction_class="character",
                extraction_text="ROMEO",
                attributes={"emotional_state": "wonder"}
            ),
            lx.data.Extraction(
                extraction_class="emotion",
                extraction_text="But soft!",
                attributes={"feeling": "gentle awe"}
            ),
            lx.data.Extraction(
                extraction_class="relationship",
                extraction_text="Juliet is the sun",
                attributes={"type": "metaphor"}
            ),
        ]
    )
]

# --- Docling Configuration ---
# Using the "Docling Parse with EasyOCR" configuration from your example
docling_pipeline_options = PdfPipelineOptions()
docling_pipeline_options.do_ocr = True
docling_pipeline_options.do_table_structure = True
docling_pipeline_options.table_structure_options.do_cell_matching = True
docling_pipeline_options.ocr_options.lang = ["en"] # Changed to 'en' for general text, you can change to 'es' if your PDFs are in Spanish
docling_pipeline_options.accelerator_options = AcceleratorOptions(
    num_threads=4, device=AcceleratorDevice.AUTO
)

doc_converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=docling_pipeline_options)
    }
)

# --- Main Processing Logic ---
def process_document(input_file_path: Path, relative_path: Path, output_base_dir: Path):
    """
    Processes a single document using Docling and then LangExtract.
    Saves results to the specified output directory.
    """
    _log.info(f"Processing document: {input_file_path}")

    # Create the corresponding output subdirectory
    output_sub_dir = output_base_dir / relative_path.parent
    os.makedirs(output_sub_dir, exist_ok=True)

    base_name = input_file_path.stem # Filename without extension

    try:
        # --- Step 1: Convert PDF to Text using Docling ---
        _log.info(f"Converting '{input_file_path.name}' to text using Docling...")
        docling_start_time = time.time()
        conv_result = doc_converter.convert(input_file_path)
        docling_end_time = time.time() - docling_start_time
        _log.info(f"Docling conversion completed in {docling_end_time:.2f} seconds for '{input_file_path.name}'.")

        # Get the extracted text from Docling result
        docling_extracted_text = conv_result.document.export_to_text()
        if not docling_extracted_text.strip():
            _log.warning(f"Docling extracted empty text from '{input_file_path.name}'. Skipping LangExtract.")
            return # Skip LangExtract if no text was extracted

        # --- Step 2: Run LangExtract on the extracted text ---
        _log.info(f"Running LangExtract on text from '{input_file_path.name}'...")
        langextract_start_time = time.time()
        langextract_result = lx.extract(
            text_or_documents=docling_extracted_text,
            prompt_description=langextract_prompt,
            examples=langextract_examples,
            model_id="gemini-2.5-flash", # Ensure LANGEXTRACT_API_KEY is set in .env
        )
        langextract_end_time = time.time() - langextract_start_time
        _log.info(f"LangExtract completed in {langextract_end_time:.2f} seconds for '{input_file_path.name}'.")

        # --- Step 3: Save LangExtract Results ---
        # Define output filenames based on input document name
        output_jsonl_filename = f"{base_name}_extraction_results.jsonl"
        output_html_filename = f"{base_name}_visualization.html"

        jsonl_path = output_sub_dir / output_jsonl_filename
        html_path = output_sub_dir / output_html_filename

        # Save LangExtract results to JSONL
        # save_annotated_documents expects output_name as filename and output_dir as path
        lx.io.save_annotated_documents([langextract_result], output_name=output_jsonl_filename, output_dir=str(output_sub_dir))
        _log.info(f"LangExtract results saved to '{jsonl_path}'")

        # Generate and save LangExtract HTML visualization
        html_object = lx.visualize(str(jsonl_path)) # Pass the full path to the JSONL file
        html_content_string = html_object.data

        with open(html_path, "w", encoding='utf-8') as f:
            f.write(html_content_string)
        _log.info(f"LangExtract visualization saved to '{html_path}'")

    except Exception as e:
        _log.error(f"Error processing '{input_file_path.name}': {e}", exc_info=True)
        _log.error("Please ensure your Gemini API key is correctly set for LangExtract and Docling dependencies are met.")


def main():
    input_root_folder = Path("input")
    output_root_folder = Path("output")

    _log.info(f"Starting document processing from '{input_root_folder}' to '{output_root_folder}'...")

    # Create the root output directory if it doesn't exist
    os.makedirs(output_root_folder, exist_ok=True)
    _log.info(f"Ensured root output directory '{output_root_folder}' exists.")

    if not input_root_folder.exists():
        _log.error(f"Input directory '{input_root_folder}' not found. Please create it and place documents inside.")
        return

    found_documents = False
    for root, dirs, files in os.walk(input_root_folder):
        current_input_dir = Path(root)
        relative_path_from_input_root = current_input_dir.relative_to(input_root_folder)

        for file_name in files:
            if file_name.lower().endswith(".pdf"): # Process only PDF files
                found_documents = True
                input_file_path = current_input_dir / file_name
                process_document(input_file_path, relative_path_from_input_root, output_root_folder)
            else:
                _log.info(f"Skipping non-PDF file: {file_name}")

    if not found_documents:
        _log.warning(f"No PDF documents found in '{input_root_folder}' or its subdirectories.")
    _log.info("Document processing finished.")

if __name__ == "__main__":
    main()

Enter fullscreen mode Exit fullscreen mode

Conclusion and further thoughs

These explorations into Google’s LangExtract have been really helpful. This powerful library transforms unstructured text into structured, auditable data, with the added benefit of source grounding. My initial tests, using an LLM-generated document, confirmed its ability to extract complex relationships, not just isolated entities. The interactive HTML output was a game-changer, making it easy to visualize and verify the data. This success naturally led to the exciting — though currently untested — concept of a LangExtract-Docling pipeline. By using Docling to prepare documents and preserve their rich layout, we could take LangExtract’s capabilities a step further, mapping extracted data back to its precise location within the original document. While my final test fell short due to API limits, the conceptual framework remains solid. The potential for building more advanced document analysis tools and Graph-RAG systems with this approach is immense. This is a space worth watching, and I encourage you to experiment with these tools yourself.

Links

Top comments (0)