DEV Community

Cover image for LangExtract: A Structured Data Extraction Library that Adds Character Positions to Extraction Results
tumf
tumf

Posted on • Originally published at blog.tumf.dev

LangExtract: A Structured Data Extraction Library that Adds Character Positions to Extraction Results

Originally published on 2026-01-18
Original article (Japanese): LangExtract: 抽出結果に文字位置を付ける構造化データ抽出ライブラリ

I want to extract "factors affecting sales" from the securities reports of 100 companies. I want to find "termination clauses" from 1,000 contracts. Such information extraction from large documents has traditionally been done manually.

With the advent of LLMs, "structured data extraction" has become possible, but a new problem has arisen: "Is this information really written in that document?"

LangExtract is an open-source Python library from Google that solves this problem. By attaching character positions (CharInterval) to the extraction results, it clarifies "where the information was extracted from."

Why Extract Information from Large Documents

Example 1: Financial Analysts' Work

Analysts at securities firms read the securities reports of over 100 companies to create industry reports.

Traditional Method:

1. Open the PDF
2. Search for "sales" using Ctrl+F
3. Manually copy and paste the relevant sections into Excel
4. Repeat for 100 companies (takes several days)
Enter fullscreen mode Exit fullscreen mode

Desired Output:

| Company Name | Segment       | Factor          | Amount   | Source Page |
|--------------|---------------|------------------|----------|-------------|
| Company A    | Digital Ads   | New customer acquisition | +30 billion yen | p.23       |
| Company B    | Cloud         | Contract reduction | -5 billion yen  | p.15       |
Enter fullscreen mode Exit fullscreen mode

Can this task be automated with LLMs?

Example 2: Contract Management in Legal Departments

Legal departments in companies manage hundreds of contracts.

Common Challenges:

  • "Which contracts require a 30-day notice?"
  • "I want to identify all contracts with automatic renewal clauses."
  • "I want to list the amounts for termination penalties."

This also takes a tremendous amount of time when done manually.

Structured Data Extraction with LLMs

Traditional Libraries: Instructor

Instructor and LangChain are popular libraries that convert LLM outputs into structured data.

import instructor
from openai import OpenAI
from pydantic import BaseModel

client = instructor.from_openai(OpenAI())

class Company(BaseModel):
    name: str
    location: str

# Extract structured data from unstructured text
text = "株式会社サイバーエージェントは東京都渋谷区に本社を置く。"
result = client.chat.completions.create(
    model="gpt-4",
    response_model=Company,
    messages=[{"role": "user", "content": f"Extract company info: {text}"}]
)

print(result)
# Company(name='株式会社サイバーエージェント', location='東京都渋谷区')
Enter fullscreen mode Exit fullscreen mode

Convenient, but there is a problem.

Problem: "Is this really written in that document?"

Output from Instructor:

Company(name='株式会社サイバーエージェント', location='東京都渋谷区')
Enter fullscreen mode Exit fullscreen mode

Where did this information come from?

  • From which character position to which character position in the original text?
  • Was it really written there? Did the LLM guess?
  • What if an audit asks, "Show the evidence"?

It cannot be answered.

The Need for Source Pointers

Example in Healthcare

When extracting medication information from medical records:

# Extracted with Instructor
result = {"drug": "アスピリン", "dosage": "100mg"}
Enter fullscreen mode Exit fullscreen mode

Doctor's Questions:

  • "Was 100mg really written in the medical record?"
  • "Is the LLM just guessing?"
  • "Who takes responsibility if there's a medical error?"

What is needed:

result = {
    "drug": "アスピリン",
    "dosage": "100mg",
    "source_position": "カルテ123-145文字目",
    "original_text": "アスピリン100mgを処方"
}
Enter fullscreen mode Exit fullscreen mode

Example in Legal Documents

When extracting termination clauses from contracts:

# Extracted with Instructor
result = {"notice_period": "30日前"}
Enter fullscreen mode Exit fullscreen mode

Lawyer's Questions:

  • "Was 30 days really written in the contract?"
  • "Can I present the original text if it goes to litigation?"
  • "Which page and which clause?"

What is needed:

result = {
    "notice_period": "30日前",
    "source_position": "p.5, 1234-1256文字目",
    "original_text": "解約の場合は30日前までに書面で通知すること"
}
Enter fullscreen mode Exit fullscreen mode

The Emergence of LangExtract

LangExtract is an open-source library released by Google.

Key Features:

  • Includes character positions (CharInterval) in extraction results
  • Clearly indicates "where the information was extracted from"
  • Usable in fields like healthcare, law, and auditing where "evidence presentation" is essential

Basic Usage

Installation

pip install langextract
Enter fullscreen mode Exit fullscreen mode

Code Example

import os
import langextract as lx
from langextract import data

# Text to extract from
text = """株式会社サイバーエージェントは、東京都渋谷区に本社を置くインターネット広告代理店です。
代表取締役社長は藤田晋氏で、1998年に設立されました。
主な事業はAbemaTVやAmeba、広告事業などです。"""

# Define prompt and examples
prompt = """
テキストから企業名、人名、場所を抽出してください。
正確なテキストを使用し、言い換えないでください。
"""

examples = [
    data.ExampleData(
        text="トヨタ自動車の豊田章男社長は、愛知県豊田市の本社で記者会見を行いました。",
        extractions=[
            data.Extraction(
                extraction_class="企業名",
                extraction_text="トヨタ自動車",
            ),
            data.Extraction(
                extraction_class="人名",
                extraction_text="豊田章男",
            ),
            data.Extraction(
                extraction_class="場所",
                extraction_text="愛知県豊田市",
            ),
        ]
    )
]

# Execute extraction
result = lx.extract(
    text_or_documents=text,
    prompt_description=prompt,
    examples=examples,
    model_id="gemini-2.5-flash",
    api_key=os.environ.get('GOOGLE_API_KEY'),
    extraction_passes=1,
)

# Display results (with source pointers)
for extraction in result.extractions:
    print(f"Class: {extraction.extraction_class}")
    print(f"Text: {extraction.extraction_text}")
    if extraction.char_interval:
        start = extraction.char_interval.start_pos
        end = extraction.char_interval.end_pos
        print(f"Position: {start}-{end} characters")
        print(f"Verification: '{text[start:end]}'")
        print("---")
Enter fullscreen mode Exit fullscreen mode

Execution Result

Class: 企業名
Text: 株式会社サイバーエージェント
Position: 0-14 characters
Verification: '株式会社サイバーエージェント'
---
Class: 場所
Text: 東京都渋谷区
Position: 16-22 characters
Verification: '東京都渋谷区'
---
Class: 人名
Text: 藤田晋
Position: 56-59 characters
Verification: '藤田晋'
---
Enter fullscreen mode Exit fullscreen mode

Key Points:

  • char_interval.start_pos and char_interval.end_pos identify the positions in the original text
  • text[start:end] allows for immediate verification of the original text
  • Can promptly answer "What is the basis for this information?" during audits

Practical Example: Analyzing Securities Reports

Use Case

Extract "factors affecting sales" from the securities reports of 100 companies and save them in a database for analysis.

Code Example

import langextract as lx
from langextract import data
import psycopg2
from datetime import datetime

# Text of the securities report
yuho_text = """
【経営成績の分析】
当期の売上高は1,200億円(前期比15%増)となりました。
増収の主な要因は以下の通りです。

(1) デジタル広告事業
新規顧客の獲得により300億円の増収となりました。

(2) メディア事業
既存サービスの会員数減少により100億円の減収となりました。

(3) 為替影響
円安進行により50億円の増収効果がありました。
"""

# Define prompt
prompt = """
有価証券報告書から売上変動要因を抽出してください。
以下の情報を含めること:
- 事業セグメント名
- 変動理由
- 金額(増減を明示)

正確なテキストを使用し、言い換えないでください。
"""

# Few-shot examples
examples = [
    data.ExampleData(
        text="クラウド事業は新規契約の増加により200億円の増収となりました。",
        extractions=[
            data.Extraction(
                extraction_class="売上変動要因",
                extraction_text="クラウド事業は新規契約の増加により200億円の増収となりました",
            ),
        ]
    )
]

# Execute extraction
result = lx.extract(
    text_or_documents=yuho_text,
    prompt_description=prompt,
    examples=examples,
    model_id="gemini-2.5-flash",
    api_key=os.environ.get('GOOGLE_API_KEY'),
    extraction_passes=2,  # Two passes for long text
)

# Save to PostgreSQL (also record source pointers)
conn = psycopg2.connect("dbname=analytics user=analyst")
cur = conn.cursor()

for extraction in result.extractions:
    cur.execute("""
        INSERT INTO sales_factors 
        (company_code, fiscal_year, extraction_text, 
         source_start, source_end, document_id, extracted_at)
        VALUES (%s, %s, %s, %s, %s, %s, %s)
    """, (
        "4751",  # CyberAgent
        2024,
        extraction.extraction_text,
        extraction.char_interval.start_pos if extraction.char_interval else None,
        extraction.char_interval.end_pos if extraction.char_interval else None,
        "yuho_4751_2024.pdf",
        datetime.now()
    ))

conn.commit()
Enter fullscreen mode Exit fullscreen mode

Analyzing in the Database

-- Extract companies affected by foreign exchange in 2024
SELECT company_code, extraction_text, source_start, source_end
FROM sales_factors
WHERE fiscal_year = 2024
  AND extraction_text LIKE '%為替%'
ORDER BY company_code;

-- If you want to verify the original text
SELECT document_id, source_start, source_end, extraction_text
FROM sales_factors
WHERE id = 123;
-- → "Please refer to characters 123-156 of yuho_4751_2024.pdf"
Enter fullscreen mode Exit fullscreen mode

Differences from Traditional Libraries:

Item LangExtract Instructor
DB Storage ✅ Can save original text positions ⚠️ Only extraction results
Audit Compliance ✅ Can immediately present original text ❌ Needs to be searched manually
Traceability ✅ Complete ❌ None

Visualization Features

LangExtract can visualize extraction results in an interactive HTML format.

# Save as JSONL
lx.io.save_annotated_documents([result], output_name="result.jsonl", output_dir=".")

# HTML visualization
html = lx.visualize("result.jsonl")
with open("visualization.html", "w", encoding="utf-8") as f:
    f.write(html if isinstance(html, str) else html.data)
Enter fullscreen mode Exit fullscreen mode

When you open the generated HTML in a browser, the extracted sections are highlighted on the original text.

Example:

株式会社サイバーエージェントは、東京都渋谷区に本社を置く...
^^^^^^^^^^^^^^^^^^^^^^^          ^^^^^^
[Company Name]                   [Location]
Enter fullscreen mode Exit fullscreen mode

This allows for a clear distinction between "information guessed by the LLM" and "information that was actually written in the document."

Differences from Other Libraries

Differentiating from Instructor

Purpose Recommended Library
Simple Structured Output (e.g., parsing API responses) Instructor
Need for Clear Extraction Basis (Healthcare, Law, Auditing) LangExtract

Differences from LangStruct

LangStruct also provides Source Grounding, but the optimization methods differ:

  • LangExtract: Manually prepares few-shot examples (focus on control)
  • LangStruct: Automatically optimizes with DSPy (focus on efficiency)

When switching models, LangExtract requires readjusting examples, while LangStruct automatically re-optimizes.

Feature Comparison Table

Feature LangExtract Instructor LangChain
Source Grounding ✅ Up to character positions ❌ None ❌ None
Visualization ✅ Built-in HTML ❌ None ❌ None
Long Document Support ✅ Chunking + Parallel ❌ Basic support ⚠️ Needs implementation
Japanese Support ✅ UAX#29 ⚠️ Basic ⚠️ Basic
Batch API ✅ Vertex AI ❌ None ❌ None

Support for Long Documents

LangExtract addresses the "Needle-in-a-Haystack problem" (the issue of missing information buried in the middle of long documents).

Chunking Strategy

result = lx.extract(
    text_or_documents=long_document,
    prompt_description=prompt,
    examples=examples,
    model_id="gemini-2.5-flash",
    extraction_passes=3,      # Improve accuracy with multiple passes
    max_workers=10,           # Number of parallel processes
    max_char_buffer=2000,     # Chunk size
)
Enter fullscreen mode Exit fullscreen mode

Parameter Meanings:

  • extraction_passes: 1 (fast) to 3 (high accuracy)
  • max_workers: Degree of parallelism (watch for API limits)
  • max_char_buffer: Smaller improves accuracy, larger increases speed

By splitting long documents into smaller chunks and extracting over multiple passes, information that might be overlooked in a single LLM call can be reliably captured.

Japanese Language Support Implementation

LangExtract includes a Unicode-compliant tokenizer (UAX #29) that supports Japanese sentence boundary detection.

In this test, it accurately extracted company names, personal names, and locations from Japanese text. Other structured output libraries may fail to recognize Japanese word boundaries correctly, but LangExtract did not encounter this issue.

Setup

Obtaining an API Key

Using the Google AI Studio API is optimal:

  1. Get API Key: https://aistudio.google.com/app/apikey
  2. Free Tier: Up to 15 requests/minute for free
  3. Set Environment Variable:
   export GOOGLE_API_KEY='your-api-key'
Enter fullscreen mode Exit fullscreen mode

Does Not Work with OpenRouter

After trial and error, it was found that the free Gemini model from OpenRouter does not work.

Error Message:

openai.BadRequestError: Error code: 400 - 
{'error': {'message': "Invalid parameter: 'response_format' of type 'json_object' is not supported with this model."}}
Enter fullscreen mode Exit fullscreen mode

Cause:

  1. The OpenAI provider for LangExtract forces response_format: {type: "json_object"}
  2. The free Gemini model from OpenRouter does not support this parameter

If using OpenRouter, a paid model (like GPT-4 or Claude) is required.

Use Cases

Suitable Applications

  1. Healthcare and Legal Documents: Clear extraction basis is essential
  2. Long Document Processing: Contracts, academic papers, meeting minutes, etc.
  3. Auditing and Verification: Need to confirm the validity of extraction results
  4. Multilingual Support: Non-English texts including Japanese
  5. Database Storage: Save and analyze extraction results as structured data

Unsuitable Applications

  • Simple structured output only (Instructor is sufficient)
  • Need for automatic optimization (LangStruct is appropriate)
  • Emphasis on real-time performance (Batch API is slower)

Conclusion

LangExtract provides a feature that was previously unavailable: "adding character positions to LLM extraction results."

Summary of the article:

  1. Need for Structured Extraction: Want to automate information extraction from large documents
  2. Problems with Traditional Libraries: Unable to determine "where the information was extracted from"
  3. Importance of Source Pointers: Essential to present evidence in healthcare, law, and auditing
  4. LangExtract's Solution: Records character positions to ensure traceability
  5. Practical Example: Possible to save and analyze in a database

This feature makes a decisive difference in fields where "evidence is important," such as healthcare, law, and auditing. Since the quality of few-shot examples determines extraction accuracy, providing domain-specific examples allows for high-accuracy extraction without the need for fine-tuning.

Japanese support is also good, and the tests ran smoothly. However, note that it does not work with the free model from OpenRouter, but you can adequately test it with the free tier of Google AI Studio.

If you're interested, please give it a try.

Reference Links

Top comments (0)