Originally published on 2026-01-18
Original article (Japanese): LangExtract: 抽出結果に文字位置を付ける構造化データ抽出ライブラリ
I want to extract "factors affecting sales" from the securities reports of 100 companies. I want to find "termination clauses" from 1,000 contracts. Such information extraction from large documents has traditionally been done manually.
With the advent of LLMs, "structured data extraction" has become possible, but a new problem has arisen: "Is this information really written in that document?"
LangExtract is an open-source Python library from Google that solves this problem. By attaching character positions (CharInterval) to the extraction results, it clarifies "where the information was extracted from."
Why Extract Information from Large Documents
Example 1: Financial Analysts' Work
Analysts at securities firms read the securities reports of over 100 companies to create industry reports.
Traditional Method:
1. Open the PDF
2. Search for "sales" using Ctrl+F
3. Manually copy and paste the relevant sections into Excel
4. Repeat for 100 companies (takes several days)
Desired Output:
| Company Name | Segment | Factor | Amount | Source Page |
|--------------|---------------|------------------|----------|-------------|
| Company A | Digital Ads | New customer acquisition | +30 billion yen | p.23 |
| Company B | Cloud | Contract reduction | -5 billion yen | p.15 |
Can this task be automated with LLMs?
Example 2: Contract Management in Legal Departments
Legal departments in companies manage hundreds of contracts.
Common Challenges:
- "Which contracts require a 30-day notice?"
- "I want to identify all contracts with automatic renewal clauses."
- "I want to list the amounts for termination penalties."
This also takes a tremendous amount of time when done manually.
Structured Data Extraction with LLMs
Traditional Libraries: Instructor
Instructor and LangChain are popular libraries that convert LLM outputs into structured data.
import instructor
from openai import OpenAI
from pydantic import BaseModel
client = instructor.from_openai(OpenAI())
class Company(BaseModel):
name: str
location: str
# Extract structured data from unstructured text
text = "株式会社サイバーエージェントは東京都渋谷区に本社を置く。"
result = client.chat.completions.create(
model="gpt-4",
response_model=Company,
messages=[{"role": "user", "content": f"Extract company info: {text}"}]
)
print(result)
# Company(name='株式会社サイバーエージェント', location='東京都渋谷区')
Convenient, but there is a problem.
Problem: "Is this really written in that document?"
Output from Instructor:
Company(name='株式会社サイバーエージェント', location='東京都渋谷区')
Where did this information come from?
- From which character position to which character position in the original text?
- Was it really written there? Did the LLM guess?
- What if an audit asks, "Show the evidence"?
It cannot be answered.
The Need for Source Pointers
Example in Healthcare
When extracting medication information from medical records:
# Extracted with Instructor
result = {"drug": "アスピリン", "dosage": "100mg"}
Doctor's Questions:
- "Was 100mg really written in the medical record?"
- "Is the LLM just guessing?"
- "Who takes responsibility if there's a medical error?"
What is needed:
result = {
"drug": "アスピリン",
"dosage": "100mg",
"source_position": "カルテ123-145文字目",
"original_text": "アスピリン100mgを処方"
}
Example in Legal Documents
When extracting termination clauses from contracts:
# Extracted with Instructor
result = {"notice_period": "30日前"}
Lawyer's Questions:
- "Was 30 days really written in the contract?"
- "Can I present the original text if it goes to litigation?"
- "Which page and which clause?"
What is needed:
result = {
"notice_period": "30日前",
"source_position": "p.5, 1234-1256文字目",
"original_text": "解約の場合は30日前までに書面で通知すること"
}
The Emergence of LangExtract
LangExtract is an open-source library released by Google.
Key Features:
- Includes character positions (CharInterval) in extraction results
- Clearly indicates "where the information was extracted from"
- Usable in fields like healthcare, law, and auditing where "evidence presentation" is essential
Basic Usage
Installation
pip install langextract
Code Example
import os
import langextract as lx
from langextract import data
# Text to extract from
text = """株式会社サイバーエージェントは、東京都渋谷区に本社を置くインターネット広告代理店です。
代表取締役社長は藤田晋氏で、1998年に設立されました。
主な事業はAbemaTVやAmeba、広告事業などです。"""
# Define prompt and examples
prompt = """
テキストから企業名、人名、場所を抽出してください。
正確なテキストを使用し、言い換えないでください。
"""
examples = [
data.ExampleData(
text="トヨタ自動車の豊田章男社長は、愛知県豊田市の本社で記者会見を行いました。",
extractions=[
data.Extraction(
extraction_class="企業名",
extraction_text="トヨタ自動車",
),
data.Extraction(
extraction_class="人名",
extraction_text="豊田章男",
),
data.Extraction(
extraction_class="場所",
extraction_text="愛知県豊田市",
),
]
)
]
# Execute extraction
result = lx.extract(
text_or_documents=text,
prompt_description=prompt,
examples=examples,
model_id="gemini-2.5-flash",
api_key=os.environ.get('GOOGLE_API_KEY'),
extraction_passes=1,
)
# Display results (with source pointers)
for extraction in result.extractions:
print(f"Class: {extraction.extraction_class}")
print(f"Text: {extraction.extraction_text}")
if extraction.char_interval:
start = extraction.char_interval.start_pos
end = extraction.char_interval.end_pos
print(f"Position: {start}-{end} characters")
print(f"Verification: '{text[start:end]}'")
print("---")
Execution Result
Class: 企業名
Text: 株式会社サイバーエージェント
Position: 0-14 characters
Verification: '株式会社サイバーエージェント'
---
Class: 場所
Text: 東京都渋谷区
Position: 16-22 characters
Verification: '東京都渋谷区'
---
Class: 人名
Text: 藤田晋
Position: 56-59 characters
Verification: '藤田晋'
---
Key Points:
-
char_interval.start_posandchar_interval.end_posidentify the positions in the original text -
text[start:end]allows for immediate verification of the original text - Can promptly answer "What is the basis for this information?" during audits
Practical Example: Analyzing Securities Reports
Use Case
Extract "factors affecting sales" from the securities reports of 100 companies and save them in a database for analysis.
Code Example
import langextract as lx
from langextract import data
import psycopg2
from datetime import datetime
# Text of the securities report
yuho_text = """
【経営成績の分析】
当期の売上高は1,200億円(前期比15%増)となりました。
増収の主な要因は以下の通りです。
(1) デジタル広告事業
新規顧客の獲得により300億円の増収となりました。
(2) メディア事業
既存サービスの会員数減少により100億円の減収となりました。
(3) 為替影響
円安進行により50億円の増収効果がありました。
"""
# Define prompt
prompt = """
有価証券報告書から売上変動要因を抽出してください。
以下の情報を含めること:
- 事業セグメント名
- 変動理由
- 金額(増減を明示)
正確なテキストを使用し、言い換えないでください。
"""
# Few-shot examples
examples = [
data.ExampleData(
text="クラウド事業は新規契約の増加により200億円の増収となりました。",
extractions=[
data.Extraction(
extraction_class="売上変動要因",
extraction_text="クラウド事業は新規契約の増加により200億円の増収となりました",
),
]
)
]
# Execute extraction
result = lx.extract(
text_or_documents=yuho_text,
prompt_description=prompt,
examples=examples,
model_id="gemini-2.5-flash",
api_key=os.environ.get('GOOGLE_API_KEY'),
extraction_passes=2, # Two passes for long text
)
# Save to PostgreSQL (also record source pointers)
conn = psycopg2.connect("dbname=analytics user=analyst")
cur = conn.cursor()
for extraction in result.extractions:
cur.execute("""
INSERT INTO sales_factors
(company_code, fiscal_year, extraction_text,
source_start, source_end, document_id, extracted_at)
VALUES (%s, %s, %s, %s, %s, %s, %s)
""", (
"4751", # CyberAgent
2024,
extraction.extraction_text,
extraction.char_interval.start_pos if extraction.char_interval else None,
extraction.char_interval.end_pos if extraction.char_interval else None,
"yuho_4751_2024.pdf",
datetime.now()
))
conn.commit()
Analyzing in the Database
-- Extract companies affected by foreign exchange in 2024
SELECT company_code, extraction_text, source_start, source_end
FROM sales_factors
WHERE fiscal_year = 2024
AND extraction_text LIKE '%為替%'
ORDER BY company_code;
-- If you want to verify the original text
SELECT document_id, source_start, source_end, extraction_text
FROM sales_factors
WHERE id = 123;
-- → "Please refer to characters 123-156 of yuho_4751_2024.pdf"
Differences from Traditional Libraries:
| Item | LangExtract | Instructor |
|---|---|---|
| DB Storage | ✅ Can save original text positions | ⚠️ Only extraction results |
| Audit Compliance | ✅ Can immediately present original text | ❌ Needs to be searched manually |
| Traceability | ✅ Complete | ❌ None |
Visualization Features
LangExtract can visualize extraction results in an interactive HTML format.
# Save as JSONL
lx.io.save_annotated_documents([result], output_name="result.jsonl", output_dir=".")
# HTML visualization
html = lx.visualize("result.jsonl")
with open("visualization.html", "w", encoding="utf-8") as f:
f.write(html if isinstance(html, str) else html.data)
When you open the generated HTML in a browser, the extracted sections are highlighted on the original text.
Example:
株式会社サイバーエージェントは、東京都渋谷区に本社を置く...
^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^
[Company Name] [Location]
This allows for a clear distinction between "information guessed by the LLM" and "information that was actually written in the document."
Differences from Other Libraries
Differentiating from Instructor
| Purpose | Recommended Library |
|---|---|
| Simple Structured Output (e.g., parsing API responses) | Instructor |
| Need for Clear Extraction Basis (Healthcare, Law, Auditing) | LangExtract |
Differences from LangStruct
LangStruct also provides Source Grounding, but the optimization methods differ:
- LangExtract: Manually prepares few-shot examples (focus on control)
- LangStruct: Automatically optimizes with DSPy (focus on efficiency)
When switching models, LangExtract requires readjusting examples, while LangStruct automatically re-optimizes.
Feature Comparison Table
| Feature | LangExtract | Instructor | LangChain |
|---|---|---|---|
| Source Grounding | ✅ Up to character positions | ❌ None | ❌ None |
| Visualization | ✅ Built-in HTML | ❌ None | ❌ None |
| Long Document Support | ✅ Chunking + Parallel | ❌ Basic support | ⚠️ Needs implementation |
| Japanese Support | ✅ UAX#29 | ⚠️ Basic | ⚠️ Basic |
| Batch API | ✅ Vertex AI | ❌ None | ❌ None |
Support for Long Documents
LangExtract addresses the "Needle-in-a-Haystack problem" (the issue of missing information buried in the middle of long documents).
Chunking Strategy
result = lx.extract(
text_or_documents=long_document,
prompt_description=prompt,
examples=examples,
model_id="gemini-2.5-flash",
extraction_passes=3, # Improve accuracy with multiple passes
max_workers=10, # Number of parallel processes
max_char_buffer=2000, # Chunk size
)
Parameter Meanings:
-
extraction_passes: 1 (fast) to 3 (high accuracy) -
max_workers: Degree of parallelism (watch for API limits) -
max_char_buffer: Smaller improves accuracy, larger increases speed
By splitting long documents into smaller chunks and extracting over multiple passes, information that might be overlooked in a single LLM call can be reliably captured.
Japanese Language Support Implementation
LangExtract includes a Unicode-compliant tokenizer (UAX #29) that supports Japanese sentence boundary detection.
In this test, it accurately extracted company names, personal names, and locations from Japanese text. Other structured output libraries may fail to recognize Japanese word boundaries correctly, but LangExtract did not encounter this issue.
Setup
Obtaining an API Key
Using the Google AI Studio API is optimal:
- Get API Key: https://aistudio.google.com/app/apikey
- Free Tier: Up to 15 requests/minute for free
- Set Environment Variable:
export GOOGLE_API_KEY='your-api-key'
Does Not Work with OpenRouter
After trial and error, it was found that the free Gemini model from OpenRouter does not work.
Error Message:
openai.BadRequestError: Error code: 400 -
{'error': {'message': "Invalid parameter: 'response_format' of type 'json_object' is not supported with this model."}}
Cause:
- The OpenAI provider for LangExtract forces
response_format: {type: "json_object"} - The free Gemini model from OpenRouter does not support this parameter
If using OpenRouter, a paid model (like GPT-4 or Claude) is required.
Use Cases
Suitable Applications
- Healthcare and Legal Documents: Clear extraction basis is essential
- Long Document Processing: Contracts, academic papers, meeting minutes, etc.
- Auditing and Verification: Need to confirm the validity of extraction results
- Multilingual Support: Non-English texts including Japanese
- Database Storage: Save and analyze extraction results as structured data
Unsuitable Applications
- Simple structured output only (Instructor is sufficient)
- Need for automatic optimization (LangStruct is appropriate)
- Emphasis on real-time performance (Batch API is slower)
Conclusion
LangExtract provides a feature that was previously unavailable: "adding character positions to LLM extraction results."
Summary of the article:
- Need for Structured Extraction: Want to automate information extraction from large documents
- Problems with Traditional Libraries: Unable to determine "where the information was extracted from"
- Importance of Source Pointers: Essential to present evidence in healthcare, law, and auditing
- LangExtract's Solution: Records character positions to ensure traceability
- Practical Example: Possible to save and analyze in a database
This feature makes a decisive difference in fields where "evidence is important," such as healthcare, law, and auditing. Since the quality of few-shot examples determines extraction accuracy, providing domain-specific examples allows for high-accuracy extraction without the need for fine-tuning.
Japanese support is also good, and the tests ran smoothly. However, note that it does not work with the free model from OpenRouter, but you can adequately test it with the free tier of Google AI Studio.
If you're interested, please give it a try.
Top comments (0)