I Built an Open-Source Framework to Make LLM Data Extraction Dead Simple

Sergii Shcherbak — Fri, 02 May 2025 17:51:35 +0000

After getting tired of writing endless boilerplate to extract structured data from documents with LLMs, I built ContextGem - a free, open-source framework that makes this radically easier.

What makes it different?

✅ Automated dynamic prompts and data modeling
✅ Precise reference mapping to source content
✅ Built-in justifications for extractions
✅ Nested context extraction
✅ Works with any LLM provider
and more built-in abstractions that save developer time.

Simple LLM extraction in just a few lines:

from contextgem import Aspect, Document, DocumentLLM

# Define what to extract
doc = Document(raw_text="Your document text here...")
doc.aspects = [
    Aspect(
        name="Intellectual property",
        description="Clauses on intellectual property rights",
    )
]

# Extract with any LLM
llm = DocumentLLM(model="<provider>/<model>", api_key="<api_key>")
doc = llm.extract_all(doc)

# Get results
print(doc.aspects[0].extracted_items)

Features a native DOCX converter, support for multiple LLMs, and full serialization - all under Apache 2.0 permissive license.

View project on GitHub: https://github.com/shcherbak-ai/contextgem

Try it out and let me know your thoughts!

DEV Community: Sergii Shcherbak

I Built an Open-Source Framework to Make LLM Data Extraction Dead Simple

What makes it different?

Simple LLM extraction in just a few lines: