In this article, we explore how Pydantic's type system bridges LLM outputs, structured data, and knowledge graph concepts. If you ever wanted to extract a full knowledge graph, but you only know roughly what you want to extract but not how, this article is for you. Whether you're building document processing pipelines, chatbots, or data extraction workflows, understanding these patterns will help you build more robust LLM applications.
Table of contents:
- Recap: Structured Output Extraction Using LLMs
- Knowledge Graph Concepts
- Mapping Knowledge Graphs to Pydantic
- How to Extract Relationships
- Performing the Actual Prompting
- Conclusion
Recap: Structured Output Extraction Using LLMs
The foundation of reliable LLM data extraction is defining clear schemas. Pydantic provides an elegant way to create these schemas using Python's type hints, which can then be converted to JSON Schema for LLM consumption.
Let's start with a simple code example taken from my other article:
from pydantic import BaseModel, Field
import json
from typing import Any, Literal
class Person(BaseModel):
"""A person is a human being with the denoted attributes."""
name: str = Field(...,
description="Which is the name of the person?"
)
age: int = Field(...,
description="Which is the age of the person?"
)
email: str = Field(...,
description="Which is the email of the person?"
)
country: Literal["Germany", "Switzerland", "Austria"] = Field(...,
description="In which country does the person reside?"
)
json_schema: dict[str, Any] = Person.model_json_schema()
print(json.dumps(json_schema, indent=2))
This code defines a Person entity with three attributes. The Field descriptions provide context to the LLM about what information to extract. This will be provided together with the JSON Schema to the LLM along with prompt so that it knows which information should be extracted. When converted to JSON Schema, you get:
{
"description": "A person is a human being with the denoted attributes.",
"properties": {
"name": {
"description": "Which is the name of the person?",
"title": "Name",
"type": "string"
},
"age": {
"description": "Which is the age of the person?",
"title": "Age",
"type": "integer"
},
"email": {
"description": "Which is the email of the person?",
"title": "Email",
"type": "string"
}
},
"country": {
"description": "In which country does the person reside?",
"title": "Country",
"type": "string",
"enum": ["Germany", "Switzerland", "Austria"]
}
},
"required": [
"name",
"age",
"email",
"country"
],
"title": "Person",
"type": "object"
}
This JSON Schema is exactly what modern LLM APIs need to constrain their outputs. Now we can use it with an LLM to extract structured data from unstructured text:
# Initialize client
client = Mistral(api_key="your-api-key")
# Make structured output request
response = client.chat.complete(
model="mistral-large-latest",
messages=[{
"role": "user",
"content": "Extract person info: John Doe is 30 years old, email: john@example.com, resides in Austria."
}],
response_format={
"type": "json_object",
"schema": Person.model_json_schema()
}
)
# Parse response into Pydantic model
answer: str = response.choices[0].message.content
person: Person = Person.model_validate_json(answer)
assert json.loads(answer) == person.dict()
print(json.dumps(person.dict(), indent=2))
Resulting JSON string:
{
"name": "John Doe",
"age": 30,
"email": "john@example.com",
"country": "Austria"
}
Knowledge Graph Concepts
Knowledge graphs offer a powerful mental model for structuring LLM extraction tasks. At their core, knowledge graphs represent information as entities (nodes) connected by relationships (edges). This maps remarkably well to how we structure Pydantic models for LLM outputs.
A knowledge graph is a structured representation of knowledge where:
- Entities are the "things" in your domain (people, organizations, products)
- Attributes describe properties of entities (name, age, color)
- Relationships connect entities together (Person works_at Organization)
For example, consider this simple knowledge graph:
Person: "John Doe"
- age: 30
- email: john@example.com
- works_at -> Organization: "Acme Corp"
Organization: "Acme Corp"
- founded: 2010
- industry: "Technology"
This graph contains two entities (John Doe and Acme Corp), several attributes (age, email, founded, industry), and one relationship (works_at).
Mapping Knowledge Graphs to Pydantic
When building LLM extraction pipelines, thinking in knowledge graph terms helps structure your approach:
Competency question: The question in the prompt asked to an LLM to extract an entity, an attribute, or a relationship. Example: "What organizations does this person work for?". This guides what your Pydantic model should capture. In Pydantic, this equals to the
descriptionparameter of aFieldobject.Ontology: All definitions of entity categories, attributes, and relationships, taken together as the data model of your domain. This is your collection of Pydantic schemas.
Knowledge graph: The instance of your entities and relationships on a concrete set of data. This is the actual extracted and validated data from your documents.
By mapping these concepts, you create a clear separation between schema (ontology) and data (knowledge graph), making your extraction pipeline more maintainable and scalable.
Ontology Definition
Relationships follow the classic triplet pattern: (subject, predicate, object) - for example, (John Doe, works_at, Acme Corp). Before you start extracting, however, you need to know which types of entities, attributes and relationships you are looking for. These inform the ontology. First, we show how to persist the ontology definition as tables. Second, we show how to derive Pydantic BaseModels from it.
Persisting the Ontology
For this tutorial, we define them using three tables of the following schema.
Entity Classes
| Field Name | Data Type | Description |
|---|---|---|
| ENTITY_NAME | str | Name of the entity class |
| DESCRIPTION | str | Docstring for the entity class |
| FIELD_DESCRIPTION | str | Field description for the combined basemodel where we want list[EntityCategory] |
| ID_ATTRIBUTE | str | What is the id attribute for entity (probably going to remove this from the table) |
| ID_FIELD_DESCRIPTION | str | What is the field description for the id attribute (probably removing this from the table) |
| ID | str | Hash of the entity_name |
Example:
{
"entity_name": "DamagedObject",
"description": "", # No class-level docstring found; can be added if needed
"field_description": "List of DamagedObject which have been damaged",
"id_attribute": "id",
"id_field_description": "Unique ID for the entity, e.g. damage_01",
}
Attribute Classes
| Field Name | Data Type | Description |
|---|---|---|
| ATTRIBUTE_NAME | str | Name of the attribute |
| DTYPE | str | Data type of the attribute |
| FIELD_DESCRIPTION | str | The competency question for the attribute |
| ENTITY_NAME | str | Which entity it belongs to |
| ID | str | Hash of the attribute_name (note: should be based on attribute_name + entity_id) |
| ENTITY_ID | str | ID of the entity |
{
"attribute_name": "DamageSeverity",
"dtype": "str", # No class-level docstring found; can be added if needed
"field_description": "List of DamagedSeverity attributes telling how severe each damage is.",
"entity_name": "DamagedObject",
"id_attribute": "id",
"id_field_description": "Unique ID for the entity, e.g. damage_01",
"entity_id": ""
}
Relationship Classes
| Field Name | Data Type | Description |
|---|---|---|
| SUBJEKT_ENTITAET | str | Subject entity category |
| BEZIEHUNG_NAME | str | Relationship name |
| OBJEKT_ENTITAET | str | Object entity category |
| FIELD_DESCRIPTION | str | Description of the relationship |
| ID | str | Identifier for the relationship |
| SUBJEKT_ENTITY_ID | str | ID of the subject entity |
| OBJEKT_ENTITY_ID | str | ID of the object entity |
Deriving Pydantic BaseModels From the Ontology
Example code for creating Pydantic BaseModel classes dynamically for each entity category:
from typing import Any
import polars as pl
from pydantic import BaseModel, create_model
def build_entity_models_(
entities_df: pl.DataFrame,
*,
id_field_strategy: str = "entitaet_id", # change to 'as_is' or 'prefixed' if you prefer
id_type: type = str, # change if your IDs are not strings
) -> dict[str, type[BaseModel]]:
"""Build Pydantic models from a entities dataframe.
Each dataframe should include columns:
- entity_name
- description
- id_attribute
- id_field_description
"""
specs: list[dict[str, Any]] = entities_df.to_dicts()
models: dict[str, type[BaseModel]] = {}
for spec in specs:
doc_string: str = spec.get("DESCRIPTION", "")
class_name: str = spec["ENTITY_NAME"]
id_field_name: str = f"{class_name}_id"
# Define fields: name -> (type, default or FieldInfo)
fields: dict[str, tuple[Any, Any]] = {
id_field_name: (id_type, Field(..., description=f"Eindeutige ID für die Entität, z. B {class_name.lower()}_01")),
"kanonische_bezeichnung": (
str,
Field(
...,
description="""Kurze, menschenlesbare Standardbezeichnung der Entität für die Anzeige. Aus den informativsten Erwähnungen abgeleitet (z. B. Name + Zusatzinfo) und stabil über mehrere Dokumente hinweg. Nicht als eindeutigen Schlüssel verwenden.
Beispiele: "John Smith (geb. 1980-04-12)", "Police #DE-12345-2024", "Fahrzeug [B-AB 1234]"
""",
),
),
"aliase": (
list[str],
Field(
...,
description="""Menge aller beobachteten Oberflächenformen (Erwähnungen) dieser Entität aus den Dokumenten, inkl. Namensvarianten (Schreibweisen, Abkürzungen, Titel) und nominalen Verweisformen/Rollenbezeichnungen, sofern sie im Kontext eindeutig auf diese Entität zielen (z. B. "der Vorgesetzte", "der Gutachter"). Dient der Suche und Nachvollziehbarkeit; Originalschreibweise beibehalten, Duplikate entfernen.
Beispiele: ["Dr. John", "Herr John Smith", "J. Smith", "der Vorgesetzte", "der Versicherungsnehmer"].
""",
),
),
}
Model = create_model(
class_name,
__base__=BaseModel,
__module__="dynamic_models",
**fields, # type: ignore[call-overload]
)
Model.__doc__ = doc_string
models[class_name] = Model
return models
Entity Extraction
Once you have a BaseModel for each type of entity, you need to be able to extract a list of each. In order to do so, you need another BaseModel as entry point, which can be defined as follows:
def build_entity_extraction_model(
entity_models: dict[str, type[BaseModel]],
*,
extraction_class_name: str = "EntitaetenExtraktion",
) -> type[BaseModel]:
"""Creates a Pydantic model, e.g. 'EntitiesExtraction'), whose fields are lists of all the separate entiy models."""
fields: dict[str, type[BaseModel]] = dict()
for entity_name, row in entity_models.items():
desc = row.get("description")
fields[entity_name] = (list[row], Field(..., description=desc))
ExtractionModel = create_model(
extraction_class_name,
__base__=BaseModel,
__module__="dynamic_models",
**fields, # type: ignore[call-overload]
)
ExtractionModel.__doc__ = "Container model for extracted entities."
return ExtractionModel
entity_models = build_entity_models(entity_types_df)
EntityExtraction: type[BaseModel] = build_entity_extraction_model(entity_models, entities_tbl)
entity_extraction_json_schema = EntityExtraction.model_json_schema()
Eventually, the EntityExtraction BaseModel normally can be passed to the structured output extraction API from most LLM providers.
Moving Forward
- Extract the entities on a piece of text as described.
- Based on the extracted entities and the piece of text, you can extract attributes and relationships in a separate consecutive step.
Conclusion
Combining Pydantic's static typing with knowledge graph thinking provides a robust framework for LLM data extraction. The structured output approach ensures type safety and validation, while knowledge graph concepts help you design comprehensive data models that capture not just entities, but the relationships between them.
As you build more complex LLM applications, this foundation becomes essential for maintaining data quality, enabling downstream analytics, and scaling your extraction pipelines across diverse document types and domains.
Top comments (0)