DEV Community

Cover image for Knowledge Graph Extraction in Pydantic
Johann Hagerer
Johann Hagerer

Posted on • Edited on

Knowledge Graph Extraction in Pydantic

In this article, we explore how Pydantic's type system bridges LLM outputs, structured data, and knowledge graph concepts. If you ever wanted to extract a full knowledge graph, but you only know roughly what you want to extract but not how, this article is for you. Whether you're building document processing pipelines, chatbots, or data extraction workflows, understanding these patterns will help you build more robust LLM applications.

Table of contents:

Recap: Structured Output Extraction Using LLMs

The foundation of reliable LLM data extraction is defining clear schemas. Pydantic provides an elegant way to create these schemas using Python's type hints, which can then be converted to JSON Schema for LLM consumption.

Let's start with a simple code example taken from my other article:

from pydantic import BaseModel, Field
import json
from typing import Any, Literal

class Person(BaseModel):
    """A person is a human being with the denoted attributes."""

    name: str = Field(..., 
        description="Which is the name of the person?"
    )
    age: int = Field(..., 
        description="Which is the age of the person?"
    )
    email: str = Field(..., 
        description="Which is the email of the person?"
    )
    country: Literal["Germany", "Switzerland", "Austria"] = Field(..., 
        description="In which country does the person reside?"
    )

json_schema: dict[str, Any] = Person.model_json_schema()
print(json.dumps(json_schema, indent=2))
Enter fullscreen mode Exit fullscreen mode

This code defines a Person entity with three attributes. The Field descriptions provide context to the LLM about what information to extract. This will be provided together with the JSON Schema to the LLM along with prompt so that it knows which information should be extracted. When converted to JSON Schema, you get:

{
  "description": "A person is a human being with the denoted attributes.",
  "properties": {
    "name": {
      "description": "Which is the name of the person?",
      "title": "Name",
      "type": "string"
    },
    "age": {
      "description": "Which is the age of the person?",
      "title": "Age",
      "type": "integer"
    },
    "email": {
      "description": "Which is the email of the person?",
      "title": "Email",
      "type": "string"
    }
  },
    "country": {
      "description": "In which country does the person reside?",
      "title": "Country",
      "type": "string",
      "enum": ["Germany", "Switzerland", "Austria"]
    }
  },
  "required": [
    "name",
    "age",
    "email",
    "country"
  ],
  "title": "Person",
  "type": "object"
}
Enter fullscreen mode Exit fullscreen mode

This JSON Schema is exactly what modern LLM APIs need to constrain their outputs. Now we can use it with an LLM to extract structured data from unstructured text:

# Initialize client
client = Mistral(api_key="your-api-key")

# Make structured output request
response = client.chat.complete(
    model="mistral-large-latest",
    messages=[{
        "role": "user",
        "content": "Extract person info: John Doe is 30 years old, email: john@example.com, resides in Austria."
    }],
    response_format={
        "type": "json_object",
        "schema": Person.model_json_schema()
    }
)

# Parse response into Pydantic model
answer: str = response.choices[0].message.content
person: Person = Person.model_validate_json(answer)
assert json.loads(answer) == person.dict()
print(json.dumps(person.dict(), indent=2))
Enter fullscreen mode Exit fullscreen mode

Resulting JSON string:

{
  "name": "John Doe",
  "age": 30,
  "email": "john@example.com",
  "country": "Austria"
}
Enter fullscreen mode Exit fullscreen mode

Knowledge Graph Concepts

Knowledge graphs offer a powerful mental model for structuring LLM extraction tasks. At their core, knowledge graphs represent information as entities (nodes) connected by relationships (edges). This maps remarkably well to how we structure Pydantic models for LLM outputs.

A knowledge graph is a structured representation of knowledge where:

  • Entities are the "things" in your domain (people, organizations, products)
  • Attributes describe properties of entities (name, age, color)
  • Relationships connect entities together (Person works_at Organization)

For example, consider this simple knowledge graph:

Person: "John Doe"
  - age: 30
  - email: john@example.com
  - works_at -> Organization: "Acme Corp"

Organization: "Acme Corp"
  - founded: 2010
  - industry: "Technology"
Enter fullscreen mode Exit fullscreen mode

This graph contains two entities (John Doe and Acme Corp), several attributes (age, email, founded, industry), and one relationship (works_at).

Mapping Knowledge Graphs to Pydantic

When building LLM extraction pipelines, thinking in knowledge graph terms helps structure your approach:

  • Competency question: The question in the prompt asked to an LLM to extract an entity, an attribute, or a relationship. Example: "What organizations does this person work for?". This guides what your Pydantic model should capture. In Pydantic, this equals to the description parameter of a Field object.

  • Ontology: All definitions of entity categories, attributes, and relationships, taken together as the data model of your domain. This is your collection of Pydantic schemas.

  • Knowledge graph: The instance of your entities and relationships on a concrete set of data. This is the actual extracted and validated data from your documents.

By mapping these concepts, you create a clear separation between schema (ontology) and data (knowledge graph), making your extraction pipeline more maintainable and scalable.

Ontology Definition

Relationships follow the classic triplet pattern: (subject, predicate, object) - for example, (John Doe, works_at, Acme Corp). Before you start extracting, however, you need to know which types of entities, attributes and relationships you are looking for. These inform the ontology. First, we show how to persist the ontology definition as tables. Second, we show how to derive Pydantic BaseModels from it.

Persisting the Ontology

For this tutorial, we define them using three tables of the following schema.

Entity Classes

Field Name Data Type Description
ENTITY_NAME str Name of the entity class
DESCRIPTION str Docstring for the entity class
FIELD_DESCRIPTION str Field description for the combined basemodel where we want list[EntityCategory]
ID_ATTRIBUTE str What is the id attribute for entity (probably going to remove this from the table)
ID_FIELD_DESCRIPTION str What is the field description for the id attribute (probably removing this from the table)
ID str Hash of the entity_name

Example:

{
    "entity_name": "DamagedObject",
    "description": "",  # No class-level docstring found; can be added if needed
    "field_description": "List of DamagedObject which have been damaged",
    "id_attribute": "id",
    "id_field_description": "Unique ID for the entity, e.g. damage_01",
}
Enter fullscreen mode Exit fullscreen mode

Attribute Classes

Field Name Data Type Description
ATTRIBUTE_NAME str Name of the attribute
DTYPE str Data type of the attribute
FIELD_DESCRIPTION str The competency question for the attribute
ENTITY_NAME str Which entity it belongs to
ID str Hash of the attribute_name (note: should be based on attribute_name + entity_id)
ENTITY_ID str ID of the entity
{
    "attribute_name": "DamageSeverity",
    "dtype": "str",  # No class-level docstring found; can be added if needed
    "field_description": "List of DamagedSeverity attributes telling how severe each damage is.",
    "entity_name": "DamagedObject",
    "id_attribute": "id",
    "id_field_description": "Unique ID for the entity, e.g. damage_01",
    "entity_id": ""
}
Enter fullscreen mode Exit fullscreen mode

Relationship Classes

Field Name Data Type Description
SUBJEKT_ENTITAET str Subject entity category
BEZIEHUNG_NAME str Relationship name
OBJEKT_ENTITAET str Object entity category
FIELD_DESCRIPTION str Description of the relationship
ID str Identifier for the relationship
SUBJEKT_ENTITY_ID str ID of the subject entity
OBJEKT_ENTITY_ID str ID of the object entity

Deriving Pydantic BaseModels From the Ontology

Example code for creating Pydantic BaseModel classes dynamically for each entity category:

from typing import Any
import polars as pl
from pydantic import BaseModel, create_model

def build_entity_models_(
    entities_df: pl.DataFrame,
    *,
    id_field_strategy: str = "entitaet_id",  # change to 'as_is' or 'prefixed' if you prefer
    id_type: type = str,  # change if your IDs are not strings
) -> dict[str, type[BaseModel]]:
    """Build Pydantic models from a entities dataframe.

    Each dataframe should include columns:
    - entity_name
    - description
    - id_attribute
    - id_field_description
    """
    specs: list[dict[str, Any]] = entities_df.to_dicts()
    models: dict[str, type[BaseModel]] = {}
    for spec in specs:
        doc_string: str = spec.get("DESCRIPTION", "")
        class_name: str = spec["ENTITY_NAME"]
        id_field_name: str = f"{class_name}_id"

        # Define fields: name -> (type, default or FieldInfo)
        fields: dict[str, tuple[Any, Any]] = {
            id_field_name: (id_type, Field(..., description=f"Eindeutige ID für die Entität, z. B {class_name.lower()}_01")),
            "kanonische_bezeichnung": (
                str,
                Field(
                    ...,
                    description="""Kurze, menschenlesbare Standardbezeichnung der Entität für die Anzeige. Aus den informativsten Erwähnungen abgeleitet (z. B. Name + Zusatzinfo) und stabil über mehrere Dokumente hinweg. Nicht als eindeutigen Schlüssel verwenden.

                    Beispiele: "John Smith (geb. 1980-04-12)", "Police #DE-12345-2024", "Fahrzeug [B-AB 1234]"
                    """,
                ),
            ),
            "aliase": (
                list[str],
                Field(
                    ...,
                    description="""Menge aller beobachteten Oberflächenformen (Erwähnungen) dieser Entität aus den Dokumenten, inkl. Namensvarianten (Schreibweisen, Abkürzungen, Titel) und nominalen Verweisformen/Rollenbezeichnungen, sofern sie im Kontext eindeutig auf diese Entität zielen (z. B. "der Vorgesetzte", "der Gutachter"). Dient der Suche und Nachvollziehbarkeit; Originalschreibweise beibehalten, Duplikate entfernen.

                    Beispiele: ["Dr. John", "Herr John Smith", "J. Smith", "der Vorgesetzte", "der Versicherungsnehmer"].
                    """,
                ),
            ),
        }

        Model = create_model(
            class_name,
            __base__=BaseModel,
            __module__="dynamic_models",
            **fields,  # type: ignore[call-overload]
        )
        Model.__doc__ = doc_string
        models[class_name] = Model

    return models
Enter fullscreen mode Exit fullscreen mode

Entity Extraction

Once you have a BaseModel for each type of entity, you need to be able to extract a list of each. In order to do so, you need another BaseModel as entry point, which can be defined as follows:

def build_entity_extraction_model(
    entity_models: dict[str, type[BaseModel]],
    *,
    extraction_class_name: str = "EntitaetenExtraktion",
) -> type[BaseModel]:
    """Creates a Pydantic model, e.g. 'EntitiesExtraction'), whose fields are lists of all the separate entiy models."""

    fields: dict[str, type[BaseModel]] = dict()

    for entity_name, row in entity_models.items():
        desc = row.get("description")
        fields[entity_name] = (list[row], Field(..., description=desc))

    ExtractionModel = create_model(
        extraction_class_name,
        __base__=BaseModel,
        __module__="dynamic_models",
        **fields, # type: ignore[call-overload]
    )
    ExtractionModel.__doc__ = "Container model for extracted entities."
    return ExtractionModel

entity_models = build_entity_models(entity_types_df)
EntityExtraction: type[BaseModel] = build_entity_extraction_model(entity_models, entities_tbl)
entity_extraction_json_schema = EntityExtraction.model_json_schema()
Enter fullscreen mode Exit fullscreen mode

Eventually, the EntityExtraction BaseModel normally can be passed to the structured output extraction API from most LLM providers.

Moving Forward

  1. Extract the entities on a piece of text as described.
  2. Based on the extracted entities and the piece of text, you can extract attributes and relationships in a separate consecutive step.

Conclusion

Combining Pydantic's static typing with knowledge graph thinking provides a robust framework for LLM data extraction. The structured output approach ensures type safety and validation, while knowledge graph concepts help you design comprehensive data models that capture not just entities, but the relationships between them.

As you build more complex LLM applications, this foundation becomes essential for maintaining data quality, enabling downstream analytics, and scaling your extraction pipelines across diverse document types and domains.

Top comments (0)