Knowledge Graph Extraction in Pydantic

#pydantic #nlp #genai #knowledgegraphs

In this article, we explore how Pydantic's type system bridges LLM outputs, structured data, and knowledge graph concepts. If you ever wanted to extract a full knowledge graph, but you only know roughly what you want to extract but not how, this article is for you. Whether you're building document processing pipelines, chatbots, or data extraction workflows, understanding these patterns will help you build more robust LLM applications.

Table of contents:

Recap: Structured Output Extraction Using LLMs
Knowledge Graph Concepts
Mapping Knowledge Graphs to Pydantic
How to Extract Relationships
Performing the Actual Prompting
Conclusion

Recap: Structured Output Extraction Using LLMs

The foundation of reliable LLM data extraction is defining clear schemas. Pydantic provides an elegant way to create these schemas using Python's type hints, which can then be converted to JSON Schema for LLM consumption.

Let's start with a simple code example taken from my other article:

from pydantic import BaseModel, Field
import json
from typing import Any, Literal

class Person(BaseModel):
    """A person is a human being with the denoted attributes."""

    name: str = Field(..., 
        description="Which is the name of the person?"
    )
    age: int = Field(..., 
        description="Which is the age of the person?"
    )
    email: str = Field(..., 
        description="Which is the email of the person?"
    )
    country: Literal["Germany", "Switzerland", "Austria"] = Field(..., 
        description="In which country does the person reside?"
    )

json_schema: dict[str, Any] = Person.model_json_schema()
print(json.dumps(json_schema, indent=2))

This code defines a Person entity with three attributes. The Field descriptions provide context to the LLM about what information to extract. This will be provided together with the JSON Schema to the LLM along with prompt so that it knows which information should be extracted. When converted to JSON Schema, you get:

{
  "description": "A person is a human being with the denoted attributes.",
  "properties": {
    "name": {
      "description": "Which is the name of the person?",
      "title": "Name",
      "type": "string"
    },
    "age": {
      "description": "Which is the age of the person?",
      "title": "Age",
      "type": "integer"
    },
    "email": {
      "description": "Which is the email of the person?",
      "title": "Email",
      "type": "string"
    }
  },
    "country": {
      "description": "In which country does the person reside?",
      "title": "Country",
      "type": "string",
      "enum": ["Germany", "Switzerland", "Austria"]
    }
  },
  "required": [
    "name",
    "age",
    "email",
    "country"
  ],
  "title": "Person",
  "type": "object"
}

This JSON Schema is exactly what modern LLM APIs need to constrain their outputs. Now we can use it with an LLM to extract structured data from unstructured text:

# Initialize client
client = Mistral(api_key="your-api-key")

# Make structured output request
response = client.chat.complete(
    model="mistral-large-latest",
    messages=[{
        "role": "user",
        "content": "Extract person info: John Doe is 30 years old, email: john@example.com, resides in Austria."
    }],
    response_format={
        "type": "json_object",
        "schema": Person.model_json_schema()
    }
)

# Parse response into Pydantic model
answer: str = response.choices[0].message.content
person: Person = Person.model_validate_json(answer)
assert json.loads(answer) == person.dict()
print(json.dumps(person.dict(), indent=2))

Resulting JSON string:

{
  "name": "John Doe",
  "age": 30,
  "email": "john@example.com",
  "country": "Austria"
}

Knowledge Graph Concepts

Knowledge graphs offer a powerful mental model for structuring LLM extraction tasks. At their core, knowledge graphs represent information as entities (nodes) connected by relationships (edges). This maps remarkably well to how we structure Pydantic models for LLM outputs.

A knowledge graph is a structured representation of knowledge where:

Entities are the "things" in your domain (people, organizations, products)
Attributes describe properties of entities (name, age, color)
Relationships connect entities together (Person works_at Organization)

For example, consider this simple knowledge graph:

Person: "John Doe"
  - age: 30
  - email: john@example.com
  - works_at -> Organization: "Acme Corp"

Organization: "Acme Corp"
  - founded: 2010
  - industry: "Technology"

This graph contains two entities (John Doe and Acme Corp), several attributes (age, email, founded, industry), and one relationship (works_at).

Mapping Knowledge Graphs to Pydantic

When building LLM extraction pipelines, thinking in knowledge graph terms helps structure your approach:

Competency question: The partial prompt asked to an LLM to extract an entity, an attribute, or a relationship. Example: "What organizations does this person work for?". This guides what your Pydantic model should capture. In Pydantic, this equals to the description parameter of a Field object.
Entity category: A Pydantic BaseModel class definition. Each class represents a type of entity you want to extract (Person, Organization, Product, etc.).
Entity: An instance of an entity category, i.e., an actual BaseModel object as it is parsed from an LLM structured output. This is the concrete data extracted from text.
Attributes: The attributes of an entity, which are the member variables of a Pydantic BaseModel. These are the simple properties like strings, integers, and dates.
Relationships: Triplets connecting entities (Subject-Predicate-Object). In Pydantic, these can be represented as references to other BaseModel instances or IDs.
Ontology: All definitions of entity categories, attributes, and relationships, taken together as the data model of your domain. This is your collection of Pydantic schemas.
Knowledge graph: The instance of your entities and relationships on a concrete set of data. This is the actual extracted and validated data from your documents.

By mapping these concepts, you create a clear separation between schema (ontology) and data (knowledge graph), making your extraction pipeline more maintainable and scalable.

How to Extract Relationships

Relationships follow the classic triplet pattern: (subject, predicate, object) - for example, (John Doe, works_at, Acme Corp). In Pydantic, you can model this by creating relationship classes that reference entity IDs or embed nested entities directly. The key is ensuring your LLM prompt explicitly asks for both entities and their connections, then validating that referenced IDs actually exist in your extracted entity set. This two-pass approach - first extract entities, then extract relationships between them - creates a true knowledge graph that enables queries and reasoning across your data.

At first, you need to extract only the entities:

from pydantic import BaseModel, Field
from typing import Literal
from enum import Enum

class PersonEntity(BaseModel):
    """A person is a human being with the denoted attributes."""
    name: str = Field(..., 
        description="Which is the name of the person?"
    )

class OrganizationEntity(BaseModel):
    """An organization is a company or an institute at which persons are employed."""
    name: str = Field(..., 
        description="Which is the name of the organization?"
    )

class ExtractedEntities(BaseModel):
    persons: list[PersonEntity]
    organizations: list[OrganizationEntity]

Second, not that you have the entities extracted, you may find how they are related to one another. Note that we extract attributes as attribute nodes/classes as shown in the following code:

class PersonHasAgeAttribute(BaseModel):
    """Represents the age attribute of a person."""
    person: PersonEntity
    age: int = Field(...,
        description="Which is the age of the person?"
    )

class PersonHasEmailAttribute(BaseModel):
    """Represents the email attribute of a person."""
    person: PersonEntity
    email: str = Field(...,
        description="Which is the email of the person?"
    )

class PersonResidesInCountryAttribute(BaseModel):
    """Represents the country residence relationship of a person."""
    person: PersonEntity
    country: CountryAttribute

class CountryAttribute(str, Enum):
    """In which country does the person reside?"""
    germany = "Germany"
    austria = "Austria"
    switzerland = "Switzerland"

class PersonIsEmployedAtOrganizationRelationship(BaseModel):
    """Represents employment relationship between a person and an organization."""
    person: list[PersonEntity]
    organization: [OrganizationEntity]

class ExtractedRelationshipsAndAttributes(BaseModel):
    """Collection of all relationship types."""
    age_relationships: list[PersonHasAgeAttribute]
    email_relationships: list[PersonHasEmailAttribute]
    country_relationships: list[PersonResidesInCountryAttribute]
    employed_relationships: list[PersonIsEmployedAtOrganizationRelationship]

Performing the Actual Prompting

tbd

Conclusion

Combining Pydantic's static typing with knowledge graph thinking provides a robust framework for LLM data extraction. The structured output approach ensures type safety and validation, while knowledge graph concepts help you design comprehensive data models that capture not just entities, but the relationships between them.

As you build more complex LLM applications, this foundation becomes essential for maintaining data quality, enabling downstream analytics, and scaling your extraction pipelines across diverse document types and domains.