Building a Knowledge Graph from Text with LLMs_ Complete Pipeline
Building a Knowledge Graph from Text with LLMs: Complete Pipeline
Transform unstructured data into interactive knowledge graphs using Python and language models
Why Knowledge Graphs?
Unstructured data (articles, documents, biographies) contains valuable information, but it’s difficult to query programmatically. A*Knowledge Graph*(KG) structures that information as a network of entities connected by relationships, enabling:
- Queries like “What did Marie Curie discover?”
- Visual navigation of connections between concepts
- Inference of new facts from existing relationships
- Integration with RAG (Retrieval-Augmented Generation) systems
This article presents a complete pipeline that uses LLMs to automatically extract facts from text and build an interactive Knowledge Graph.
The Concept: SPO Triples
The fundamental unit of a Knowledge Graph is the*SPO triple*(Subject-Predicate-Object):
kg-spo-concept1625×1427 99.9 KB
Each fact from the text is decomposed into three parts:
| Component | Role | Example |
|------------|-----|---------|\n|Subject| The main entity | marie curie |
|Predicate| The relationship/action | discovered |
|Object| The related entity | radium |
This structure maps directly to the graph:
- Subjectand*Object*→ Nodes
- Predicate→ Directed edge (with label)
Pipeline Architecture
The complete process follows these steps:
Summary of stages:
- Input:Unstructured text (any document)
- Chunking:Divide into manageable fragments with overlap
- LLM Extraction:Send each chunk to the LLM with SPO prompt
- Normalization:Clean, lowercase, deduplicate triples
- Construction:Create the graph with NetworkX
- Visualization:Render interactively
Setup: Dependencies
pip install openai networkx ipycytoscape ipywidgets pandas
import openai
import json
import networkx as nx
import ipycytoscape
import pandas as pd
import os
import re
LLM Configuration
The pipeline is compatible with any provider that uses the OpenAI API:
# Environment variables
# export OPENAI_API_KEY='your-api-key'
# export OPENAI_API_BASE='https://api.openai.com/v1' # Optional
api_key = os.getenv("OPENAI_API_KEY")
base_url = os.getenv("OPENAI_API_BASE") # None for standard OpenAI
# Create client
client = openai.OpenAI(
api_key=api_key,
base_url=base_url
)
# Configuration
llm_model = "gpt-4o" # or "claude-3-sonnet", "deepseek-v3", etc.
llm_temperature = 0.0 # Deterministic for extraction
llm_max_tokens = 4096
Model options:
- OpenAI:
gpt-4o,gpt-4o-mini - Anthropic:
claude-3-5-sonnet(via compatible API) - Local:
ollamawith any model - Others: DeepSeek, Mistral, etc.
Step 1: Input Text
For this example, we’ll use a biography of Marie Curie:
text = """ Marie Curie, born Maria Skłodowska in Warsaw, Poland, was a pioneering physicist and chemist. She conducted groundbreaking research on radioactivity. Together with her husband, Pierre Curie, she discovered the elements polonium and radium. Marie Curie was the first woman to win a Nobel Prize, the first person and only woman to win the Nobel Prize twice, and the only person to win the Nobel Prize in two different scientific fields. She won the Nobel Prize in Physics in 1903 with Pierre Curie and Henri Becquerel. Later, she won the Nobel Prize in Chemistry in 1911 for her work on radium and polonium. """
print(f"Words: {len(text.split())}")
# Words: ~120
Step 2: Chunking with Overlap
LLMs have context limits. Dividing text into chunks allows processing long documents, and overlap preserves context between fragments:
def chunk_text(text: str, chunk_size: int = 150, overlap: int = 30) -> list:
""" Divide text into chunks with overlap. Args: text: Text to divide chunk_size: Words per chunk overlap: Overlapping words between chunks Returns: List of dicts with 'text' and 'chunk_number' """
words = text.split()
chunks = []
start = 0
chunk_num = 1
while start < len(words):
end = min(start + chunk_size, len(words))
chunk_text = " ".join(words[start:end])
chunks.append({
"text": chunk_text,
"chunk_number": chunk_num
})
# Next chunk with overlap
next_start = start + chunk_size - overlap
if next_start <= start:
next_start = start + 1
start = next_start
chunk_num += 1
# Safety: avoid infinite loops
if chunk_num > len(words):
break
return chunks
# Apply chunking
chunks = chunk_text(text, chunk_size=150, overlap=30)
print(f"Chunks generated: {len(chunks)}")
# Visualize
for c in chunks:
words = len(c['text'].split())
print(f" Chunk {c['chunk_number']}: {words} words")
Output:
Chunks generated: 1
Chunk 1: 120 words
For short texts, this may result in a single chunk. In long documents, you’ll see multiple chunks with overlap preserving context.
Step 3: SPO Extraction Prompt
The prompt is critical. It must specify exactly the expected output format:
SYSTEM_PROMPT = """ You are an AI expert specialized in knowledge graph extraction. Your task is to identify and extract factual Subject-Predicate-Object (SPO) triples from the given text. Focus on accuracy and adhere strictly to the JSON output format requested. """
USER_PROMPT_TEMPLATE = """ Extract Subject-Predicate-Object (S-P-O) triples from the text below. **RULES:** 1. Output ONLY a valid JSON array. Each element must have keys: "subject", "predicate", "object" 2. NO text before or after the JSON. NO markdown code fences. 3. Keep predicates concise (1-3 words, verbs preferred) 4. ALL values must be LOWERCASE 5. Replace pronouns (she, he, it) with the actual entity name 6. Be specific (e.g., "nobel prize in physics" not just "nobel prize") 7. Extract ALL distinct factual relationships **Text:** {text_chunk} **Required format:** [ {{"subject": "entity1", "predicate": "relation", "object": "entity2"}}, ... ] Your JSON: """
```
**Key Rules Explained:**
| Rule | Reason |
|-------|----------|
| JSON Only | Facilitates automatic parsing |
| Lowercase | Normalization for deduplication |
| Resolve Pronouns | Avoids "she discovered" without knowing who "she" is |
| Concise Predicates | Cleaner and more navigable graphs |
| Specificity | Preserves important information |
---
## Step 4: Extraction with the LLM
```python
def extract_triples_from_chunk(client, chunk: dict, model: str) -> list:
""" Extracts SPO triples from a chunk using the LLM. Returns: List of validated triples with 'chunk' source """
prompt = USER_PROMPT_TEMPLATE.format(text_chunk=chunk['text'])
try:
response = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": prompt}
],
temperature=0.0,
max_tokens=4096
)
raw = response.choices[0].message.content.strip()
except Exception as e:
print(f"Error in chunk {chunk['chunk_number']}: {e}")
return []
# Parse JSON
try:
data = json.loads(raw)
if isinstance(data, dict):
# Some LLMs return {"triples": [...]}
data = next((v for v in data.values() if isinstance(v, list)), [])
except json.JSONDecodeError:
# Fallback: search for array with regex
match = re.search(r'\[.*\]', raw, re.DOTALL)
if match:
try:
data = json.loads(match.group())
except:
return []
else:
return []
# Validate structure
valid_triples = []
for t in data:
if isinstance(t, dict):
s = t.get('subject', '')
p = t.get('predicate', '')
o = t.get('object', '')
if all(isinstance(x, str) and x.strip() for x in [s, p, o]):
valid_triples.append({
'subject': s,
'predicate': p,
'object': o,
'chunk': chunk['chunk_number']
})
return valid_triples
# Process all chunks
all_triples = []
for chunk in chunks:
triples = extract_triples_from_chunk(client, chunk, llm_model)
all_triples.extend(triples)
print(f"Chunk {chunk['chunk_number']}: {len(triples)} triples extracted")
print(f"\nTotal raw triples: {len(all_triples)}")
```
`
**Example Output:**
```
Chunk 1: 18 triples extracted
Total raw triples: 18
```
***
## **Step 5: Normalization and Deduplication**
Raw triples need cleaning before building the graph:
[kg-normalization1469×1467 168 KB](https://canada1.discourse-cdn.com/flex009/uploads/inovacon/original/2X/c/c0f502a2c25320ce28741b456fde117423febe23.png)
```
def normalize_triples(raw_triples: list) -> list:
""" Normalizes and deduplicates triples. Steps: 1. Lowercase and trim 2. Filter empty values 3. Deduplicate using Set """
normalized = []
seen = set()
stats = {
'original': len(raw_triples),
'empty_removed': 0,
'duplicates_removed': 0
}
for t in raw_triples:
# Normalize
s = t.get('subject', '').strip().lower()
p = t.get('predicate', '').strip().lower()
p = re.sub(r'\s+', ' ', p) # Multiple spaces → one
o = t.get('object', '').strip().lower()
# Filter empty values
if not all([s, p, o]):
stats['empty_removed'] += 1
continue
# Deduplicate
key = (s, p, o)
if key in seen:
stats['duplicates_removed'] += 1
continue
seen.add(key)
normalized.append({
'subject': s,
'predicate': p,
'object': o,
'source_chunk': t.get('chunk', '?')
})
print(f"Normalization:")
print(f" Original: {stats['original']}")
print(f" Empty removed: {stats['empty_removed']}")
print(f" Duplicates removed: {stats['duplicates_removed']}")
print(f" Final: {len(normalized)}")
return normalized
# Apply normalization
clean_triples = normalize_triples(all_triples)
```
**Example Output:**
```
Normalization:
Original: 18
Empty removed: 0
Duplicates removed: 2
Final: 16
```
***
## **Step 6: Graph Construction**
With clean triples, we build the graph using NetworkX:
```
def build_knowledge_graph(triples: list) -> nx.DiGraph:
""" Builds a NetworkX DiGraph from the triples. - Subject → Node - Object → Node - Predicate → Edge label """
G = nx.DiGraph()
for t in triples:
subject = t['subject']
predicate = t['predicate']
obj = t['object']
# add_edge automatically creates nodes if they don't exist
G.add_edge(subject, obj, label=predicate)
return G
# Build
kg = build_knowledge_graph(clean_triples)
print(f"Knowledge Graph created:")
print(f" Nodes (entities): {kg.number_of_nodes()}")
print(f" Edges (relations): {kg.number_of_edges()}")
```
**Example Output:**
```
Knowledge Graph created:
Nodes (entities): 15
Edges (relations): 16
```
***
## **Step 7: Interactive Visualization**
Using ipycytoscape to render the graph in Jupyter:
`
```
def visualize_kg(G: nx.DiGraph):
""" Creates interactive visualization of the Knowledge Graph. """
# Convert to Cytoscape format
nodes = []
edges = []
# Calculate degrees for sizing
degrees = dict(G.degree())
max_degree = max(degrees.values()) if degrees else 1
for node_id in G.nodes():
degree = degrees.get(node_id, 0)
size = 20 + (degree / max_degree) * 40
nodes.append({
'data': {
'id': str(node_id),
'label': str(node_id),
'size': size
}
})
for i, (u, v, data) in enumerate(G.edges(data=True)):
edges.append({
'data': {
'id': f'edge_{i}',
'source': str(u),
'target': str(v),
'label': data.get('label', '')
}
})
# Create widget
cyto = ipycytoscape.CytoscapeWidget()
cyto.graph.add_graph_from_json({
'nodes': nodes,
'edges': edges
})
# Style
cyto.set_style([
{
'selector': 'node',
'style': {
'label': 'data(label)',
'background-color': '#6366f1',
'color': '#ffffff',
'text-valign': 'center',
'width': 'data(size)',
'height': 'data(size)',
'font-size': '10px'
}
},
{
'selector': 'edge',
'style': {
'label': 'data(label)',
'curve-style': 'bezier',
'target-arrow-shape': 'triangle',
'line-color': '#94a3b8',
'target-arrow-color': '#94a3b8',
'font-size': '8px',
'color': '#64748b'
}
},
{
'selector': 'node:selected',
'style': {
'background-color': '#22c55e',
'border-width': 2,
'border-color': '#16a34a'
}
}
])
# Layout
cyto.set_layout(name='cose', nodeRepulsion=8000)
return cyto
# Visualize
widget = visualize_kg(kg)
display(widget)
```
The result is an interactive graph where you can:
- Drag nodes to reorganize
- Click on nodes to select
- Zoom in/out with scroll
- View the relationships (predicates) on the edges
---
## Output Example
For the Marie Curie text, the resulting graph shows:
**Central nodes:**
- `marie curie` (main hub with multiple connections)
- `pierre curie`
- `nobel prize in physics`
- `nobel prize in chemistry`
- `radium`, `polonium`
- `warsaw, poland`
**Typical extracted relationships:**
(marie curie) —[born in]→ (warsaw, poland)
(marie curie) —[discovered]→ (radium)
(marie curie) —[discovered]→ (polonium)
(marie curie) —[won]→ (nobel prize in physics)
(marie curie) —[won]→ (nobel prize in chemistry)
(marie curie) —[married to]→ (pierre curie)
(pierre curie) —[discovered]→ (radium)
`
---
## Complete Code
```python
# kg_pipeline.py - Complete Knowledge Graph Pipeline
import openai
import json
import networkx as nx
import os
import re
def chunk_text(text, chunk_size=150, overlap=30):
words = text.split()
chunks = []
start = 0
num = 1
while start < len(words):
end = min(start + chunk_size, len(words))
chunks.append({"text": " ".join(words[start:end]), "chunk_number": num})
start = start + chunk_size - overlap
if start <= 0: start = 1
num += 1
if num > len(words): break
return chunks
def extract_triples(client, chunk, model):
SYSTEM = "You are an AI expert in knowledge graph extraction."
USER = f"""Extract SPO triples from this text as JSON array.
Rules: lowercase, no markdown, concise predicates, resolve pronouns.
Format: [{{"subject": "x", "predicate": "y", "object": "z"}}]
Text: {chunk['text']}
JSON:"""
try:
r = client.chat.completions.create(
model=model,
messages=[{"role": "system", "content": SYSTEM},
{"role": "user", "content": USER}],
temperature=0.0, max_tokens=4096
)
data = json.loads(r.choices[0].message.content.strip())
if isinstance(data, dict):
data = next((v for v in data.values() if isinstance(v, list)), [])
except:
return []
return [
{**t, 'chunk': chunk['chunk_number']}
for t in data
if all(t.get(k) for k in ['subject', 'predicate', 'object'])
]
def normalize(triples):
seen = set()
out = []
for t in triples:
key = tuple(t[k].strip().lower() for k in ['subject', 'predicate', 'object'])
if all(key) and key not in seen:
seen.add(key)
out.append({'subject': key[0], 'predicate': key[1], 'object': key[2]})
return out
def build_graph(triples):
G = nx.DiGraph()
for t in triples:
G.add_edge(t['subject'], t['object'], label=t['predicate'])
return G
# Main
if __name__ == "__main__":
client = openai.OpenAI()
text = "..." # Your text here
chunks = chunk_text(text)
raw = [t for c in chunks for t in extract_triples(client, c, "gpt-4o")]
clean = normalize(raw)
kg = build_graph(clean)
print(f"Nodes: {kg.number_of_nodes()}, Edges: {kg.number_of_edges()}")
````
***
## **Next Steps**
This basic pipeline can be extended with:
| **Enhancement** | **Description** |
| --------------------------- | --------------------------------------------------- |
| **Entity Linking** | Connect “Marie Curie” and “M. Curie” to the same ID |
| **Relationship Clustering** | Group “born in” and “was born at” |
| **Persistence** | Save to Neo4j or ArangoDB |
| **Evaluation** | Measure extraction precision/recall |
| **Multi-hop Queries** | “What did Pierre Curie’s wife discover?” |
| **RAG Integration** | Use the KG to improve LLM responses |
***
## **Resources**
* [NetworkX Documentation](https://networkx.org/documentation/stable/)
* [ipycytoscape](https://github.com/cytoscape/ipycytoscape)
* [Neo4j Graph Database](https://neo4j.com/)
* [Knowledge Graphs - Wikipedia](https://en.wikipedia.org/wiki/Knowledge_graph)
***
*Published on yoDEV.dev — The Latin American developers community*
Top comments (0)