Grego

Posted on Jan 2

Building a Knowledge Graph from Text with LLMs_ Complete Pipeline

#python #llm

Building a Knowledge Graph from Text with LLMs_ Complete Pipeline

Building a Knowledge Graph from Text with LLMs: Complete Pipeline

Transform unstructured data into interactive knowledge graphs using Python and language models

Why Knowledge Graphs?

Unstructured data (articles, documents, biographies) contains valuable information, but it’s difficult to query programmatically. A*Knowledge Graph*(KG) structures that information as a network of entities connected by relationships, enabling:

Queries like “What did Marie Curie discover?”
Visual navigation of connections between concepts
Inference of new facts from existing relationships
Integration with RAG (Retrieval-Augmented Generation) systems

This article presents a complete pipeline that uses LLMs to automatically extract facts from text and build an interactive Knowledge Graph.

The Concept: SPO Triples

The fundamental unit of a Knowledge Graph is the*SPO triple*(Subject-Predicate-Object):

kg-spo-concept1625×1427 99.9 KB

Each fact from the text is decomposed into three parts:

| Component | Role | Example |
|------------|-----|---------|\n|Subject| The main entity | marie curie |
|Predicate| The relationship/action | discovered |
|Object| The related entity | radium |

This structure maps directly to the graph:

Subjectand*Object*→ Nodes
Predicate→ Directed edge (with label)

Pipeline Architecture

The complete process follows these steps:

767×1785 140 KB

Summary of stages:

Input:Unstructured text (any document)
Chunking:Divide into manageable fragments with overlap
LLM Extraction:Send each chunk to the LLM with SPO prompt
Normalization:Clean, lowercase, deduplicate triples
Construction:Create the graph with NetworkX
Visualization:Render interactively

Setup: Dependencies

pip install openai networkx ipycytoscape ipywidgets pandas

import openai
import json
import networkx as nx
import ipycytoscape
import pandas as pd
import os
import re

LLM Configuration

The pipeline is compatible with any provider that uses the OpenAI API:

# Environment variables
# export OPENAI_API_KEY='your-api-key'
# export OPENAI_API_BASE='https://api.openai.com/v1' # Optional

api_key = os.getenv("OPENAI_API_KEY")
base_url = os.getenv("OPENAI_API_BASE")  # None for standard OpenAI

# Create client
client = openai.OpenAI(
    api_key=api_key,
    base_url=base_url
)

# Configuration
llm_model = "gpt-4o"  # or "claude-3-sonnet", "deepseek-v3", etc.
llm_temperature = 0.0  # Deterministic for extraction
llm_max_tokens = 4096

Model options:

OpenAI:gpt-4o,gpt-4o-mini
Anthropic:claude-3-5-sonnet(via compatible API)
Local:ollamawith any model
Others: DeepSeek, Mistral, etc.

Step 1: Input Text

For this example, we’ll use a biography of Marie Curie:

text = """ Marie Curie, born Maria Skłodowska in Warsaw, Poland, was a pioneering physicist and chemist. She conducted groundbreaking research on radioactivity. Together with her husband, Pierre Curie, she discovered the elements polonium and radium. Marie Curie was the first woman to win a Nobel Prize, the first person and only woman to win the Nobel Prize twice, and the only person to win the Nobel Prize in two different scientific fields. She won the Nobel Prize in Physics in 1903 with Pierre Curie and Henri Becquerel. Later, she won the Nobel Prize in Chemistry in 1911 for her work on radium and polonium. """

print(f"Words: {len(text.split())}")
# Words: ~120

Step 2: Chunking with Overlap

LLMs have context limits. Dividing text into chunks allows processing long documents, and overlap preserves context between fragments:

def chunk_text(text: str, chunk_size: int = 150, overlap: int = 30) -> list:
    """ Divide text into chunks with overlap. Args: text: Text to divide chunk_size: Words per chunk overlap: Overlapping words between chunks Returns: List of dicts with 'text' and 'chunk_number' """
    words = text.split()
    chunks = []
    start = 0
    chunk_num = 1

    while start < len(words):
        end = min(start + chunk_size, len(words))
        chunk_text = " ".join(words[start:end])
        chunks.append({
            "text": chunk_text,
            "chunk_number": chunk_num
        })

        # Next chunk with overlap
        next_start = start + chunk_size - overlap
        if next_start <= start:
            next_start = start + 1
        start = next_start
        chunk_num += 1

        # Safety: avoid infinite loops
        if chunk_num > len(words):
            break

    return chunks

# Apply chunking
chunks = chunk_text(text, chunk_size=150, overlap=30)
print(f"Chunks generated: {len(chunks)}")

# Visualize
for c in chunks:
    words = len(c['text'].split())
    print(f" Chunk {c['chunk_number']}: {words} words")

Output:

Chunks generated: 1
  Chunk 1: 120 words

For short texts, this may result in a single chunk. In long documents, you’ll see multiple chunks with overlap preserving context.

Step 3: SPO Extraction Prompt

The prompt is critical. It must specify exactly the expected output format:

SYSTEM_PROMPT = """ You are an AI expert specialized in knowledge graph extraction. Your task is to identify and extract factual Subject-Predicate-Object (SPO) triples from the given text. Focus on accuracy and adhere strictly to the JSON output format requested. """

USER_PROMPT_TEMPLATE = """ Extract Subject-Predicate-Object (S-P-O) triples from the text below. **RULES:** 1. Output ONLY a valid JSON array. Each element must have keys: "subject", "predicate", "object" 2. NO text before or after the JSON. NO markdown code fences. 3. Keep predicates concise (1-3 words, verbs preferred) 4. ALL values must be LOWERCASE 5. Replace pronouns (she, he, it) with the actual entity name 6. Be specific (e.g., "nobel prize in physics" not just "nobel prize") 7. Extract ALL distinct factual relationships **Text:** {text_chunk} **Required format:** [ {{"subject": "entity1", "predicate": "relation", "object": "entity2"}}, ... ] Your JSON: """
```

**Key Rules Explained:**

| Rule | Reason |
|-------|----------|
| JSON Only | Facilitates automatic parsing |
| Lowercase | Normalization for deduplication |
| Resolve Pronouns | Avoids "she discovered" without knowing who "she" is |
| Concise Predicates | Cleaner and more navigable graphs |
| Specificity | Preserves important information |

---

## Step 4: Extraction with the LLM



```python
def extract_triples_from_chunk(client, chunk: dict, model: str) -> list:
    """ Extracts SPO triples from a chunk using the LLM. Returns: List of validated triples with 'chunk' source """
    prompt = USER_PROMPT_TEMPLATE.format(text_chunk=chunk['text'])

    try:
        response = client.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": SYSTEM_PROMPT},
                {"role": "user", "content": prompt}
            ],
            temperature=0.0,
            max_tokens=4096
        )
        raw = response.choices[0].message.content.strip()
    except Exception as e:
        print(f"Error in chunk {chunk['chunk_number']}: {e}")
        return []

    # Parse JSON
    try:
        data = json.loads(raw)
        if isinstance(data, dict):
            # Some LLMs return {"triples": [...]}
            data = next((v for v in data.values() if isinstance(v, list)), [])
    except json.JSONDecodeError:
        # Fallback: search for array with regex
        match = re.search(r'\[.*\]', raw, re.DOTALL)
        if match:
            try:
                data = json.loads(match.group())
            except:
                return []
        else:
            return []

    # Validate structure
    valid_triples = []
    for t in data:
        if isinstance(t, dict):
            s = t.get('subject', '')
            p = t.get('predicate', '')
            o = t.get('object', '')
            if all(isinstance(x, str) and x.strip() for x in [s, p, o]):
                valid_triples.append({
                    'subject': s,
                    'predicate': p,
                    'object': o,
                    'chunk': chunk['chunk_number']
                })

    return valid_triples

# Process all chunks
all_triples = []
for chunk in chunks:
    triples = extract_triples_from_chunk(client, chunk, llm_model)
    all_triples.extend(triples)
    print(f"Chunk {chunk['chunk_number']}: {len(triples)} triples extracted")

print(f"\nTotal raw triples: {len(all_triples)}")

```

`

**Example Output:**

```
Chunk 1: 18 triples extracted

Total raw triples: 18

```

***

## **Step 5: Normalization and Deduplication**

Raw triples need cleaning before building the graph:

[kg-normalization1469×1467 168 KB](https://canada1.discourse-cdn.com/flex009/uploads/inovacon/original/2X/c/c0f502a2c25320ce28741b456fde117423febe23.png)

```
def normalize_triples(raw_triples: list) -> list:
    """ Normalizes and deduplicates triples. Steps: 1. Lowercase and trim 2. Filter empty values 3. Deduplicate using Set """
    normalized = []
    seen = set()

    stats = {
        'original': len(raw_triples),
        'empty_removed': 0,
        'duplicates_removed': 0
    }

    for t in raw_triples:
        # Normalize
        s = t.get('subject', '').strip().lower()
        p = t.get('predicate', '').strip().lower()
        p = re.sub(r'\s+', ' ', p)  # Multiple spaces → one
        o = t.get('object', '').strip().lower()

        # Filter empty values
        if not all([s, p, o]):
            stats['empty_removed'] += 1
            continue

        # Deduplicate
        key = (s, p, o)
        if key in seen:
            stats['duplicates_removed'] += 1
            continue

        seen.add(key)
        normalized.append({
            'subject': s,
            'predicate': p,
            'object': o,
            'source_chunk': t.get('chunk', '?')
        })

    print(f"Normalization:")
    print(f" Original: {stats['original']}")
    print(f" Empty removed: {stats['empty_removed']}")
    print(f" Duplicates removed: {stats['duplicates_removed']}")
    print(f" Final: {len(normalized)}")

    return normalized

# Apply normalization
clean_triples = normalize_triples(all_triples)

```

**Example Output:**

```
Normalization:
  Original: 18
  Empty removed: 0
  Duplicates removed: 2
  Final: 16

```

***

## **Step 6: Graph Construction**

With clean triples, we build the graph using NetworkX:

```
def build_knowledge_graph(triples: list) -> nx.DiGraph:
    """ Builds a NetworkX DiGraph from the triples. - Subject → Node - Object → Node - Predicate → Edge label """
    G = nx.DiGraph()

    for t in triples:
        subject = t['subject']
        predicate = t['predicate']
        obj = t['object']

        # add_edge automatically creates nodes if they don't exist
        G.add_edge(subject, obj, label=predicate)

    return G

# Build
kg = build_knowledge_graph(clean_triples)

print(f"Knowledge Graph created:")
print(f" Nodes (entities): {kg.number_of_nodes()}")
print(f" Edges (relations): {kg.number_of_edges()}")

```

**Example Output:**

```
Knowledge Graph created:
  Nodes (entities): 15
  Edges (relations): 16

```

***

## **Step 7: Interactive Visualization**

Using ipycytoscape to render the graph in Jupyter:

`

```
def visualize_kg(G: nx.DiGraph):
    """ Creates interactive visualization of the Knowledge Graph. """
    # Convert to Cytoscape format
    nodes = []
    edges = []

    # Calculate degrees for sizing
    degrees = dict(G.degree())
    max_degree = max(degrees.values()) if degrees else 1

    for node_id in G.nodes():
        degree = degrees.get(node_id, 0)
        size = 20 + (degree / max_degree) * 40
        nodes.append({
            'data': {
                'id': str(node_id),
                'label': str(node_id),
                'size': size
            }
        })

    for i, (u, v, data) in enumerate(G.edges(data=True)):
        edges.append({
            'data': {
                'id': f'edge_{i}',
                'source': str(u),
                'target': str(v),
                'label': data.get('label', '')
            }
        })

    # Create widget
    cyto = ipycytoscape.CytoscapeWidget()
    cyto.graph.add_graph_from_json({
        'nodes': nodes,
        'edges': edges
    })

    # Style
    cyto.set_style([
        {
            'selector': 'node',
            'style': {
                'label': 'data(label)',
                'background-color': '#6366f1',
                'color': '#ffffff',
                'text-valign': 'center',
                'width': 'data(size)',
                'height': 'data(size)',
                'font-size': '10px'
            }
        },
        {
            'selector': 'edge',
            'style': {
                'label': 'data(label)',
                'curve-style': 'bezier',
                'target-arrow-shape': 'triangle',
                'line-color': '#94a3b8',
                'target-arrow-color': '#94a3b8',
                'font-size': '8px',
                'color': '#64748b'
            }
        },
        {
            'selector': 'node:selected',
            'style': {
                'background-color': '#22c55e',
                'border-width': 2,
                'border-color': '#16a34a'
            }
        }
    ])

    # Layout
    cyto.set_layout(name='cose', nodeRepulsion=8000)

    return cyto

# Visualize
widget = visualize_kg(kg)
display(widget)
```

The result is an interactive graph where you can:
- Drag nodes to reorganize
- Click on nodes to select
- Zoom in/out with scroll
- View the relationships (predicates) on the edges

---

## Output Example

For the Marie Curie text, the resulting graph shows:

**Central nodes:**
- `marie curie` (main hub with multiple connections)
- `pierre curie`
- `nobel prize in physics`
- `nobel prize in chemistry`
- `radium`, `polonium`
- `warsaw, poland`

**Typical extracted relationships:**

(marie curie) —[born in]→ (warsaw, poland)
(marie curie) —[discovered]→ (radium)
(marie curie) —[discovered]→ (polonium)
(marie curie) —[won]→ (nobel prize in physics)
(marie curie) —[won]→ (nobel prize in chemistry)
(marie curie) —[married to]→ (pierre curie)
(pierre curie) —[discovered]→ (radium)


`

---

## Complete Code

```python
# kg_pipeline.py - Complete Knowledge Graph Pipeline

import openai
import json
import networkx as nx
import os
import re

def chunk_text(text, chunk_size=150, overlap=30):
    words = text.split()
    chunks = []
    start = 0
    num = 1
    while start < len(words):
        end = min(start + chunk_size, len(words))
        chunks.append({"text": " ".join(words[start:end]), "chunk_number": num})
        start = start + chunk_size - overlap
        if start <= 0: start = 1
        num += 1
        if num > len(words): break
    return chunks

def extract_triples(client, chunk, model):
    SYSTEM = "You are an AI expert in knowledge graph extraction."
    USER = f"""Extract SPO triples from this text as JSON array.
Rules: lowercase, no markdown, concise predicates, resolve pronouns.
Format: [{{"subject": "x", "predicate": "y", "object": "z"}}]

Text: {chunk['text']}

JSON:"""

    try:
        r = client.chat.completions.create(
            model=model,
            messages=[{"role": "system", "content": SYSTEM},
                      {"role": "user", "content": USER}],
            temperature=0.0, max_tokens=4096
        )
        data = json.loads(r.choices[0].message.content.strip())
        if isinstance(data, dict):
            data = next((v for v in data.values() if isinstance(v, list)), [])
    except:
        return []

    return [
        {**t, 'chunk': chunk['chunk_number']}
        for t in data
        if all(t.get(k) for k in ['subject', 'predicate', 'object'])
    ]

def normalize(triples):
    seen = set()
    out = []
    for t in triples:
        key = tuple(t[k].strip().lower() for k in ['subject', 'predicate', 'object'])
        if all(key) and key not in seen:
            seen.add(key)
            out.append({'subject': key[0], 'predicate': key[1], 'object': key[2]})
    return out

def build_graph(triples):
    G = nx.DiGraph()
    for t in triples:
        G.add_edge(t['subject'], t['object'], label=t['predicate'])
    return G

# Main
if __name__ == "__main__":
    client = openai.OpenAI()
    text = "..." # Your text here

    chunks = chunk_text(text)
    raw = [t for c in chunks for t in extract_triples(client, c, "gpt-4o")]
    clean = normalize(raw)
    kg = build_graph(clean)

    print(f"Nodes: {kg.number_of_nodes()}, Edges: {kg.number_of_edges()}")

````

***

## **Next Steps**

This basic pipeline can be extended with:

| **Enhancement**             | **Description**                                     |
| --------------------------- | --------------------------------------------------- |
| **Entity Linking**          | Connect “Marie Curie” and “M. Curie” to the same ID |
| **Relationship Clustering** | Group “born in” and “was born at”                   |
| **Persistence**             | Save to Neo4j or ArangoDB                           |
| **Evaluation**              | Measure extraction precision/recall                 |
| **Multi-hop Queries**       | “What did Pierre Curie’s wife discover?”            |
| **RAG Integration**         | Use the KG to improve LLM responses                 |

***

## **Resources**

* [NetworkX Documentation](https://networkx.org/documentation/stable/)
* [ipycytoscape](https://github.com/cytoscape/ipycytoscape)
* [Neo4j Graph Database](https://neo4j.com/)
* [Knowledge Graphs - Wikipedia](https://en.wikipedia.org/wiki/Knowledge_graph)

***

*Published on yoDEV.dev — The Latin American developers community*

DEV Community

Building a Knowledge Graph from Text with LLMs_ Complete Pipeline

Building a Knowledge Graph from Text with LLMs_ Complete Pipeline

Building a Knowledge Graph from Text with LLMs: Complete Pipeline

Why Knowledge Graphs?

The Concept: SPO Triples

Pipeline Architecture

Setup: Dependencies

LLM Configuration

Step 1: Input Text

Step 2: Chunking with Overlap

Step 3: SPO Extraction Prompt

Top comments (0)