Validating PII Obfuscation using GLiNER-Powered Named Entity Recognition (NER) within Docling
Introduction
The challenge of rigorously detecting and obscuring Personally Identifiable Information (PII) is one that numerous tools — both commercial and open-source — aim to solve using Named Entity Recognition (NER). Yet, this necessity is no longer a matter of best practice, but a critical legal mandate driven primarily by the European Union’s General Data Protection Regulation (GDPR). Passed in 2016, GDPR fundamentally reshaped how organizations worldwide must handle the personal data of EU residents, demanding data minimization, transparency, and high standards of security. Failure to protect information like names, addresses, and account numbers — which could directly or indirectly identify an individual — exposes companies to severe penalties, potentially reaching tens of millions of euros or a significant percentage of global annual turnover. Consequently, adopting advanced techniques like Named Entity Recognition (NER) for automated PII obfuscation has become essential, transforming privacy compliance from a manual checklist item into a scalable, technological safeguard.
What is named entity recognition?
Named entity recognition (NER) — also called entity chunking or entity extraction — is a component of natural language processing (NLP) that identifies predefined categories of objects in a body of text.
These categories can include, but are not limited to, names of individuals, organizations, locations, expressions of times, quantities, medical codes, monetary values and percentages, among others. Essentially, NER is the process of taking a string of text (i.e., a sentence, paragraph or entire document), and identifying and classifying the entities that refer to each category.
When the term “NER” was coined at the Sixth Message Understanding Conference (MUC-6), the goal was to streamline information extraction tasks, which involved processing large amounts of unstructured text and identifying key information. Since then, NER has expanded and evolved, owing much of its evolution to advancements in machine learning and deep learning techniques.
The whole article could be found here: https://www.ibm.com/think/topics/named-entity-recognition
How can Docling help with NER Obfuscation?
Docling simplifies document processing, parsing diverse formats — including advanced PDF understanding — and providing seamless integrations with the gen AI ecosystem.
Features
- 🗂️ Parsing of multiple document formats incl. PDF, DOCX, PPTX, XLSX, HTML, WAV, MP3, VTT, images (PNG, TIFF, JPEG, …), and more
- 📑 Advanced PDF understanding incl. page layout, reading order, table structure, code, formulas, image classification, and more
- 🧬 Unified, expressive DoclingDocument representation format
- ↪️ Various export formats and options, including Markdown, HTML, DocTags and lossless JSON
- 🔒 Local execution capabilities for sensitive data and air-gapped environments
- 🤖 Plug-and-play integrations incl. LangChain, LlamaIndex, Crew AI & Haystack for agentic AI
- 🔍 Extensive OCR support for scanned PDFs and images
- 👓 Support of several Visual Language Models (GraniteDocling)
- 🎙️ Audio support with Automatic Speech Recognition (ASR) models
- 🔌 Connect to any agent using the MCP server
- 💻 Simple and convenient CLI
What’s new
- 📤 Structured information extraction [🧪 beta]
- 📑 New layout model (Heron) by default, for faster PDF parsing
- 🔌 MCP server for agentic applications
- 💬 Parsing of Web Video Text Tracks (WebVTT) files
Coming soon
- 📝 Metadata extraction, including title, authors, references & language
- 📝 Chart understanding (Barchart, Piechart, LinePlot, etc)
- 📝 Complex chemistry understanding (Molecular structures)
Testing the NER Functionality ad Implementation
For my tests, related to project, I started to work with the provided sample out-of-the-box. I just adapted the code to my way of working…
- Hereafter the steps to test and implement the code
# for using the GLiNER package you better use a previous python version!
python3.12 -m venv myenv
source myenv/bin/activate
pip install docling
pip install transformers
pip install torch --extra-index-url https://download.pytorch.org/whl/cpu
pip install gliner
- First implementation of NER which does the following: converts a PDF and saves original Markdown with embedded images, runs a HF token-classification pipeline (NER) to detect PII-like entities, obfuscates occurrences in TextItem and TableItem by stable, type-based IDs.
The input sample document is to be found here: https://github.com/docling-project/docling/blob/main/tests/data/pdf/2206.01062.pdf
import argparse
import logging
import os
import re
from datetime import datetime
from pathlib import Path
from typing import Dict, List, Tuple
from docling_core.types.doc import ImageRefMode, TableItem, TextItem
from tabulate import tabulate
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.document_converter import DocumentConverter, PdfFormatOption
_log = logging.getLogger(__name__)
IMAGE_RESOLUTION_SCALE = 2.0
HF_MODEL = "dslim/bert-base-NER" # Swap with another HF NER/PII model if desired, eg https://huggingface.co/urchade/gliner_multi_pii-v1 looks very promising too!
GLINER_MODEL = "urchade/gliner_multi_pii-v1"
def _build_simple_ner_pipeline():
"""Create a Hugging Face token-classification pipeline for NER.
Returns a callable like: ner(text) -> List[dict]
"""
try:
from transformers import (
AutoModelForTokenClassification,
AutoTokenizer,
pipeline,
)
except Exception:
_log.error("Transformers not installed. Please run: pip install transformers")
raise
tokenizer = AutoTokenizer.from_pretrained(HF_MODEL)
model = AutoModelForTokenClassification.from_pretrained(HF_MODEL)
ner = pipeline(
"token-classification",
model=model,
tokenizer=tokenizer,
aggregation_strategy="simple", # groups subwords into complete entities
# Note: modern Transformers returns `start`/`end` when possible with aggregation
)
return ner
class SimplePiiObfuscator:
"""Tracks PII strings and replaces them with stable IDs per entity type."""
def __init__(self, ner_callable):
self.ner = ner_callable
self.entity_map: Dict[str, str] = {}
self.counters: Dict[str, int] = {
"person": 0,
"org": 0,
"location": 0,
"misc": 0,
}
# Map model labels to our coarse types
self.label_map = {
"PER": "person",
"PERSON": "person",
"ORG": "org",
"ORGANIZATION": "org",
"LOC": "location",
"LOCATION": "location",
"GPE": "location",
# Fallbacks
"MISC": "misc",
"O": "misc",
}
# Only obfuscate these by default. Adjust as needed.
self.allowed_types = {"person", "org", "location"}
def _next_id(self, typ: str) -> str:
self.counters[typ] += 1
return f"{typ}-{self.counters[typ]}"
def _normalize(self, s: str) -> str:
return re.sub(r"\s+", " ", s).strip()
def _extract_entities(self, text: str) -> List[Tuple[str, str]]:
"""Run NER and return a list of (surface_text, type) to obfuscate."""
if not text:
return []
results = self.ner(text)
# Collect normalized items with optional span info
items = []
for r in results:
raw_label = r.get("entity_group") or r.get("entity") or "MISC"
label = self.label_map.get(raw_label, "misc")
if label not in self.allowed_types:
continue
start = r.get("start")
end = r.get("end")
word = self._normalize(r.get("word") or r.get("text") or "")
items.append({"label": label, "start": start, "end": end, "word": word})
found: List[Tuple[str, str]] = []
# If the pipeline provides character spans, merge consecutive/overlapping
# entities of the same type into a single span, then take the substring
# from the original text. This handles cases like subword tokenization
# where multiple adjacent pieces belong to the same named entity.
have_spans = any(i["start"] is not None and i["end"] is not None for i in items)
if have_spans:
spans = [
i for i in items if i["start"] is not None and i["end"] is not None
]
# Ensure processing order by start (then end)
spans.sort(key=lambda x: (x["start"], x["end"]))
merged = []
for s in spans:
if not merged:
merged.append(dict(s))
continue
last = merged[-1]
if s["label"] == last["label"] and s["start"] <= last["end"]:
# Merge identical, overlapping, or touching spans of same type
last["start"] = min(last["start"], s["start"])
last["end"] = max(last["end"], s["end"])
else:
merged.append(dict(s))
for m in merged:
surface = self._normalize(text[m["start"] : m["end"]])
if surface:
found.append((surface, m["label"]))
# Include any items lacking spans as-is (fallback)
for i in items:
if i["start"] is None or i["end"] is None:
if i["word"]:
found.append((i["word"], i["label"]))
else:
# Fallback when spans aren't provided: return normalized words
for i in items:
if i["word"]:
found.append((i["word"], i["label"]))
return found
def obfuscate_text(self, text: str) -> str:
if not text:
return text
entities = self._extract_entities(text)
if not entities:
return text
# Deduplicate per text, keep stable global mapping
unique_words: Dict[str, str] = {}
for word, label in entities:
if word not in self.entity_map:
replacement = self._next_id(label)
self.entity_map[word] = replacement
unique_words[word] = self.entity_map[word]
# Replace longer matches first to avoid partial overlaps
sorted_pairs = sorted(
unique_words.items(), key=lambda x: len(x[0]), reverse=True
)
def replace_once(s: str, old: str, new: str) -> str:
# Use simple substring replacement; for stricter matching, use word boundaries
# when appropriate (e.g., names). This is a demo, keep it simple.
pattern = re.escape(old)
return re.sub(pattern, new, s)
obfuscated = text
for old, new in sorted_pairs:
obfuscated = replace_once(obfuscated, old, new)
return obfuscated
def _build_gliner_model():
"""Create a GLiNER model for PII-like entity extraction.
Returns a tuple (model, labels) where model.predict_entities(text, labels)
yields entities with "text" and "label" fields.
"""
try:
from gliner import GLiNER # type: ignore
except Exception:
_log.error(
"GLiNER not installed. Please run: pip install gliner torch --extra-index-url https://download.pytorch.org/whl/cpu"
)
raise
model = GLiNER.from_pretrained(GLINER_MODEL)
# Curated set of labels for PII detection. Adjust as needed.
labels = [
# "work",
"booking number",
"personally identifiable information",
"driver licence",
"person",
"full address",
"company",
# "actor",
# "character",
"email",
"passport number",
"Social Security Number",
"phone number",
]
return model, labels
class AdvancedPIIObfuscator:
"""PII obfuscator powered by GLiNER with fine-grained labels.
- Uses GLiNER's `predict_entities(text, labels)` to detect entities.
- Obfuscates with stable IDs per fine-grained label, e.g. `email-1`.
"""
def __init__(self, gliner_model, labels: List[str]):
self.model = gliner_model
self.labels = labels
self.entity_map: Dict[str, str] = {}
self.counters: Dict[str, int] = {}
def _normalize(self, s: str) -> str:
return re.sub(r"\s+", " ", s).strip()
def _norm_label(self, label: str) -> str:
return (
re.sub(
r"[^a-z0-9_]+", "_", label.lower().replace(" ", "_").replace("-", "_")
).strip("_")
or "pii"
)
def _next_id(self, typ: str) -> str:
self.cc(typ)
self.counters[typ] += 1
return f"{typ}-{self.counters[typ]}"
def cc(self, typ: str) -> None:
if typ not in self.counters:
self.counters[typ] = 0
def _extract_entities(self, text: str) -> List[Tuple[str, str]]:
if not text:
return []
results = self.model.predict_entities(
text, self.labels
) # expects dicts with text/label
found: List[Tuple[str, str]] = []
for r in results:
label = self._norm_label(str(r.get("label", "pii")))
surface = self._normalize(str(r.get("text", "")))
if surface:
found.append((surface, label))
return found
def obfuscate_text(self, text: str) -> str:
if not text:
return text
entities = self._extract_entities(text)
if not entities:
return text
unique_words: Dict[str, str] = {}
for word, label in entities:
if word not in self.entity_map:
replacement = self._next_id(label)
self.entity_map[word] = replacement
unique_words[word] = self.entity_map[word]
sorted_pairs = sorted(
unique_words.items(), key=lambda x: len(x[0]), reverse=True
)
def replace_once(s: str, old: str, new: str) -> str:
pattern = re.escape(old)
return re.sub(pattern, new, s)
obfuscated = text
for old, new in sorted_pairs:
obfuscated = replace_once(obfuscated, old, new)
return obfuscated
def main():
logging.basicConfig(level=logging.INFO)
# --- Start of modifications for input/output handling ---
input_dir = Path("./input")
output_dir = Path("./output")
# Choose engine via CLI flag or env var (default: hf)
parser = argparse.ArgumentParser(description="PII obfuscation example")
parser.add_argument(
"--engine",
choices=["hf", "gliner"],
default=os.getenv("PII_ENGINE", "hf"),
help="NER engine: 'hf' (Transformers) or 'gliner' (GLiNER)",
)
args = parser.parse_args()
# Ensure output dir exists
output_dir.mkdir(parents=True, exist_ok=True)
_log.info(f"Output directory created/verified: {output_dir}")
# --- End of modifications for input/output handling ---
# Keep and generate images so Markdown can embed them
pipeline_options = PdfPipelineOptions()
pipeline_options.images_scale = IMAGE_RESOLUTION_SCALE
pipeline_options.generate_page_images = True
pipeline_options.generate_picture_images = True
doc_converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
}
)
# Build NER pipeline and obfuscator
if args.engine == "gliner":
_log.info("Using GLiNER-based AdvancedPIIObfuscator")
gliner_model, gliner_labels = _build_gliner_model()
# Create a new obfuscator for each run to reset the ID counter
ObfuscatorClass = lambda: AdvancedPIIObfuscator(gliner_model, gliner_labels)
else:
_log.info("Using HF Transformers-based SimplePiiObfuscator")
ner = _build_simple_ner_pipeline()
# Create a new obfuscator for each run to reset the ID counter
ObfuscatorClass = lambda: SimplePiiObfuscator(ner)
# --- Start of modifications for recursive processing and output saving ---
# Recursively find all files in the input directory
input_files = [p for p in input_dir.rglob("*") if p.is_file()]
_log.info(f"Found {len(input_files)} files in {input_dir}")
for input_doc_path in input_files:
_log.info(f"Processing file: {input_doc_path}")
# Reset obfuscator for each file to ensure unique, sequential IDs per document
obfuscator = ObfuscatorClass()
try:
conv_res = doc_converter.convert(input_doc_path)
conv_doc = conv_res.document
# Use relative path from input folder to preserve directory structure in filename
relative_path = input_doc_path.relative_to(input_dir).with_suffix('')
# Clean up the path for use in a filename (replace slashes with underscores)
file_prefix = str(relative_path).replace(os.sep, '_')
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
# --- Perform PII Obfuscation on the document elements ---
for element, _level in conv_res.document.iterate_items():
if isinstance(element, TextItem):
element.orig = element.text
element.text = obfuscator.obfuscate_text(element.text)
elif isinstance(element, TableItem):
for cell in element.data.table_cells:
cell.text = obfuscator.obfuscate_text(cell.text)
# Save markdown with embedded pictures and obfuscated text
md_filename = output_dir / f"{file_prefix}_{timestamp}_obfuscated.md"
conv_doc.save_as_markdown(md_filename, image_mode=ImageRefMode.EMBEDDED)
_log.info(f"Saved obfuscated output to: {md_filename}")
# Optional: log mapping summary
if obfuscator.entity_map:
data = []
for key, val in obfuscator.entity_map.items():
data.append([key, val])
_log.info(
f"Obfuscated entities for {input_doc_path.name}:\n\n{tabulate(data)}",
)
except Exception as e:
_log.error(f"Failed to process {input_doc_path}: {e}")
continue
# --- End of modifications for recursive processing and output saving ---
if __name__ == "__main__":
main()
- Onz of the important key parts to personnalize the application is;
model = GLiNER.from_pretrained(GLINER_MODEL)
# Curated set of labels for PII detection. Adjust as needed.
labels = [
# "work",
"booking number",
"personally identifiable information",
"driver licence",
"person",
"full address",
"company",
# "actor",
# "character",
"email",
"passport number",
"Social Security Number",
"phone number",
]
return model, labels
- The output we get on console is 👇
> python app-pii.py
2025-11-19 11:27:11,720 - INFO - Output directory created/verified: output
2025-11-19 11:27:11,722 - INFO - Using HF Transformers-based SimplePiiObfuscator
tokenizer_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 59.0/59.0 [00:00<00:00, 189kB/s]
config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 829/829 [00:00<00:00, 11.2MB/s]
vocab.txt: 213kB [00:00, 5.43MB/s]
added_tokens.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.00/2.00 [00:00<00:00, 18.0kB/s]
special_tokens_map.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 112/112 [00:00<00:00, 621kB/s]
model.safetensors: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 433M/433M [00:22<00:00, 18.9MB/s]
Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use mps:0
2025-11-19 11:27:37,501 - INFO - Found 1 files in input
2025-11-19 11:27:37,501 - INFO - Processing file: input/2206.01062.pdf
2025-11-19 11:27:37,502 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2025-11-19 11:27:37,538 - INFO - Going to convert document batch...
2025-11-19 11:27:37,538 - INFO - Initializing pipeline for StandardPdfPipeline with options hash 02e213d66fe10d5cd7525796b8c0a9af
2025-11-19 11:27:37,546 - INFO - Loading plugin 'docling_defaults'
2025-11-19 11:27:37,547 - INFO - Registered picture descriptions: ['vlm', 'api']
2025-11-19 11:27:37,550 - INFO - Loading plugin 'docling_defaults'
2025-11-19 11:27:37,553 - INFO - Registered ocr engines: ['auto', 'easyocr', 'ocrmac', 'rapidocr', 'tesserocr', 'tesseract']
2025-11-19 11:27:46,451 - INFO - Auto OCR model selected ocrmac.
2025-11-19 11:27:46,456 - INFO - Accelerator device: 'mps'
2025-11-19 11:28:32,670 - INFO - Accelerator device: 'mps'
2025-11-19 11:28:32,930 - INFO - Processing document 2206.01062.pdf
2025-11-19 11:28:48,144 - INFO - Finished converting document 2206.01062.pdf in 70.64 sec.
2025-11-19 11:29:02,303 - INFO - Saved obfuscated output to: output/2206.01062_20251119_112848_obfuscated.md
2025-11-19 11:29:02,304 - INFO - Obfuscated entities for 2206.01062.pdf:
-------------------------------------------------------------- -----------
DocLayNet org-1
Birgit Pfitzman person-1
IBM Research org-2
Rueschlikon location-1
Switzerland location-2
Christoph Au person-2
Ahmed S. Nassa person-3
Michele Dolf person-4
Peter Staar person-5
KEYWORDS org-3
ACMR org-4
Birgit Pfitzmann person-6
Christoph Auer person-7
Michele Dolfi person-8
Ahmed S. Nassar person-9
M org-5
DD org-6
Washington location-3
DC location-4
USA location-5
ACM org-7
New York location-6
NY location-7
ABSTR org-8
PubLayNet org-9
DocBank org-10
ed org-11
COCO org-12
L org-13
KDD location-8
' org-14
KD location-9
D ' org-15
Washington, DC location-10
p org-16
PubMed org-17
Mask org-18
CNN org-19
Financial org-20
SEC org-21
AAPL org-22
AN org-23
MPA org-24
Val org-25
Fin person-10
Man location-11
Pat person-11
Corpus Conversion Service org-26
CCS org-27
CC org-28
DocB org-29
k org-30
MRCNN org-31
FRCNN org-32
YOLO org-33
MR org-34
R org-35
Re org-36
COCO API org-37
Fast org-38
C org-39
Text org-40
PLN org-41
DB org-42
DLN org-43
Max Göbel person-12
Tamir Hassan person-13
Ermelinda Oro person-14
Giorgio Orsi person-15
Icdar org-44
Christian Clausner person-16
Apostolos Antonacopoulos person-17
Stefan Pletschacher person-18
cdar org-45
ICDAR org-46
Hervé Déjean person-19
Jean-Luc Meunier person-20
Liangcai Gao person-21
Yilun Huang person-22
Yu Fang person-23
Florian Kleber person-24
Eva person-25
Maria Lang person-26
Antonio Jimeno Yepes person-27
Peter Zhong person-28
Douglas Burdick person-29
LNC org-47
SpringerVerlag org-48
Logan Markewich person-30
Hao Zhang person-31
Yubin Xing person-32
Navid Lambert person-33
Shirzad person-34
Jiang Zhexin person-35
Roy Lee person-36
Zhi Li person-37
Seok person-38
Bum Ko person-39
International Journal on Document Analysis and Recognition org-49
IJDAR org-50
Xu Zhong person-40
Jianbin Tang person-41
Antonio Jimeno person-42
Yep person-43
Minghao Li person-44
Yiheng Xu person-45
Lei Cui person-46
Shaohan Huang person-47
Furu Wei person-48
Zhoujun Li person-49
Ming Zhou person-50
Docbank org-51
International Committee on Comp org-52
Ling org-53
stics org-54
Riaz Ahmad person-51
Muhammad Tanvir Afzal person-52
M. Qadir person-53
ESWC org-55
Ross B. Girshick person-54
Jeff Donahue person-55
Trevor Darrell person-56
Jitendra Malik person-57
CVPR org-56
IEEE Computer Society org-57
ICCV org-58
Shaoqing Ren person-58
Kaiming He person-59
Ross Girshick person-60
Jian Sun person-61
IEEE Transactions on Pattern Analysis and Machine Intelligence org-59
Georgia Gkioxari person-62
Piotr Dollár person-63
Glenn Jocher person-64
Alex Stoken person-65
Ayush Chaurasia person-66
Jirka Borovec person-67
NanoCode org-60
TaoXie person-68
Yonghye Kwon person-69
Kalen Michael person-70
Liu Changyu person-71
Jiacong Fang person-72
Abhir person-73
V person-74
Laughing person-75
t person-76
y org-61
Piotr Skalski person-77
Adam Hogan person-78
Jebastin Nadar person-79
im person-80
Lorenzo Mammana person-81
Alex Wang person-82
Cristi Fati person-83
Diego Montes person-84
Jan Hajek person-85
Laurent person-86
O org-62
MODEL A org-63
Hunan location-12
B org-64
IJ org-65
da org-66
portob org-67
Diaconu person-87
Mai Thanh Minh person-88
Marc person-89
al person-90
Nicolas Carion person-91
Francisco Massa person-92
Gabriel Synnaeve person-93
Nicolas Usunier person-94
Alexander Kirillov person-95
Sergey Zagoruyko person-96
Co org-68
Mingxing Tan person-97
Ruoming Pang person-98
Q person-99
Le person-100
Tsung person-101
Yi Lin person-102
Michael Maire person-103
Serge J. Belongie person-104
Lubomir D. Bourdev person-105
James Hays person-106
Pietro Perona person-107
Deva Ramanan person-108
C. Lawrence Zitnick person-109
Microsoft org-69
Yuxin Wu person-110
Wan person-111
Yen Lo person-112
Nikolaos Livathinos person-113
Cesar Berrospi person-114
Maksym Lysak person-115
Viktor Kuropiatnyk person-116
Ahmed Nassar person-117
Andre Carvalho person-118
Kasper Dinkla person-119
Peter W. J. Staar person-120
AAAI org-70
D org-71
K org-72
Association for Computing Machinery org-73
Shoubin Li person-121
Xuyan Ma person-122
Shuaiqun Pan person-123
Jun Hu person-124
Lin Shi person-125
Qing Wang person-126
Peng Zhang person-127
Can Li person-128
Liang Qiao person-129
Zhanzhan Cheng person-130
Shiliang Pu person-131
Yi Niu person-132
Fei Wu person-133
Peter W J Staar person-134
Costas Bekas person-135
Connor Shorten person-136
Taghi M. Khoshgoftaar person-137
Journal of Big Data org-74
-------------------------------------------------------------- -----------
- And as I implemented a markdown export, we can see the excerpt below;
# org-1: A Large Human-Annotated Dataset for Document-Layout Analysis
person-1n org-2 location-1, location-2 bpf@zurich.ibm.com
person-2er org-2 location-1, location-2 cau@zurich.ibm.com
person-3r org-2
location-1, location-2 ahn@zurich.ibm.com
person-4i org-2 location-1, location-2 dol@zurich.ibm.com
person-5 org-2 location-1, location-2 taa@zurich.ibm.com
...
- Second implementation which is a the advanced version using GLiNER for richer PII labels.
import logging
import os
import re
from datetime import datetime
from pathlib import Path
from typing import Dict, List, Tuple, Callable
# docling imports
from docling_core.types.doc import ImageRefMode, TableItem, TextItem
from tabulate import tabulate
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.document_converter import DocumentConverter, PdfFormatOption
_log = logging.getLogger(__name__)
IMAGE_RESOLUTION_SCALE = 2.0
GLINER_MODEL = "urchade/gliner_multi_pii-v1"
def _build_gliner_model():
"""Create a GLiNER model for PII-like entity extraction.
Returns a tuple (model, labels) where model.predict_entities(text, labels)
yields entities with "text" and "label" fields.
"""
try:
from gliner import GLiNER # type: ignore
except Exception:
_log.error(
"GLiNER not installed. Please run: pip install gliner torch --extra-index-url https://download.pytorch.org/whl/cpu"
)
raise
model = GLiNER.from_pretrained(GLINER_MODEL)
# Curated set of labels for PII detection. Adjust this list as needed.
labels = [
"booking number",
"personally identifiable information",
"driver licence",
"person",
"full address",
"company",
"email",
"passport number",
"Social Security Number",
"phone number",
]
return model, labels
class AdvancedPIIObfuscator:
"""PII obfuscator powered by GLiNER with fine-grained labels.
- Uses GLiNER's `predict_entities(text, labels)` to detect entities.
- Obfuscates with stable IDs per fine-grained label, e.g. `email-1`.
"""
def __init__(self, gliner_model, labels: List[str]):
self.model = gliner_model
self.labels = labels
self.entity_map: Dict[str, str] = {}
self.counters: Dict[str, int] = {}
def _normalize(self, s: str) -> str:
return re.sub(r"\s+", " ", s).strip()
def _norm_label(self, label: str) -> str:
# Converts labels like "full address" to "full_address"
return (
re.sub(
r"[^a-z0-9_]+", "_", label.lower().replace(" ", "_").replace("-", "_")
).strip("_")
or "pii"
)
def _next_id(self, typ: str) -> str:
self.cc(typ)
self.counters[typ] += 1
return f"{typ}-{self.counters[typ]}"
def cc(self, typ: str) -> None:
if typ not in self.counters:
self.counters[typ] = 0
def _extract_entities(self, text: str) -> List[Tuple[str, str]]:
if not text:
return []
# GLiNER entity prediction
results = self.model.predict_entities(
text, self.labels
) # expects dicts with text/label
found: List[Tuple[str, str]] = []
for r in results:
label = self._norm_label(str(r.get("label", "pii")))
surface = self._normalize(str(r.get("text", "")))
if surface:
found.append((surface, label))
return found
def obfuscate_text(self, text: str) -> str:
if not text:
return text
entities = self._extract_entities(text)
if not entities:
return text
# Map unique words/entities to stable IDs
unique_words: Dict[str, str] = {}
for word, label in entities:
if word not in self.entity_map:
replacement = self._next_id(label)
self.entity_map[word] = replacement
unique_words[word] = self.entity_map[word]
# Replace longer matches first to avoid partial overlaps
sorted_pairs = sorted(
unique_words.items(), key=lambda x: len(x[0]), reverse=True
)
def replace_once(s: str, old: str, new: str) -> str:
pattern = re.escape(old)
# Replace the entity with its stable ID
return re.sub(pattern, new, s)
obfuscated = text
for old, new in sorted_pairs:
obfuscated = replace_once(obfuscated, old, new)
return obfuscated
def main():
logging.basicConfig(level=logging.INFO)
input_dir = Path("./input")
output_dir = Path("./output")
# Ensure output dir exists
output_dir.mkdir(parents=True, exist_ok=True)
_log.info(f"Output directory created/verified: {output_dir}")
# --- GLiNER Model Setup (Always used) ---
_log.info("Setting up GLiNER-based AdvancedPIIObfuscator...")
gliner_model, gliner_labels = _build_gliner_model()
# Factory function to create a new obfuscator instance for each file
ObfuscatorFactory: Callable[[], AdvancedPIIObfuscator] = lambda: AdvancedPIIObfuscator(gliner_model, gliner_labels)
# Document Converter Setup
pipeline_options = PdfPipelineOptions()
pipeline_options.images_scale = IMAGE_RESOLUTION_SCALE
pipeline_options.generate_page_images = True
pipeline_options.generate_picture_images = True
doc_converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
}
)
# Recursively find all files in the input directory
input_files = [p for p in input_dir.rglob("*") if p.is_file()]
_log.info(f"Found {len(input_files)} files in {input_dir}. Starting processing...")
for input_doc_path in input_files:
_log.info(f"Processing file: {input_doc_path}")
# Reset obfuscator for each file to ensure unique, sequential IDs per document type
obfuscator = ObfuscatorFactory()
try:
# 1. Convert Document
conv_res = doc_converter.convert(input_doc_path)
conv_doc = conv_res.document
# Prepare filename components
relative_path = input_doc_path.relative_to(input_dir).with_suffix('')
file_prefix = str(relative_path).replace(os.sep, '_')
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
# 2. Perform PII Obfuscation
for element, _level in conv_res.document.iterate_items():
if isinstance(element, TextItem):
element.orig = element.text
element.text = obfuscator.obfuscate_text(element.text)
elif isinstance(element, TableItem):
for cell in element.data.table_cells:
cell.text = obfuscator.obfuscate_text(cell.text)
# 3. Save Output
# Output filename format: [SourcePathPrefix]_[Timestamp]_obfuscated.md
md_filename = output_dir / f"{file_prefix}_{timestamp}_obfuscated.md"
conv_doc.save_as_markdown(md_filename, image_mode=ImageRefMode.EMBEDDED)
_log.info(f"Saved obfuscated output to: {md_filename}")
# Optional: log mapping summary
if obfuscator.entity_map:
data = []
for key, val in obfuscator.entity_map.items():
data.append([key, val])
_log.info(
f"Obfuscated entities for {input_doc_path.name}:\n\n{tabulate(data)}",
)
except Exception as e:
_log.error(f"Failed to process {input_doc_path}: {e}")
continue
if __name__ == "__main__":
main()
- For the same document as input, we get the following output;
python AdvancedPIIObfuscator.py
2025-11-19 12:01:49,809 - INFO - Output directory created/verified: output
2025-11-19 12:01:49,809 - INFO - Setting up GLiNER-based AdvancedPIIObfuscator...
gliner_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 478/478 [00:00<00:00, 8.68MB/s]
README.md: 3.04kB [00:00, 21.1MB/s] | 0.00/478 [00:00<?, ?B/s]
.gitattributes: 1.52kB [00:00, 16.3MB/s]
pytorch_model.bin: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.16G/1.16G [00:51<00:00, 22.5MB/s]
Fetching 4 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:52<00:00, 13.02s/it]
tokenizer_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 52.0/52.0 [00:00<00:00, 508kB/s]
config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 579/579 [00:00<00:00, 2.23MB/s]
spm.model: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.31M/4.31M [00:00<00:00, 4.54MB/s]
/Users/alainairom/Devs/docling-pii/myenv/lib/python3.12/site-packages/transformers/convert_slow_tokenizer.py:559: UserWarning: The sentencepiece tokenizer that you are converting to a fast tokenizer uses the byte fallback option which is not implemented in the fast tokenizers. In practice this means that the fast version of the tokenizer can produce unknown tokens whereas the sentencepiece version would have converted these unknown tokens into a sequence of byte tokens matching the original piece of text.
warnings.warn(
2025-11-19 12:02:57,088 - INFO - Found 1 files in input. Starting processing...
2025-11-19 12:02:57,088 - INFO - Processing file: input/2206.01062.pdf
2025-11-19 12:02:57,090 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2025-11-19 12:02:57,115 - INFO - Going to convert document batch...
2025-11-19 12:02:57,115 - INFO - Initializing pipeline for StandardPdfPipeline with options hash 02e213d66fe10d5cd7525796b8c0a9af
2025-11-19 12:02:57,122 - INFO - Loading plugin 'docling_defaults'
2025-11-19 12:02:57,123 - INFO - Registered picture descriptions: ['vlm', 'api']
2025-11-19 12:02:57,125 - INFO - Loading plugin 'docling_defaults'
2025-11-19 12:02:57,127 - INFO - Registered ocr engines: ['auto', 'easyocr', 'ocrmac', 'rapidocr', 'tesserocr', 'tesseract']
2025-11-19 12:03:04,036 - INFO - Auto OCR model selected ocrmac.
2025-11-19 12:03:04,038 - INFO - Accelerator device: 'mps'
2025-11-19 12:03:12,394 - INFO - Accelerator device: 'mps'
2025-11-19 12:03:12,666 - INFO - Processing document 2206.01062.pdf
2025-11-19 12:03:20,547 - INFO - Finished converting document 2206.01062.pdf in 23.46 sec.
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
2025-11-19 12:03:45,543 - INFO - Saved obfuscated output to: output/2206.01062_20251119_120320_obfuscated.md
2025-11-19 12:03:45,543 - INFO - Obfuscated entities for 2206.01062.pdf:
-------------------------------- ----------
DocLayNet company-1
Birgit Pfitzmann person-1
IBM Research company-2
bpf@zurich.ibm.com email-1
Christoph Auer person-2
cau@zurich.ibm.com email-2
Ahmed S. Nassar person-3
ahn@zurich.ibm.com email-3
Michele Dolfi person-4
dol@zurich.ibm.com email-4
Peter Staar person-5
taa@zurich.ibm.com email-5
CCS CONCEPTS company-3
owner/author(s) person-6
KDD '22 person-7
ACM company-4
PubLayNet company-5
DocBank company-6
we person-8
We person-9
AAPL company-7
Val person-10
Man person-11
Corpus Conversion Service company-8
group of 40 dedicated annotators person-12
one annotator person-13
one proficient core team member person-14
staff person-15
32 annotators person-16
annotator staff person-17
human person-18
MRCNN company-9
FRCNN company-10
YOLO company-11
experienced annotation staff person-19
DLN company-12
PubLayNet (PLN) company-13
DocBank (DB) company-14
DocLayNet (DLN) company-15
Max Göbel person-20
Tamir Hassan person-21
Ermelinda Oro person-22
Giorgio Orsi person-23
Christian Clausner person-24
Apostolos Antonacopoulos person-25
Stefan Pletschacher person-26
Hervé Déjean person-27
Jean-Luc Meunier person-28
Liangcai Gao person-29
Yilun Huang person-30
Yu Fang person-31
Florian Kleber person-32
Eva-Maria Lang person-33
Antonio Jimeno Yepes person-34
Peter Zhong person-35
Douglas Burdick person-36
Logan Markewich person-37
Hao Zhang person-38
Yubin Xing person-39
Navid Lambert-Shirzad person-40
Jiang Zhexin person-41
Roy Lee person-42
Zhi Li person-43
Seok-Bum Ko person-44
Xu Zhong person-45
Jianbin Tang person-46
Antonio Jimeno-Yepes person-47
Publaynet company-16
Minghao Li person-48
Yiheng Xu person-49
Lei Cui person-50
Shaohan Huang person-51
Furu Wei person-52
Zhoujun Li person-53
Ming Zhou person-54
Riaz Ahmad person-55
Muhammad Tanvir Afzal person-56
M. Qadir person-57
Ross B. Girshick person-58
Jeff Donahue person-59
Trevor Darrell person-60
Jitendra Malik person-61
Shaoqing Ren person-62
Kaiming He person-63
Ross Girshick person-64
Jian Sun person-65
Georgia Gkioxari person-66
Piotr Dollár person-67
Glenn Jocher person-68
Alex Stoken person-69
Ayush Chaurasia person-70
Jirka Borovec person-71
NanoCode012 person-72
TaoXie person-73
Yonghye Kwon person-74
Kalen Michael person-75
Liu Changyu person-76
Jiacong Fang person-77
Abhiram V person-78
Laughing person-79
tkianai person-80
yxNONG person-81
Piotr Skalski person-82
Adam Hogan person-83
Jebastin Nadar person-84
imyhxy person-85
Lorenzo Mammana person-86
Alex Wang person-87
Cristi Fati person-88
Diego Montes person-89
Jan Hajek person-90
Laurentiu person-91
e person-92
ader creconbn person-93
nalo bonos person-94
sorne imomaban person-95
melan croune person-96
Bichater person-97
Diaconu person-98
Mai Thanh Minh person-99
Marc person-100
albinxavi person-101
fatih person-102
oleg person-103
wanghao yang person-104
Nicolas Carion person-105
Francisco Massa person-106
Gabriel Synnaeve person-107
Nicolas Usunier person-108
Alexander Kirillov person-109
Sergey Zagoruyko person-110
Mingxing Tan person-111
Ruoming Pang person-112
Quoc V. Le person-113
Tsung-Yi Lin person-114
Michael Maire person-115
Serge J. Belongie person-116
Lubomir D. Bourdev person-117
James Hays person-118
Pietro Perona person-119
Deva Ramanan person-120
C. Lawrence Zitnick person-121
Microsoft company-17
Yuxin Wu person-122
Wan-Yen Lo person-123
Nikolaos Livathinos person-124
Cesar Berrospi person-125
Maksym Lysak person-126
Viktor Kuropiatnyk person-127
Ahmed Nassar person-128
Andre Carvalho person-129
Kasper Dinkla person-130
Peter W. J. Staar person-131
Shoubin Li person-132
Xuyan Ma person-133
Shuaiqun Pan person-134
Jun Hu person-135
Lin Shi person-136
Qing Wang person-137
Peng Zhang person-138
Can Li person-139
Liang Qiao person-140
Zhanzhan Cheng person-141
Shiliang Pu person-142
Yi Niu person-143
Fei Wu person-144
Peter W J Staar person-145
Costas Bekas person-146
Connor Shorten person-147
Taghi M. Khoshgoftaar person-148
-------------------------------- ----------
- And the result markdown is almost the same;
## company-1: A Large Human-Annotated Dataset for Document-Layout Analysis
person-1 company-2 Rueschlikon, Switzerland email-1
person-2 company-2 Rueschlikon, Switzerland email-2
person-3 company-2
Rueschlikon, Switzerland email-3
person-4 company-2 Rueschlikon, Switzerland email-4
person-5 company-2 Rueschlikon, Switzerland email-5
What is GliNER?
GLiNER (Generalist and Lightweight Model for Named Entity Recognition) is a cutting-edge Named Entity Recognition (NER) model designed to overcome the limitations of traditional NER systems and the resource demands of Large Language Models (LLMs).
It offers a powerful solution for flexible, custom entity extraction, which is why it’s a great choice for detecting Personally Identifiable Information (PII) as seen in your code.
Key Features of GLiNER
Zero-Shot Learning (Generalist Model):
- The Problem: Traditional NER models are limited to the entities they were explicitly trained on (e.g., PERSON, ORG, LOC). To recognize new entity types (like “passport number” or “booking ID”), you would typically have to gather thousands of examples and retrain the model.
The GLiNER Solution: GLiNER is zero-shot, meaning you can feed it a list of custom entity labels (like the list you use in pii_obfuscate_gliner.py) and it will find those entities in the text without needing retraining. It matches text spans to the entity labels in a shared latent space.
Lightweight and Efficient:Unlike massive LLMs (like GPT-4), which are often slow and expensive to run at scale, GLiNER is a smaller, bidirectional transformer encoder (similar to BERT).
This makes it much faster and capable of running efficiently on standard hardware, including CPUs, which is critical for high-volume data processing and edge deployment.
Parallel Entity Extraction:GLiNER processes the text and entity labels simultaneously, allowing for parallel extraction of entities. This is faster than the sequential, token-by-token generation process used by autoregressive LLMs.
In short, GLiNER gives you the flexibility of an LLM to define any entity type you want, combined with the speed and efficiency of a lightweight transformer model.
Conclusion
Ultimately, achieving comprehensive data privacy in the age of complex compliance like GDPR requires fusing advanced linguistic models with robust document processing. This is where the force of Docling becomes indispensable. By seamlessly ingesting and parsing a wide range of unstructured documents — from complex PDFs and digitized forms to embedded tables — Docling provides the structured text foundation necessary for deep analysis. When this is combined with the flexible, zero-shot capabilities of Named Entity Recognition (NER), specifically models like GLiNER, we create an automated, end-to-end pipeline capable of masterfully detecting and securely obfuscating PII, regardless of its location or format within the document. This powerful combination shifts PII protection from a brittle, rule-based chore to a scalable, high-accuracy technological safeguard, ensuring your compliance posture is both secure and future-proof.
Thanks for reading 🥂
Links
- GLiNER: https://github.com/urchade/GLiNER
- GLiNER Python Package: https://pypi.org/project/gliner/0.2.5/
- Docling Documentation: https://docling-project.github.io/docling/
- Detect and obfuscate PII: https://docling-project.github.io/docling/examples/pii_obfuscate/
- Docling GitHub Repository: https://github.com/docling-project
- Source Test Document for running the Code: https://github.com/docling-project/docling/blob/main/tests/data/pdf/2206.01062.pdf
- NER: https://www.ibm.com/think/topics/named-entity-recognition

Top comments (0)