DEV Community

Cover image for The Next Generation of Privacy: Using Docling & GLiNER’s Advanced NER to Masterfully Detect and Obfuscate PII
Alain Airom
Alain Airom

Posted on

The Next Generation of Privacy: Using Docling & GLiNER’s Advanced NER to Masterfully Detect and Obfuscate PII

Validating PII Obfuscation using GLiNER-Powered Named Entity Recognition (NER) within Docling

Introduction

The challenge of rigorously detecting and obscuring Personally Identifiable Information (PII) is one that numerous tools — both commercial and open-source — aim to solve using Named Entity Recognition (NER). Yet, this necessity is no longer a matter of best practice, but a critical legal mandate driven primarily by the European Union’s General Data Protection Regulation (GDPR). Passed in 2016, GDPR fundamentally reshaped how organizations worldwide must handle the personal data of EU residents, demanding data minimization, transparency, and high standards of security. Failure to protect information like names, addresses, and account numbers — which could directly or indirectly identify an individual — exposes companies to severe penalties, potentially reaching tens of millions of euros or a significant percentage of global annual turnover. Consequently, adopting advanced techniques like Named Entity Recognition (NER) for automated PII obfuscation has become essential, transforming privacy compliance from a manual checklist item into a scalable, technological safeguard.

What is named entity recognition?

Named entity recognition (NER) — also called entity chunking or entity extraction — is a component of natural language processing (NLP) that identifies predefined categories of objects in a body of text.
These categories can include, but are not limited to, names of individuals, organizations, locations, expressions of times, quantities, medical codes, monetary values and percentages, among others. Essentially, NER is the process of taking a string of text (i.e., a sentence, paragraph or entire document), and identifying and classifying the entities that refer to each category.
When the term “NER” was coined at the Sixth Message Understanding Conference (MUC-6), the goal was to streamline information extraction tasks, which involved processing large amounts of unstructured text and identifying key information. Since then, NER has expanded and evolved, owing much of its evolution to advancements in machine learning and deep learning techniques.

The whole article could be found here: https://www.ibm.com/think/topics/named-entity-recognition

How can Docling help with NER Obfuscation?

Docling simplifies document processing, parsing diverse formats — including advanced PDF understanding — and providing seamless integrations with the gen AI ecosystem.

Features

  • 🗂️ Parsing of multiple document formats incl. PDF, DOCX, PPTX, XLSX, HTML, WAV, MP3, VTT, images (PNG, TIFF, JPEG, …), and more
  • 📑 Advanced PDF understanding incl. page layout, reading order, table structure, code, formulas, image classification, and more
  • 🧬 Unified, expressive DoclingDocument representation format
  • ↪️ Various export formats and options, including Markdown, HTML, DocTags and lossless JSON
  • 🔒 Local execution capabilities for sensitive data and air-gapped environments
  • 🤖 Plug-and-play integrations incl. LangChain, LlamaIndex, Crew AI & Haystack for agentic AI
  • 🔍 Extensive OCR support for scanned PDFs and images
  • 👓 Support of several Visual Language Models (GraniteDocling)
  • 🎙️ Audio support with Automatic Speech Recognition (ASR) models
  • 🔌 Connect to any agent using the MCP server
  • 💻 Simple and convenient CLI

What’s new

  • 📤 Structured information extraction [🧪 beta]
  • 📑 New layout model (Heron) by default, for faster PDF parsing
  • 🔌 MCP server for agentic applications
  • 💬 Parsing of Web Video Text Tracks (WebVTT) files

Coming soon

  • 📝 Metadata extraction, including title, authors, references & language
  • 📝 Chart understanding (Barchart, Piechart, LinePlot, etc)
  • 📝 Complex chemistry understanding (Molecular structures)

Testing the NER Functionality ad Implementation

For my tests, related to project, I started to work with the provided sample out-of-the-box. I just adapted the code to my way of working…

  • Hereafter the steps to test and implement the code
# for using the GLiNER package you better use a previous python version!
python3.12 -m venv myenv
source myenv/bin/activate

pip install docling
pip install transformers
pip install torch --extra-index-url https://download.pytorch.org/whl/cpu
pip install gliner
Enter fullscreen mode Exit fullscreen mode
  • First implementation of NER which does the following: converts a PDF and saves original Markdown with embedded images, runs a HF token-classification pipeline (NER) to detect PII-like entities, obfuscates occurrences in TextItem and TableItem by stable, type-based IDs.

The input sample document is to be found here: https://github.com/docling-project/docling/blob/main/tests/data/pdf/2206.01062.pdf

import argparse
import logging
import os
import re
from datetime import datetime 
from pathlib import Path
from typing import Dict, List, Tuple

from docling_core.types.doc import ImageRefMode, TableItem, TextItem
from tabulate import tabulate

from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.document_converter import DocumentConverter, PdfFormatOption

_log = logging.getLogger(__name__)

IMAGE_RESOLUTION_SCALE = 2.0
HF_MODEL = "dslim/bert-base-NER"  # Swap with another HF NER/PII model if desired, eg https://huggingface.co/urchade/gliner_multi_pii-v1 looks very promising too!
GLINER_MODEL = "urchade/gliner_multi_pii-v1"


def _build_simple_ner_pipeline():
    """Create a Hugging Face token-classification pipeline for NER.

    Returns a callable like: ner(text) -> List[dict]
    """
    try:
        from transformers import (
            AutoModelForTokenClassification,
            AutoTokenizer,
            pipeline,
        )
    except Exception:
        _log.error("Transformers not installed. Please run: pip install transformers")
        raise

    tokenizer = AutoTokenizer.from_pretrained(HF_MODEL)
    model = AutoModelForTokenClassification.from_pretrained(HF_MODEL)
    ner = pipeline(
        "token-classification",
        model=model,
        tokenizer=tokenizer,
        aggregation_strategy="simple",  # groups subwords into complete entities
        # Note: modern Transformers returns `start`/`end` when possible with aggregation
    )
    return ner


class SimplePiiObfuscator:
    """Tracks PII strings and replaces them with stable IDs per entity type."""

    def __init__(self, ner_callable):
        self.ner = ner_callable
        self.entity_map: Dict[str, str] = {}
        self.counters: Dict[str, int] = {
            "person": 0,
            "org": 0,
            "location": 0,
            "misc": 0,
        }
        # Map model labels to our coarse types
        self.label_map = {
            "PER": "person",
            "PERSON": "person",
            "ORG": "org",
            "ORGANIZATION": "org",
            "LOC": "location",
            "LOCATION": "location",
            "GPE": "location",
            # Fallbacks
            "MISC": "misc",
            "O": "misc",
        }
        # Only obfuscate these by default. Adjust as needed.
        self.allowed_types = {"person", "org", "location"}

    def _next_id(self, typ: str) -> str:
        self.counters[typ] += 1
        return f"{typ}-{self.counters[typ]}"

    def _normalize(self, s: str) -> str:
        return re.sub(r"\s+", " ", s).strip()

    def _extract_entities(self, text: str) -> List[Tuple[str, str]]:
        """Run NER and return a list of (surface_text, type) to obfuscate."""
        if not text:
            return []
        results = self.ner(text)
        # Collect normalized items with optional span info
        items = []
        for r in results:
            raw_label = r.get("entity_group") or r.get("entity") or "MISC"
            label = self.label_map.get(raw_label, "misc")
            if label not in self.allowed_types:
                continue
            start = r.get("start")
            end = r.get("end")
            word = self._normalize(r.get("word") or r.get("text") or "")
            items.append({"label": label, "start": start, "end": end, "word": word})

        found: List[Tuple[str, str]] = []
        # If the pipeline provides character spans, merge consecutive/overlapping
        # entities of the same type into a single span, then take the substring
        # from the original text. This handles cases like subword tokenization
        # where multiple adjacent pieces belong to the same named entity.
        have_spans = any(i["start"] is not None and i["end"] is not None for i in items)
        if have_spans:
            spans = [
                i for i in items if i["start"] is not None and i["end"] is not None
            ]
            # Ensure processing order by start (then end)
            spans.sort(key=lambda x: (x["start"], x["end"]))

            merged = []
            for s in spans:
                if not merged:
                    merged.append(dict(s))
                    continue
                last = merged[-1]
                if s["label"] == last["label"] and s["start"] <= last["end"]:
                    # Merge identical, overlapping, or touching spans of same type
                    last["start"] = min(last["start"], s["start"])
                    last["end"] = max(last["end"], s["end"])
                else:
                    merged.append(dict(s))

            for m in merged:
                surface = self._normalize(text[m["start"] : m["end"]])
                if surface:
                    found.append((surface, m["label"]))

            # Include any items lacking spans as-is (fallback)
            for i in items:
                if i["start"] is None or i["end"] is None:
                    if i["word"]:
                        found.append((i["word"], i["label"]))
        else:
            # Fallback when spans aren't provided: return normalized words
            for i in items:
                if i["word"]:
                    found.append((i["word"], i["label"]))
        return found

    def obfuscate_text(self, text: str) -> str:
        if not text:
            return text

        entities = self._extract_entities(text)
        if not entities:
            return text

        # Deduplicate per text, keep stable global mapping
        unique_words: Dict[str, str] = {}
        for word, label in entities:
            if word not in self.entity_map:
                replacement = self._next_id(label)
                self.entity_map[word] = replacement
            unique_words[word] = self.entity_map[word]

        # Replace longer matches first to avoid partial overlaps
        sorted_pairs = sorted(
            unique_words.items(), key=lambda x: len(x[0]), reverse=True
        )

        def replace_once(s: str, old: str, new: str) -> str:
            # Use simple substring replacement; for stricter matching, use word boundaries
            # when appropriate (e.g., names). This is a demo, keep it simple.
            pattern = re.escape(old)
            return re.sub(pattern, new, s)

        obfuscated = text
        for old, new in sorted_pairs:
            obfuscated = replace_once(obfuscated, old, new)
        return obfuscated


def _build_gliner_model():
    """Create a GLiNER model for PII-like entity extraction.

    Returns a tuple (model, labels) where model.predict_entities(text, labels)
    yields entities with "text" and "label" fields.
    """
    try:
        from gliner import GLiNER  # type: ignore
    except Exception:
        _log.error(
            "GLiNER not installed. Please run: pip install gliner torch --extra-index-url https://download.pytorch.org/whl/cpu"
        )
        raise

    model = GLiNER.from_pretrained(GLINER_MODEL)
    # Curated set of labels for PII detection. Adjust as needed.
    labels = [
        # "work",
        "booking number",
        "personally identifiable information",
        "driver licence",
        "person",
        "full address",
        "company",
        # "actor",
        # "character",
        "email",
        "passport number",
        "Social Security Number",
        "phone number",
    ]
    return model, labels


class AdvancedPIIObfuscator:
    """PII obfuscator powered by GLiNER with fine-grained labels.

    - Uses GLiNER's `predict_entities(text, labels)` to detect entities.
    - Obfuscates with stable IDs per fine-grained label, e.g. `email-1`.
    """

    def __init__(self, gliner_model, labels: List[str]):
        self.model = gliner_model
        self.labels = labels
        self.entity_map: Dict[str, str] = {}
        self.counters: Dict[str, int] = {}

    def _normalize(self, s: str) -> str:
        return re.sub(r"\s+", " ", s).strip()

    def _norm_label(self, label: str) -> str:
        return (
            re.sub(
                r"[^a-z0-9_]+", "_", label.lower().replace(" ", "_").replace("-", "_")
            ).strip("_")
            or "pii"
        )

    def _next_id(self, typ: str) -> str:
        self.cc(typ)
        self.counters[typ] += 1
        return f"{typ}-{self.counters[typ]}"

    def cc(self, typ: str) -> None:
        if typ not in self.counters:
            self.counters[typ] = 0

    def _extract_entities(self, text: str) -> List[Tuple[str, str]]:
        if not text:
            return []
        results = self.model.predict_entities(
            text, self.labels
        )  # expects dicts with text/label
        found: List[Tuple[str, str]] = []
        for r in results:
            label = self._norm_label(str(r.get("label", "pii")))
            surface = self._normalize(str(r.get("text", "")))
            if surface:
                found.append((surface, label))
        return found

    def obfuscate_text(self, text: str) -> str:
        if not text:
            return text
        entities = self._extract_entities(text)
        if not entities:
            return text

        unique_words: Dict[str, str] = {}
        for word, label in entities:
            if word not in self.entity_map:
                replacement = self._next_id(label)
                self.entity_map[word] = replacement
            unique_words[word] = self.entity_map[word]

        sorted_pairs = sorted(
            unique_words.items(), key=lambda x: len(x[0]), reverse=True
        )

        def replace_once(s: str, old: str, new: str) -> str:
            pattern = re.escape(old)
            return re.sub(pattern, new, s)

        obfuscated = text
        for old, new in sorted_pairs:
            obfuscated = replace_once(obfuscated, old, new)
        return obfuscated


def main():
    logging.basicConfig(level=logging.INFO)

    # --- Start of modifications for input/output handling ---
    input_dir = Path("./input")
    output_dir = Path("./output")

    # Choose engine via CLI flag or env var (default: hf)
    parser = argparse.ArgumentParser(description="PII obfuscation example")
    parser.add_argument(
        "--engine",
        choices=["hf", "gliner"],
        default=os.getenv("PII_ENGINE", "hf"),
        help="NER engine: 'hf' (Transformers) or 'gliner' (GLiNER)",
    )
    args = parser.parse_args()

    # Ensure output dir exists
    output_dir.mkdir(parents=True, exist_ok=True)
    _log.info(f"Output directory created/verified: {output_dir}")
    # --- End of modifications for input/output handling ---

    # Keep and generate images so Markdown can embed them
    pipeline_options = PdfPipelineOptions()
    pipeline_options.images_scale = IMAGE_RESOLUTION_SCALE
    pipeline_options.generate_page_images = True
    pipeline_options.generate_picture_images = True

    doc_converter = DocumentConverter(
        format_options={
            InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
        }
    )

    # Build NER pipeline and obfuscator
    if args.engine == "gliner":
        _log.info("Using GLiNER-based AdvancedPIIObfuscator")
        gliner_model, gliner_labels = _build_gliner_model()
        # Create a new obfuscator for each run to reset the ID counter
        ObfuscatorClass = lambda: AdvancedPIIObfuscator(gliner_model, gliner_labels)
    else:
        _log.info("Using HF Transformers-based SimplePiiObfuscator")
        ner = _build_simple_ner_pipeline()
        # Create a new obfuscator for each run to reset the ID counter
        ObfuscatorClass = lambda: SimplePiiObfuscator(ner)

    # --- Start of modifications for recursive processing and output saving ---

    # Recursively find all files in the input directory
    input_files = [p for p in input_dir.rglob("*") if p.is_file()]
    _log.info(f"Found {len(input_files)} files in {input_dir}")

    for input_doc_path in input_files:
        _log.info(f"Processing file: {input_doc_path}")

        # Reset obfuscator for each file to ensure unique, sequential IDs per document
        obfuscator = ObfuscatorClass()

        try:
            conv_res = doc_converter.convert(input_doc_path)
            conv_doc = conv_res.document

            # Use relative path from input folder to preserve directory structure in filename
            relative_path = input_doc_path.relative_to(input_dir).with_suffix('')

            # Clean up the path for use in a filename (replace slashes with underscores)
            file_prefix = str(relative_path).replace(os.sep, '_')

            timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')

            # --- Perform PII Obfuscation on the document elements ---
            for element, _level in conv_res.document.iterate_items():
                if isinstance(element, TextItem):
                    element.orig = element.text
                    element.text = obfuscator.obfuscate_text(element.text)
                elif isinstance(element, TableItem):
                    for cell in element.data.table_cells:
                        cell.text = obfuscator.obfuscate_text(cell.text)

            # Save markdown with embedded pictures and obfuscated text
            md_filename = output_dir / f"{file_prefix}_{timestamp}_obfuscated.md"
            conv_doc.save_as_markdown(md_filename, image_mode=ImageRefMode.EMBEDDED)
            _log.info(f"Saved obfuscated output to: {md_filename}")

            # Optional: log mapping summary
            if obfuscator.entity_map:
                data = []
                for key, val in obfuscator.entity_map.items():
                    data.append([key, val])
                _log.info(
                    f"Obfuscated entities for {input_doc_path.name}:\n\n{tabulate(data)}",
                )

        except Exception as e:
            _log.error(f"Failed to process {input_doc_path}: {e}")
            continue

    # --- End of modifications for recursive processing and output saving ---


if __name__ == "__main__":
    main()
Enter fullscreen mode Exit fullscreen mode
  • Onz of the important key parts to personnalize the application is;
model = GLiNER.from_pretrained(GLINER_MODEL)
    # Curated set of labels for PII detection. Adjust as needed.
labels = [
        # "work",
        "booking number",
        "personally identifiable information",
        "driver licence",
        "person",
        "full address",
        "company",
        # "actor",
        # "character",
        "email",
        "passport number",
        "Social Security Number",
        "phone number",
    ]
    return model, labels
Enter fullscreen mode Exit fullscreen mode
  • The output we get on console is 👇
> python app-pii.py
2025-11-19 11:27:11,720 - INFO - Output directory created/verified: output
2025-11-19 11:27:11,722 - INFO - Using HF Transformers-based SimplePiiObfuscator
tokenizer_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 59.0/59.0 [00:00<00:00, 189kB/s]
config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 829/829 [00:00<00:00, 11.2MB/s]
vocab.txt: 213kB [00:00, 5.43MB/s]
added_tokens.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.00/2.00 [00:00<00:00, 18.0kB/s]
special_tokens_map.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 112/112 [00:00<00:00, 621kB/s]
model.safetensors: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 433M/433M [00:22<00:00, 18.9MB/s]
Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use mps:0
2025-11-19 11:27:37,501 - INFO - Found 1 files in input
2025-11-19 11:27:37,501 - INFO - Processing file: input/2206.01062.pdf
2025-11-19 11:27:37,502 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2025-11-19 11:27:37,538 - INFO - Going to convert document batch...
2025-11-19 11:27:37,538 - INFO - Initializing pipeline for StandardPdfPipeline with options hash 02e213d66fe10d5cd7525796b8c0a9af
2025-11-19 11:27:37,546 - INFO - Loading plugin 'docling_defaults'
2025-11-19 11:27:37,547 - INFO - Registered picture descriptions: ['vlm', 'api']
2025-11-19 11:27:37,550 - INFO - Loading plugin 'docling_defaults'
2025-11-19 11:27:37,553 - INFO - Registered ocr engines: ['auto', 'easyocr', 'ocrmac', 'rapidocr', 'tesserocr', 'tesseract']
2025-11-19 11:27:46,451 - INFO - Auto OCR model selected ocrmac.
2025-11-19 11:27:46,456 - INFO - Accelerator device: 'mps'
2025-11-19 11:28:32,670 - INFO - Accelerator device: 'mps'
2025-11-19 11:28:32,930 - INFO - Processing document 2206.01062.pdf
2025-11-19 11:28:48,144 - INFO - Finished converting document 2206.01062.pdf in 70.64 sec.
2025-11-19 11:29:02,303 - INFO - Saved obfuscated output to: output/2206.01062_20251119_112848_obfuscated.md
2025-11-19 11:29:02,304 - INFO - Obfuscated entities for 2206.01062.pdf:

--------------------------------------------------------------  -----------
DocLayNet                                                       org-1
Birgit Pfitzman                                                 person-1
IBM Research                                                    org-2
Rueschlikon                                                     location-1
Switzerland                                                     location-2
Christoph Au                                                    person-2
Ahmed S. Nassa                                                  person-3
Michele Dolf                                                    person-4
Peter Staar                                                     person-5
KEYWORDS                                                        org-3
ACMR                                                            org-4
Birgit Pfitzmann                                                person-6
Christoph Auer                                                  person-7
Michele Dolfi                                                   person-8
Ahmed S. Nassar                                                 person-9
M                                                               org-5
DD                                                              org-6
Washington                                                      location-3
DC                                                              location-4
USA                                                             location-5
ACM                                                             org-7
New York                                                        location-6
NY                                                              location-7
ABSTR                                                           org-8
PubLayNet                                                       org-9
DocBank                                                         org-10
ed                                                              org-11
COCO                                                            org-12
L                                                               org-13
KDD                                                             location-8
'                                                               org-14
KD                                                              location-9
D '                                                             org-15
Washington, DC                                                  location-10
p                                                               org-16
PubMed                                                          org-17
Mask                                                            org-18
CNN                                                             org-19
Financial                                                       org-20
SEC                                                             org-21
AAPL                                                            org-22
AN                                                              org-23
MPA                                                             org-24
Val                                                             org-25
Fin                                                             person-10
Man                                                             location-11
Pat                                                             person-11
Corpus Conversion Service                                       org-26
CCS                                                             org-27
CC                                                              org-28
DocB                                                            org-29
k                                                               org-30
MRCNN                                                           org-31
FRCNN                                                           org-32
YOLO                                                            org-33
MR                                                              org-34
R                                                               org-35
Re                                                              org-36
COCO API                                                        org-37
Fast                                                            org-38
C                                                               org-39
Text                                                            org-40
PLN                                                             org-41
DB                                                              org-42
DLN                                                             org-43
Max Göbel                                                       person-12
Tamir Hassan                                                    person-13
Ermelinda Oro                                                   person-14
Giorgio Orsi                                                    person-15
Icdar                                                           org-44
Christian Clausner                                              person-16
Apostolos Antonacopoulos                                        person-17
Stefan Pletschacher                                             person-18
cdar                                                            org-45
ICDAR                                                           org-46
Hervé Déjean                                                    person-19
Jean-Luc Meunier                                                person-20
Liangcai Gao                                                    person-21
Yilun Huang                                                     person-22
Yu Fang                                                         person-23
Florian Kleber                                                  person-24
Eva                                                             person-25
Maria Lang                                                      person-26
Antonio Jimeno Yepes                                            person-27
Peter Zhong                                                     person-28
Douglas Burdick                                                 person-29
LNC                                                             org-47
SpringerVerlag                                                  org-48
Logan Markewich                                                 person-30
Hao Zhang                                                       person-31
Yubin Xing                                                      person-32
Navid Lambert                                                   person-33
Shirzad                                                         person-34
Jiang Zhexin                                                    person-35
Roy Lee                                                         person-36
Zhi Li                                                          person-37
Seok                                                            person-38
Bum Ko                                                          person-39
International Journal on Document Analysis and Recognition      org-49
IJDAR                                                           org-50
Xu Zhong                                                        person-40
Jianbin Tang                                                    person-41
Antonio Jimeno                                                  person-42
Yep                                                             person-43
Minghao Li                                                      person-44
Yiheng Xu                                                       person-45
Lei Cui                                                         person-46
Shaohan Huang                                                   person-47
Furu Wei                                                        person-48
Zhoujun Li                                                      person-49
Ming Zhou                                                       person-50
Docbank                                                         org-51
International Committee on Comp                                 org-52
Ling                                                            org-53
stics                                                           org-54
Riaz Ahmad                                                      person-51
Muhammad Tanvir Afzal                                           person-52
M. Qadir                                                        person-53
ESWC                                                            org-55
Ross B. Girshick                                                person-54
Jeff Donahue                                                    person-55
Trevor Darrell                                                  person-56
Jitendra Malik                                                  person-57
CVPR                                                            org-56
IEEE Computer Society                                           org-57
ICCV                                                            org-58
Shaoqing Ren                                                    person-58
Kaiming He                                                      person-59
Ross Girshick                                                   person-60
Jian Sun                                                        person-61
IEEE Transactions on Pattern Analysis and Machine Intelligence  org-59
Georgia Gkioxari                                                person-62
Piotr Dollár                                                    person-63
Glenn Jocher                                                    person-64
Alex Stoken                                                     person-65
Ayush Chaurasia                                                 person-66
Jirka Borovec                                                   person-67
NanoCode                                                        org-60
TaoXie                                                          person-68
Yonghye Kwon                                                    person-69
Kalen Michael                                                   person-70
Liu Changyu                                                     person-71
Jiacong Fang                                                    person-72
Abhir                                                           person-73
V                                                               person-74
Laughing                                                        person-75
t                                                               person-76
y                                                               org-61
Piotr Skalski                                                   person-77
Adam Hogan                                                      person-78
Jebastin Nadar                                                  person-79
im                                                              person-80
Lorenzo Mammana                                                 person-81
Alex Wang                                                       person-82
Cristi Fati                                                     person-83
Diego Montes                                                    person-84
Jan Hajek                                                       person-85
Laurent                                                         person-86
O                                                               org-62
MODEL A                                                         org-63
Hunan                                                           location-12
B                                                               org-64
IJ                                                              org-65
da                                                              org-66
portob                                                          org-67
Diaconu                                                         person-87
Mai Thanh Minh                                                  person-88
Marc                                                            person-89
al                                                              person-90
Nicolas Carion                                                  person-91
Francisco Massa                                                 person-92
Gabriel Synnaeve                                                person-93
Nicolas Usunier                                                 person-94
Alexander Kirillov                                              person-95
Sergey Zagoruyko                                                person-96
Co                                                              org-68
Mingxing Tan                                                    person-97
Ruoming Pang                                                    person-98
Q                                                               person-99
Le                                                              person-100
Tsung                                                           person-101
Yi Lin                                                          person-102
Michael Maire                                                   person-103
Serge J. Belongie                                               person-104
Lubomir D. Bourdev                                              person-105
James Hays                                                      person-106
Pietro Perona                                                   person-107
Deva Ramanan                                                    person-108
C. Lawrence Zitnick                                             person-109
Microsoft                                                       org-69
Yuxin Wu                                                        person-110
Wan                                                             person-111
Yen Lo                                                          person-112
Nikolaos Livathinos                                             person-113
Cesar Berrospi                                                  person-114
Maksym Lysak                                                    person-115
Viktor Kuropiatnyk                                              person-116
Ahmed Nassar                                                    person-117
Andre Carvalho                                                  person-118
Kasper Dinkla                                                   person-119
Peter W. J. Staar                                               person-120
AAAI                                                            org-70
D                                                               org-71
K                                                               org-72
Association for Computing Machinery                             org-73
Shoubin Li                                                      person-121
Xuyan Ma                                                        person-122
Shuaiqun Pan                                                    person-123
Jun Hu                                                          person-124
Lin Shi                                                         person-125
Qing Wang                                                       person-126
Peng Zhang                                                      person-127
Can Li                                                          person-128
Liang Qiao                                                      person-129
Zhanzhan Cheng                                                  person-130
Shiliang Pu                                                     person-131
Yi Niu                                                          person-132
Fei Wu                                                          person-133
Peter W J Staar                                                 person-134
Costas Bekas                                                    person-135
Connor Shorten                                                  person-136
Taghi M. Khoshgoftaar                                           person-137
Journal of Big Data                                             org-74
--------------------------------------------------------------  -----------
Enter fullscreen mode Exit fullscreen mode
  • And as I implemented a markdown export, we can see the excerpt below;
# org-1: A Large Human-Annotated Dataset for Document-Layout Analysis

person-1n org-2 location-1, location-2 bpf@zurich.ibm.com

person-2er org-2 location-1, location-2 cau@zurich.ibm.com

person-3r org-2

location-1, location-2 ahn@zurich.ibm.com

person-4i org-2 location-1, location-2 dol@zurich.ibm.com

person-5 org-2 location-1, location-2 taa@zurich.ibm.com
...
Enter fullscreen mode Exit fullscreen mode
  • Second implementation which is a the advanced version using GLiNER for richer PII labels.
import logging
import os
import re
from datetime import datetime
from pathlib import Path
from typing import Dict, List, Tuple, Callable

# docling imports
from docling_core.types.doc import ImageRefMode, TableItem, TextItem
from tabulate import tabulate

from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.document_converter import DocumentConverter, PdfFormatOption

_log = logging.getLogger(__name__)

IMAGE_RESOLUTION_SCALE = 2.0
GLINER_MODEL = "urchade/gliner_multi_pii-v1"


def _build_gliner_model():
    """Create a GLiNER model for PII-like entity extraction.

    Returns a tuple (model, labels) where model.predict_entities(text, labels)
    yields entities with "text" and "label" fields.
    """
    try:
        from gliner import GLiNER  # type: ignore
    except Exception:
        _log.error(
            "GLiNER not installed. Please run: pip install gliner torch --extra-index-url https://download.pytorch.org/whl/cpu"
        )
        raise

    model = GLiNER.from_pretrained(GLINER_MODEL)
    # Curated set of labels for PII detection. Adjust this list as needed.
    labels = [
        "booking number",
        "personally identifiable information",
        "driver licence",
        "person",
        "full address",
        "company",
        "email",
        "passport number",
        "Social Security Number",
        "phone number",
    ]
    return model, labels


class AdvancedPIIObfuscator:
    """PII obfuscator powered by GLiNER with fine-grained labels.

    - Uses GLiNER's `predict_entities(text, labels)` to detect entities.
    - Obfuscates with stable IDs per fine-grained label, e.g. `email-1`.
    """

    def __init__(self, gliner_model, labels: List[str]):
        self.model = gliner_model
        self.labels = labels
        self.entity_map: Dict[str, str] = {}
        self.counters: Dict[str, int] = {}

    def _normalize(self, s: str) -> str:
        return re.sub(r"\s+", " ", s).strip()

    def _norm_label(self, label: str) -> str:
        # Converts labels like "full address" to "full_address"
        return (
            re.sub(
                r"[^a-z0-9_]+", "_", label.lower().replace(" ", "_").replace("-", "_")
            ).strip("_")
            or "pii"
        )

    def _next_id(self, typ: str) -> str:
        self.cc(typ)
        self.counters[typ] += 1
        return f"{typ}-{self.counters[typ]}"

    def cc(self, typ: str) -> None:
        if typ not in self.counters:
            self.counters[typ] = 0

    def _extract_entities(self, text: str) -> List[Tuple[str, str]]:
        if not text:
            return []
        # GLiNER entity prediction
        results = self.model.predict_entities(
            text, self.labels
        )  # expects dicts with text/label
        found: List[Tuple[str, str]] = []
        for r in results:
            label = self._norm_label(str(r.get("label", "pii")))
            surface = self._normalize(str(r.get("text", "")))
            if surface:
                found.append((surface, label))
        return found

    def obfuscate_text(self, text: str) -> str:
        if not text:
            return text
        entities = self._extract_entities(text)
        if not entities:
            return text

        # Map unique words/entities to stable IDs
        unique_words: Dict[str, str] = {}
        for word, label in entities:
            if word not in self.entity_map:
                replacement = self._next_id(label)
                self.entity_map[word] = replacement
            unique_words[word] = self.entity_map[word]

        # Replace longer matches first to avoid partial overlaps
        sorted_pairs = sorted(
            unique_words.items(), key=lambda x: len(x[0]), reverse=True
        )

        def replace_once(s: str, old: str, new: str) -> str:
            pattern = re.escape(old)
            # Replace the entity with its stable ID
            return re.sub(pattern, new, s)

        obfuscated = text
        for old, new in sorted_pairs:
            obfuscated = replace_once(obfuscated, old, new)
        return obfuscated


def main():
    logging.basicConfig(level=logging.INFO)

    input_dir = Path("./input")
    output_dir = Path("./output")

    # Ensure output dir exists
    output_dir.mkdir(parents=True, exist_ok=True)
    _log.info(f"Output directory created/verified: {output_dir}")

    # --- GLiNER Model Setup (Always used) ---
    _log.info("Setting up GLiNER-based AdvancedPIIObfuscator...")
    gliner_model, gliner_labels = _build_gliner_model()
    # Factory function to create a new obfuscator instance for each file
    ObfuscatorFactory: Callable[[], AdvancedPIIObfuscator] = lambda: AdvancedPIIObfuscator(gliner_model, gliner_labels)

    # Document Converter Setup
    pipeline_options = PdfPipelineOptions()
    pipeline_options.images_scale = IMAGE_RESOLUTION_SCALE
    pipeline_options.generate_page_images = True
    pipeline_options.generate_picture_images = True

    doc_converter = DocumentConverter(
        format_options={
            InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
        }
    )

    # Recursively find all files in the input directory
    input_files = [p for p in input_dir.rglob("*") if p.is_file()]
    _log.info(f"Found {len(input_files)} files in {input_dir}. Starting processing...")

    for input_doc_path in input_files:
        _log.info(f"Processing file: {input_doc_path}")

        # Reset obfuscator for each file to ensure unique, sequential IDs per document type
        obfuscator = ObfuscatorFactory()

        try:
            # 1. Convert Document
            conv_res = doc_converter.convert(input_doc_path)
            conv_doc = conv_res.document

            # Prepare filename components
            relative_path = input_doc_path.relative_to(input_dir).with_suffix('')
            file_prefix = str(relative_path).replace(os.sep, '_')
            timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')

            # 2. Perform PII Obfuscation
            for element, _level in conv_res.document.iterate_items():
                if isinstance(element, TextItem):
                    element.orig = element.text
                    element.text = obfuscator.obfuscate_text(element.text)
                elif isinstance(element, TableItem):
                    for cell in element.data.table_cells:
                        cell.text = obfuscator.obfuscate_text(cell.text)

            # 3. Save Output
            # Output filename format: [SourcePathPrefix]_[Timestamp]_obfuscated.md
            md_filename = output_dir / f"{file_prefix}_{timestamp}_obfuscated.md"
            conv_doc.save_as_markdown(md_filename, image_mode=ImageRefMode.EMBEDDED)
            _log.info(f"Saved obfuscated output to: {md_filename}")

            # Optional: log mapping summary
            if obfuscator.entity_map:
                data = []
                for key, val in obfuscator.entity_map.items():
                    data.append([key, val])
                _log.info(
                    f"Obfuscated entities for {input_doc_path.name}:\n\n{tabulate(data)}",
                )

        except Exception as e:
            _log.error(f"Failed to process {input_doc_path}: {e}")
            continue


if __name__ == "__main__":
    main()
Enter fullscreen mode Exit fullscreen mode
  • For the same document as input, we get the following output;
python AdvancedPIIObfuscator.py
2025-11-19 12:01:49,809 - INFO - Output directory created/verified: output
2025-11-19 12:01:49,809 - INFO - Setting up GLiNER-based AdvancedPIIObfuscator...
gliner_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 478/478 [00:00<00:00, 8.68MB/s]
README.md: 3.04kB [00:00, 21.1MB/s]                                                                                                                  | 0.00/478 [00:00<?, ?B/s]
.gitattributes: 1.52kB [00:00, 16.3MB/s]
pytorch_model.bin: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.16G/1.16G [00:51<00:00, 22.5MB/s]
Fetching 4 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:52<00:00, 13.02s/it]
tokenizer_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 52.0/52.0 [00:00<00:00, 508kB/s]
config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 579/579 [00:00<00:00, 2.23MB/s]
spm.model: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.31M/4.31M [00:00<00:00, 4.54MB/s]
/Users/alainairom/Devs/docling-pii/myenv/lib/python3.12/site-packages/transformers/convert_slow_tokenizer.py:559: UserWarning: The sentencepiece tokenizer that you are converting to a fast tokenizer uses the byte fallback option which is not implemented in the fast tokenizers. In practice this means that the fast version of the tokenizer can produce unknown tokens whereas the sentencepiece version would have converted these unknown tokens into a sequence of byte tokens matching the original piece of text.
  warnings.warn(
2025-11-19 12:02:57,088 - INFO - Found 1 files in input. Starting processing...
2025-11-19 12:02:57,088 - INFO - Processing file: input/2206.01062.pdf
2025-11-19 12:02:57,090 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2025-11-19 12:02:57,115 - INFO - Going to convert document batch...
2025-11-19 12:02:57,115 - INFO - Initializing pipeline for StandardPdfPipeline with options hash 02e213d66fe10d5cd7525796b8c0a9af
2025-11-19 12:02:57,122 - INFO - Loading plugin 'docling_defaults'
2025-11-19 12:02:57,123 - INFO - Registered picture descriptions: ['vlm', 'api']
2025-11-19 12:02:57,125 - INFO - Loading plugin 'docling_defaults'
2025-11-19 12:02:57,127 - INFO - Registered ocr engines: ['auto', 'easyocr', 'ocrmac', 'rapidocr', 'tesserocr', 'tesseract']
2025-11-19 12:03:04,036 - INFO - Auto OCR model selected ocrmac.
2025-11-19 12:03:04,038 - INFO - Accelerator device: 'mps'
2025-11-19 12:03:12,394 - INFO - Accelerator device: 'mps'
2025-11-19 12:03:12,666 - INFO - Processing document 2206.01062.pdf
2025-11-19 12:03:20,547 - INFO - Finished converting document 2206.01062.pdf in 23.46 sec.
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
2025-11-19 12:03:45,543 - INFO - Saved obfuscated output to: output/2206.01062_20251119_120320_obfuscated.md
2025-11-19 12:03:45,543 - INFO - Obfuscated entities for 2206.01062.pdf:

--------------------------------  ----------
DocLayNet                         company-1
Birgit Pfitzmann                  person-1
IBM Research                      company-2
bpf@zurich.ibm.com                email-1
Christoph Auer                    person-2
cau@zurich.ibm.com                email-2
Ahmed S. Nassar                   person-3
ahn@zurich.ibm.com                email-3
Michele Dolfi                     person-4
dol@zurich.ibm.com                email-4
Peter Staar                       person-5
taa@zurich.ibm.com                email-5
CCS CONCEPTS                      company-3
owner/author(s)                   person-6
KDD '22                           person-7
ACM                               company-4
PubLayNet                         company-5
DocBank                           company-6
we                                person-8
We                                person-9
AAPL                              company-7
Val                               person-10
Man                               person-11
Corpus Conversion Service         company-8
group of 40 dedicated annotators  person-12
one annotator                     person-13
one proficient core team member   person-14
staff                             person-15
32 annotators                     person-16
annotator staff                   person-17
human                             person-18
MRCNN                             company-9
FRCNN                             company-10
YOLO                              company-11
experienced annotation staff      person-19
DLN                               company-12
PubLayNet (PLN)                   company-13
DocBank (DB)                      company-14
DocLayNet (DLN)                   company-15
Max Göbel                         person-20
Tamir Hassan                      person-21
Ermelinda Oro                     person-22
Giorgio Orsi                      person-23
Christian Clausner                person-24
Apostolos Antonacopoulos          person-25
Stefan Pletschacher               person-26
Hervé Déjean                      person-27
Jean-Luc Meunier                  person-28
Liangcai Gao                      person-29
Yilun Huang                       person-30
Yu Fang                           person-31
Florian Kleber                    person-32
Eva-Maria Lang                    person-33
Antonio Jimeno Yepes              person-34
Peter Zhong                       person-35
Douglas Burdick                   person-36
Logan Markewich                   person-37
Hao Zhang                         person-38
Yubin Xing                        person-39
Navid Lambert-Shirzad             person-40
Jiang Zhexin                      person-41
Roy Lee                           person-42
Zhi Li                            person-43
Seok-Bum Ko                       person-44
Xu Zhong                          person-45
Jianbin Tang                      person-46
Antonio Jimeno-Yepes              person-47
Publaynet                         company-16
Minghao Li                        person-48
Yiheng Xu                         person-49
Lei Cui                           person-50
Shaohan Huang                     person-51
Furu Wei                          person-52
Zhoujun Li                        person-53
Ming Zhou                         person-54
Riaz Ahmad                        person-55
Muhammad Tanvir Afzal             person-56
M. Qadir                          person-57
Ross B. Girshick                  person-58
Jeff Donahue                      person-59
Trevor Darrell                    person-60
Jitendra Malik                    person-61
Shaoqing Ren                      person-62
Kaiming He                        person-63
Ross Girshick                     person-64
Jian Sun                          person-65
Georgia Gkioxari                  person-66
Piotr Dollár                      person-67
Glenn Jocher                      person-68
Alex Stoken                       person-69
Ayush Chaurasia                   person-70
Jirka Borovec                     person-71
NanoCode012                       person-72
TaoXie                            person-73
Yonghye Kwon                      person-74
Kalen Michael                     person-75
Liu Changyu                       person-76
Jiacong Fang                      person-77
Abhiram V                         person-78
Laughing                          person-79
tkianai                           person-80
yxNONG                            person-81
Piotr Skalski                     person-82
Adam Hogan                        person-83
Jebastin Nadar                    person-84
imyhxy                            person-85
Lorenzo Mammana                   person-86
Alex Wang                         person-87
Cristi Fati                       person-88
Diego Montes                      person-89
Jan Hajek                         person-90
Laurentiu                         person-91
e                                 person-92
ader creconbn                     person-93
nalo bonos                        person-94
sorne imomaban                    person-95
melan croune                      person-96
Bichater                          person-97
Diaconu                           person-98
Mai Thanh Minh                    person-99
Marc                              person-100
albinxavi                         person-101
fatih                             person-102
oleg                              person-103
wanghao yang                      person-104
Nicolas Carion                    person-105
Francisco Massa                   person-106
Gabriel Synnaeve                  person-107
Nicolas Usunier                   person-108
Alexander Kirillov                person-109
Sergey Zagoruyko                  person-110
Mingxing Tan                      person-111
Ruoming Pang                      person-112
Quoc V. Le                        person-113
Tsung-Yi Lin                      person-114
Michael Maire                     person-115
Serge J. Belongie                 person-116
Lubomir D. Bourdev                person-117
James Hays                        person-118
Pietro Perona                     person-119
Deva Ramanan                      person-120
C. Lawrence Zitnick               person-121
Microsoft                         company-17
Yuxin Wu                          person-122
Wan-Yen Lo                        person-123
Nikolaos Livathinos               person-124
Cesar Berrospi                    person-125
Maksym Lysak                      person-126
Viktor Kuropiatnyk                person-127
Ahmed Nassar                      person-128
Andre Carvalho                    person-129
Kasper Dinkla                     person-130
Peter W. J. Staar                 person-131
Shoubin Li                        person-132
Xuyan Ma                          person-133
Shuaiqun Pan                      person-134
Jun Hu                            person-135
Lin Shi                           person-136
Qing Wang                         person-137
Peng Zhang                        person-138
Can Li                            person-139
Liang Qiao                        person-140
Zhanzhan Cheng                    person-141
Shiliang Pu                       person-142
Yi Niu                            person-143
Fei Wu                            person-144
Peter W J Staar                   person-145
Costas Bekas                      person-146
Connor Shorten                    person-147
Taghi M. Khoshgoftaar             person-148
--------------------------------  ----------
Enter fullscreen mode Exit fullscreen mode
  • And the result markdown is almost the same;
## company-1: A Large Human-Annotated Dataset for Document-Layout Analysis

person-1 company-2 Rueschlikon, Switzerland email-1

person-2 company-2 Rueschlikon, Switzerland email-2

person-3 company-2

Rueschlikon, Switzerland email-3

person-4 company-2 Rueschlikon, Switzerland email-4

person-5 company-2 Rueschlikon, Switzerland email-5
Enter fullscreen mode Exit fullscreen mode

What is GliNER?

GLiNER (Generalist and Lightweight Model for Named Entity Recognition) is a cutting-edge Named Entity Recognition (NER) model designed to overcome the limitations of traditional NER systems and the resource demands of Large Language Models (LLMs).

It offers a powerful solution for flexible, custom entity extraction, which is why it’s a great choice for detecting Personally Identifiable Information (PII) as seen in your code.

Key Features of GLiNER

Zero-Shot Learning (Generalist Model):

  • The Problem: Traditional NER models are limited to the entities they were explicitly trained on (e.g., PERSON, ORG, LOC). To recognize new entity types (like “passport number” or “booking ID”), you would typically have to gather thousands of examples and retrain the model.
  • The GLiNER Solution: GLiNER is zero-shot, meaning you can feed it a list of custom entity labels (like the list you use in pii_obfuscate_gliner.py) and it will find those entities in the text without needing retraining. It matches text spans to the entity labels in a shared latent space.
    Lightweight and Efficient:

  • Unlike massive LLMs (like GPT-4), which are often slow and expensive to run at scale, GLiNER is a smaller, bidirectional transformer encoder (similar to BERT).

  • This makes it much faster and capable of running efficiently on standard hardware, including CPUs, which is critical for high-volume data processing and edge deployment.
    Parallel Entity Extraction:

  • GLiNER processes the text and entity labels simultaneously, allowing for parallel extraction of entities. This is faster than the sequential, token-by-token generation process used by autoregressive LLMs.

In short, GLiNER gives you the flexibility of an LLM to define any entity type you want, combined with the speed and efficiency of a lightweight transformer model.

Conclusion

Ultimately, achieving comprehensive data privacy in the age of complex compliance like GDPR requires fusing advanced linguistic models with robust document processing. This is where the force of Docling becomes indispensable. By seamlessly ingesting and parsing a wide range of unstructured documents — from complex PDFs and digitized forms to embedded tables — Docling provides the structured text foundation necessary for deep analysis. When this is combined with the flexible, zero-shot capabilities of Named Entity Recognition (NER), specifically models like GLiNER, we create an automated, end-to-end pipeline capable of masterfully detecting and securely obfuscating PII, regardless of its location or format within the document. This powerful combination shifts PII protection from a brittle, rule-based chore to a scalable, high-accuracy technological safeguard, ensuring your compliance posture is both secure and future-proof.

Thanks for reading 🥂

Links

Top comments (0)