DEV Community: Athroniaeth

Comment laisser GPT-5.5 corriger un CV sans jamais lui montrer un seul donnée personnelle

Athroniaeth — Wed, 27 May 2026 16:13:29 +0000

TLDR

Pour relire votre CV avant un envoi important, vous pouvez le confier à un LLM. Quelques secondes, et vous avez une liste de fautes. Sauf que vous venez aussi de donner votre nom, votre adresse, vos employeurs et vos dates à un service tiers.

piighost-proofreader résout ça. Le CV est anonymisé localement avant l'appel au LLM, et les corrections retrouvent leur place sur le PDF d'origine :

Le LLM ne voit jamais un nom, une date, une adresse.

L'anonymisation, c'est la partie facile. Le morceau pénible, c'est de retrouver dans le PDF un mot que le LLM n'a vu qu'en Markdown. Et le LLM et PyMuPDF ne tokenisent pas pareil.

1. Pourquoi pas juste une regex ?

Première idée : avant d'envoyer le CV au LLM, on remplace les données sensibles par une bonne grosse regex. Ça marche pour les emails et les numéros de téléphone, qui ont un format reconnaissable. Pour le reste, c'est mort.

Un nom n'a aucune forme syntaxique distinctive. Paul Martin ressemble à n'importe quels deux mots capitalisés ; rien dans le texte ne dit à une regex que c'est un nom.
Orange est une entreprise. C'est aussi un fruit. Mars, Apple, Carrefour, pareil.
Une date dans un CV peut être une naissance, un diplôme, un changement de poste. Le format est le même.

Il faut un détecteur entraîné, pas un pattern. piighost en fournit un, et l'appel ressemble à ça :

# src/proofreader/anonymize.py
async def anonymize(self, text: str, *, thread_id: str) -> str:
    return await self._call(
        "/v1/anonymize", text, thread_id, response_key="anonymized_text"
    )

Le thread_id est une UUID par CV. Le mapping entité→placeholder reste côté serveur, isolé par cet ID : un même nom devient le même placeholder à chaque occurrence.

2. Streamer les erreurs avec `instructor`

Un CV de deux pages contient une bonne quinzaine de fautes, et le LLM prend plusieurs secondes pour les sortir. Sans streaming, l'utilisateur fixe un loader pendant tout ce temps. Avec, les fautes apparaissent une par une au fur et à mesure que le modèle les émet.

Le piège : la plupart des libs de structured output (LangChain with_structured_output, OpenAI Functions, Pydantic AI) renvoient le résultat complet. Vous demandez un list[Mistake], vous recevez la liste entière une fois l'inférence terminée. Pas de granularité objet par objet.

instructor règle exactement ce cas. Sa méthode create_iterable parse le JSON streamé par le LLM au fil de l'eau et renvoie chaque objet pydantic dès qu'il est complet :

# src/proofreader/llm.py
client = instructor.from_litellm(litellm.acompletion)
response = client.chat.completions.create_iterable(
    model=model,
    response_model=Mistake,   # un seul objet, pas list[Mistake]
    messages=[
        {"role": "system", "content": SYSTEM_PROMPT_STREAM.format(language=language)},
        {"role": "user", "content": markdown},
    ],
)
async for mistake in response:
    yield mistake

Deux complications qui ne sautent pas aux yeux :

Le prompt change selon le mode. Pour un with_structured_output LangChain, on demande au LLM de renvoyer un objet wrapper avec une liste de Mistakes dedans. Pour create_iterable, on lui demande d'émettre un seul Mistake JSON par tour de génération. Les deux prompts ne sont pas tout à fait les mêmes. Le projet maintient les deux côte à côte : LangChain pour le chemin Streamlit one-shot, instructor pour le streaming FastAPI.
Le streaming SSE en aval. Chaque Mistake émis est immédiatement repackagé en event Server-Sent Events côté FastAPI, puis envoyé au frontend. Le locator de la section suivante tourne par-Mistake, donc l'utilisateur voit chaque rectangle rouge apparaître au fur et à mesure, pas en bloc à la fin.

3. Le retour sur PDF : quatre stratégies de fallback

Pour chaque Mistake qu'instructor renvoie, j'ai un error_text, un correction, un context_before, et une description. Le LLM, lui, n'a jamais vu un seul pixel du PDF : il travaillait sur le Markdown extrait. Aucun champ ne contient des coordonnées.

Or l'utilisateur veut voir les corrections sur le PDF d'origine, pas un texte plat dans une page de résultats. Donc il faut, pour chaque erreur, retrouver le mot dans le PDF.

Du côté PDF, j'utilise PyMuPDF, qui me donne un word stream : la liste de tous les mots de la page avec leurs bbox (rectangles en points). Le problème devient : trouver la fenêtre [mot1, mot2, …] dans cette liste. Sauf que le LLM et PyMuPDF ne tokenisent pas pareil, que les apostrophes typographiques ne sont pas alignées, et que sur un CV en deux colonnes le LLM hallucine parfois son context_before.

D'où quatre stratégies essayées dans l'ordre. Chacune rattrape un cas que la précédente ne sait pas gérer :

# src/proofreader/locator.py
def locate_mistake(mistake: Mistake, *, words: list[Word]) -> LocatedMistake | None:
    err_tokens = mistake.error_text.split()
    if not err_tokens:
        return None
    ctx_tokens = mistake.context_before.split()

    # Strategy 1: strict whole-word match.
    matched = _match_window(ctx_tokens, err_tokens, words, normalize=False)
    if matched is not None:
        return _build_located(mistake, matched)

    # Strategy 2: punctuation-tolerant (casefold + ASCII quotes + strip punct).
    matched = _match_window(ctx_tokens, err_tokens, words, normalize=True)
    if matched is not None:
        return _build_located(mistake, matched)

    # Strategy 3: error_text alone if it appears exactly once on the page.
    # Catches LLM context drift in multi-column layouts.
    matched = _find_error_alone_if_unique(err_tokens, words)
    if matched is not None:
        return _build_located(mistake, matched)

    # Strategy 4: substring of the concatenated normalised stream. Handles LLM
    # tokenisation drift like `d'une` → `d' + une`, where the standalone word
    # has no PyMuPDF token equivalent.
    matched = _find_error_as_substring_if_unique(err_tokens, words)
    if matched is not None:
        return _build_located(mistake, matched)

    return None

Pourquoi cet ordre exact :

Strict. La fenêtre context_before + error_text correspond au mot près, sans normalisation. Le cas heureux : le LLM cite le PDF parfaitement, correspondance exacte, zéro ambiguïté.
Tolérant. Le LLM capitalise le premier mot d'une phrase, ou remplace ' par ' (apostrophe typographique). _normalize casefold le tout, remplace les guillemets et apostrophes typographiques par leur version ASCII, et retire la ponctuation que PyMuPDF colle aux tokens.
Error-only unique. Sur les CVs en deux colonnes, le context_before que le LLM produit est parfois pioché dans la mauvaise colonne (les modèles linéarisent maladroitement le multi-colonne). Si l'error_text n'apparaît qu'une fois sur la page, on prend, peu importe le contexte. Ça suffit dans la quasi-totalité des cas.
Substring du stream concaténé. Cas tordu : d'une est un mot pour le LLM, mais PyMuPDF le tokenise en d' + une. Le LLM peut renvoyer error_text="une" comme mot isolé, sans token PyMuPDF correspondant. Solution : concaténer tous les tokens de la page en une seule chaîne et chercher en sous-chaîne. On filtre par _MIN_SUBSTRING_CHARS = 5, parce que sans ça un error_text="une" se retrouve dans commune, lacune, tribune. Bonjour les faux positifs.

Si aucune des quatre n'attrape rien, l'erreur passe dans une section « Non localisées » du résultat plutôt que d'être silencieusement perdue. Une erreur visible que l'utilisateur peut lire mais qui n'a pas son rectangle rouge, c'est moins grave qu'une erreur dont on prétend qu'elle est ailleurs.

Bilan

Si vous bricolez quelque chose de similaire, trois choses à retenir :

Une regex ne détecte pas les noms, entreprises ou dates. Il faut un détecteur entraîné.
Si vous voulez streamer du structured output (objets pydantic au fil de l'eau, pas la liste entière à la fin), les libs habituelles ne suffisent pas. instructor est conçu pour ça.
Si le LLM travaille sur du texte extrait d'un document (PDF, OCR, scans), il vous rend des erreurs sans coordonnées. Vous devez les relocaliser après coup, et accepter que ce ne soit pas toujours possible.

piighost règle le premier point. instructor règle le deuxième. Le troisième m'a fait écrire ce projet, dont le code est ouvert.

Application : https://piighost-proofreader.athroniaeth.cloud/
piighost : github.com/Athroniaeth/piighost, la lib d'anonymisation utilisée ici.
piighost-proofreader : github.com/Athroniaeth/piighost-proofreader, le projet complet, démo en ligne, locator inclus.

Issues et PR bienvenues. Si vous travaillez sur du texte privé avec un LLM, les trois points ci-dessus vont probablement vous parler.

How to let GPT-5.5 proofread a CV without leak it personal data

Athroniaeth — Wed, 27 May 2026 16:09:06 +0000

TLDR

Before you send out an important CV, you can hand it to an LLM for proofreading. A few seconds later, you have a list of mistakes. Except you've also just handed your name, your address, your employers and your dates to a third-party service.

piighost-proofreader fixes that. The CV is anonymized locally before the LLM call, and the corrections find their way back onto the right word in the original PDF:

The LLM never sees a name, a date, an address.

Anonymization is the easy part. The painful bit is finding, back in the PDF, a word the LLM only ever saw as Markdown. And the LLM and PyMuPDF don't tokenize the same way.

1. Why not just a regex?

First idea: before sending the CV to the LLM, you replace the sensitive data with one big regex. That works for emails and phone numbers, which have a recognizable format. For everything else, forget it.

A name has no distinctive syntactic shape. Paul Martin looks like any two capitalized words; nothing in the text tells a regex it's a name.
Orange is a company. It's also a fruit. Mars, Apple, Carrefour, same story.
A date in a CV can be a birth, a degree, a job change. The format is identical.

You need a trained detector, not a pattern. piighost provides one, and the call looks like this:

# src/proofreader/anonymize.py
async def anonymize(self, text: str, *, thread_id: str) -> str:
    return await self._call(
        "/v1/anonymize", text, thread_id, response_key="anonymized_text"
    )

The thread_id is a UUID per CV. The entity→placeholder mapping stays server-side, scoped by that ID: the same name becomes the same placeholder on every occurrence.

2. Streaming the mistakes with `instructor`

A two-page CV holds a good fifteen mistakes, and the LLM takes several seconds to spit them out. Without streaming, the user stares at a loader the whole time. With it, the mistakes show up one by one as the model emits them.

The catch: most structured-output libs (LangChain with_structured_output, OpenAI Functions, Pydantic AI) return the complete result. You ask for a list[Mistake], you get the whole list once inference is done. No object-by-object granularity.

instructor is built for exactly this. Its create_iterable method parses the LLM's streamed JSON on the fly and yields each pydantic object as soon as it's complete:

# src/proofreader/llm.py
client = instructor.from_litellm(litellm.acompletion)
response = client.chat.completions.create_iterable(
    model=model,
    response_model=Mistake,   # a single object, not list[Mistake]
    messages=[
        {"role": "system", "content": SYSTEM_PROMPT_STREAM.format(language=language)},
        {"role": "user", "content": markdown},
    ],
)
async for mistake in response:
    yield mistake

Two complications that aren't obvious:

The prompt changes with the mode. For LangChain's with_structured_output, you ask the LLM to return a wrapper object with a list of Mistakes inside. For create_iterable, you ask it to emit a single Mistake JSON per generation turn. The two prompts aren't quite the same. The project keeps both side by side: LangChain for the one-shot Streamlit path, instructor for the FastAPI streaming path.
The SSE streaming downstream. Each Mistake emitted is immediately repackaged into a Server-Sent Events event on the FastAPI side, then pushed to the frontend. The locator from the next section runs per-Mistake, so the user watches each red rectangle pop up as it goes, not all at once at the end.

3. Back onto the PDF: four fallback strategies

For each Mistake instructor yields, I have an error_text, a correction, a context_before, and a description. The LLM, though, never saw a single pixel of the PDF: it worked on the extracted Markdown. No field carries coordinates.

But the user wants to see the corrections on the original PDF, not flat text on a results page. So for each mistake, I have to find the word back in the PDF.

On the PDF side, I use PyMuPDF, which gives me a word stream: the list of every word on the page with its bbox (rectangles in points). The problem becomes: find the window [word1, word2, …] in that list. Except the LLM and PyMuPDF don't tokenize the same way, the typographic apostrophes don't line up, and on a two-column CV the LLM sometimes hallucinates its context_before.

Hence four strategies tried in order. Each one catches a case the previous can't handle:

# src/proofreader/locator.py
def locate_mistake(mistake: Mistake, *, words: list[Word]) -> LocatedMistake | None:
    err_tokens = mistake.error_text.split()
    if not err_tokens:
        return None
    ctx_tokens = mistake.context_before.split()

    # Strategy 1: strict whole-word match.
    matched = _match_window(ctx_tokens, err_tokens, words, normalize=False)
    if matched is not None:
        return _build_located(mistake, matched)

    # Strategy 2: punctuation-tolerant (casefold + ASCII quotes + strip punct).
    matched = _match_window(ctx_tokens, err_tokens, words, normalize=True)
    if matched is not None:
        return _build_located(mistake, matched)

    # Strategy 3: error_text alone if it appears exactly once on the page.
    # Catches LLM context drift in multi-column layouts.
    matched = _find_error_alone_if_unique(err_tokens, words)
    if matched is not None:
        return _build_located(mistake, matched)

    # Strategy 4: substring of the concatenated normalised stream. Handles LLM
    # tokenisation drift like `d'une` → `d' + une`, where the standalone word
    # has no PyMuPDF token equivalent.
    matched = _find_error_as_substring_if_unique(err_tokens, words)
    if matched is not None:
        return _build_located(mistake, matched)

    return None

Why this exact order:

Strict. The context_before + error_text window matches word for word, no normalization. The happy case: the LLM quotes the PDF perfectly, exact match, zero ambiguity.
Tolerant. The LLM capitalizes the first word of a sentence, or swaps ' for ' (a typographic apostrophe). _normalize casefolds everything, replaces curly quotes and apostrophes with their ASCII version, and strips the punctuation PyMuPDF glues onto tokens.
Error-only unique. On two-column CVs, the context_before the LLM produces is sometimes lifted from the wrong column (models linearize multi-column layouts clumsily). If error_text appears exactly once on the page, take it, context be damned. That's enough in the vast majority of cases.
Substring of the concatenated stream. Nasty case: d'une is one word to the LLM, but PyMuPDF tokenizes it as d' + une. The LLM may return error_text="une" as a standalone word with no matching PyMuPDF token. Fix: concatenate all the page's tokens into a single string and search by substring. We gate on _MIN_SUBSTRING_CHARS = 5, because without it an error_text="une" shows up inside commune, lacune, tribune. Cue the false positives.

If none of the four catches anything, the mistake lands in a "Not located" section of the result instead of being silently dropped. A visible mistake the user can read but that has no red rectangle is less bad than a mistake we claim is somewhere it isn't.

Takeaways

If you're hacking on something similar, three things to remember:

A regex doesn't detect names, companies or dates. You need a trained detector.
If you want to stream structured output (pydantic objects on the fly, not the whole list at the end), the usual libs won't cut it. instructor is built for that.
If the LLM works on text extracted from a document (PDF, OCR, scans), it hands you mistakes with no coordinates. You have to relocate them afterward, and accept it won't always be possible.

piighost handles the first point. instructor handles the second. The third is what made me write this project, whose code is open.

Example of app : https://piighost-proofreader.athroniaeth.cloud/
piighost: github.com/Athroniaeth/piighost, the anonymization lib used here.
piighost-proofreader: github.com/Athroniaeth/piighost-proofreader, the full project, live demo, locator included.

Issues and PRs welcome. If you work with private text in an LLM, the three points above will probably ring a bell.

PIIGhost: a Python library for PII anonymization in LLM agents

Athroniaeth — Mon, 27 Apr 2026 18:18:21 +0000

I've been building agents on top of LangGraph for a while now, and I keep running into the same problem: every message sent to the LLM might contain sensitive data, and depending on the provider you're using, what happens to that data changes completely.

To simplify, there are three families of providers:

Non-EU cloud (OpenAI, Anthropic, Google): the best models, but data leaves the EU, which is problematic on many fronts. I wrote a summary here.
Sovereign EU cloud (Mistral, Aleph Alpha): processing happens in the EU, but a more restricted catalog.
Self-hosted (Ollama, vLLM, open-weight models): you never hand your data to a third party, you control everything, but you have to manage the infrastructure yourself.

I'm currently working on notarial documents, which in practice limits me to Mistral. So I can't take advantage of the best LLMs to do my work. The only clean way to decouple the LLM from the sensitivity of the content is to anonymize upstream.

Why it's harder than it looks

On paper, it's simple. You take a detector (regex for emails, NER model for names), replace what matches with placeholders, and send to the LLM.

In practice, four problems show up almost immediately.

Placeholder consistency. The point of anonymization is to replace "Patrick" with a placeholder like <<PERSON:1>>, which tells the LLM two things. A person has been hidden here, and every occurrence of <<PERSON:1>> refers to the same person. If "Patrick" becomes <<PERSON:1>> at the start of the text and <<PERSON:3>> at the end, the LLM can no longer reason about the fact that it's the same individual.

Variants missed by the detector. The NER detects "Patrick Dupont" at the start of the text but misses "Patrick" alone two sentences later. Or it detects "Patrick" but not "patrick" in lowercase. Or not "Patriick" with a typo.

Overlap between detectors. You chain two NERs to boost recall. On "Patrick", both can claim the same span with different labels (one says PERSON, the other says ORG because it confused it with a company name). Without arbitration, the final replacement hits the same position twice and breaks the text.

Persistence across messages. Once the LLM has seen <<PERSON:1>> in message 1, message 2 needs to use the same placeholder. Without shared memory, "Patrick" becomes <<PERSON:1>> then <<PERSON:7>> depending on the moment, and the LLM loses track.

And that's before we even get to the agent, where tools need to receive the real values (to send an email, for example) while the LLM should only see placeholders. On the front-end side, you also have to deanonymize the placeholders before showing the response to the user, without the LLM ever knowing the mapping.

It's to address all of this that I built PIIGhost, an open-source project that adds a layer of detection, anonymization and deanonymization on top of your detectors (NER, regex, LLM, whatever you want). It also offers a conversational mode and a LangChain middleware that plugs into LangGraph without modifying your existing code.

The rest of the article follows the pipeline order: detection, span arbitration, entity linking, merging, anonymization, then the conversational and agent layers.

Step 1: Detection

Everything starts with detection. A detector takes text and returns a list of Detection objects (text found, label, position, confidence). PIIGhost ships several out of the box:

RegexDetector for structured formats (emails, phone numbers, IBAN).
ExactMatchDetector for fixed words known in advance, useful for tests or business dictionaries.
Gliner2Detector for NER, plugged on GLiNER2 by default.
CompositeDetector to combine multiple detectors into one.

The interface is an AnyDetector protocol, so you can plug in your own (an LLM call, another NER model, whatever you want).

Here's an example without an ML model, just to show the mechanics:

from piighost import ExactMatchDetector

detector = ExactMatchDetector([
    ("Patrick", "PERSON"),
    ("Paris", "LOCATION"),
])

detections = await detector.detect("Patrick lives in Paris.")
# Detection(text='Patrick', label='PERSON',   position=Span(0, 7),   confidence=1.0)
# Detection(text='Paris',   label='LOCATION', position=Span(15, 20), confidence=1.0)

At this stage, we have a raw list of detections. No anonymization, no duplicate handling, nothing. Just "here's what looks like PII and where it sits".

Step 2: Span arbitration

First real problem. When you chain multiple detectors on the same text, they can claim the same chunk with different labels. This is typically what happens when you combine two NERs to boost recall. They step on each other and one of them is wrong.

A concrete example. On the following sentence:

"Patrick works at Orange since 2015."

You run two NERs:

NER A (a generalist model) detects "Patrick" → PERSON, span [0:7], confidence 0.95
NER B (a domain model less reliable on first names) detects "Patrick" → ORG, span [0:7], confidence 0.60 (it confused it with a company name)

Both point to exactly the same span [0:7], but with mutually exclusive labels. If we replace both, we hit the same position twice and end up with something broken like <<ORG:1>><<PERSON:1>> works at.... We have to choose.

That's the role of the span resolver. PIIGhost ships two by default:

ConfidenceSpanConflictResolver: keeps the detection with the highest confidence in case of overlap. The reasonable default.
DisabledSpanConflictResolver: does nothing, to use if your detections are already clean or if you want to handle the case yourself.

You can also write your own (prefer the longest span, prefer a specific label, etc.) by implementing the SpanConflictResolver protocol.

from piighost import ConfidenceSpanConflictResolver

resolver = ConfidenceSpanConflictResolver()
clean = resolver.resolve(detections)

# Input detections:
#   - PERSON "Patrick" [0:7] confidence=0.95   (NER A)
#   - ORG    "Patrick" [0:7] confidence=0.60   (NER B)
#
# After resolution, only this remains:
#   - PERSON "Patrick" [0:7] confidence=0.95

At the end of this step, no more overlaps. Each chunk of text is claimed by only one detection.

Overlap isn't necessarily exact. The resolver also handles cases where one span is included in another, or where two spans partially overlap. The principle stays the same. Keep the most confident.

Step 3: Entity linking

Second problem. The NER misses occurrences. It finds "Patrick Dupont" in sentence 1 but misses "Patrick" alone in sentence 3. If we stop at raw detection, "Patrick" stays in clear text in the anonymized output. That's exactly what we want to avoid.

The linker fixes this. ExactEntityLinker does two things:

For each detection, it searches for all other occurrences of the same text in the document, using a word-boundary regex (to avoid matching "Patric" inside "Patricia").
It groups every detection that points to the same normalized text into a single Entity object.

Concretely:

Text: "Patrick Dupont lives in Paris. Patrick loves Paris."

Raw NER detections:
  - PERSON   "Patrick Dupont"  (sentence 1)
  - LOCATION "Paris"            (sentence 1)
  # "Patrick" and "Paris" in sentence 2 were missed by the NER

After ExactEntityLinker:
  - Entity(label=PERSON,   detections=["Patrick Dupont", "Patrick"])
  - Entity(label=LOCATION, detections=["Paris", "Paris"])

All occurrences are recovered, grouped by entity. The NER misses things, the linker catches them.

One caveat. The linker does exact string matching. It won't catch "patrick" in lowercase or "Patriick" with a typo. For that, you need a fuzzy linker, which you can write by implementing the EntityLinker protocol.

Step 4: Entity merging

Third problem, more subtle. Imagine two detectors that see the same person but with different spans:

The NER detects "Patrick Dupont" → entity A, label PERSON
A business dictionary detects "Patrick" alone (because they're in the firm's associates list) → entity B, label PERSON

After the linker, you end up with two distinct entities even though it's clearly the same person. If you anonymize as is, "Patrick Dupont" becomes <<PERSON:1>> and "Patrick" alone becomes <<PERSON:2>>. The LLM thinks these are two different people.

The entity resolver merges these duplicates. Two options:

MergeEntityConflictResolver: uses union-find to merge entities sharing at least one detection (strict matching). The default.
FuzzyEntityConflictResolver: uses Jaro-Winkler distance to merge entities whose canonical text is close (e.g. "Patrick" and "Patriick" with a typo). More tolerant, but higher false-positive risk.

A concrete example:

Before merge:
  - Entity(label=PERSON, detections=["Patrick Dupont"])
  - Entity(label=PERSON, detections=["Patrick"])
  # Both entities share a detection on the string "Patrick"

After MergeEntityConflictResolver:
  - Entity(label=PERSON, detections=["Patrick Dupont", "Patrick"])

At this stage, you have a clean list of entities, each grouping all of its occurrences. No more duplicates, no more overlaps.

Step 5: Anonymization

Now we can replace. The Anonymizer generates a unique placeholder per entity via a PlaceholderFactory, then replaces the spans in the text from right to left (so the positions of the following spans don't shift).

from piighost import Anonymizer, LabelCounterPlaceholderFactory

anonymizer = Anonymizer(LabelCounterPlaceholderFactory())
result = anonymizer.anonymize(text, entities)

# Patrick Dupont lives in Paris. Patrick loves Paris.
# becomes
# <<PERSON:1>> lives in <<LOCATION:1>>. <<PERSON:1>> loves <<LOCATION:1>>.

Several factories are provided, to choose based on your case:

LabelCounterPlaceholderFactory: <<PERSON:1>>, <<LOCATION:1>>. Readable in logs and traces.
LabelHashPlaceholderFactory: <<PERSON:a3f9>>. Avoids leaking the order in which entities appear from one conversation to another.
FakerCounterPlaceholderFactory: "John Smith", "Springfield". Preserves linguistic flow for the LLM (useful if the model struggles with raw placeholders).
MaskPlaceholderFactory: [REDACTED]. Pure anonymization, irreversible.

The default <<LABEL:N>> format has four useful properties:

it's unique as a token in theory,
the LLM immediately sees what type of PII it's dealing with,
it's not ambiguous in regular text,
it can't be confused with another placeholder (unlike a plain <<PERSON>>, which doesn't distinguish people from one another).

The assembled pipeline

All the steps above chain together into a pipeline:

from piighost.pipeline import AnonymizationPipeline
from piighost import (
    ConfidenceSpanConflictResolver,
    ExactEntityLinker,
    MergeEntityConflictResolver,
    Anonymizer,
    LabelCounterPlaceholderFactory,
)

pipeline = AnonymizationPipeline(
    detector=detector,
    span_resolver=ConfidenceSpanConflictResolver(),
    entity_linker=ExactEntityLinker(),
    entity_resolver=MergeEntityConflictResolver(),
    anonymizer=Anonymizer(LabelCounterPlaceholderFactory()),
)

anonymized, entities = await pipeline.anonymize(
    "Patrick Dupont lives in Paris. Patrick loves Paris."
)
# <<PERSON:1>> lives in <<LOCATION:1>>. <<PERSON:1>> loves <<LOCATION:1>>.

original, _ = await pipeline.deanonymize(anonymized)
# Patrick Dupont lives in Paris. Patrick loves Paris.

The pipeline keeps a cache of the mapping (SHA-256 key on the input text), so deanonymization is free after the first call.

The conversation problem

All of this works for an isolated message. In a real conversation, it breaks because of three problems.

Counters not shared. Every call to anonymize starts from scratch. The Patrick → <<PERSON:1>> mapping from message 1 is not guaranteed to be reused at message 2.

Detections missed across messages. The NER detects "Patrick" in message 1 but misses it in message 5. Without memory of entities already seen, we can't fill the gap.

Concurrent conversations. If multiple users share the same pipeline instance, their entities mix together. The <<PERSON:1>> of one and the other become indistinguishable.

Bug demonstration:

# Message 1
m1, _ = await pipeline.anonymize("Patrick lives in Paris.")
# <<PERSON:1>> lives in <<LOCATION:1>>.

# Message 2, state not shared
m2, _ = await pipeline.anonymize("Bob is happy.")
# <<PERSON:1>> is happy.   ← the counter restarted at 1
# Bob inherits the same placeholder as Patrick → collision:
# the LLM thinks it's the same person.

ThreadAnonymizationPipeline extends the standard pipeline with a ConversationMemory scoped by thread_id. The memory accumulates entities across messages, deduplicated by (text.lower(), label). Each call passes a thread_id, and the cache is prefixed with that identifier so conversations stay isolated.

from piighost.pipeline.thread import ThreadAnonymizationPipeline

pipeline = ThreadAnonymizationPipeline(detector=..., span_resolver=..., ...)

# Conversation A
m1, _ = await pipeline.anonymize("Patrick lives in Paris.", thread_id="user-A")
# <<PERSON:1>> lives in <<LOCATION:1>>.

m2, _ = await pipeline.anonymize("Patrick is happy.", thread_id="user-A")
# <<PERSON:1>> is happy.   ← guaranteed, shared via the thread memory

# Conversation B in parallel, isolated
m3, _ = await pipeline.anonymize("Bob loves Lyon.", thread_id="user-B")
# <<PERSON:1>> loves <<LOCATION:1>>.   ← counter independent from conversation A

ThreadAnonymizationPipeline also adds two operations useful for the agent case:

anonymize_with_ent(text, thread_id=...): pure string replacement, without detection. Uses the entities already known to the thread to anonymize a new text. Faster, but doesn't detect new PII.
deanonymize_with_ent(text, thread_id=...): inverse replacement. Useful when the LLM produces text with placeholders we want to restore.

These two operations correctly handle cases where one placeholder is a prefix of another (<<PERSON:1>> vs <<PERSON:10>>) by replacing the longer ones first.

The agent problem

In a LangGraph agent, the LLM doesn't just process messages. It calls tools, reads their results, and reasons in a loop. Anonymizing properly in this setting requires three interventions at precise moments.

Before the LLM call. All messages have to be anonymized. This is the standard pipeline.anonymize(), applied to each message of the context.

Before and after a tool execution. The LLM calls send_email(to=<<PERSON:1>>). The tool needs the real address, not the placeholder. We deanonymize the arguments via deanonymize_with_ent, execute, then re-anonymize the result before handing it back to the LLM.

Before display to the user. The LLM produces "Done, I sent the email to <<PERSON:1>>". The user wants to see "Patrick", not the placeholder.

PIIAnonymizationMiddleware wires these three hooks into LangGraph:

from langchain.agents import create_agent
from piighost.middleware import PIIAnonymizationMiddleware

middleware = PIIAnonymizationMiddleware(pipeline=pipeline)

agent = create_agent(
    model="mistral:mistral-large-latest",
    tools=[send_email, get_weather],
    middleware=[middleware],
)

Under the hood, the middleware reads the thread_id from the LangGraph config (get_config()["configurable"]["thread_id"]) and passes it to every pipeline operation. The LLM never sees real values, the tools receive them normally, the user gets the response with their names intact. No agent code to modify.

piighost-chat: the human-in-the-loop demo

To make all of this concrete, I built a chatbot on top of the library. The user sees what is about to be anonymized before the message is sent to the LLM. They can deselect a span flagged by mistake, or select text the detector missed. Once validated, the message goes into the pipeline.

This kind of human-in-the-loop UX is what makes auto-anonymization actually usable in real workflows, where automatic precision often plateaus around 90-95% and those few missed percent can be a problem. The auto pass does the heavy lifting, the human catches the edges.

For instance, here you type your message, it goes through the piighost API and the front shows what was detected and what's about to be anonymized.

You can remove anonymized entities if there's a false positive.

You can also select text to add new entities to anonymize.

If you ask for information about an anonymized PII, for instance which letter the word starts with, the LLM won't be able to answer.

The library is in its early days. I tried to anticipate as many cases as possible starting from my own needs on notarial documents, but I know that's a particular angle and that many things can be debated. Components that aren't generic enough, abstractions that don't pull their weight, use cases I haven't seen.
If you give it a try, your feedback genuinely matters to me:

what felt missing or counter-intuitive,
what feels too complex or pointless and should be removed,
the use cases where it doesn't hold up.

Anything is welcome, whether through a GitHub issue, a PR, or even a direct message. I'd rather cut early on what doesn't belong than accumulate debt.

Thanks for reading.

PIIGhost : une librairie Python d'anonymisation de données confidentiels pour les agents LLM

Athroniaeth — Sun, 26 Apr 2026 23:38:08 +0000

Ça fait un moment que je construis des agents avec LangGraph, et je retombe toujours sur le même problème : chaque message envoyé au LLM peut contenir des données sensibles, et selon le fournisseur que vous utilisez, ce qu'il advient de ces données change complètement.

En simplifiant, il y a trois familles de fournisseurs :

Cloud non-européen (OpenAI, Anthropic, Google) : les meilleurs modèles, mais les données quittent l'UE, ce qui est problématique sur plein d'aspects. J'en ai fait un résumé ici.
Cloud souverain européen (Mistral, Aleph Alpha) : traitement en UE, mais catalogue plus restreint.
Self-hosted (Ollama, vLLM, modèles open-weight) : vous ne fournissez jamais vos données à un tiers, vous contrôlez tout, mais vous devez gérer l'infrastructure vous-même.

Je travaille actuellement sur des documents notariaux, ce qui me limite en pratique à Mistral. Je ne peux donc pas profiter des meilleurs LLM pour effectuer mes tâches. La seule façon propre de découpler le LLM de la sensibilité du contenu, c'est d'anonymiser en amont.

Pourquoi c'est plus dur qu'il n'y paraît

Sur le papier, c'est simple : on prend un détecteur (regex pour les emails, modèle NER pour les noms), on remplace ce qui matche par des placeholders, et on envoie au LLM.

En pratique, quatre problèmes apparaissent presque immédiatement.

Cohérence des placeholders. Le but de l'anonymisation est de remplacer "Patrick" par un placeholder du type <<PERSON:1>>, qui dit deux choses au LLM : on a caché une personne ici, et toutes les occurrences de <<PERSON:1>> parlent de la même personne. Si "Patrick" devient <<PERSON:1>> au début du texte et <<PERSON:3>> à la fin, le LLM ne peut plus raisonner sur le fait qu'il s'agit du même individu.

Variantes ratées par le détecteur. Le NER détecte "Patrick Dupont" en début de texte mais rate "Patrick" tout seul deux phrases plus loin. Ou il détecte "Patrick" mais pas "patrick" en bas de casse. Ou pas "Patriick" avec une faute d'orthographe.

Chevauchement entre détecteurs. Vous chaînez deux NER pour augmenter le rappel. Sur "Patrick", les deux peuvent revendiquer le même span avec des labels différents (l'un dit PERSON, l'autre dit ORG parce qu'il a confondu avec un nom d'entreprise). Sans arbitrage, le remplacement final tape sur la même position deux fois et casse le texte.

Persistance entre messages. Une fois que le LLM a vu <<PERSON:1>> dans le message 1, il faut que le message 2 utilise le même placeholder. Sans mémoire partagée, "Patrick" devient <<PERSON:1>> puis <<PERSON:7>> selon le moment, et le LLM perd le fil.

Et c'est avant même de parler de l'agent, où les outils doivent recevoir les vraies valeurs (pour envoyer un email, par exemple) tandis que le LLM ne doit voir que les placeholders. Côté front, il faut aussi désanonymiser les placeholders avant de montrer la réponse à l'utilisateur, sans que le LLM ait connaissance du mapping.

C'est pour répondre à tout ça que j'ai construit PIIGhost, un projet open-source qui ajoute une couche de détection, d'anonymisation et de désanonymisation par-dessus vos détecteurs (NER, regex, LLM, ce que vous voulez). Il propose en plus un mode conversationnel et un middleware LangChain qui s'intègre dans LangGraph sans modifier votre code existant.

Le reste de l'article suit l'ordre du pipeline : détection, arbitrage des spans, liaison d'entités, fusion, anonymisation, puis les couches conversationnelle et agent.

Étape 1 : Détection

Tout commence par la détection. Un détecteur prend du texte et retourne une liste d'objets Detection (texte trouvé, label, position, confiance). PIIGhost en fournit plusieurs en standard :

RegexDetector pour les formats structurés (emails, téléphones, IBAN).
ExactMatchDetector pour des mots fixes connus à l'avance, utile pour les tests ou pour des dictionnaires métier.
Gliner2Detector pour le NER, branché sur GLiNER2 par défaut.
CompositeDetector pour combiner plusieurs détecteurs en un seul.

L'interface est un protocole AnyDetector, donc vous pouvez brancher le vôtre (un appel LLM, un autre modèle NER, ce que vous voulez).

Voici un exemple sans modèle ML, juste pour montrer la mécanique :

from piighost import ExactMatchDetector

detector = ExactMatchDetector([
    ("Patrick", "PERSON"),
    ("Paris", "LOCATION"),
])

detections = await detector.detect("Patrick habite à Paris.")
# Detection(text='Patrick', label='PERSON',   position=Span(0, 7),   confidence=1.0)
# Detection(text='Paris',   label='LOCATION', position=Span(17, 22), confidence=1.0)

À ce stade, on a une liste brute de détections. Pas encore d'anonymisation, pas de gestion de doublons, rien. Juste : "voici ce qui ressemble à des PII et où elles sont".

Étape 2 : Arbitrage des spans

Premier vrai problème : quand vous chaînez plusieurs détecteurs sur le même texte, ils peuvent revendiquer le même morceau avec des labels différents. C'est typiquement ce qui arrive quand on combine deux NER pour augmenter le rappel : ils se marchent dessus et l'un des deux se trompe.

Prenons un exemple concret. Sur la phrase suivante :

"Patrick travaille chez Orange depuis 2015."

Vous faites tourner deux NER :

NER A (un modèle généraliste) détecte "Patrick" → PERSON, span [0:7], confidence 0.95
NER B (un modèle métier moins fiable sur les prénoms) détecte "Patrick" → ORG, span [0:7], confidence 0.60 (il a confondu avec un nom d'entreprise)

Les deux pointent exactement sur le même span [0:7], mais avec des labels qui s'excluent mutuellement. Si on remplace les deux, on tape deux fois sur la même position et on obtient un truc cassé du genre <<ORG:1>><<PERSON:1>> travaille chez.... Il faut choisir.

C'est le rôle du résolveur de spans. PIIGhost en fournit deux par défaut :

ConfidenceSpanConflictResolver : garde la détection avec la plus haute confiance en cas de chevauchement. C'est le défaut raisonnable.
DisabledSpanConflictResolver : ne fait rien, à utiliser si vos détections sont déjà propres ou si vous voulez gérer le cas vous-même.

Vous pouvez aussi écrire le vôtre (préférer le span le plus long, préférer un label spécifique, etc.) en implémentant le protocole SpanConflictResolver.

from piighost import ConfidenceSpanConflictResolver

resolver = ConfidenceSpanConflictResolver()
clean = resolver.resolve(detections)

# Détections en entrée :
#   - PERSON "Patrick" [0:7] confidence=0.95   (NER A)
#   - ORG    "Patrick" [0:7] confidence=0.60   (NER B)
#
# Après résolution, il ne reste que :
#   - PERSON "Patrick" [0:7] confidence=0.95

À la fin de cette étape, plus de chevauchements. Chaque morceau de texte n'est revendiqué que par une seule détection.

Le chevauchement n'est pas forcément exact. Le résolveur gère aussi les cas où un span est inclus dans un autre, ou où deux spans se recouvrent partiellement. Le principe reste le même : garder le plus confiant.

Étape 3 : Liaison d'entités

Deuxième problème : le NER rate des occurrences. Il trouve "Patrick Dupont" dans la phrase 1, mais rate "Patrick" tout seul dans la phrase 3. Si on s'arrête à la détection brute, "Patrick" reste en clair dans le texte anonymisé. C'est exactement ce qu'on veut éviter.

Le linker corrige ça. ExactEntityLinker fait deux choses :

Pour chaque détection, il cherche toutes les autres occurrences du même texte dans le document, avec une regex word-boundary (pour éviter de matcher "Patric" dans "Patricia").
Il regroupe toutes les détections qui pointent vers le même texte normalisé en un seul objet Entity.

Concrètement :

Texte : "Patrick Dupont habite à Paris. Patrick adore Paris."

Détections brutes du NER :
  - PERSON   "Patrick Dupont"  (phrase 1)
  - LOCATION "Paris"            (phrase 1)
  # "Patrick" et "Paris" de la phrase 2 ont été ratés par le NER

Après ExactEntityLinker :
  - Entity(label=PERSON,   detections=["Patrick Dupont", "Patrick"])
  - Entity(label=LOCATION, detections=["Paris", "Paris"])

Toutes les occurrences sont retrouvées, regroupées par entité. Le NER rate des choses, le linker rattrape derrière.

À noter : le linker fait du matching exact sur la chaîne. Il n'attrape pas "patrick" en bas de casse ou "Patriick" avec une faute. Pour ça, il faut un linker fuzzy, qu'on peut écrire en implémentant le protocole EntityLinker.

Étape 4 : Fusion d'entités

Troisième problème, plus subtil. Imaginez deux détecteurs qui voient la même personne mais avec des spans différents :

Le NER détecte "Patrick Dupont" → entité A, label PERSON
Un dictionnaire métier détecte "Patrick" tout seul (parce qu'il est dans la liste des associés du cabinet) → entité B, label PERSON

Après le linker, vous vous retrouvez avec deux entités distinctes alors qu'il s'agit clairement de la même personne. Si vous anonymisez tel quel, "Patrick Dupont" devient <<PERSON:1>> et "Patrick" tout seul devient <<PERSON:2>>. Le LLM pense que ce sont deux personnes différentes.

Le resolver d'entités fusionne ces doublons. Deux options :

MergeEntityConflictResolver : utilise un union-find pour fusionner les entités qui partagent au moins une détection en commun (matching strict). C'est le défaut.
FuzzyEntityConflictResolver : utilise la distance Jaro-Winkler pour fusionner les entités dont le texte canonique est proche (ex. "Patrick" et "Patriick" avec une typo). Plus tolérant, mais risque de faux positifs plus élevé.

Exemple concret :

Avant fusion :
  - Entity(label=PERSON, detections=["Patrick Dupont"])
  - Entity(label=PERSON, detections=["Patrick"])
  # Les deux entités partagent une détection sur la chaîne "Patrick"

Après MergeEntityConflictResolver :
  - Entity(label=PERSON, detections=["Patrick Dupont", "Patrick"])

À ce stade, vous avez une liste propre d'entités, chacune regroupant toutes ses occurrences. Plus de doublons, plus de chevauchements.

Étape 5 : Anonymisation

Maintenant on peut remplacer. L'Anonymizer génère un placeholder unique par entité via une PlaceholderFactory, puis remplace les spans dans le texte de droite à gauche (pour ne pas décaler les positions des spans suivants).

from piighost import Anonymizer, LabelCounterPlaceholderFactory

anonymizer = Anonymizer(LabelCounterPlaceholderFactory())
result = anonymizer.anonymize(text, entities)

# Patrick Dupont habite à Paris. Patrick adore Paris.
# devient
# <<PERSON:1>> habite à <<LOCATION:1>>. <<PERSON:1>> adore <<LOCATION:1>>.

Plusieurs factories sont fournies, à choisir selon votre cas :

LabelCounterPlaceholderFactory : <<PERSON:1>>, <<LOCATION:1>>. Lisible dans les logs et les traces.
LabelHashPlaceholderFactory : <<PERSON:a3f9>>. Évite de fuiter l'ordre d'apparition des entités d'une conversation à l'autre.
FakerCounterPlaceholderFactory : "John Smith", "Springfield". Préserve le flux linguistique pour le LLM (utile si le modèle galère avec les placeholders bruts).
MaskPlaceholderFactory : [REDACTED]. Anonymisation pure, irréversible.

Le format <<LABEL:N>> par défaut a quatre propriétés utiles :

il est en théorie unique comme token,
le LLM voit immédiatement de quel type de PII il s'agit,
il n'est pas ambigu dans du texte normal,
il ne peut pas être confondu avec un autre placeholder (contrairement à <<PERSON>> tout court, qui ne distingue pas les personnes entre elles).

Le pipeline assemblé

Toutes les étapes ci-dessus s'enchaînent dans un pipeline :

from piighost.pipeline import AnonymizationPipeline
from piighost import (
    ConfidenceSpanConflictResolver,
    ExactEntityLinker,
    MergeEntityConflictResolver,
    Anonymizer,
    LabelCounterPlaceholderFactory,
)

pipeline = AnonymizationPipeline(
    detector=detector,
    span_resolver=ConfidenceSpanConflictResolver(),
    entity_linker=ExactEntityLinker(),
    entity_resolver=MergeEntityConflictResolver(),
    anonymizer=Anonymizer(LabelCounterPlaceholderFactory()),
)

anonymized, entities = await pipeline.anonymize(
    "Patrick Dupont habite à Paris. Patrick adore Paris."
)
# <<PERSON:1>> habite à <<LOCATION:1>>. <<PERSON:1>> adore <<LOCATION:1>>.

original, _ = await pipeline.deanonymize(anonymized)
# Patrick Dupont habite à Paris. Patrick adore Paris.

Le pipeline garde un cache du mapping (clé SHA-256 sur le texte d'entrée), donc la désanonymisation est gratuite après le premier appel.

Le problème de la conversation

Tout ça marche pour un message isolé. Dans une vraie conversation, ça casse à cause de trois problèmes.

Compteurs non partagés. Chaque appel à anonymize repart de zéro. Le mapping Patrick → <<PERSON:1>> du message 1 n'est pas garanti d'être réutilisé au message 2.

Détections manquées entre messages. Le NER détecte "Patrick" dans le message 1 mais le rate dans le message 5. Sans mémoire des entités déjà vues, on ne peut pas combler le trou.

Conversations concurrentes. Si plusieurs utilisateurs partagent la même instance de pipeline, leurs entités se mélangent. Les <<PERSON:1>> des uns et des autres deviennent indiscernables.

Démonstration du bug :

# Message 1
m1, _ = await pipeline.anonymize("Patrick habite à Paris.")
# <<PERSON:1>> habite à <<LOCATION:1>>.

# Message 2 : état non partagé
m2, _ = await pipeline.anonymize("Bob est content.")
# <<PERSON:1>> est content.   ← le compteur est reparti à 1
# Bob hérite donc du même placeholder que Patrick → collision :
# le LLM pense que c'est la même personne.

ThreadAnonymizationPipeline étend le pipeline standard avec une ConversationMemory scopée par thread_id. La mémoire accumule les entités au fil des messages, dédupliquées par (text.lower(), label). Chaque appel passe un thread_id, et le cache est préfixé par cet identifiant pour isoler les conversations.

from piighost.pipeline.thread import ThreadAnonymizationPipeline

pipeline = ThreadAnonymizationPipeline(detector=..., span_resolver=..., ...)

# Conversation A
m1, _ = await pipeline.anonymize("Patrick habite à Paris.", thread_id="user-A")
# <<PERSON:1>> habite à <<LOCATION:1>>.

m2, _ = await pipeline.anonymize("Patrick est content.", thread_id="user-A")
# <<PERSON:1>> est content.   ← garanti, partagé via la mémoire du thread

# Conversation B en parallèle, isolée
m3, _ = await pipeline.anonymize("Bob aime Lyon.", thread_id="user-B")
# <<PERSON:1>> aime <<LOCATION:1>>.   ← compteur indépendant de la conversation A

ThreadAnonymizationPipeline ajoute aussi deux opérations utiles pour le cas agent :

anonymize_with_ent(text, thread_id=...) : remplacement de chaîne pur, sans détection. Utilise les entités déjà connues du thread pour anonymiser un nouveau texte. Plus rapide, mais ne détecte pas de nouvelles PII.
deanonymize_with_ent(text, thread_id=...) : remplacement inverse. Utile quand le LLM produit un texte avec des placeholders qu'on veut restaurer.

Ces deux opérations gèrent correctement les cas où un placeholder est préfixe d'un autre (<<PERSON:1>> vs <<PERSON:10>>) en remplaçant les plus longs en premier.

Le problème de l'agent

Dans un agent LangGraph, le LLM ne traite pas juste des messages. Il appelle des outils, lit leurs résultats, et raisonne en boucle. Anonymiser proprement dans ce contexte demande trois interventions à des moments précis.

Avant l'appel LLM. Tous les messages doivent être anonymisés. C'est le pipeline.anonymize() standard, appliqué sur chaque message du contexte.

Avant et après l'exécution d'un outil. Le LLM appelle send_email(to=<<PERSON:1>>). Le tool a besoin de la vraie adresse, pas du placeholder. On désanonymise les arguments via deanonymize_with_ent, on exécute, puis on réanonymise le résultat avant de le redonner au LLM.

Avant l'affichage à l'utilisateur. Le LLM produit "C'est fait, j'ai envoyé l'email à <<PERSON:1>>". L'utilisateur veut voir "Patrick", pas le placeholder.

PIIAnonymizationMiddleware pose ces trois hooks dans LangGraph :

from langchain.agents import create_agent
from piighost.middleware import PIIAnonymizationMiddleware

middleware = PIIAnonymizationMiddleware(pipeline=pipeline)

agent = create_agent(
    model="mistral:mistral-large-latest",
    tools=[send_email, get_weather],
    middleware=[middleware],
)

Sous le capot, le middleware lit le thread_id depuis la config LangGraph (get_config()["configurable"]["thread_id"]) et le passe à toutes les opérations du pipeline. Le LLM ne voit jamais les vraies valeurs, les outils les reçoivent normalement, l'utilisateur récupère sa réponse avec ses noms intacts. Aucun code agent à modifier.

piighost-chat : la démo human-in-the-loop

Pour rendre tout ça concret, j'ai construit un chatbot par-dessus la librairie. L'utilisateur voit ce qui va être anonymisé avant que le message parte au LLM. Il peut désélectionner un span flaggué par erreur, ou sélectionner du texte que le détecteur a raté. Une fois validé, le message part dans la pipeline.

Ce genre d'UX human-in-the-loop est ce qui rend l'anonymisation automatique vraiment utilisable dans les workflows réels, où la précision automatique plafonne souvent autour de 90-95 % et où ces quelques pourcents manqués peuvent être problématiques. La passe automatique fait le gros du boulot, l'humain rattrape les bords.

Par exemple ici vous rentrez votre message, il passe par l'API piighost et le front affiche ce qui a été détecté et ce qui va être anonymisé.

Vous pouvez supprimer des entités anonymisées s'il y a eu un faux positif.

Vous pouvez aussi sélectionner du texte pour rajouter des entités à anonymiser.

Si vous demandez des informations sur une PII anonymisée, par exemple par quelle lettre commence le mot, le LLM ne pourra pas vous répondre.

La librairie est à ses débuts. J'ai essayé d'anticiper un maximum de cas en partant de mes propres besoins sur des documents notariaux, mais je sais que c'est un angle particulier et que beaucoup de choses peuvent être discutées : des composants pas assez génériques, des abstractions qui ne servent à rien, des cas d'usage que je n'ai pas vus.
Si vous l'essayez, vos retours m'intéressent vraiment :

ce qui vous a manqué ou paru contre-intuitif,
ce qui vous semble trop complexe ou inutile et mériterait d'être supprimé,
les cas d'usage où elle ne tient pas la route.

Tout est bon à prendre, que ce soit via une issue GitHub, une PR, ou même un message direct. Je préfère trancher tôt sur ce qui n'a pas sa place plutôt que d'accumuler de la dette.

Merci d'avoir lu.

DEV Community: Athroniaeth

Comment laisser GPT-5.5 corriger un CV sans jamais lui montrer un seul donnée personnelle

TLDR

1. Pourquoi pas juste une regex ?

2. Streamer les erreurs avec instructor

3. Le retour sur PDF : quatre stratégies de fallback

Bilan

How to let GPT-5.5 proofread a CV without leak it personal data

TLDR

1. Why not just a regex?

2. Streaming the mistakes with instructor

3. Back onto the PDF: four fallback strategies

Takeaways

PIIGhost: a Python library for PII anonymization in LLM agents

Why it's harder than it looks

Step 1: Detection

Step 2: Span arbitration

Step 3: Entity linking

Step 4: Entity merging

Step 5: Anonymization

The assembled pipeline

The conversation problem

The agent problem

piighost-chat: the human-in-the-loop demo

PIIGhost : une librairie Python d'anonymisation de données confidentiels pour les agents LLM

Pourquoi c'est plus dur qu'il n'y paraît

Étape 1 : Détection

Étape 2 : Arbitrage des spans

Étape 3 : Liaison d'entités

Étape 4 : Fusion d'entités

Étape 5 : Anonymisation

Le pipeline assemblé

Le problème de la conversation

Le problème de l'agent

piighost-chat : la démo human-in-the-loop

2. Streamer les erreurs avec `instructor`

2. Streaming the mistakes with `instructor`