TLDR
Before you send out an important CV, you can hand it to an LLM for proofreading. A few seconds later, you have a list of mistakes. Except you've also just handed your name, your address, your employers and your dates to a third-party service.
piighost-proofreader fixes that. The CV is anonymized locally before the LLM call, and the corrections find their way back onto the right word in the original PDF:
The LLM never sees a name, a date, an address.
Anonymization is the easy part. The painful bit is finding, back in the PDF, a word the LLM only ever saw as Markdown. And the LLM and PyMuPDF don't tokenize the same way.
1. Why not just a regex?
First idea: before sending the CV to the LLM, you replace the sensitive data with one big regex. That works for emails and phone numbers, which have a recognizable format. For everything else, forget it.
- A name has no distinctive syntactic shape.
Paul Martinlooks like any two capitalized words; nothing in the text tells a regex it's a name. -
Orangeis a company. It's also a fruit.Mars,Apple,Carrefour, same story. - A date in a CV can be a birth, a degree, a job change. The format is identical.
You need a trained detector, not a pattern. piighost provides one, and the call looks like this:
# src/proofreader/anonymize.py
async def anonymize(self, text: str, *, thread_id: str) -> str:
return await self._call(
"/v1/anonymize", text, thread_id, response_key="anonymized_text"
)
The thread_id is a UUID per CV. The entity→placeholder mapping stays server-side, scoped by that ID: the same name becomes the same placeholder on every occurrence.
2. Streaming the mistakes with instructor
A two-page CV holds a good fifteen mistakes, and the LLM takes several seconds to spit them out. Without streaming, the user stares at a loader the whole time. With it, the mistakes show up one by one as the model emits them.
The catch: most structured-output libs (LangChain with_structured_output, OpenAI Functions, Pydantic AI) return the complete result. You ask for a list[Mistake], you get the whole list once inference is done. No object-by-object granularity.
instructor is built for exactly this. Its create_iterable method parses the LLM's streamed JSON on the fly and yields each pydantic object as soon as it's complete:
# src/proofreader/llm.py
client = instructor.from_litellm(litellm.acompletion)
response = client.chat.completions.create_iterable(
model=model,
response_model=Mistake, # a single object, not list[Mistake]
messages=[
{"role": "system", "content": SYSTEM_PROMPT_STREAM.format(language=language)},
{"role": "user", "content": markdown},
],
)
async for mistake in response:
yield mistake
Two complications that aren't obvious:
The prompt changes with the mode. For LangChain's
with_structured_output, you ask the LLM to return a wrapper object with a list of Mistakes inside. Forcreate_iterable, you ask it to emit a single Mistake JSON per generation turn. The two prompts aren't quite the same. The project keeps both side by side: LangChain for the one-shot Streamlit path,instructorfor the FastAPI streaming path.The SSE streaming downstream. Each
Mistakeemitted is immediately repackaged into a Server-Sent Events event on the FastAPI side, then pushed to the frontend. The locator from the next section runs per-Mistake, so the user watches each red rectangle pop up as it goes, not all at once at the end.
3. Back onto the PDF: four fallback strategies
For each Mistake instructor yields, I have an error_text, a correction, a context_before, and a description. The LLM, though, never saw a single pixel of the PDF: it worked on the extracted Markdown. No field carries coordinates.
But the user wants to see the corrections on the original PDF, not flat text on a results page. So for each mistake, I have to find the word back in the PDF.
On the PDF side, I use PyMuPDF, which gives me a word stream: the list of every word on the page with its bbox (rectangles in points). The problem becomes: find the window [word1, word2, …] in that list. Except the LLM and PyMuPDF don't tokenize the same way, the typographic apostrophes don't line up, and on a two-column CV the LLM sometimes hallucinates its context_before.
Hence four strategies tried in order. Each one catches a case the previous can't handle:
# src/proofreader/locator.py
def locate_mistake(mistake: Mistake, *, words: list[Word]) -> LocatedMistake | None:
err_tokens = mistake.error_text.split()
if not err_tokens:
return None
ctx_tokens = mistake.context_before.split()
# Strategy 1: strict whole-word match.
matched = _match_window(ctx_tokens, err_tokens, words, normalize=False)
if matched is not None:
return _build_located(mistake, matched)
# Strategy 2: punctuation-tolerant (casefold + ASCII quotes + strip punct).
matched = _match_window(ctx_tokens, err_tokens, words, normalize=True)
if matched is not None:
return _build_located(mistake, matched)
# Strategy 3: error_text alone if it appears exactly once on the page.
# Catches LLM context drift in multi-column layouts.
matched = _find_error_alone_if_unique(err_tokens, words)
if matched is not None:
return _build_located(mistake, matched)
# Strategy 4: substring of the concatenated normalised stream. Handles LLM
# tokenisation drift like `d'une` → `d' + une`, where the standalone word
# has no PyMuPDF token equivalent.
matched = _find_error_as_substring_if_unique(err_tokens, words)
if matched is not None:
return _build_located(mistake, matched)
return None
Why this exact order:
Strict. The
context_before + error_textwindow matches word for word, no normalization. The happy case: the LLM quotes the PDF perfectly, exact match, zero ambiguity.Tolerant. The LLM capitalizes the first word of a sentence, or swaps
'for'(a typographic apostrophe)._normalizecasefolds everything, replaces curly quotes and apostrophes with their ASCII version, and strips the punctuation PyMuPDF glues onto tokens.Error-only unique. On two-column CVs, the
context_beforethe LLM produces is sometimes lifted from the wrong column (models linearize multi-column layouts clumsily). Iferror_textappears exactly once on the page, take it, context be damned. That's enough in the vast majority of cases.Substring of the concatenated stream. Nasty case:
d'uneis one word to the LLM, but PyMuPDF tokenizes it asd'+une. The LLM may returnerror_text="une"as a standalone word with no matching PyMuPDF token. Fix: concatenate all the page's tokens into a single string and search by substring. We gate on_MIN_SUBSTRING_CHARS = 5, because without it anerror_text="une"shows up insidecommune,lacune,tribune. Cue the false positives.
If none of the four catches anything, the mistake lands in a "Not located" section of the result instead of being silently dropped. A visible mistake the user can read but that has no red rectangle is less bad than a mistake we claim is somewhere it isn't.
Takeaways
If you're hacking on something similar, three things to remember:
- A regex doesn't detect names, companies or dates. You need a trained detector.
- If you want to stream structured output (pydantic objects on the fly, not the whole list at the end), the usual libs won't cut it.
instructoris built for that. - If the LLM works on text extracted from a document (PDF, OCR, scans), it hands you mistakes with no coordinates. You have to relocate them afterward, and accept it won't always be possible.
piighost handles the first point. instructor handles the second. The third is what made me write this project, whose code is open.
- Example of app : https://piighost-proofreader.athroniaeth.cloud/
- piighost: github.com/Athroniaeth/piighost, the anonymization lib used here.
- piighost-proofreader: github.com/Athroniaeth/piighost-proofreader, the full project, live demo, locator included.
Issues and PRs welcome. If you work with private text in an LLM, the three points above will probably ring a bell.

Top comments (0)