A client sends you a PDF brochure and asks you to update the tagline.
You open it. It looks nicely designed and clean.
You click on a word and start typing. A harmless substitution. You save.
When you open it again, it is wrong - garbage. Weird characters. Wrong spacing. Stuff moved.
You try another tool. It does not really fix the old text. It paints a new word on top instead. The old word is still there. You can still search for it. A screen reader can still see it.
You think you are editing a document. You are not.
PDF calls itself a document and it pretends to be one, but internally it is mostly commands to a drawing machine:
- move here
- use this font
- draw that glyph
- draw a line
- one more line
- fill that
A stream of instructions, indifferent to the idea that someone, somewhere might want to change a single word.
In this article, I will take you on a deep dive into how PDFs represent text, and why “just replacing a word” is so hard. Then we will look at what a reliable solution has to do differently.
How text is represented in PDF
Let's have a closer look at how PDFs work internally.
A PDF page is drawn by interpreting a content stream - a sequence of operators (drawing instructions) with operands.
There are 73 content stream operators (drawing instructions) defined in PDF.
| Category | Count | Examples |
|---|---|---|
| Path Construction | 6 | m, l, c, v, y, h, re |
| Path Painting | 10 | S, s, f, F, f*, B, B*, b, b*, n |
| Clipping | 2 | W, W* |
| Text State | 9 | Tc, Tw, Tz, TL, Tf, Tr, Ts |
| Text Positioning | 4 | Td, TD, Tm, T* |
| Text Showing | 5 | Tj, TJ, ', " |
| Text Object | 2 | BT, ET |
| Color | 12 | CS, cs, SC, SCN, sc, scn, G, g, RG, rg, K, k |
| Graphics State | 9 | q, Q, cm, w, J, j, M, d, i, ri, gs |
| Shading | 1 | sh |
| External Objects | 1 | Do |
| Inline Images | 3 | BI, ID, EI |
| Marked Content | 4 | BMC, BDC, EMC, DP, MP |
| Type 3 Fonts | 2 | d0, d1 |
| Compatibility | 2 | BX, EX |
When you see text in a PDF viewer, you might assume the file stores something like:
<h1>TITLE</h1>
or something like
TJ TITLE
A typical PDF text draw looks more like this:
BT
/F1 24 Tf
72 720 Td
(<bytes>) Tj
ET
Or, when spacing adjustments are involved:
BT
/F1 24 Tf
72 720 Td
[(<bytes>) -120 (<more-bytes>)] TJ
ET
Important detail: those () are not “Unicode text”. They are bytes interpreted through the font’s encoding (and for many PDFs, through CMaps and CID mappings). The result is then used to pick glyphs to draw.
So conceptually, it is closer to:
bytes -> character codes (often CIDs) -> glyphs -> paint shapes
Wait, what's a glyph?
A glyph is the visual representation of a character which differs from font to font.
You just looked up some tiny vector graphics. Not the semantic character.
Same character, different glyph (Arial vs OpenSans lowercase 'g'):
That distinction matters because when you “replace a character”, you are not replacing “an ‘o’”. You are changing bytes that might map to a completely different glyph depending on the font’s internal encoding.
But can’t we just map it back to Unicode?
Sometimes.
PDF has a mechanism (ToUnicode) to map glyphs back to Unicode so copy/paste and search can work. But in the real world:
the mapping can be missing
the mapping can be incomplete
the mapping can be wrong
and reverse mapping (Unicode -> the right glyph code) is often not possible
ToUnicode is the optional mapping table that tells a PDF viewer “this glyph code corresponds to this Unicode character.”
In theory it makes copy/paste, search, and accessibility work.
In practice it is frequently missing or wrong for a few boring-but-common reasons.
Many PDFs are produced by toolchains that prioritize visual output over text semantics (print pipelines, converters, some design apps), so they never emit a correct mapping.
Subset fonts and custom encodings make it worse - the bytes in the content stream are often not Unicode, and without the exact font/CMap logic the producer can easily write a mapping that is incomplete or shifted.
And sometimes it is wrong on purpose: some generators intentionally scramble mappings to discourage copying, or they embed “best effort” text that looks right on screen but decodes to nonsense for anything trying to edit it.
We are still at the character level though. But what about a word? How is a word represented in PDF?
The CID Soup Problem
Let’s say the word on the page is:
Hello
In a friendly PDF, you might see one text showing operator that draws those letters in order.
In many real PDFs, you do not.
A PDF is free to draw letters in any order, split across multiple operations, interleave with other drawings, and place glyphs anywhere on the page. You can draw the first letter last - no problem.
Conceptually, you can end up with something like:
draw “H” at x=0
draw some other stuff
draw “o” at x=120
draw “l” at x=66
draw “e” at x=12
draw more stuff
draw “l” at x=72
Same visible word, but the drawing order is not the reading order.
Two consequences:
Finding the word “Hello” is already non-trivial - you cannot assume the characters are adjacent in the stream.
Even if you find the right glyphs, changing one of them can change spacing, kerning, and layout in ways the PDF never explicitly modeled as “a word”.
Font subsetting makes it worse again:
most PDFs embed only the handful of glyphs that actually appear in the file, so the font inside the PDF is incomplete by design.
If the glyph you need for an edit isn’t already in that subset (even for a common letter), there is nothing to “switch to” - you either have to extend the embedded font or substitute a different one and then deal with the layout fallout.
The Graphics State Stack
Ready for the next PDF internal that makes editing even harder?
PDF operators are interpreted in context. That context is managed via a stack - you can push and pop drawing state while nesting transformations, clipping, and style settings.
You will see this pattern everywhere:
q
...set transform, clip, colors...
...draw things...
Q
That q/Q pair saves and restores the current graphics state.
The graphics state can include (among other things):
transformations (CTM): translate, scale, rotate, skew
stroking/non-stroking colors
line width and dash patterns
clipping paths
Text itself also has its own state (font, size, spacing), typically used inside BT/ET text objects. But the key point remains:
The final position of a glyph on the page is often the result of multiple nested transformations.
Example:
outer CTM scales by 2x
inner CTM scales by 2x
result inside the inner scope is 4x
For editing, that means:
the “x,y” you see in a text operator may not be in page coordinates
you must evaluate the transformation stack to know where a glyph ends up
if you insert or replace text, you must apply the same transforms (or reconstruct the local coordinate space)
Clipping Masks
Clipping is omnipresent in real PDFs.
A clipping path defines the visible region. Everything you draw after setting a clip is only visible where it overlaps the clipping path.
Simple example:
start with a full A4 page
define a clipping rectangle (50x50) in the center
now draw anything - only the part inside that rectangle will show
This is how PDFs crop images, mask shapes, and sometimes even create text effects.
So it basically means:
draw whatever you want, but only show the parts that fall inside this path.
The path can be anything:
rectangle
circle
arbitrary logo outline
combination of shapes
even glyph shapes (yes, text can act as a clipping path)
For editing, clipping is a frequent source of surprises:
move text slightly -> it gets partially clipped
remove or change a clip -> hidden content may suddenly appear
insert new text -> it might be fully clipped without you noticing
Putting it all together
Given all this, how is it even possible to edit text?
Let’s use a concrete example in a moderately complex PDF:
embedded TrueType subset font
one transformation matrix in effect
a simple clipping mask
There is a typo we want to fix: change “Fax” to “Fox”.
A naive approach might look like this:
-
Locate the occurrences of “Fax”
- Walk the content stream and find text showing operators (Tj, TJ, ', ").
- For each candidate, try to decode to Unicode via ToUnicode (if present).
- Look for an occurrence that decodes to “Fax”.
-
Verify the font can render “o”
In many PDFs, the font is subsetted and may not include the glyph for “o” unless “o” is used somewhere with that same font subset.
-
Replace the middle glyph
Change the encoded bytes / codes so the drawn glyph becomes “o” while preserving spacing.
This can work - but only in the happy path.
It breaks when:
Characters are not drawn contiguously or in reading order (the CID soup problem).
The subset font does not contain the target glyph (common).
The Unicode mapping is missing/wrong, so you cannot reliably find or reverse-map the glyph codes.
The text is drawn under transforms/clips you did not account for.
Disambiguation makes it worse
Now make it slightly more realistic:
There are two occurrences of “Fax” on the page.
One is correct: a fax number at the bottom.
One is wrong: the typo in the header.
If you only search for the decoded word “Fax”, you do not know which one you found.
To tell them apart, you must:
evaluate the graphics state stack (q/Q) around the operators you matched
compute the effective transformation matrix at that point
account for clipping
and determine where on the page those glyphs actually render
Only then can you say: “this occurrence is at the top, so this is the typo”.
And if you want to replace a whole line or paragraph, even this approach stops being practical. PDFs often do not store enough semantic structure (word boundaries, line breaks, paragraph structure) to make “edit a paragraph” a simple mutation.
So what does a reliable solution do?
The solution: semantic reconstruction
In the end, there is no way around a semantic reconstruction approach:
extract all drawn text fragments (with their geometry)
normalize them into a layout model (baselines, fonts, sizes, spacing)
group into lines (and handle columns)
infer paragraphs (harder than it sounds)
decide where replacement text should go
-
and then re-emit content that matches the original look:
- correct font appearance (or a compatible substitute)
- line spacing, kerning, ligatures
- correct interaction with transforms and clipping
By “reconstruction” I mean rebuilding the structure a PDF doesn’t store:
- you start from what the page actually draws (glyphs + exact positions + the active transforms/clips)
- you turn that back into a layout you can reason about (characters -> words -> lines -> paragraphs)
- you apply the edit in that model
- then write new PDF instructions that reproduce the same visual result
Concretely: instead of “swap one byte code”, semantic editing looks like this
this line contains Fax at (x,y) in this embedded font with these spacing adjustments -> replace it with Fox -> re-layout just enough to keep the same baseline and spacing, so nothing jumps and nothing gets clipped.
This is the approach I ended up building.
Try it
I built PDFDancer to make real-world PDFs editable at a higher level (text, fonts, graphics) while preserving the original layout.
For early adopters, we offer a free Pro plan until December 17, 2025
Once you treat PDFs as drawing programs, the path to real editing is obvious: reconstruction, not replacement. The surprising part is how far you can take it when you do that properly.





Top comments (0)