DEV Community

Michael Lahr
Michael Lahr

Posted on

Why is PDF so hard to edit?

A client sends you a PDF brochure and asks you to update the tagline.

You open it. It looks nicely designed and clean.

You click on a word and start typing. A harmless substitution. You save.

When you open it again, it is wrong - garbage. Weird characters. Wrong spacing. Stuff moved.

You try another tool. It does not really fix the old text. It paints a new word on top instead. The old word is still there. You can still search for it. A screen reader can still see it.

You think you are editing a document. You are not.


PDF calls itself a document and it pretends to be one, but internally it is mostly commands to a drawing machine:

  • move here
  • use this font
  • draw that glyph
  • draw a line
  • one more line
  • fill that

A stream of instructions, indifferent to the idea that someone, somewhere might want to change a single word.


In this article, I will take you on a deep dive into how PDFs represent text, and why “just replacing a word” is so hard. Then we will look at what a reliable solution has to do differently.

How text is represented in PDF

Let's have a closer look at how PDFs work internally.

A PDF page is drawn by interpreting a content stream - a sequence of operators (drawing instructions) with operands.

There are 73 content stream operators (drawing instructions) defined in PDF.

Category Count Examples
Path Construction 6 m, l, c, v, y, h, re
Path Painting 10 S, s, f, F, f*, B, B*, b, b*, n
Clipping 2 W, W*
Text State 9 Tc, Tw, Tz, TL, Tf, Tr, Ts
Text Positioning 4 Td, TD, Tm, T*
Text Showing 5 Tj, TJ, ', "
Text Object 2 BT, ET
Color 12 CS, cs, SC, SCN, sc, scn, G, g, RG, rg, K, k
Graphics State 9 q, Q, cm, w, J, j, M, d, i, ri, gs
Shading 1 sh
External Objects 1 Do
Inline Images 3 BI, ID, EI
Marked Content 4 BMC, BDC, EMC, DP, MP
Type 3 Fonts 2 d0, d1
Compatibility 2 BX, EX

When you see text in a PDF viewer, you might assume the file stores something like:

<h1>TITLE</h1>
Enter fullscreen mode Exit fullscreen mode

or something like

TJ TITLE
Enter fullscreen mode Exit fullscreen mode

A typical PDF text draw looks more like this:

BT
/F1 24 Tf
72 720 Td
(<bytes>) Tj
ET
Enter fullscreen mode Exit fullscreen mode

Or, when spacing adjustments are involved:

BT
/F1 24 Tf
72 720 Td
[(<bytes>) -120 (<more-bytes>)] TJ
ET
Enter fullscreen mode Exit fullscreen mode

Important detail: those () are not “Unicode text”. They are bytes interpreted through the font’s encoding (and for many PDFs, through CMaps and CID mappings). The result is then used to pick glyphs to draw.

So conceptually, it is closer to:

bytes -> character codes (often CIDs) -> glyphs -> paint shapes
Enter fullscreen mode Exit fullscreen mode

Wait, what's a glyph?

A glyph is the visual representation of a character which differs from font to font.
You just looked up some tiny vector graphics. Not the semantic character.

Same character, different glyph (Arial vs OpenSans lowercase 'g'):

Arial vs OpenSans lowercase 'g'

That distinction matters because when you “replace a character”, you are not replacing “an ‘o’”. You are changing bytes that might map to a completely different glyph depending on the font’s internal encoding.

But can’t we just map it back to Unicode?

Sometimes.

PDF has a mechanism (ToUnicode) to map glyphs back to Unicode so copy/paste and search can work. But in the real world:

  • the mapping can be missing

  • the mapping can be incomplete

  • the mapping can be wrong

  • and reverse mapping (Unicode -> the right glyph code) is often not possible


ToUnicode is the optional mapping table that tells a PDF viewer “this glyph code corresponds to this Unicode character.”

In theory it makes copy/paste, search, and accessibility work.

In practice it is frequently missing or wrong for a few boring-but-common reasons.

Many PDFs are produced by toolchains that prioritize visual output over text semantics (print pipelines, converters, some design apps), so they never emit a correct mapping.

Subset fonts and custom encodings make it worse - the bytes in the content stream are often not Unicode, and without the exact font/CMap logic the producer can easily write a mapping that is incomplete or shifted.

And sometimes it is wrong on purpose: some generators intentionally scramble mappings to discourage copying, or they embed “best effort” text that looks right on screen but decodes to nonsense for anything trying to edit it.


We are still at the character level though. But what about a word? How is a word represented in PDF?

The CID Soup Problem

Let’s say the word on the page is:

Hello

In a friendly PDF, you might see one text showing operator that draws those letters in order.

In many real PDFs, you do not.

A PDF is free to draw letters in any order, split across multiple operations, interleave with other drawings, and place glyphs anywhere on the page. You can draw the first letter last - no problem.

Conceptually, you can end up with something like:

  • draw “H” at x=0

  • draw some other stuff

  • draw “o” at x=120

  • draw “l” at x=66

  • draw “e” at x=12

  • draw more stuff

  • draw “l” at x=72

CID Soup Example

Same visible word, but the drawing order is not the reading order.

Two consequences:

  1. Finding the word “Hello” is already non-trivial - you cannot assume the characters are adjacent in the stream.

  2. Even if you find the right glyphs, changing one of them can change spacing, kerning, and layout in ways the PDF never explicitly modeled as “a word”.

Font subsetting makes it worse again:
most PDFs embed only the handful of glyphs that actually appear in the file, so the font inside the PDF is incomplete by design.

If the glyph you need for an edit isn’t already in that subset (even for a common letter), there is nothing to “switch to” - you either have to extend the embedded font or substitute a different one and then deal with the layout fallout.

The Graphics State Stack

Ready for the next PDF internal that makes editing even harder?

PDF operators are interpreted in context. That context is managed via a stack - you can push and pop drawing state while nesting transformations, clipping, and style settings.

You will see this pattern everywhere:

q
  ...set transform, clip, colors...
  ...draw things...
Q
Enter fullscreen mode Exit fullscreen mode

That q/Q pair saves and restores the current graphics state.

The graphics state can include (among other things):

  • transformations (CTM): translate, scale, rotate, skew

  • stroking/non-stroking colors

  • line width and dash patterns

  • clipping paths

Text itself also has its own state (font, size, spacing), typically used inside BT/ET text objects. But the key point remains:

The final position of a glyph on the page is often the result of multiple nested transformations.

Example:

  • outer CTM scales by 2x

  • inner CTM scales by 2x

  • result inside the inner scope is 4x

For editing, that means:

  • the “x,y” you see in a text operator may not be in page coordinates

  • you must evaluate the transformation stack to know where a glyph ends up

  • if you insert or replace text, you must apply the same transforms (or reconstruct the local coordinate space)

Transformation Matrices in action

Clipping Masks

Clipping is omnipresent in real PDFs.

A clipping path defines the visible region. Everything you draw after setting a clip is only visible where it overlaps the clipping path.

Simple example:

  • start with a full A4 page

  • define a clipping rectangle (50x50) in the center

  • now draw anything - only the part inside that rectangle will show

Clipping Mask in action

This is how PDFs crop images, mask shapes, and sometimes even create text effects.

So it basically means:

draw whatever you want, but only show the parts that fall inside this path.

The path can be anything:

  • rectangle

  • circle

  • arbitrary logo outline

  • combination of shapes

  • even glyph shapes (yes, text can act as a clipping path)

For editing, clipping is a frequent source of surprises:

  • move text slightly -> it gets partially clipped

  • remove or change a clip -> hidden content may suddenly appear

  • insert new text -> it might be fully clipped without you noticing

Putting it all together

Given all this, how is it even possible to edit text?

Let’s use a concrete example in a moderately complex PDF:

  • embedded TrueType subset font

  • one transformation matrix in effect

  • a simple clipping mask

There is a typo we want to fix: change “Fax” to “Fox”.

A naive approach might look like this:

  1. Locate the occurrences of “Fax”

    1. Walk the content stream and find text showing operators (Tj, TJ, ', ").
    2. For each candidate, try to decode to Unicode via ToUnicode (if present).
    3. Look for an occurrence that decodes to “Fax”.
  2. Verify the font can render “o”

    In many PDFs, the font is subsetted and may not include the glyph for “o” unless “o” is used somewhere with that same font subset.

  3. Replace the middle glyph

    Change the encoded bytes / codes so the drawn glyph becomes “o” while preserving spacing.

This can work - but only in the happy path.

It breaks when:

  • Characters are not drawn contiguously or in reading order (the CID soup problem).

  • The subset font does not contain the target glyph (common).

  • The Unicode mapping is missing/wrong, so you cannot reliably find or reverse-map the glyph codes.

  • The text is drawn under transforms/clips you did not account for.

Disambiguation makes it worse

Now make it slightly more realistic:

There are two occurrences of “Fax” on the page.

  • One is correct: a fax number at the bottom.

  • One is wrong: the typo in the header.

If you only search for the decoded word “Fax”, you do not know which one you found.

To tell them apart, you must:

  • evaluate the graphics state stack (q/Q) around the operators you matched

  • compute the effective transformation matrix at that point

  • account for clipping

  • and determine where on the page those glyphs actually render

Only then can you say: “this occurrence is at the top, so this is the typo”.

And if you want to replace a whole line or paragraph, even this approach stops being practical. PDFs often do not store enough semantic structure (word boundaries, line breaks, paragraph structure) to make “edit a paragraph” a simple mutation.

So what does a reliable solution do?

The solution: semantic reconstruction

In the end, there is no way around a semantic reconstruction approach:

  • extract all drawn text fragments (with their geometry)

  • normalize them into a layout model (baselines, fonts, sizes, spacing)

  • group into lines (and handle columns)

  • infer paragraphs (harder than it sounds)

  • decide where replacement text should go

  • and then re-emit content that matches the original look:

    • correct font appearance (or a compatible substitute)
    • line spacing, kerning, ligatures
    • correct interaction with transforms and clipping

By “reconstruction” I mean rebuilding the structure a PDF doesn’t store:

  • you start from what the page actually draws (glyphs + exact positions + the active transforms/clips)
  • you turn that back into a layout you can reason about (characters -> words -> lines -> paragraphs)
  • you apply the edit in that model
  • then write new PDF instructions that reproduce the same visual result

Concretely: instead of “swap one byte code”, semantic editing looks like this

this line contains Fax at (x,y) in this embedded font with these spacing adjustments -> replace it with Fox -> re-layout just enough to keep the same baseline and spacing, so nothing jumps and nothing gets clipped.

Semantic Reconstruction in action

This is the approach I ended up building.

Try it

I built PDFDancer to make real-world PDFs editable at a higher level (text, fonts, graphics) while preserving the original layout.

For early adopters, we offer a free Pro plan until December 17, 2025

Once you treat PDFs as drawing programs, the path to real editing is obvious: reconstruction, not replacement. The surprising part is how far you can take it when you do that properly.

Top comments (0)