Pranav Mailarpawar

Posted on Mar 26

I Built a PDF to Word Converter That Places Every Word at Its Exact Coordinates

PDF to Word conversion is a solved problem if you don't care about accuracy. Export the text, dump it into paragraphs, call it a docx. The result opens in Word and technically contains the words from the original document, arranged in ways that bear no resemblance to the original layout.

Doing it accurately is a different problem entirely.

The PDF to Word converter inside ihatepdf.cv has two modes. Pixel Perfect — every page becomes a high-resolution JPEG embedded in the docx, visually identical to the original, text not editable. Ultra-Accurate Editable — every word extracted from the PDF's internal structure with its exact X/Y coordinates, font size, font family, bold, and italic state, then placed as an absolutely-positioned text box in the docx at its precise location. Both run entirely in your browser. No upload.

Here's how the accurate mode works.

What a PDF actually stores

A PDF is not a document format in the way Word is. It is closer to a list of drawing instructions. When a PDF renderer encounters text, it executes a series of PostScript-derived commands that describe exactly where to place each glyph on the page.

The key data structure is the text transform matrix. Every text item in a PDF has a 6-element array:

[a, b, c, d, tx, ty]

This is an affine transformation matrix. For horizontal text (which is most text), a and d are the horizontal and vertical scale factors, and tx, ty are the translation — the absolute position of the character on the page in PDF points.

The font size can be extracted directly from the matrix:

const [a, b, c, d, tx, ty] = item.transform;
const fontSize = Math.sqrt(a * a + b * b); // actual rendered size in pts

Math.sqrt(a² + b²) gives the magnitude of the horizontal basis vector — which is exactly the rendered font size accounting for any scaling applied to the text state.

The Y coordinate needs coordinate system conversion. PDFs use a bottom-left origin; HTML, CSS, and Word use top-left. The conversion:

const x = tx;
const y = viewport.height - ty; // flip Y axis

This gives us the top-left corner of the text in the PDF's point coordinate system, which is what we need to place it in Word.

Extracting the full text layer

The extraction function reads every text item from PDF.js's getTextContent() API:

async function extractNativeTextItems(pdfPage) {
  const content = await pdfPage.getTextContent({
    includeMarkedContent: false,
  });
  const vp = pdfPage.getViewport({ scale: 1 });

  const items = [];
  for (const item of content.items) {
    if (!item.str || !item.str.trim()) continue;

    const [a, b, c, d, tx, ty] = item.transform;
    const fontSize = Math.sqrt(a * a + b * b);
    const x = tx;
    const y = vp.height - ty;         // flip Y for top-left origin
    const w = item.width || fontSize * item.str.length * 0.6;
    const h = item.height || fontSize * 1.2;

    items.push({
      str:       item.str,
      x, y, w, h,
      fontSize,
      fontName:  (item.fontName || '').toLowerCase(),
      transform: item.transform,
      hasEOL:    item.hasEOL || false,
    });
  }
  return { items, pageW: vp.width, pageH: vp.height };
}

The item.fontName from PDF.js is the PDF's internal font identifier — typically something like BCDEAA+Arial-BoldItalic or g_d0_f1. The six-character prefix before the + is a PDF subset tag that can be ignored; everything after it is the actual font information.

Font detection from internal names

PDF font names encode style information in their naming patterns. Bold and italic are detected with regex:

function detectFontStyle(fontName = '') {
  const f = fontName.toLowerCase();
  const isBold   = /bold|heavy|black|semibold|demi/i.test(f);
  const isItalic = /italic|oblique|slanted/i.test(f);
  return { isBold, isItalic };
}

The font family name is normalised through a lookup table that maps PDF internal names to Word-compatible font families:

const FONT_MAP = [
  [/times|tmnr|nimbus rom/i,   'Times New Roman'],
  [/arial|helvetica|swiss/i,   'Arial'],
  [/courier/i,                  'Courier New'],
  [/georgia/i,                  'Georgia'],
  [/verdana/i,                  'Verdana'],
  [/calibri/i,                  'Calibri'],
  [/cambria/i,                  'Cambria'],
  [/palatino|palladio/i,        'Palatino Linotype'],
  // ...
];

function normaliseFont(rawName, fallbackFont) {
  if (!rawName) return fallbackFont;
  for (const [re, mapped] of FONT_MAP) {
    if (re.test(rawName)) return mapped;
  }
  // Strip subset prefix, clean up separators
  const clean = rawName.replace(/^[A-Z]{6}\+/, '').replace(/[-_,]/g, ' ');
  return clean || fallbackFont;
}

If the font can't be matched, the user's chosen fallback font (Calibri by default) is used. This handles the common case of PDF fonts with completely opaque internal names.

Line clustering — grouping spans that belong together

PDF.js returns text as individual spans, not lines. A single visual line of text might be 20 separate spans if the PDF uses different fonts, sizes, or colors mid-line. To place them sensibly in Word, they need to be grouped into lines first.

The clustering algorithm compares vertical overlap between consecutive spans:

function groupIntoLines(items) {
  const sorted = [...items].sort((a, b) =>
    a.y !== b.y ? a.y - b.y : a.x - b.x
  );

  const lines = [];
  let current = [sorted[0]];

  for (let i = 1; i < sorted.length; i++) {
    const prev = current[current.length - 1];
    const cur  = sorted[i];
    const prevMid = prev.y + prev.h / 2;
    const curMid  = cur.y  + cur.h / 2;
    const overlap = Math.min(prev.y + prev.h, cur.y + cur.h)
                  - Math.max(prev.y, cur.y);
    const minH    = Math.min(prev.h, cur.h);

    if (Math.abs(prevMid - curMid) < minH * 0.65 || overlap > minH * 0.4) {
      current.push(cur);
    } else {
      lines.push(current.sort((a, b) => a.x - b.x));
      current = [cur];
    }
  }
  if (current.length) lines.push(current.sort((a, b) => a.x - b.x));
  return lines;
}

After grouping, adjacent spans on the same line that have matching font properties are merged with a space if the gap between them warrants one:

function mergeAdjacentSpans(lines, mergeGapThreshold = 2) {
  return lines.map(line => {
    const merged = [{ ...line[0] }];
    for (let i = 1; i < line.length; i++) {
      const prev = merged[merged.length - 1];
      const cur  = line[i];
      const gap  = cur.x - (prev.x + prev.w);
      const sameFontSize = Math.abs(cur.fontSize - prev.fontSize) < 0.8;
      const sameBold     = cur.isBold   === prev.isBold;
      const sameItalic   = cur.isItalic === prev.isItalic;

      if (gap <= mergeGapThreshold && sameFontSize && sameBold && sameItalic) {
        prev.str += (gap > 0.5 ? ' ' : '') + cur.str;
        prev.w    = cur.x + cur.w - prev.x;
      } else {
        merged.push({ ...cur });
      }
    }
    return merged;
  });
}

Converting coordinates to EMU

Word's internal coordinate system uses EMU — English Metric Units. One point equals 12,700 EMU. One inch equals 914,400 EMU. This is the coordinate space that Word uses for absolute positioning of floating elements.

const PT_TO_EMU = 12700;

// For each text item:
const xEmu = Math.round(item.x * PT_TO_EMU);
const yEmu = Math.round(item.y * PT_TO_EMU);
const wEmu = Math.round(item.w * PT_TO_EMU);
const hEmu = Math.round(item.h * PT_TO_EMU);

The page dimensions also need to be converted for the DOCX section properties. PDF uses points for page dimensions; Word uses twips (twentieths of a point):

const PT_TO_TWIPS = 20;

const pgWTwips = Math.round(pageWPt * PT_TO_TWIPS);
const pgHTwips = Math.round(pageHPt * PT_TO_TWIPS);

This ensures the output DOCX has exactly the same page dimensions as the original PDF — no layout shift.

Building the DOCX anchored text boxes

Each text item becomes a <wp:anchor> — an absolutely-positioned floating text box in OOXML (the XML format underlying docx files). The anchor is positioned relative to the page at the exact EMU coordinates extracted from the PDF:

function buildAnchoredTextBox(params, id) {
  const { xEmu, yEmu, wEmu, hEmu, runs, pageWEmu, pageHEmu } = params;

  // Clamp to page bounds — prevents out-of-range crashes in Word
  const safeX = Math.max(0, Math.min(xEmu, pageWEmu - 91440)); // 91440 EMU = min 1pt
  const safeY = Math.max(0, Math.min(yEmu, pageHEmu - 91440));
  const safeW = Math.max(91440, Math.min(wEmu, pageWEmu - safeX));
  const safeH = Math.max(91440, Math.min(hEmu, pageHEmu - safeY));

  return (
    `<wp:anchor relativeFrom="page">` +
    `<wp:positionH relativeFrom="page">` +
    `  <wp:posOffset>${safeX}</wp:posOffset>` +
    `</wp:positionH>` +
    `<wp:positionV relativeFrom="page">` +
    `  <wp:posOffset>${safeY}</wp:posOffset>` +
    `</wp:positionV>` +
    `<wp:extent cx="${safeW}" cy="${safeH}"/>` +
    `<wp:wrapNone/>` +
    // ... text box content with runs ...
    `</wp:anchor>`
  );
}

<wp:wrapNone/> means the text box doesn't affect text flow — it floats freely at its absolute position, exactly like the original PDF content. relativeFrom="page" anchors the coordinates to the page origin rather than the text area, which is essential because PDFs use page-relative coordinates.

The bounds clamping (91440 EMU = approximately 0.1 points) prevents a common crash in Word where text boxes positioned outside the page area cause the file to be reported as corrupted.

The OCR fallback for scanned PDFs

When extractNativeTextItems() returns fewer than 5 text items — which happens with scanned documents, image-only PDFs, or PDFs where text was saved as outlines — the tool switches to Tesseract.js OCR:

if (nativeResult && nativeResult.items.length > 5) {
  // use native extraction
} else {
  // render page at 3× scale for OCR
  const vpScaled = page.getViewport({ scale: RENDER_SCALE }); // RENDER_SCALE = 3
  const canvas   = document.createElement('canvas');
  canvas.width   = Math.round(vpScaled.width);
  canvas.height  = Math.round(vpScaled.height);

  await page.render({ canvasContext: ctx, viewport: vpScaled }).promise;

  const tessData = await ocrPageCanvas(canvas, opts.ocrLang, onProgress);
  canvas.width = canvas.height = 0; // release GPU memory

  items = tessWordsToItems(tessData, RENDER_SCALE, pageWPt, pageHPt);
}

Rendering at 3× (216 DPI equivalent) before OCR significantly improves Tesseract accuracy compared to rendering at 1× or 2×. The 3× rendered pixel coordinates are then divided back by the render scale to get PDF point coordinates, which are then converted to EMU for placement in the docx.

Tesseract provides word-level bounding boxes (word.bbox) with confidence scores. Words with confidence below 15% are discarded — they're almost certainly noise or artifacts rather than real text.

The Pixel Perfect mode — for when visual fidelity matters more than editability

For PDFs where layout accuracy matters more than editability — forms, certificates, complex multi-column layouts — the pixel pipeline renders each page as a JPEG and embeds it inline in the docx:

async function canvasToJpeg(canvas, quality = 0.93) {
  const blob = await new Promise(res =>
    canvas.toBlob(res, 'image/jpeg', quality)
  );
  const bytes = await blobToUint8Array(blob);
  canvas.width = canvas.height = 0; // GPU memory release
  return bytes;
}

The JPEG quality options map to 0.82 (balanced), 0.93 (high, default), and 0.97 (maximum). High quality at 2× render scale produces JPEG images that are visually indistinguishable from the original PDF when viewed at normal sizes.

Each page image is embedded using <wp:inline> rather than <wp:anchor> — inline drawing elements in OOXML flow with the document and don't overlap, which is what you want when each page is the full-width content of a section.

The privacy architecture

The entire pipeline — PDF parsing, text extraction, OCR, DOCX assembly — runs locally in the browser. No bytes of your document touch any server.

For sensitive PDFs — contracts, financial statements, medical records, legal filings — this matters. The document you upload to a conversion service goes somewhere. It sits on a server. It may be retained. With ihatepdf.cv, the conversion happens in your browser tab and the result goes directly to your Downloads folder. The file never leaves your device.

Open DevTools → Network tab → convert a PDF. You'll see PDF.js and Tesseract.js loading once and being cached by the service worker. You'll see zero upload requests for your document.

When to use which mode

Ultra-Accurate Editable is the right choice when:

You need to edit the converted document — update names, fix typos, change dates
The PDF was generated digitally (not scanned) — Word, InDesign, Google Docs exports
You want selectable, copyable, searchable text in the output

Pixel Perfect is the right choice when:

Layout accuracy matters more than editability — forms, certificates, designed documents
The PDF has complex layouts that are difficult to reconstruct with text boxes
You want guaranteed visual fidelity and don't need to edit the content

Try it

ihatepdf.cv/pdf-to-word

Free. No account. No upload. No watermark. Both conversion modes available for every file, no paywalls.

If you work with PDFs professionally and have conversion cases that break — unusual layouts, complex tables, right-to-left scripts, mathematical content — I read comments. Edge cases are how the tool improves.

Part of an ongoing series on building a privacy-first PDF toolkit in the browser. The architecture overview is at ihatepdf.cv/technical-blog. Previous posts: PDF compression with Ghostscript-WASM · PDF to JPG at 600 DPI.