sunshey

Posted on Jun 30

How I Built a Browser-Based PDF to Word Converter with Vue 3, pdf.js, and docx

#javascript #vue #pdf #webdev

Converting a PDF to a Word document is one of those tasks that sounds simple until you try to do it privately. Most converters upload your file to a server, process it, and send it back. That works, but it means trusting someone else with your document.

I wanted a converter that runs entirely in the browser. The result is en.sotool.top/pdf-to-word/. Here's how I built it.

The Goal

Extract selectable text from a PDF and package it into a .docx file, without ever sending the PDF to a server.

The scope is intentionally narrow:

Text-only output
No layout preservation
No image extraction
No OCR for scanned PDFs

This covers a lot of real use cases — contracts, reports, essays, meeting notes — while staying fast and private.

The Stack

Vue 3 — UI and state management
pdfjs-dist — Extract text from each PDF page
docx — Generate .docx files in the browser
File API + Blob — Read input and trigger downloads

npm install pdfjs-dist docx

Loading the PDF

pdfjs-dist needs a worker. I point it to a CDN worker file to avoid bundling the large worker binary.

import * as pdfjs from 'pdfjs-dist'

pdfjs.GlobalWorkerOptions.workerSrc =
  `https://cdnjs.cloudflare.com/ajax/libs/pdf.js/${pdfjs.version}/pdf.worker.min.mjs`

Then load the document from a file:

async function extractText(file: File) {
  const arrayBuffer = await file.arrayBuffer()
  const pdf = await pdfjs.getDocument({ data: arrayBuffer }).promise
  return pdf
}

Extracting Text Page by Page

pdfjs-dist gives you a TextItem array per page. I collect the text and split it into paragraphs.

const pdf = await extractText(file)
const paragraphs: string[] = []

for (let i = 1; i <= pdf.numPages; i++) {
  const page = await pdf.getPage(i)
  const content = await page.getTextContent()
  const text = content.items
    .filter((item: any) => 'str' in item)
    .map((item: any) => item.str)
    .join(' ')

  if (text.trim()) {
    paragraphs.push(...text.split(/\n{2,}/).filter(p => p.trim()))
  }
}

The result is an array of paragraph strings. We lose exact layout, but the text content is preserved.

Building the DOCX File

The docx library lets you create a Word-compatible document without a backend.

import { Document, Paragraph, Packer } from 'docx'

const doc = new Document({
  sections: [{
    properties: {},
    children: paragraphs.map(text => new Paragraph({ text })),
  }],
})

const blob = await Packer.toBlob(doc)

Packer.toBlob() returns a Blob that you can download with a simple anchor element.

Downloading the Result

function downloadBlob(blob: Blob, filename: string) {
  const url = URL.createObjectURL(blob)
  const a = document.createElement('a')
  a.href = url
  a.download = filename
  a.click()
  URL.revokeObjectURL(url)
}

UI Considerations

Set expectations early. We show a clear message that conversion is text-only and that scanned PDFs won't work.

Preview first three pages. Users can see the extracted text before downloading, which builds trust and lets them catch problems early.

Affiliate guidance for complex needs. If a user needs layout preservation, images, or OCR, we recommend a desktop tool. We use a CJ Affiliate link for Wondershare PDFelement with rel="noopener sponsored".

Lessons Learned

Text extraction is easy; layout preservation is hard. Trying to keep columns, tables, and images in a pure browser tool quickly becomes a research project. Text-only is a pragmatic cutoff.

Scanned PDFs are the biggest support burden. Users expect any "PDF to Word" tool to handle scanned documents. We detect low or zero text content and show a specific message explaining the limitation.

Preview reduces disappointment. Letting users see the first few pages of extracted text before downloading prevents the "this output is broken" reaction.

Worker source matters. Bundling pdf.worker.js adds significant chunk size. Pointing to a CDN version keeps the initial bundle smaller.

Try It

The tool is live at en.sotool.top/pdf-to-word/.

Free, no signup, no upload. Full source is on GitHub.

Need Full Formatting?

For complex documents with tables, images, or scanned pages, a desktop tool is still the better option. Wondershare PDFelement converts PDFs to Word while preserving formatting and includes OCR.

This post contains affiliate links.

Have you built document conversion tools in the browser? What trade-offs did you make?

DEV Community