ArshTechPro

Posted on Jun 7

LiteParse: A Fast, Local Document Parser for Developers

#javascript #opensource #ai #python

LiteParse is a fast, local document parser for extracting text from clean, well-structured files. It handles PDFs, DOCX, HTML, and more, with minimal setup and no API calls. Everything runs locally, so your documents never leave your environment.

The project is honest about its scope, which is refreshing. It's a strong fit when:

Your documents are relatively straightforward, without complex tables, mixed layouts, or scanned pages.
You want parsing to run locally rather than sending data to an external service.
You're prototyping or building a lightweight pipeline and don't need enterprise-grade accuracy.

For the genuinely hard stuff (dense tables, multi-column layouts, charts, handwriting, scanned PDFs), the maintainers point you toward LlamaParse, their cloud product. LiteParse deliberately stays in the "fast and light" lane rather than trying to be everything.

Key Features

Under the hood, LiteParse leans on PDF.js for spatial text parsing and gives you a few things that matter for AI pipelines:

Text extraction with precise bounding boxes, so you know where each piece of text sits on the page.
A flexible OCR system: Tesseract.js works out of the box with zero setup, and you can plug in HTTP OCR servers like EasyOCR or PaddleOCR for higher accuracy.
Screenshot generation, which produces page images that LLM agents can use to capture visual information text alone misses.
Output in either JSON or plain text.
A standalone binary that runs across Linux, macOS (Intel and ARM), and Windows.

Getting Started

LiteParse ships as both a CLI and a library. Here's the fast path for each.

Install the CLI

The recommended approach is a global npm install, which gives you the lit command everywhere:

npm i -g @llamaindex/liteparse

On macOS and Linux you can also use Homebrew:

brew tap run-llama/liteparse
brew install llamaindex-liteparse

Parse your first document

# Basic parsing (OCR is on by default via Tesseract)
lit parse document.pdf

# Output JSON to a file
lit parse document.pdf --format json -o output.md

# Parse only specific pages
lit parse document.pdf --target-pages "1-5,10,15-20"

# Skip OCR entirely
lit parse document.pdf --no-ocr

Parse a whole folder

For pipelines, batch mode reuses the PDF engine across files for efficiency:

lit batch-parse ./input-directory ./output-directory

Generate page screenshots

# All pages
lit screenshot document.pdf -o ./screenshots

# Specific pages at higher resolution
lit screenshot document.pdf --target-pages "1,3,5" --dpi 300 -o ./screenshots

Using LiteParse as a Library

If you'd rather call it from code, install it as a dependency:

npm install @llamaindex/liteparse
# or
pnpm add @llamaindex/liteparse

Then parsing is a few lines:

import { LiteParse } from '@llamaindex/liteparse';

const parser = new LiteParse({ ocrEnabled: true });
const result = await parser.parse('document.pdf');
console.log(result.text);

Parsing More Than PDFs

One thing that sets LiteParse apart from PDF-only tools is automatic format conversion. Point it at an Office document or an image and it will convert to PDF first, provided you have the right helper installed.

For Office documents (Word, PowerPoint, spreadsheets), install LibreOffice:

# macOS
brew install --cask libreoffice

# Ubuntu/Debian
apt-get install libreoffice

For images (JPG, PNG, GIF, BMP, TIFF, WebP, SVG), install ImageMagick:

# macOS
brew install imagemagick

# Ubuntu/Debian
apt-get install imagemagick

Once these are present, LiteParse handles the conversion behind the scenes.

Configuration

You can drive everything from CLI flags, or set defaults in a liteparse.config.json file:

{
  "ocrLanguage": "en",
  "ocrEnabled": true,
  "maxPages": 1000,
  "dpi": 150,
  "outputFormat": "json",
  "preciseBoundingBox": true,
  "preserveVerySmallText": false
}

To point at an external OCR server, add an ocrServerUrl:

{
  "ocrServerUrl": "http://localhost:8828/ocr",
  "ocrLanguage": "en",
  "outputFormat": "json"
}

Then run:

lit parse document.pdf --config liteparse.config.json

Bringing Your Own OCR

The default Tesseract.js engine needs no setup, but if you want better accuracy you can wire in any OCR service that implements LiteParse's simple API specification. The contract is minimal: a POST /ocr endpoint that accepts a file and a language, and returns JSON with each result's text, bounding box, and confidence score. The repo includes ready-made example wrappers for EasyOCR and PaddleOCR you can use as templates.

Should You Use It?

LiteParse is purpose-built and clear about its boundaries, which makes it easy to reason about. If you need local, fast text extraction from clean documents for a RAG pipeline, an agent, or a quick prototype, it's a solid, dependency-light choice. If your documents are messy, scanned, or table-heavy, the maintainers are upfront that you'll want a heavier solution.

DEV Community