fayismahmood

Posted on May 18

PDF to Text Conversion with Rust on AWS Lambda

#rust #webdev #opensource

A complete guide to building a serverless PDF conversion service using Rust, pdf_oxide, and cargo-lambda.

Demo: https://pdf-to-text-phi.vercel.app

Repo: https://github.com/fayismahmood/pdf-to-text

Prerequisites

You'll need Rust 1.70+ installed, AWS CLI configured with credentials, and basic familiarity with AWS Lambda concepts. For cargo-lambda installation, follow the official guide.

Your AWS IAM user also needs permissions for:

lambda:* (or the required Lambda deployment permissions)
iam:CreateRole
iam:AttachRolePolicy
iam:PassRole

Without these permissions, cargo lambda deploy may fail during the initial deployment when creating the Lambda execution role.
To install cargo-lambda, follow the official installation guide:

Installing cargo-lambda

cargo-lambda is the official Cargo subcommand for AWS Lambda functions.

macOS / Linux

curl -L https:// cargo-lambda.info/install.sh | sh

Verify: cargo lambda --version

Project Setup

Create new project

cargo lambda new pdf-converter --http
cd pdf-converter

This creates an HTTP-compatible Lambda project with API Gateway integration.

Update `Cargo.toml`

[package]
name = "pdf-converter"
version = "0.1.0"
edition = "2021"

[dependencies]
lambda_http = "1.0"
pdf_oxide = "0.3"
tokio = { version = "1", features = ["macros"] }

Code Implementation

src/http_handler.rs

use lambda_http::{Body, Error, Request, RequestExt, Response};
use pdf_oxide::{PdfDocument, converters::ConversionOptions};

pub(crate) enum FileType {
    Html,
    Text,
    Markdown,
}

impl FileType {
    fn from_str(s: &str) -> Option<Self> {
        match s.to_lowercase().as_str() {
            "html" => Some(FileType::Html),
            "text" => Some(FileType::Text),
            "markdown" => Some(FileType::Markdown),
            _ => None,
        }
    }
}

pub(crate) async fn function_handler(event: Request) -> Result<Response<Body>, Error> {
    let file_type = event
        .query_string_parameters_ref()
        .and_then(|params| params.first("file_type"))
        .unwrap_or("text");

    let file_type = FileType::from_str(file_type).unwrap_or(FileType::Text);

    let body = event.body().to_vec();
    let pdf_data = PdfDocument::from_bytes(body)?;

    let options = ConversionOptions::default();
    let page_count = pdf_data.page_count()?;
    let mut result = String::new();

    for i in 0..page_count {
        let page_content = match file_type {
            FileType::Html => pdf_data.to_html(i, &options)?,
            FileType::Text => pdf_data.to_plain_text(i, &options)?,
            FileType::Markdown => pdf_data.to_markdown(i, &options)?,
        };
        result.push_str(&page_content);
    }

    let content_type = match file_type {
        FileType::Html => "text/html",
        FileType::Text => "text/plain",
        FileType::Markdown => "text/markdown",
    };

    let resp = Response::builder()
        .status(200)
        .header("content-type", content_type)
        .body(result.into())
        .map_err(Box::new)?;
    Ok(resp)
}

Local Testing

Start the local server

cargo lambda watch

Send a PDF via curl

curl -X POST 'http://localhost:9000/function/function_handler?file_type=markdown' \
  -H 'Content-Type: application/pdf' \
  --data-binary @document.pdf

Deployment

Build for production

cargo lambda build --release --arm64

The --arm64 flag targets AWS Graviton processors for better cost/performance.

Deploy to AWS

cargo lambda deploy

The first deployment will create an IAM role automatically. Subsequent deployments will reuse it.

Via AWS CLI

# Package the function
cargo lambda build --release

# Deploy
aws lambda deploy

Performance Benchmarks

pdf_oxide Performance

pdf_oxide is one of the fastest PDF libraries available, with benchmark results on 3,830 real-world PDFs:

Python PDF Libraries Comparison

Library	Mean Time	Pass Rate	License
pdf_oxide	0.8ms	100%	MIT
PyMuPDF	4.6ms	99.3%	AGPL-3.0
pypdfium2	4.1ms	99.2%	Apache-2.0
pdfminer	16.8ms	98.8%	MIT
pdfplumber	23.2ms	98.8%	MIT

Rust PDF Libraries Comparison

Library	Mean Time	Pass Rate
pdf_oxide	0.8ms	100%
unfpdf	2.8ms	95.1%
pdf_extract	4.08ms	91.5%
oxidize_pdf	13.5ms	99.1%

pdf_oxide is 5× faster than pdf_extract and 17× faster than oxidize_pdf in Rust.

AWS Lambda Cold Start

Rust's minimal runtime and compiled binary size result in extremely fast cold starts:

Runtime	Cold Start (typical)
Rust (provided.al2023)	~50-100ms
Node.js	~100-200ms
Python	~100-300ms
Java	~500-2000ms

Memory Usage

With pdf_oxide's efficient design, memory usage stays low:

PDF Size	Peak Memory
100KB	~5MB
1MB	~20MB
10MB	~100MB

API Usage

Request

POST /function/function_handler?file_type={html|text|markdown}
Content-Type: application/pdf

<binary PDF data>

Response

Returns the converted content with appropriate content-type header.

Example with AWS CLI

aws lambda invoke \
  --function-name pdf-converter \
  --payload '{"file_type": "markdown"}' \
  --cli-binary-format raw-in-base64-out \
  response.json

DEV Community

PDF to Text Conversion with Rust on AWS Lambda

Demo: https://pdf-to-text-phi.vercel.app

Repo: https://github.com/fayismahmood/pdf-to-text

Prerequisites

Installing cargo-lambda

macOS / Linux

Project Setup

Create new project

Update `Cargo.toml`

Code Implementation

src/http_handler.rs

Local Testing

Start the local server

Send a PDF via curl

Deployment

Build for production

Deploy to AWS

Via AWS CLI

Performance Benchmarks

pdf_oxide Performance

Python PDF Libraries Comparison

Rust PDF Libraries Comparison

AWS Lambda Cold Start

Memory Usage

API Usage

Request

Response

Example with AWS CLI

Further Reading

Top comments (0)

Demo: https://pdf-to-text-phi.vercel.app

Repo: https://github.com/fayismahmood/pdf-to-text

Prerequisites

Installing cargo-lambda

macOS / Linux

Project Setup

Create new project

Update Cargo.toml

Code Implementation

src/http_handler.rs

Local Testing

Start the local server

Send a PDF via curl

Deployment

Build for production

Deploy to AWS

Via AWS CLI

Performance Benchmarks

pdf_oxide Performance

Python PDF Libraries Comparison

Rust PDF Libraries Comparison

AWS Lambda Cold Start

Memory Usage

API Usage

Request

Response

Example with AWS CLI

Further Reading

Update `Cargo.toml`