DEV Community

Cover image for PDF to Text Conversion with Rust on AWS Lambda
fayismahmood
fayismahmood

Posted on

PDF to Text Conversion with Rust on AWS Lambda

A complete guide to building a serverless PDF conversion service using Rust, pdf_oxide, and cargo-lambda.

Demo: https://pdf-to-text-phi.vercel.app

Repo: https://github.com/fayismahmood/pdf-to-text

Prerequisites

You'll need Rust 1.70+ installed, AWS CLI configured with credentials, and basic familiarity with AWS Lambda concepts. For cargo-lambda installation, follow the official guide.

Your AWS IAM user also needs permissions for:

  • lambda:* (or the required Lambda deployment permissions)
  • iam:CreateRole
  • iam:AttachRolePolicy
  • iam:PassRole

Without these permissions, cargo lambda deploy may fail during the initial deployment when creating the Lambda execution role.
To install cargo-lambda, follow the official installation guide:

Installing cargo-lambda

cargo-lambda is the official Cargo subcommand for AWS Lambda functions.

macOS / Linux

curl -L https:// cargo-lambda.info/install.sh | sh
Enter fullscreen mode Exit fullscreen mode

Verify: cargo lambda --version

Project Setup

Create new project

cargo lambda new pdf-converter --http
cd pdf-converter
Enter fullscreen mode Exit fullscreen mode

This creates an HTTP-compatible Lambda project with API Gateway integration.

Update Cargo.toml

[package]
name = "pdf-converter"
version = "0.1.0"
edition = "2021"

[dependencies]
lambda_http = "1.0"
pdf_oxide = "0.3"
tokio = { version = "1", features = ["macros"] }
Enter fullscreen mode Exit fullscreen mode

Code Implementation

src/http_handler.rs

use lambda_http::{Body, Error, Request, RequestExt, Response};
use pdf_oxide::{PdfDocument, converters::ConversionOptions};

pub(crate) enum FileType {
    Html,
    Text,
    Markdown,
}

impl FileType {
    fn from_str(s: &str) -> Option<Self> {
        match s.to_lowercase().as_str() {
            "html" => Some(FileType::Html),
            "text" => Some(FileType::Text),
            "markdown" => Some(FileType::Markdown),
            _ => None,
        }
    }
}

pub(crate) async fn function_handler(event: Request) -> Result<Response<Body>, Error> {
    let file_type = event
        .query_string_parameters_ref()
        .and_then(|params| params.first("file_type"))
        .unwrap_or("text");

    let file_type = FileType::from_str(file_type).unwrap_or(FileType::Text);

    let body = event.body().to_vec();
    let pdf_data = PdfDocument::from_bytes(body)?;

    let options = ConversionOptions::default();
    let page_count = pdf_data.page_count()?;
    let mut result = String::new();

    for i in 0..page_count {
        let page_content = match file_type {
            FileType::Html => pdf_data.to_html(i, &options)?,
            FileType::Text => pdf_data.to_plain_text(i, &options)?,
            FileType::Markdown => pdf_data.to_markdown(i, &options)?,
        };
        result.push_str(&page_content);
    }

    let content_type = match file_type {
        FileType::Html => "text/html",
        FileType::Text => "text/plain",
        FileType::Markdown => "text/markdown",
    };

    let resp = Response::builder()
        .status(200)
        .header("content-type", content_type)
        .body(result.into())
        .map_err(Box::new)?;
    Ok(resp)
}
Enter fullscreen mode Exit fullscreen mode

Local Testing

Start the local server

cargo lambda watch
Enter fullscreen mode Exit fullscreen mode

Send a PDF via curl

curl -X POST 'http://localhost:9000/function/function_handler?file_type=markdown' \
  -H 'Content-Type: application/pdf' \
  --data-binary @document.pdf
Enter fullscreen mode Exit fullscreen mode

Deployment

Build for production

cargo lambda build --release --arm64
Enter fullscreen mode Exit fullscreen mode

The --arm64 flag targets AWS Graviton processors for better cost/performance.

Deploy to AWS

cargo lambda deploy
Enter fullscreen mode Exit fullscreen mode

The first deployment will create an IAM role automatically. Subsequent deployments will reuse it.

Via AWS CLI

# Package the function
cargo lambda build --release

# Deploy
aws lambda deploy
Enter fullscreen mode Exit fullscreen mode

Performance Benchmarks

pdf_oxide Performance

pdf_oxide is one of the fastest PDF libraries available, with benchmark results on 3,830 real-world PDFs:

Python PDF Libraries Comparison

Library Mean Time Pass Rate License
pdf_oxide 0.8ms 100% MIT
PyMuPDF 4.6ms 99.3% AGPL-3.0
pypdfium2 4.1ms 99.2% Apache-2.0
pdfminer 16.8ms 98.8% MIT
pdfplumber 23.2ms 98.8% MIT

Rust PDF Libraries Comparison

Library Mean Time Pass Rate
pdf_oxide 0.8ms 100%
unfpdf 2.8ms 95.1%
pdf_extract 4.08ms 91.5%
oxidize_pdf 13.5ms 99.1%

pdf_oxide is 5× faster than pdf_extract and 17× faster than oxidize_pdf in Rust.

AWS Lambda Cold Start

Rust's minimal runtime and compiled binary size result in extremely fast cold starts:

Runtime Cold Start (typical)
Rust (provided.al2023) ~50-100ms
Node.js ~100-200ms
Python ~100-300ms
Java ~500-2000ms

Memory Usage

With pdf_oxide's efficient design, memory usage stays low:

PDF Size Peak Memory
100KB ~5MB
1MB ~20MB
10MB ~100MB

API Usage

Request

POST /function/function_handler?file_type={html|text|markdown}
Content-Type: application/pdf

<binary PDF data>
Enter fullscreen mode Exit fullscreen mode

Response

Returns the converted content with appropriate content-type header.

Example with AWS CLI

aws lambda invoke \
  --function-name pdf-converter \
  --payload '{"file_type": "markdown"}' \
  --cli-binary-format raw-in-base64-out \
  response.json
Enter fullscreen mode Exit fullscreen mode

Further Reading

Top comments (0)