The problem
I kept hitting the same wall when building RAG pipelines over research papers: every generic PDF parser I tried mangled the equations.
Adobe Extract, AWS Textract, pdfplumber, PyMuPDF — they all collapse display math into plain-text garbage. Attention(Q,K,V) = softmax(QKT / √dk) V becomes something like:
QKT √dk
Attention(Q,K,V ) = softmax(
)V (1)
Unusable. Your embedding model sees a soup of tokens. Your LLM has no idea what the equation means. Your RAG answers are wrong on anything math-heavy.
What I tried
I benchmarked the obvious options on a handful of arxiv papers I cared about:
- Docling (IBM): drops every display equation as a placeholder. ~5/12 on a controlled equation-extraction benchmark.
- Nougat (Meta): the results were actually good when it worked, but the repo is essentially unmaintained and the dependency tree is a minefield.
- Mistral OCR: cheap and general-purpose, but equation fidelity is inconsistent on papers with dense notation.
- LlamaParse: optimized for "give me RAG chunks", not "preserve the math".
- Marker (github.com/datalab-to/marker): the only OSS tool that consistently produced clean LaTeX. Scored ~10.5/12 on the same benchmark Docling scored 5 on.
Why I didn't just use Marker directly
Marker is the right tool, but running it yourself is not trivial:
- 5GB of model weights to download on first run
- CUDA + PyTorch + transformers + torchvision version dance
- GPU server to host it (T4 or better — CPU inference takes ~10x longer)
- A queue because parses take 60–180 seconds and you can't block an HTTP request that long
- Idle GPU bills when nobody is parsing anything
For a side project, this was 2+ days of yak shaving before I could POST my first PDF. I wanted a one-line API.
What I built
I wrapped Marker in a Modal deployment and put an async FastAPI on top of it. Two endpoints:
# Submit a paper
curl -X POST https://scientific-paper-parser1.p.rapidapi.com/parse-paper \
-H "X-RapidAPI-Key: $KEY" \
-F "url=https://arxiv.org/pdf/1706.03762"
# → {"call_id": "fc-01K...", "status": "queued"}
# Poll for the result
curl https://scientific-paper-parser1.p.rapidapi.com/parse-paper/$ID \
-H "X-RapidAPI-Key: $KEY"
# → {"status": "done", "result": {"markdown": "...", ...}}
And on the same Attention paper, it returns:
$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V$$
Clean. LaTeX. Embeds cleanly into any RAG pipeline. Renders in any markdown viewer with math support. Feeds Claude and GPT cleanly.
The Modal architecture
Three things made the economics work:
1. Persistent volume for model weights. First container ever downloads the 5GB of Marker weights to a Modal Volume. Every subsequent container mounts the volume and reuses them. Cold start on a warm volume is ~10 seconds instead of ~5 minutes.
models_volume = modal.Volume.from_name("marker-models", create_if_missing=True)
@app.cls(
volumes={"/root/.cache/datalab": models_volume},
gpu="T4",
scaledown_window=300,
)
class Parser:
@modal.enter()
def load_models(self):
from marker.converters.pdf import PdfConverter
from marker.models import create_model_dict
self.converter = PdfConverter(artifact_dict=create_model_dict())
2. spawn-and-poll for long parses. A 50-page paper takes 90–180 seconds. You can't hold an HTTP connection open that long, especially not behind a CDN. Modal's function.spawn() returns a FunctionCall object you can look up by ID later:
@api.post("/parse-paper")
async def submit(file: UploadFile = File(None), url: str = Form(None)):
pdf_bytes = await _fetch_pdf(file, url)
call = Parser().parse.spawn(pdf_bytes)
return {"call_id": call.object_id, "status": "queued"}
@api.get("/parse-paper/{call_id}")
async def poll(call_id: str):
call = modal.FunctionCall.from_id(call_id)
try:
result = call.get(timeout=0)
return {"status": "done", "result": result}
except TimeoutError:
return {"status": "processing"}
3. Scale-to-zero. scaledown_window=300 keeps a warm container for 5 minutes after the last request. After that, the container shuts down and idle cost drops to zero. I pay only for the seconds I'm actually parsing something.
The business side
I put it behind a RapidAPI listing so distribution is one click for anyone comparing parsing APIs. Free tier is 2 papers/month (no credit card) and paid plans start at $9/mo for 75 papers.
I'm not trying to beat Marker on quality (it IS Marker). I'm not trying to beat Mistral OCR on price (I can't). I'm solving one specific problem: "I want Marker quality without running a GPU server."
Honest about what this is not
- Not my model. It's Marker (Apache 2.0), hosted. I'm explicit about this on the landing page.
- Not the cheapest per-page option. Mistral OCR is cheaper if you don't care about equation fidelity.
- Not for scanned PDFs. Typeset only — Marker doesn't do OCR.
- Not for arxiv-only workflows. There's a free tool called arxiv2md that parses arxiv's HTML source if that's all you need.
Where this fits
If you're doing RAG over biorxiv, chemrxiv, published journal PDFs, internal research docs, or any scientific PDF that isn't on arxiv, and equation fidelity matters for your answers, this saves you a weekend.
Landing: https://paper-parser-landing.vercel.app
API: https://rapidapi.com/kjyounai/api/scientific-paper-parser1
Feedback welcome — especially if you've tried self-hosting Marker before or have opinions on the async polling pattern. Happy to answer questions about the Modal setup or the Marker tradeoffs in the comments.
Top comments (0)