<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Sergey Shmakov</title>
    <description>The latest articles on DEV Community by Sergey Shmakov (@sergeyshmakov).</description>
    <link>https://dev.to/sergeyshmakov</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F377612%2F8f5facd4-874b-435e-92f4-05e50e4b722c.jpeg</url>
      <title>DEV Community: Sergey Shmakov</title>
      <link>https://dev.to/sergeyshmakov</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/sergeyshmakov"/>
    <language>en</language>
    <item>
      <title>Structuring MinerU output into a clean doc tree</title>
      <dc:creator>Sergey Shmakov</dc:creator>
      <pubDate>Thu, 04 Jun 2026 00:00:00 +0000</pubDate>
      <link>https://dev.to/sergeyshmakov/structuring-mineru-output-into-a-clean-doc-tree-2cdm</link>
      <guid>https://dev.to/sergeyshmakov/structuring-mineru-output-into-a-clean-doc-tree-2cdm</guid>
      <description>&lt;p&gt;&lt;em&gt;Last Updated: 2026-06-04&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;In &lt;a href="https://sergeyshmakov.github.io/mineru-runpod/blog/clause-aligned-batching-large-pdf-mineru/" rel="noopener noreferrer"&gt;part 1&lt;/a&gt; I parsed the whole 5,039-page ECMA-376 Part 1 standard with &lt;a href="https://github.com/opendatalab/MinerU" rel="noopener noreferrer"&gt;MinerU&lt;/a&gt; on &lt;a href="https://runpod.io?ref=31jdfpnq" rel="noopener noreferrer"&gt;RunPod&lt;/a&gt;: 36 clause-aligned batches, about $1.15 of GPU time, out came &lt;strong&gt;36 &lt;code&gt;content_list.json&lt;/code&gt; files&lt;/strong&gt;. That’s where most write-ups stop. Parsed is not the same as usable. A vision-language model hands you a flat stream of typed blocks with OCR quirks and no document structure. For a coding agent to answer “what does &lt;code&gt;§17.9.4 isLgl&lt;/code&gt; (legal numbering) say?” it needs &lt;em&gt;one small, faithful, addressable file&lt;/em&gt;, not a 200-page batch of blocks.&lt;/p&gt;

&lt;p&gt;This post is the post-processing half: cleaning, structuring, cross-linking, and verifying the output. The whole thing is distilled into a small, document-agnostic toolkit you can run on your own MinerU output: &lt;strong&gt;&lt;a href="https://github.com/sergeyshmakov/mineru-runpod/tree/main/examples/doc-structuring" rel="noopener noreferrer"&gt;&lt;code&gt;examples/doc-structuring/&lt;/code&gt;&lt;/a&gt;&lt;/strong&gt; (start with the &lt;a href="https://github.com/sergeyshmakov/mineru-runpod/blob/main/examples/doc-structuring/README.md" rel="noopener noreferrer"&gt;README&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;End state: &lt;strong&gt;5,039 pages became 9,948 Markdown files&lt;/strong&gt; , every section addressable, ~7,900 cross-references turned into relative links, and every tag and attribute name verified against the official schema &lt;em&gt;and&lt;/em&gt; the source PDF.&lt;/p&gt;

&lt;h2&gt;
  
  
  What does MinerU’s content_list.json actually give you?
&lt;/h2&gt;

&lt;p&gt;A flat, page-ordered list of typed blocks: &lt;code&gt;text&lt;/code&gt; (with an optional &lt;code&gt;text_level&lt;/code&gt; for headings), &lt;code&gt;list&lt;/code&gt;, &lt;code&gt;table&lt;/code&gt; (HTML in &lt;code&gt;table_body&lt;/code&gt;), &lt;code&gt;code&lt;/code&gt; (in &lt;code&gt;code_body&lt;/code&gt;, not &lt;code&gt;text&lt;/code&gt;), &lt;code&gt;equation&lt;/code&gt;, and &lt;code&gt;image&lt;/code&gt; (with a VLM &lt;code&gt;content&lt;/code&gt; description). Mixed in is &lt;code&gt;page_number&lt;/code&gt;/&lt;code&gt;header&lt;/code&gt; noise. No tree, no cross-reference graph, and a long tail of OCR artifacts.&lt;/p&gt;

&lt;p&gt;That shape is fine for a quick read. It’s wrong for retrieval. Three jobs turn it into something an agent can navigate: rebuild the structure, render each block faithfully to Markdown, and verify the names didn’t get garbled on the way through the model. Everything below is generic; nothing in the toolkit knows about ECMA-376 specifically.&lt;/p&gt;

&lt;h2&gt;
  
  
  How do you rebuild the document structure?
&lt;/h2&gt;

&lt;p&gt;The blocks arrive in reading order, so structure is just segmenting that stream by heading. A single forward walk does it: each block belongs to the section whose heading appeared most recently. Boundaries are only ever set by a real heading, so a section can never steal a neighbour’s content. You inject one callback, &lt;code&gt;heading_id(block)&lt;/code&gt;, where all your domain logic lives.&lt;/p&gt;

&lt;p&gt;That callback is the only place document specifics enter: numbered headings, styled &lt;code&gt;text_level&lt;/code&gt; lines, the occasional heading MinerU buried inside a code caption. The forward walk is the part that doesn’t break, because it has no notion of page or position to get wrong. The segmenter lives in &lt;a href="https://github.com/sergeyshmakov/mineru-runpod/blob/main/examples/doc-structuring/segment.py" rel="noopener noreferrer"&gt;&lt;code&gt;segment.py&lt;/code&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Then the tree itself (&lt;a href="https://github.com/sergeyshmakov/mineru-runpod/blob/main/examples/doc-structuring/tree.py" rel="noopener noreferrer"&gt;&lt;code&gt;tree.py&lt;/code&gt;&lt;/a&gt;):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a section &lt;strong&gt;with&lt;/strong&gt; sub-sections becomes a folder plus a barrel file (&lt;code&gt;*-0-index.md&lt;/code&gt;) holding its intro prose and a child index;&lt;/li&gt;
&lt;li&gt;a leaf becomes &lt;strong&gt;one file&lt;/strong&gt;. The golden rule: never split a leaf into parts.&lt;/li&gt;
&lt;li&gt;an agent walks root → barrel → barrel → leaf, reading one small index per level instead of one giant batch.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Naming encodes the section id plus a short slug (&lt;code&gt;17-4-37-tbl-table.md&lt;/code&gt;), so any clause is glob-findable and the camelCase tag (&lt;code&gt;instrText&lt;/code&gt;) survives for a regex lookup.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why is the rendering the hard part?
&lt;/h2&gt;

&lt;p&gt;Because faithful Markdown is where every MinerU artifact bites. The structuring is a clean algorithm; the rendering is a pile of special cases, each one traced to a real defect on this run. Skip any of them and the content silently corrupts: examples render empty, tables lose columns, prose gets promoted to a heading.&lt;/p&gt;

&lt;p&gt;Every fix in &lt;a href="https://github.com/sergeyshmakov/mineru-runpod/blob/main/examples/doc-structuring/render.py" rel="noopener noreferrer"&gt;&lt;code&gt;render.py&lt;/code&gt;&lt;/a&gt; earned its place:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Code lives in &lt;code&gt;code_body&lt;/code&gt; (pre-fenced) and &lt;code&gt;code_caption&lt;/code&gt;, not &lt;code&gt;text&lt;/code&gt;.&lt;/strong&gt; Miss this and all 3,591 XML examples render &lt;em&gt;empty&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;[Example: … end example]&lt;/code&gt; markers&lt;/strong&gt; have to bracket the code. They routinely land before or inside the fence.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Page-split code halves&lt;/strong&gt; (one example broken across a page) sit directly adjacent, so merge them. Genuinely separate examples have prose between them, so they’re left alone.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mislabelled fences&lt;/strong&gt; (&lt;code&gt;txt&lt;/code&gt;/&lt;code&gt;asp&lt;/code&gt;/&lt;code&gt;hcl that actually hold XML) get relabelled to&lt;/code&gt;xml`, but only when the body has namespaced tags, so a text-output example stays text.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tables&lt;/strong&gt; : HTML to Markdown. Inline XML examples wrapped as &lt;code&gt;$&amp;lt;w:…&amp;gt;$&lt;/code&gt; math get deleted by naive tag-stripping unless you protect them. Fully-empty illustration columns get dropped.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lists&lt;/strong&gt; : no doubled bullets (&lt;code&gt;- - foo&lt;/code&gt;), and ordered items keep their numbers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A long or sentence-ending &lt;code&gt;text_level&lt;/code&gt; block is figure text or prose, not a heading.&lt;/strong&gt; Don’t render it as &lt;code&gt;##&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;$§17.4.62$&lt;/code&gt;-wrapped references and &lt;code&gt;\-&lt;/code&gt; escaped-dash bullets get normalized.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;None of these are exotic. They’re just what VLM output looks like at scale, and each one quietly damages content if you skip it.&lt;/p&gt;

&lt;h2&gt;
  
  
  How do you cross-link a densely-referenced spec?
&lt;/h2&gt;

&lt;p&gt;ECMA-376 cites itself constantly (&lt;code&gt;§17.9.11&lt;/code&gt;, &lt;code&gt;ST_Jc (§17.18.44)&lt;/code&gt;, and on). Two moves in &lt;a href="https://github.com/sergeyshmakov/mineru-runpod/blob/main/examples/doc-structuring/crosslink.py" rel="noopener noreferrer"&gt;&lt;code&gt;crosslink.py&lt;/code&gt;&lt;/a&gt;: normalize every reference to one canonical &lt;code&gt;§N.N.N&lt;/code&gt; form so a single regex collects them all, then turn each resolvable one into a relative Markdown link to the target’s file or barrel.&lt;/p&gt;

&lt;p&gt;The regex is &lt;code&gt;§(\d+(?:\.\d+)*|[A-Z](?:\.\d+)+)&lt;/code&gt;, which catches both numbered clauses and lettered annexes and gives you the citation graph for free. Links are relative on purpose: they’re computed section-to-section &lt;em&gt;within the tree&lt;/em&gt;, so they stay valid wherever you mount it. No host path baked in, no rewrite on move. On this run, &lt;strong&gt;7,877 of 7,959 references (98%)&lt;/strong&gt; became working relative links.&lt;/p&gt;

&lt;h2&gt;
  
  
  How do you verify the parse is actually correct?
&lt;/h2&gt;

&lt;p&gt;Two independent signals, both in &lt;a href="https://github.com/sergeyshmakov/mineru-runpod/blob/main/examples/doc-structuring/verify.py" rel="noopener noreferrer"&gt;&lt;code&gt;verify.py&lt;/code&gt;&lt;/a&gt;. First, a vocabulary check: build the canonical set of element, attribute, and type names plus enum values straight from the official XSDs, then flag any name in the tree that isn’t in it but closely resembles one. Second, a source cross-check against the PDF text layer, which is the definitive one.&lt;/p&gt;

&lt;p&gt;The vocabulary check (&lt;code&gt;Vocabulary.from_xsd([...])&lt;/code&gt;) is fast and needs no PDF. It catches the obvious garbles: &lt;code&gt;fontAlign&lt;/code&gt; when the schema only knows &lt;code&gt;fontAlgn&lt;/code&gt;. But it misses the nasty case where a misread happens to spell a &lt;em&gt;different&lt;/em&gt; real name.&lt;/p&gt;

&lt;p&gt;That’s what the source cross-check is for. A name a file uses that is absent from that section’s own PDF page, while a near-miss correct name &lt;em&gt;is&lt;/em&gt; present, is a confirmed garble. The PDF text layer is independent ground truth. This catches &lt;code&gt;algn&lt;/code&gt; misread as &lt;code&gt;align&lt;/code&gt;: &lt;code&gt;align&lt;/code&gt; is a real element elsewhere, so it passes the vocabulary check, but on the actual page the source says &lt;code&gt;algn&lt;/code&gt;. The check is bounded, each token is tested only against its section’s pages and deduped, so there’s no whole-document scan and no processing hole.&lt;/p&gt;

&lt;p&gt;Confirmed garbles feed a vetted correction map (&lt;a href="https://github.com/sergeyshmakov/mineru-runpod/blob/main/examples/doc-structuring/corrections.py" rel="noopener noreferrer"&gt;&lt;code&gt;corrections.py&lt;/code&gt;&lt;/a&gt;), applied scoped to name contexts so a garble that collides with a real name is corrected only as an attribute, never as an element. Re-run the verifier and it reports zero. On this run that fixed a consistent &lt;code&gt;align→algn&lt;/code&gt; / &lt;code&gt;fontAlign→fontAlgn&lt;/code&gt; class across DrawingML, plus &lt;code&gt;displacedByCustomXML&lt;/code&gt;, &lt;code&gt;t12br→tl2br&lt;/code&gt;, &lt;code&gt;subseted&lt;/code&gt;, and more, each confirmed against the PDF before it was applied.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why replace OCR’d schema dumps instead of correcting them?
&lt;/h2&gt;

&lt;p&gt;Because for the annexes you already have the real source, so OCR is the wrong input. The annexes are machine-generated schema listings (the full XSD and RELAX-NG for the formats). MinerU OCR’d them like everything else, producing the same garbles: &lt;code&gt;CT_Placelder&lt;/code&gt; for &lt;code&gt;CT_Placeholder&lt;/code&gt;, underscores read as spaces, a dangling fragment where a split cut mid-element. The real schema files exist, so swap them in.&lt;/p&gt;

&lt;p&gt;The generic core is &lt;a href="https://github.com/sergeyshmakov/mineru-runpod/blob/main/examples/doc-structuring/schema.py" rel="noopener noreferrer"&gt;&lt;code&gt;schema.py&lt;/code&gt;&lt;/a&gt;. Index every declaration in the official &lt;code&gt;.xsd&lt;/code&gt;/&lt;code&gt;.rnc&lt;/code&gt;, work out which schema file each annex dump came from (highest declaration-name overlap), and replace each parsed declaration with the authoritative one, matched by name then kind, exact → case-insensitive → fuzzy. The authoritative &lt;em&gt;kind&lt;/em&gt; even drives the output folder and filename, so a mis-named OCR fragment self-corrects on rebuild. Result: &lt;strong&gt;99.8% (5,699 of 5,710) declarations replaced&lt;/strong&gt;. The ~11 too garbled to match confidently keep their OCR text.&lt;/p&gt;

&lt;h2&gt;
  
  
  How do you close the long tail of one-off OCR damage?
&lt;/h2&gt;

&lt;p&gt;The residue is per-instance damage that won’t generalize into a correction map: a &lt;code&gt;\@&lt;/code&gt; date-switch read as &lt;code&gt;$@$&lt;/code&gt;, a &lt;code&gt;&amp;lt;&lt;/code&gt; read as &lt;code&gt;#&lt;/code&gt;, a dropped &lt;code&gt;)&lt;/code&gt;, glued attribute names. Past a point, stop writing heuristics. Let agents propose fixes and gate every one of them on the source PDF.&lt;/p&gt;

&lt;p&gt;I ran an adversarial multi-agent fan-out. The worklist was ~100 items: every &lt;code&gt;§17.16.5&lt;/code&gt; field clause, enum truncations found by diffing the rendered table against the authoritative XSD, and the named defects. Each item carried its PDF ground truth. About 130 agents proposed fixes, and only &lt;strong&gt;PDF-verified&lt;/strong&gt; ones were accepted: 41 patches, zero that the build couldn’t apply.&lt;/p&gt;

&lt;p&gt;They live as a per-section overlay (&lt;a href="https://github.com/sergeyshmakov/mineru-runpod/blob/main/examples/doc-structuring/corrections.py" rel="noopener noreferrer"&gt;&lt;code&gt;apply_overlay&lt;/code&gt;&lt;/a&gt;) applied &lt;em&gt;last&lt;/em&gt;, so each &lt;code&gt;find&lt;/code&gt; matches the on-disk text and a stale one gets reported on the next rebuild rather than vanishing silently. After all of it, &lt;code&gt;verify_against_pdf&lt;/code&gt; reports &lt;strong&gt;0 actionable garbles&lt;/strong&gt;. The 11 it still flags were each reviewed against the PDF as genuine, distinct OOXML names (&lt;code&gt;useFirstPageNumber&lt;/code&gt; &lt;em&gt;and&lt;/em&gt; &lt;code&gt;firstPageNumber&lt;/code&gt; both exist; &lt;code&gt;o:cname&lt;/code&gt; confirmed on p4968) and recorded as benign.&lt;/p&gt;

&lt;h2&gt;
  
  
  What does the finished tree look like?
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Source&lt;/td&gt;
&lt;td&gt;5,039 pages, 36 MinerU batches&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Output&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;9,948 Markdown files&lt;/strong&gt; (4,245 leaf clauses, 356 barrels, 5,130 split schema declarations, 1 root index)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cross-references linked&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;7,877 / 7,959 (98%)&lt;/strong&gt;, relative&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Verification&lt;/td&gt;
&lt;td&gt;names vs official XSD vocab (2,058 elements + 1,806 attributes) &lt;strong&gt;and&lt;/strong&gt; vs the source PDF text layer&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The full worked wiring is &lt;a href="https://github.com/sergeyshmakov/mineru-runpod/blob/main/examples/doc-structuring/example_pipeline.py" rel="noopener noreferrer"&gt;&lt;code&gt;example_pipeline.py&lt;/code&gt;&lt;/a&gt;. Note how little domain code it is: an outline loader, a heading detector, a naming scheme, a reference regex, and a correction map, all driven by CLI flags with nothing hard-coded. Everything else is the library.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where does this fall down?
&lt;/h2&gt;

&lt;p&gt;Honest limits, because the toolkit isn’t magic:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The long tail is OCR, not logic&lt;/strong&gt; , and verification closes it, not more regex. The systematic fixes get you most of the way, the authoritative-schema swap handles the annexes, and the per-instance residue is closed by the adversarial fan-out where every proposed edit is gated by the source PDF before it lands.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Verification needs a schema and a text-layer PDF.&lt;/strong&gt; Without an authoritative vocabulary you lose the first signal; without a real text layer (a scanned PDF) you lose the second.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Structure quality equals outline quality.&lt;/strong&gt; The tree is only as good as the section hierarchy you feed it, here the PDF bookmarks. Garbage outline, garbage tree.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Isn’t MinerU’s Markdown output enough?
&lt;/h3&gt;

&lt;p&gt;For reading, sometimes. For an addressable, agent-navigable, verified corpus, no. You need structure (the tree), a citation graph (the cross-links), and verification against ground truth. That’s the post-processing this toolkit does on top of what MinerU emits.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why a per-section PDF cross-check instead of just trusting the schema?
&lt;/h3&gt;

&lt;p&gt;Because a garble can collide with a valid name elsewhere (&lt;code&gt;align&lt;/code&gt; is a real element), so the schema vocabulary alone passes it. The source page is the only authority on which name belongs &lt;em&gt;here&lt;/em&gt;. Scoping the check to the section’s own pages keeps it cheap.&lt;/p&gt;

&lt;h3&gt;
  
  
  Do I have to use it on ECMA-376?
&lt;/h3&gt;

&lt;p&gt;No, it’s document-agnostic. Supply your own &lt;code&gt;Section&lt;/code&gt; hierarchy and a few callbacks and it runs on any MinerU output. See the &lt;a href="https://github.com/sergeyshmakov/mineru-runpod/blob/main/examples/doc-structuring/README.md" rel="noopener noreferrer"&gt;README&lt;/a&gt;. ECMA-376 is just the worked example.&lt;/p&gt;

&lt;h3&gt;
  
  
  Relative or absolute cross-links?
&lt;/h3&gt;

&lt;p&gt;Relative, computed within the tree. They’re identical wherever the tree is mounted, so the output ships anywhere with zero rewriting.&lt;/p&gt;

&lt;h3&gt;
  
  
  What block types does content_list.json contain?
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;text&lt;/code&gt; (with an optional &lt;code&gt;text_level&lt;/code&gt; for headings), &lt;code&gt;list&lt;/code&gt;, &lt;code&gt;table&lt;/code&gt; (HTML in &lt;code&gt;table_body&lt;/code&gt;), &lt;code&gt;code&lt;/code&gt; (in &lt;code&gt;code_body&lt;/code&gt;), &lt;code&gt;equation&lt;/code&gt;, and &lt;code&gt;image&lt;/code&gt; (with a VLM &lt;code&gt;content&lt;/code&gt; description), plus &lt;code&gt;page_number&lt;/code&gt; and &lt;code&gt;header&lt;/code&gt; noise you filter out.&lt;/p&gt;

&lt;h3&gt;
  
  
  How do you handle code that MinerU split across a page?
&lt;/h3&gt;

&lt;p&gt;The two halves arrive directly adjacent in the block stream, so merge adjacent code blocks. Genuinely separate examples always have prose between them, so they’re left untouched.&lt;/p&gt;

&lt;p&gt;If this saved you time, the easiest way to say thanks is &lt;a href="https://runpod.io?ref=31jdfpnq" rel="noopener noreferrer"&gt;signing up for RunPod through this link&lt;/a&gt;. Star the &lt;a href="https://github.com/sergeyshmakov/mineru-runpod" rel="noopener noreferrer"&gt;repo on GitHub&lt;/a&gt; for updates.&lt;/p&gt;




&lt;p&gt;&lt;small&gt;&lt;strong&gt;Disclosure:&lt;/strong&gt; RunPod links in this post use a referral code that credits me at no cost to you. The post would read the same without it.&lt;/small&gt;&lt;/p&gt;

</description>
      <category>mineru</category>
      <category>runpod</category>
      <category>pdfparsing</category>
      <category>documentai</category>
    </item>
    <item>
      <title>Fix RunPod's 'no resources to deploy your pod' error</title>
      <dc:creator>Sergey Shmakov</dc:creator>
      <pubDate>Wed, 03 Jun 2026 00:00:00 +0000</pubDate>
      <link>https://dev.to/sergeyshmakov/fix-runpods-no-resources-to-deploy-your-pod-error-3gc4</link>
      <guid>https://dev.to/sergeyshmakov/fix-runpods-no-resources-to-deploy-your-pod-error-3gc4</guid>
      <description>&lt;p&gt;&lt;em&gt;Last Updated: 2026-06-03&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;If RunPod fails a deploy with this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;This machine does not have the resources to deploy your pod. Please try a different machine.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;your Docker image is fine. This is a capacity error: RunPod’s scheduler tried to place your pod on a host that didn’t have a free GPU of the type you asked for, and bailed. It’s transient. The fix is to make RunPod try again on a different host, and &lt;em&gt;how&lt;/em&gt; you trigger that retry depends on which RunPod product you’re using. On a &lt;strong&gt;Serverless endpoint&lt;/strong&gt; wired to GitHub, push any commit to the watched branch. On a &lt;strong&gt;RunPod Hub&lt;/strong&gt; template, cut a new GitHub Release. Those two triggers are not interchangeable, which is the part that trips people up.&lt;/p&gt;

&lt;p&gt;I hit this constantly while maintaining the &lt;a href="https://github.com/sergeyshmakov/mineru-runpod" rel="noopener noreferrer"&gt;mineru-runpod&lt;/a&gt; template. The rest of this post is what the error actually means, why it’s not your fault, and the exact retry mechanic for each workflow.&lt;/p&gt;

&lt;h2&gt;
  
  
  What does “this machine does not have the resources to deploy your pod” mean on RunPod?
&lt;/h2&gt;

&lt;p&gt;It means RunPod’s scheduler picked a physical host to run your pod, found that host couldn’t satisfy the requested GPU, RAM, or disk, and refused the placement. It’s a per-machine capacity miss, not a global outage and not a problem with your image. “Try a different machine” is literal advice: another host of the same GPU type may have room.&lt;/p&gt;

&lt;p&gt;The message fires during the &lt;strong&gt;scheduling phase&lt;/strong&gt; , before your container ever starts. That timing is the tell. A broken image fails differently: you’d see an image-pull error, a non-zero container exit, or a failed health check. This message means the scheduler never got that far. It looked at the GPU type you requested, compared it against free capacity on the candidate host, and found the fit impossible.&lt;/p&gt;

&lt;p&gt;For most people the requested GPU is the binding constraint. Popular pools like the RTX 4090 get contended, especially in a busy region. When every host of that type in that data center is full, a fresh placement attempt fails until one frees up.&lt;/p&gt;

&lt;h2&gt;
  
  
  Is it a RunPod bug, or did I break something?
&lt;/h2&gt;

&lt;p&gt;Neither. It’s a legitimate, expected response from RunPod’s scheduler reporting that the GPU pool you targeted had no free host at that moment. Your Dockerfile, your handler, and your config are all irrelevant to this error. GPU capacity on RunPod fluctuates minute to minute, so the same deploy that fails now often succeeds 30 seconds later on a different host.&lt;/p&gt;

&lt;p&gt;The reason it feels like a bug is that it’s non-deterministic. The exact same config fails one minute and works the next, purely because the cluster’s free capacity moved. That’s also why retrying works: you’re not changing anything about your build, you’re just asking the scheduler to roll the dice again against a pool whose occupancy has shifted.&lt;/p&gt;

&lt;p&gt;Where you actually meet this string matters. RunPod throws it whenever it spins up a &lt;em&gt;real pod&lt;/em&gt;, which in this template’s world is two places: the &lt;strong&gt;Hub validator test pod&lt;/strong&gt; that runs after a release, and a &lt;strong&gt;GPU Pod&lt;/strong&gt; you launch directly. A Serverless worker that can’t find capacity usually surfaces it as workers stuck in a throttled or initializing state rather than this exact sentence, but the underlying cause and the recovery are the same: force a rebuild so RunPod reschedules.&lt;/p&gt;

&lt;h2&gt;
  
  
  How do I fix it on a RunPod Serverless endpoint?
&lt;/h2&gt;

&lt;p&gt;Push any commit to the branch your endpoint watches. Per RunPod’s docs, &lt;a href="https://docs.runpod.io/serverless/github-integration" rel="noopener noreferrer"&gt;“every git push to your specified branch results in an updated Endpoint”&lt;/a&gt;, so a no-op commit triggers a fresh build and redeploy. The new workers get scheduled again, almost always landing on a host with free capacity. You can also hit &lt;strong&gt;Rebuild&lt;/strong&gt; in the RunPod console to do the same thing without a commit.&lt;/p&gt;

&lt;p&gt;This is the path most people need. If you deployed your worker by connecting a GitHub repo, your endpoint redeploys on every push, so a trivial commit is the lowest-friction way to re-roll the scheduler:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;
git commit &lt;span class="nt"&gt;--allow-empty&lt;/span&gt; &lt;span class="nt"&gt;-m&lt;/span&gt; &lt;span class="s2"&gt;"chore: re-trigger RunPod build"&lt;/span&gt;

git push

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;--allow-empty&lt;/code&gt; flag is the point: you don’t need a real change to force a rebuild, you just need a new commit on the watched branch. RunPod’s layer caching means the rebuild is fast after the first one, since only the layers that changed get rebuilt (and for an empty commit, none did).&lt;/p&gt;

&lt;p&gt;If you’d rather not pollute history, the console’s manual &lt;strong&gt;Rebuild&lt;/strong&gt; button is the cleaner equivalent. Either way you’re doing the same thing: asking RunPod to provision workers again, on hosts whose occupancy has moved since the last attempt.&lt;/p&gt;

&lt;h2&gt;
  
  
  How do I fix it when publishing a RunPod Hub template?
&lt;/h2&gt;

&lt;p&gt;Cut a new GitHub Release. The Hub does not watch commits. Per RunPod’s publishing guide, &lt;a href="https://docs.runpod.io/hub/publishing-guide" rel="noopener noreferrer"&gt;“repository integration connects with GitHub repos using releases (not commits) for versioning”&lt;/a&gt;, so pushing to your branch does nothing on the Hub side. Only a new Release re-runs the build and the validator test pod, which is where this error shows up for template authors.&lt;/p&gt;

&lt;p&gt;Here’s the trick that saves you: a Release is just a tag, and a tag can point at a commit you already have. You don’t need to change a single line of code to re-trigger the Hub. Tag the same &lt;code&gt;HEAD&lt;/code&gt; you already shipped and publish a Release for it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;
git tag v1.6.4 &lt;span class="c"&gt;# same commit, new tag&lt;/span&gt;

git push origin v1.6.4

&lt;span class="c"&gt;# then publish a GitHub Release for v1.6.4 in the UI or via gh&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;RunPod treats each tag as a distinct template version and re-runs the full pipeline: build the image, spin up the validator test pod from &lt;code&gt;.runpod/tests.json&lt;/code&gt;, and index the listing (usually within an hour). If the previous Release failed only because the validator pod couldn’t get a GPU, the new Release gives it a fresh roll of the scheduler.&lt;/p&gt;

&lt;p&gt;The catch worth stating: each retry adds a version to your Hub listing, even if two versions are byte-identical. That’s the cost of the release-driven model. It’s cosmetic, but if you retry five times you’ll have five versions, so don’t spin on it if the failure is actually persistent (see below).&lt;/p&gt;

&lt;h2&gt;
  
  
  Why is the retry different for Serverless vs the Hub?
&lt;/h2&gt;

&lt;p&gt;Because the two products use different GitHub triggers. A Serverless endpoint rebuilds on &lt;strong&gt;every push&lt;/strong&gt; to its watched branch, so a commit is your retry. The Hub builds only on a &lt;strong&gt;new Release tag&lt;/strong&gt; , so a release is your retry. Pushing commits at a Hub listing does nothing; pushing releases at a Serverless endpoint isn’t how it watches for changes.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Serverless endpoint&lt;/th&gt;
&lt;th&gt;RunPod Hub template&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Build trigger&lt;/td&gt;
&lt;td&gt;Every push to the watched branch&lt;/td&gt;
&lt;td&gt;New GitHub Release (tag)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Retry the deploy by&lt;/td&gt;
&lt;td&gt;Empty commit, or &lt;strong&gt;Rebuild&lt;/strong&gt; in console&lt;/td&gt;
&lt;td&gt;New Release on the same commit&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Who needs this&lt;/td&gt;
&lt;td&gt;Anyone running a GitHub-connected worker&lt;/td&gt;
&lt;td&gt;Template authors publishing to the Hub&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Where the error appears&lt;/td&gt;
&lt;td&gt;Workers stuck initializing / throttled&lt;/td&gt;
&lt;td&gt;The validator test pod after a Release&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Most readers are in the left column. You deployed a worker from a repo (or one-click from the Hub) and you’re iterating on it; a commit re-rolls it. The right column is for the smaller group of people &lt;em&gt;authoring&lt;/em&gt; a Hub listing, where the validator test pod is the thing hitting the capacity wall. If you’re not publishing your own Hub template, you can ignore the release workflow entirely. For the full deploy walkthrough, the &lt;a href="https://sergeyshmakov.github.io/mineru-runpod/getting-started/overview/" rel="noopener noreferrer"&gt;getting-started guide&lt;/a&gt; covers both paths.&lt;/p&gt;

&lt;h2&gt;
  
  
  What if re-triggering doesn’t fix it?
&lt;/h2&gt;

&lt;p&gt;If the same GPU pool fails every time, the capacity miss is persistent, not transient, and rolling the scheduler won’t help. Switch to a higher-availability GPU pool, change region, or lower the resources you’re requesting. For the mineru-runpod Hub validator, that means editing the &lt;code&gt;gpuTypeId&lt;/code&gt; in &lt;code&gt;.runpod/tests.json&lt;/code&gt; to a pool that’s actually free, then cutting a new Release.&lt;/p&gt;

&lt;p&gt;The template defaults its validator to &lt;code&gt;"NVIDIA GeForce RTX 4090"&lt;/code&gt; because it has the best pool availability across RunPod’s regions. &lt;code&gt;"NVIDIA RTX A5000"&lt;/code&gt; works too but tends to be scarcer. I’ve bounced the test pod’s GPU between the A40, the A5000, and the 4090 across releases, chasing whichever pool had capacity on a given day, and the 4090 wins most often.&lt;/p&gt;

&lt;p&gt;Three levers when retries aren’t enough:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Change the GPU type&lt;/strong&gt; to a less-contended pool. A 24 GB workload fits several pools; pick the one with capacity rather than the one you assumed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Change the region&lt;/strong&gt; if your endpoint or template pins one. Capacity is per data center, so a pool that’s full in one region can be wide open in another.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reduce the request.&lt;/strong&gt; Oversized container disk or volume sizes shrink the set of hosts that can fit your pod. Trim them if they’re padded.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For template authors, there’s also an escape hatch documented in the &lt;a href="https://sergeyshmakov.github.io/mineru-runpod/guides/troubleshooting/" rel="noopener noreferrer"&gt;troubleshooting guide&lt;/a&gt;: if a release is urgent and the validator is the only blocker, rename &lt;code&gt;.runpod/tests.json&lt;/code&gt; to &lt;code&gt;.runpod/tests_.json&lt;/code&gt; so the Hub skips the test pod entirely. You lose all CI signal, so it’s a temporary unblock, not a default. For the GPU-pool math behind these choices, see &lt;a href="https://sergeyshmakov.github.io/mineru-runpod/guides/choosing-gpu/" rel="noopener noreferrer"&gt;Choosing a GPU&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Is “this machine does not have the resources to deploy your pod” a RunPod outage?
&lt;/h3&gt;

&lt;p&gt;No. It’s a per-host capacity miss, not a global outage. The scheduler tried one machine, found it full, and stopped. Other hosts of the same GPU type may have room, which is why a retry often succeeds within seconds even while RunPod is otherwise healthy.&lt;/p&gt;

&lt;h3&gt;
  
  
  Does pushing a commit fix the error on a RunPod Hub template?
&lt;/h3&gt;

&lt;p&gt;No. The Hub builds only on new GitHub Releases, not commits. Pushing to your branch leaves the Hub listing untouched. You have to publish a new Release (a new tag, which can point at the same commit) to re-run the Hub build and its validator test pod.&lt;/p&gt;

&lt;h3&gt;
  
  
  How do I re-trigger a RunPod Serverless build without changing code?
&lt;/h3&gt;

&lt;p&gt;Push an empty commit with &lt;code&gt;git commit --allow-empty&lt;/code&gt; to the watched branch, or click &lt;strong&gt;Rebuild&lt;/strong&gt; in the RunPod console. Both force a fresh build and redeploy, so workers get scheduled again on hosts whose free capacity has shifted since the last attempt.&lt;/p&gt;

&lt;h3&gt;
  
  
  Can I create a GitHub Release without a new commit?
&lt;/h3&gt;

&lt;p&gt;Yes. A Release is a tag, and a tag can point at any existing commit. Tag your current &lt;code&gt;HEAD&lt;/code&gt; and publish a Release for it. RunPod treats every tag as a new version and re-runs the build, so this re-triggers the Hub without any code change.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why does the same deploy fail once and work the next time?
&lt;/h3&gt;

&lt;p&gt;GPU capacity on RunPod fluctuates minute to minute. The same config hits a full host on one attempt and a free host on the next, with nothing about your image changing. That non-determinism is exactly why retrying is the first thing to try.&lt;/p&gt;

&lt;h3&gt;
  
  
  How do I stop hitting the capacity error repeatedly?
&lt;/h3&gt;

&lt;p&gt;Stop targeting a contended pool. Switch &lt;code&gt;gpuTypeId&lt;/code&gt; to a higher-availability GPU (RTX 4090 pools are usually the most available), change region, or reduce requested disk and volume sizes so more hosts can fit your pod.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where to next
&lt;/h2&gt;

&lt;p&gt;The error is annoying but harmless once you know it’s capacity and not your build. For a Serverless worker, a commit re-rolls it; for a Hub template, a Release does. If it persists, it’s a pool-availability problem, and the fix lives in your GPU choice, not your Dockerfile. The full set of Hub build failures (this one, the CUDA floor mismatch, and the 30-minute build timeout) is catalogued in the &lt;a href="https://sergeyshmakov.github.io/mineru-runpod/guides/troubleshooting/" rel="noopener noreferrer"&gt;troubleshooting guide&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;If this saved you a debugging session, star the &lt;a href="https://github.com/sergeyshmakov/mineru-runpod" rel="noopener noreferrer"&gt;repo on GitHub&lt;/a&gt; for updates, or open an issue if you hit a build failure that isn’t covered here.&lt;/p&gt;

</description>
      <category>runpod</category>
      <category>serverless</category>
      <category>runpodhub</category>
      <category>github</category>
    </item>
    <item>
      <title>Clause-aligned batching for large PDFs on MinerU + RunPod</title>
      <dc:creator>Sergey Shmakov</dc:creator>
      <pubDate>Wed, 03 Jun 2026 00:00:00 +0000</pubDate>
      <link>https://dev.to/sergeyshmakov/clause-aligned-batching-for-large-pdfs-on-mineru-runpod-3lio</link>
      <guid>https://dev.to/sergeyshmakov/clause-aligned-batching-for-large-pdfs-on-mineru-runpod-3lio</guid>
      <description>&lt;p&gt;&lt;em&gt;Last Updated: 2026-06-03&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://ecma-international.org/publications-and-standards/standards/ecma-376/" rel="noopener noreferrer"&gt;ECMA-376&lt;/a&gt; Part 1 (the &lt;em&gt;Office Open XML File Formats — Fundamentals and Markup Language Reference&lt;/em&gt;) is &lt;strong&gt;5,039 pages&lt;/strong&gt; of dense, table-heavy, XML-schema-laden specification in a single 35 MB PDF. It is the document that defines &lt;code&gt;.docx&lt;/code&gt;, &lt;code&gt;.xlsx&lt;/code&gt;, and &lt;code&gt;.pptx&lt;/code&gt; down to the attribute. If you want a machine-readable, clause-addressable version of it, you have to parse all 5,039 pages, and almost everything about that page count makes a naive approach fall over.&lt;/p&gt;

&lt;p&gt;This is the story of parsing the whole thing through the &lt;a href="https://github.com/sergeyshmakov/mineru-runpod" rel="noopener noreferrer"&gt;mineru-runpod&lt;/a&gt; serverless worker. The headline result: &lt;strong&gt;36 batches, 5,039 pages, 46,637 content blocks, 4,174 tables, full contiguous coverage, ~$1.15 of GPU time.&lt;/strong&gt; The interesting part is not the total. It’s &lt;em&gt;how you cut a 5,000-page document into pieces without breaking it&lt;/em&gt;, which it turns out is a decision about clause structure, not page numbers.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why not just send the whole 5,000-page PDF?
&lt;/h2&gt;

&lt;p&gt;The worker accepts a &lt;code&gt;file_url&lt;/code&gt; and parses front-to-back, so technically you could send all 5,039 pages as one job. You shouldn’t, for four reasons that all get worse with size:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;All-or-nothing failure.&lt;/strong&gt; A single job that dies at page 4,800 (OOM, a transient GPU eviction, a timeout) costs you the entire run. At ~78 minutes of GPU work (more on that below), that’s an expensive coin flip.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No resumability.&lt;/strong&gt; One job has no natural checkpoint. If it fails you start over.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The 20 MB response cap.&lt;/strong&gt; MinerU’s output for a few hundred pages already blows past &lt;a href="https://sergeyshmakov.github.io/mineru-runpod/blog/runpod-20mb-response-cap-r2-bridge/" rel="noopener noreferrer"&gt;RunPod’s ~20 MB sync-response ceiling&lt;/a&gt;. For 5,000 pages it isn’t close: the extracted output here was &lt;strong&gt;869 MB on disk&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Memory.&lt;/strong&gt; Holding the layout model output for thousands of pages in one process is a needless VRAM/RAM risk when the work is embarrassingly sliceable.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Batching fixes all four, but only if you batch at the &lt;em&gt;right&lt;/em&gt; boundaries.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why cut at clause boundaries instead of every N pages?
&lt;/h2&gt;

&lt;p&gt;The obvious split is “every 100 pages.” The problem: a standard isn’t a stream of interchangeable pages, it’s a tree of clauses. Clause §17.4 (Tables) might start three lines from the bottom of a page and run for 40 pages. If a batch boundary lands in the middle of it, you’ve torn a logical unit across two parse jobs, and every downstream step (clause extraction, cross-referencing, chunking for retrieval) has to stitch it back together.&lt;/p&gt;

&lt;p&gt;So I don’t cut by page count. I cut by &lt;strong&gt;clause&lt;/strong&gt; :&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Build an outline from the PDF’s &lt;strong&gt;4,600+ bookmarks&lt;/strong&gt; , giving a clause → page index for the whole document.&lt;/li&gt;
&lt;li&gt;Place batch boundaries &lt;strong&gt;only at clause starts&lt;/strong&gt; , never mid-clause.&lt;/li&gt;
&lt;li&gt;Treat the huge top-level clauses (§17 WordprocessingML, §18 SpreadsheetML, §19 PresentationML, §20/§21 DrawingML, §22 Shared MLs, and the annexes) as &lt;strong&gt;mandatory anchors&lt;/strong&gt; , so a big reference section always begins a fresh batch.&lt;/li&gt;
&lt;li&gt;Aim for ~100 pages per batch, allow up to ~200, and accept whatever the nearest clause boundary gives.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The result was &lt;strong&gt;36 batches averaging 140 pages&lt;/strong&gt; (smallest 66, largest 238). Every batch starts and ends on a clause edge, so no clause is ever split across the seam between two parse jobs.&lt;/p&gt;

&lt;p&gt;(One calibration gotcha specific to this PDF: printed page 1 is PDF page index 9. There’s a 9-page front-matter offset you have to fold into the bookmark→page mapping or every boundary is off by nine.)&lt;/p&gt;

&lt;p&gt;A useful consequence: because the worker slices the PDF server-side via &lt;code&gt;start_page&lt;/code&gt;/&lt;code&gt;end_page&lt;/code&gt; (see the &lt;a href="https://sergeyshmakov.github.io/mineru-runpod/reference/api/" rel="noopener noreferrer"&gt;API reference&lt;/a&gt;), &lt;strong&gt;you never pre-split the PDF&lt;/strong&gt;. You upload it once and each batch job asks for its page range out of the same source file.&lt;/p&gt;

&lt;h2&gt;
  
  
  Did the batches actually cover the whole document?
&lt;/h2&gt;

&lt;p&gt;Yes, and this is worth verifying mechanically rather than trusting. After the run, I checked each batch’s &lt;em&gt;produced&lt;/em&gt; page span against its &lt;em&gt;planned&lt;/em&gt; range and confirmed the batches tile the document with no gaps and no overlaps:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Batches&lt;/td&gt;
&lt;td&gt;36&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pages&lt;/td&gt;
&lt;td&gt;5,039 (contiguous &lt;strong&gt;0–5038&lt;/strong&gt; , 0 gaps, 0 overlaps)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Content blocks&lt;/td&gt;
&lt;td&gt;46,637&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tables&lt;/td&gt;
&lt;td&gt;4,174&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Code blocks&lt;/td&gt;
&lt;td&gt;3,591&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pages/batch&lt;/td&gt;
&lt;td&gt;mean 140, min 66 (b35), max 238 (b25)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Output downloaded&lt;/td&gt;
&lt;td&gt;~465 MB (compressed tarballs)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Output on disk&lt;/td&gt;
&lt;td&gt;869 MB (extracted)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The contiguity check is the one piece of validation I wouldn’t skip on a document this size: it’s the difference between “the run finished” and “the run is complete.”&lt;/p&gt;

&lt;h2&gt;
  
  
  How was the document transported in and out?
&lt;/h2&gt;

&lt;p&gt;Two different transports, for two different size problems.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Input: R2 URL.&lt;/strong&gt; At 35 MB the PDF is well over the 20 MB inline (&lt;code&gt;file_b64&lt;/code&gt;) limit, so it can’t ride in the request body. I put it on Cloudflare &lt;strong&gt;R2&lt;/strong&gt; and passed a public URL as &lt;code&gt;file_url&lt;/code&gt;. The worker downloads it (≤200 MB cap) and slices the requested pages itself. One upload, 36 jobs read from it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Output: &lt;code&gt;transport="s3"&lt;/code&gt;.&lt;/strong&gt; Per-batch output is large (the biggest batch produced a &lt;strong&gt;32 MB&lt;/strong&gt; tarball), so embedding results in the sync response was out. With &lt;code&gt;transport="s3"&lt;/code&gt;, the worker uploads each result &lt;code&gt;.tar.gz&lt;/code&gt; back to R2 and returns a presigned URL the client downloads and extracts. The tarball carries everything: &lt;code&gt;content_list.json&lt;/code&gt; (the flat, typed, page-indexed block list I treat as source of truth), the rendered markdown, &lt;code&gt;middle.json&lt;/code&gt;, and a layout-overlay PDF.&lt;/p&gt;

&lt;p&gt;The presigned URL has a &lt;strong&gt;1-hour TTL&lt;/strong&gt; , which has a real consequence for batching: you must download each batch’s result &lt;em&gt;as its job finishes&lt;/em&gt;, not in a sweep at the end of a 78-minute run. By then the early URLs have expired.&lt;/p&gt;

&lt;h2&gt;
  
  
  What GPU and backend, and what did throughput look like?
&lt;/h2&gt;

&lt;p&gt;Backend: &lt;code&gt;vlm-auto-engine&lt;/code&gt; (MinerU 2.5 Pro, the &lt;code&gt;MinerU2.5-Pro-2605-1.2B&lt;/code&gt; vision-language model) on a 24 GB AMPERE_24 (RTX A5000-class) &lt;a href="https://runpod.io?ref=31jdfpnq" rel="noopener noreferrer"&gt;RunPod&lt;/a&gt; serverless GPU. One parse per worker (&lt;code&gt;MINERU_MAX_CONCURRENCY=1&lt;/code&gt;: vLLM’s KV cache isn’t safe to drive from concurrent parses on a 24 GB card). For how to pick a card, see &lt;a href="https://sergeyshmakov.github.io/mineru-runpod/guides/choosing-gpu/" rel="noopener noreferrer"&gt;Choosing a GPU&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Across the 35 timed batches, total GPU compute was &lt;strong&gt;4,674.8 s (77.9 min)&lt;/strong&gt; at an overall &lt;strong&gt;1.04 pages/sec&lt;/strong&gt; , with individual batches ranging &lt;strong&gt;0.84–1.27 pp/s&lt;/strong&gt; depending on table density. A few representative batches:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Batch&lt;/th&gt;
&lt;th&gt;Clause&lt;/th&gt;
&lt;th&gt;Pages&lt;/th&gt;
&lt;th&gt;Worker time&lt;/th&gt;
&lt;th&gt;pp/s&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;b00&lt;/td&gt;
&lt;td&gt;§1 Scope (front matter)&lt;/td&gt;
&lt;td&gt;176&lt;/td&gt;
&lt;td&gt;145.0 s&lt;/td&gt;
&lt;td&gt;1.21&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;b01&lt;/td&gt;
&lt;td&gt;§17 WordprocessingML&lt;/td&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;td&gt;102.8 s&lt;/td&gt;
&lt;td&gt;0.97&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;b17&lt;/td&gt;
&lt;td&gt;§18.17.7 (functions)&lt;/td&gt;
&lt;td&gt;176&lt;/td&gt;
&lt;td&gt;147.9 s&lt;/td&gt;
&lt;td&gt;1.19&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;b25&lt;/td&gt;
&lt;td&gt;§21.2 DrawingML – Charts&lt;/td&gt;
&lt;td&gt;238&lt;/td&gt;
&lt;td&gt;231.8 s&lt;/td&gt;
&lt;td&gt;1.03&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;b32&lt;/td&gt;
&lt;td&gt;Annex L Primer&lt;/td&gt;
&lt;td&gt;102&lt;/td&gt;
&lt;td&gt;80.2 s&lt;/td&gt;
&lt;td&gt;1.27&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Cost worked out to roughly &lt;strong&gt;$0.00023/page, ~$1.15 for the whole standard&lt;/strong&gt;. Before committing to that, a &lt;strong&gt;3-page smoke test&lt;/strong&gt; (cents, ~110–130 s dominated by cold start) validated the entire pipeline end-to-end (URL fetch → parse → R2 upload → download → extract), the cheapest insurance you can buy on a big run.&lt;/p&gt;

&lt;h2&gt;
  
  
  The parallelism lesson: it’s RunPod-side, not in the worker
&lt;/h2&gt;

&lt;p&gt;This is the part that cost the most confusion: &lt;strong&gt;a single batch is already parallelized inside the worker&lt;/strong&gt; (the VLM batches many page-images through the GPU at once), but &lt;strong&gt;running multiple batches at once is a RunPod scaling decision, not something you trigger by submitting more jobs.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I learned this the hard way. The client submitted 3 batches concurrently, and RunPod ran exactly one while two sat in the queue. The endpoint was configured &lt;code&gt;workersMax=1&lt;/code&gt;: one GPU worker, one batch at a time, no matter how many jobs you fire. Raising &lt;code&gt;workersMax&lt;/code&gt; to 3 (and matching the client’s concurrency) is what actually delivered 3×: the remaining 31 batches then finished in &lt;strong&gt;27.8 minutes wall-clock&lt;/strong&gt;. The &lt;a href="https://sergeyshmakov.github.io/mineru-runpod/guides/scaling/" rel="noopener noreferrer"&gt;scaling guide&lt;/a&gt; covers how concurrency and &lt;code&gt;workersMax&lt;/code&gt; interact.&lt;/p&gt;

&lt;p&gt;The mental-model fix:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Inside one job:&lt;/strong&gt; pages are parallelized on one GPU. Already maxed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;workersMax&lt;/code&gt;:&lt;/strong&gt; how many separate GPUs run separate jobs at once. &lt;em&gt;This&lt;/em&gt; is your throughput dial.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A related myth worth busting: MinerU’s pipeline logs mention a &lt;code&gt;window_size=64&lt;/code&gt;. That is a &lt;strong&gt;GPU throughput batch&lt;/strong&gt; (how many page-images stream through the model at a time to bound VRAM), &lt;strong&gt;not a context window&lt;/strong&gt;. Pages are recognized independently regardless of it, so it has zero effect on content continuity across pages. Which is exactly &lt;em&gt;why&lt;/em&gt; clause-aligned &lt;strong&gt;batch&lt;/strong&gt; boundaries matter and the internal window size doesn’t: continuity is something you protect at the batch layer, not by tuning a throughput knob.&lt;/p&gt;

&lt;h2&gt;
  
  
  Which clauses produced the most structure?
&lt;/h2&gt;

&lt;p&gt;Block and table counts track the content shape of the standard almost perfectly: the reference-material and function-catalog clauses dominate:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Batch&lt;/th&gt;
&lt;th&gt;Clause&lt;/th&gt;
&lt;th&gt;Blocks&lt;/th&gt;
&lt;th&gt;Tables&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;b25&lt;/td&gt;
&lt;td&gt;§21.2 DrawingML – Charts&lt;/td&gt;
&lt;td&gt;2,829&lt;/td&gt;
&lt;td&gt;378&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;b17&lt;/td&gt;
&lt;td&gt;§18.17.7 (spreadsheet functions)&lt;/td&gt;
&lt;td&gt;2,805&lt;/td&gt;
&lt;td&gt;239&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;b26&lt;/td&gt;
&lt;td&gt;§22 Shared MLs&lt;/td&gt;
&lt;td&gt;2,509&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;b10&lt;/td&gt;
&lt;td&gt;§17.17 Miscellaneous&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;238&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;b11&lt;/td&gt;
&lt;td&gt;§18 SpreadsheetML&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;224&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;These are the dense element/attribute reference tables that make ECMA-376 what it is. They’re a good reminder to spot-check table fidelity on exactly these batches before trusting the output downstream.&lt;/p&gt;

&lt;h2&gt;
  
  
  The annex schema dumps look completely different
&lt;/h2&gt;

&lt;p&gt;The most striking per-batch contrast is the annexes. Annex A (W3C XML Schema), Annex B (RELAX NG) and friends are &lt;strong&gt;long code listings&lt;/strong&gt; , not prose with tables, and the numbers show it. Same ~150-page batch size, radically smaller output:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Batch&lt;/th&gt;
&lt;th&gt;Annex&lt;/th&gt;
&lt;th&gt;Tarball&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;b29&lt;/td&gt;
&lt;td&gt;Annex B (RELAX NG)&lt;/td&gt;
&lt;td&gt;1.15 MB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;b30&lt;/td&gt;
&lt;td&gt;B.3 PresentationML&lt;/td&gt;
&lt;td&gt;1.75 MB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;b27&lt;/td&gt;
&lt;td&gt;Annex A (XML Schema)&lt;/td&gt;
&lt;td&gt;2.06 MB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;b28&lt;/td&gt;
&lt;td&gt;A.3 PresentationML&lt;/td&gt;
&lt;td&gt;2.23 MB&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Compare that to the prose-and-table batches that ran &lt;strong&gt;31–32 MB&lt;/strong&gt; (b10, b11) for a similar page count: roughly a 15× size difference driven entirely by content type. MinerU classifies the schema listings as code, so they compress to almost nothing relative to a table-dense reference section.&lt;/p&gt;

&lt;h2&gt;
  
  
  How did resumability actually work?
&lt;/h2&gt;

&lt;p&gt;The runner keeps a &lt;code&gt;manifest.json&lt;/code&gt; keyed by batch, and writes each batch’s result &lt;strong&gt;atomically&lt;/strong&gt; : extract into a temporary directory, then rename into place. A batch is only marked &lt;code&gt;ok&lt;/code&gt; after its download, extraction, and rename all succeed. Two payoffs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Pause/resume.&lt;/strong&gt; Midway through, I paused the run to raise &lt;code&gt;workersMax&lt;/code&gt; (you don’t want to change cluster settings while jobs are in flight). Stopping the client abandoned the in-flight jobs, but because their downloads hadn’t completed, the manifest never marked them done, so resuming re-ran them. Completed batches were skipped. No corruption, no duplicate downloads.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Crash recovery is free.&lt;/strong&gt; The same mechanism means any crash resumes from the last completed batch.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For a 36-job run that you might interrupt, the resumable manifest is what turns “a long fragile script” into “a process you can walk away from.”&lt;/p&gt;

&lt;h2&gt;
  
  
  Where this falls down / what I’d change
&lt;/h2&gt;

&lt;p&gt;Honest limitations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;1-hour presign expiry forces eager download.&lt;/strong&gt; You cannot defer pulling results to the end of a long run; download each batch as it lands. My runner does this, but it’s a constraint to design around, not a free lunch.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Clause boundaries are only as good as the outline.&lt;/strong&gt; The whole scheme leans on the PDF’s bookmark tree being accurate and complete. A document with missing or wrong bookmarks needs a fallback (TOC parsing, heading detection) before this works.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Table/code fidelity needs spot-checking.&lt;/strong&gt; 4,174 tables and 3,591 code blocks is a lot of structure to trust blindly; the dense reference batches (b25, b17, b11) and the annex code dumps are where I’d sample-verify first.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;One GPU is the ceiling.&lt;/strong&gt; Throughput is fundamentally &lt;code&gt;workersMax × per-GPU rate&lt;/code&gt;. There’s no in-job trick to go faster: you pay for more workers or you wait. And more workers means more cold starts, so wall-clock and cost don’t scale perfectly linearly.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What I’d change next time: drive client concurrency directly from the endpoint’s live &lt;code&gt;workersMax&lt;/code&gt; so the two never drift, and prune the &lt;code&gt;middle.json&lt;/code&gt; + layout PDF from batches where I only need &lt;code&gt;content_list.json&lt;/code&gt;. They were roughly half the on-disk footprint.&lt;/p&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;h3&gt;
  
  
  How long does it take to parse a 5,000-page PDF with MinerU?
&lt;/h3&gt;

&lt;p&gt;About 78 minutes of single-GPU compute (~1 page/sec on a 24 GB RTX A5000-class card with the VLM backend), or ~28 minutes of wall-clock at 3× worker concurrency. Cost is roughly $1.15 total at ~$0.00023/page.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why batch at clause boundaries instead of fixed page counts?
&lt;/h3&gt;

&lt;p&gt;So no logical unit is split across two parse jobs. A clause can start mid-page and span dozens of pages; cutting by page count tears it in half and forces every downstream step to reassemble it. Cutting at clause starts keeps each clause whole within a batch.&lt;/p&gt;

&lt;h3&gt;
  
  
  How do you handle output larger than RunPod’s 20 MB response cap?
&lt;/h3&gt;

&lt;p&gt;Use &lt;code&gt;transport="s3"&lt;/code&gt;: the worker uploads each result tarball to an S3-compatible bucket (Cloudflare R2 here) and returns a presigned URL you download. Per-batch output here reached 32 MB, far past the sync-response ceiling.&lt;/p&gt;

&lt;h3&gt;
  
  
  Does sending more concurrent jobs make a single endpoint faster?
&lt;/h3&gt;

&lt;p&gt;No. Concurrency above the endpoint’s &lt;code&gt;workersMax&lt;/code&gt; just fills the queue. Parallelism is the number of GPU workers RunPod runs, set by &lt;code&gt;workersMax&lt;/code&gt;. Raise that to go wider.&lt;/p&gt;

&lt;h3&gt;
  
  
  Do I need to split the PDF before uploading?
&lt;/h3&gt;

&lt;p&gt;No. Upload the full PDF once (or host it on R2 and pass &lt;code&gt;file_url&lt;/code&gt;); each batch job requests its page range via &lt;code&gt;start_page&lt;/code&gt;/&lt;code&gt;end_page&lt;/code&gt; and the worker slices server-side.&lt;/p&gt;

&lt;h3&gt;
  
  
  How do I know the whole document was covered?
&lt;/h3&gt;

&lt;p&gt;Verify mechanically: check each batch’s produced page span against its planned range and confirm the batches tile the document with zero gaps and zero overlaps. “The run finished” and “the run is complete” are not the same claim.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where to next
&lt;/h2&gt;

&lt;p&gt;The output is 36 batches of &lt;code&gt;content_list.json&lt;/code&gt; with page-indexed, typed blocks. The next step is joining each block’s page back to the clause outline to emit a clause-addressable tree (one compact file per clause) that an agent or a retrieval index can navigate. The clause-aligned batching is what makes that join clean: every block already lives inside exactly one clause’s batch.&lt;/p&gt;

&lt;p&gt;ECMA-376 is freely available from &lt;a href="https://ecma-international.org/publications-and-standards/standards/ecma-376/" rel="noopener noreferrer"&gt;Ecma International&lt;/a&gt;; it’s used here purely as a parsing benchmark. The parsed corpus is kept in a private repository for internal use, and this post shares only the parsing process and aggregate statistics, not the standard’s content.&lt;/p&gt;

&lt;p&gt;If you want per-phase timings (fetch / parse / package) and throughput dashboards for a run like this, the worker can ship OpenTelemetry traces and metrics to any OTLP backend. See the &lt;a href="https://sergeyshmakov.github.io/mineru-runpod/guides/observability/" rel="noopener noreferrer"&gt;observability guide&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;If this saved you time, the easiest way to say thanks is &lt;a href="https://runpod.io?ref=31jdfpnq" rel="noopener noreferrer"&gt;signing up for RunPod through this link&lt;/a&gt;. Star the &lt;a href="https://github.com/sergeyshmakov/mineru-runpod" rel="noopener noreferrer"&gt;repo on GitHub&lt;/a&gt; for updates.&lt;/p&gt;




&lt;p&gt;&lt;small&gt;&lt;strong&gt;Disclosure:&lt;/strong&gt; RunPod links in this post use a referral code that credits me at no cost to you. The post would read the same without it.&lt;/small&gt;&lt;/p&gt;

</description>
      <category>mineru</category>
      <category>runpod</category>
      <category>pdfparsing</category>
      <category>batching</category>
    </item>
    <item>
      <title>Ship MinerU on RunPod logs to Axiom via OpenTelemetry</title>
      <dc:creator>Sergey Shmakov</dc:creator>
      <pubDate>Fri, 29 May 2026 00:00:00 +0000</pubDate>
      <link>https://dev.to/sergeyshmakov/ship-mineru-on-runpod-logs-to-axiom-via-opentelemetry-14a3</link>
      <guid>https://dev.to/sergeyshmakov/ship-mineru-on-runpod-logs-to-axiom-via-opentelemetry-14a3</guid>
      <description>&lt;p&gt;&lt;em&gt;Last Updated: 2026-05-29&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;If you're running a serverless worker on &lt;a href="https://runpod.io?ref=31jdfpnq" rel="noopener noreferrer"&gt;RunPod&lt;/a&gt;, you can't ssh in to read logs and you can't run a sidecar agent — the worker scales to zero between jobs. You need to ship logs and metrics off the box during the request lifetime, on the worker's own time. The mineru-runpod template includes an OpenTelemetry exporter for exactly this, and &lt;a href="https://axiom.co/" rel="noopener noreferrer"&gt;Axiom&lt;/a&gt; is the sink I picked for my own deployment.&lt;/p&gt;

&lt;p&gt;This post is the exact env-var layout. If you use a different OTLP backend (Honeycomb, Grafana, Datadog, Jaeger, your own &lt;a href="https://opentelemetry.io/docs/collector/" rel="noopener noreferrer"&gt;OpenTelemetry Collector&lt;/a&gt;), the &lt;a href="https://dev.to/mineru-runpod/guides/observability/"&gt;observability guide&lt;/a&gt; covers the vendor-neutral setup; come back here only for the Axiom-specific values.&lt;/p&gt;

&lt;p&gt;Setup time from a fresh Axiom account to logs flowing: ~10 minutes, dominated by waiting for the worker's next cold start.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why use Axiom for serverless worker observability?
&lt;/h2&gt;

&lt;p&gt;Axiom ingests OTLP/HTTP directly, charges per event rather than per host, and has no agent to install — which matters when your workers scale to zero between jobs. For a mineru-runpod deployment processing a few hundred PDFs a day, the events fit inside Axiom's free tier, and the metrics dataset is queryable via APL.&lt;/p&gt;

&lt;p&gt;Three concrete reasons it fits this workload:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;OTLP-native ingest.&lt;/strong&gt; No collector to run inside the container, no daemonset, no Fluentbit config. The mineru-runpod worker calls Axiom's regional edge endpoint (&lt;code&gt;https://eu-central-1.aws.edge.axiom.co/v1/logs&lt;/code&gt; or the US variant) directly via the OpenTelemetry Python SDK. The serverless model rules out anything that needs a long-running sidecar, so "the exporter IS the integration" is the only model that works.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Per-event pricing instead of per-host.&lt;/strong&gt; A worker serving 500 PDFs a day emits roughly 5,000 log records and 2,500 spans. That fits comfortably inside Axiom's free 0.5 GB/month tier. Per-host pricing models (the Datadog and NewRelic shape) penalize the ephemeral-worker pattern.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;APL reads like SPL.&lt;/strong&gt; Axiom Processing Language sits between Splunk SPL and KQL ergonomically. Filter by attribute, group by backend, drill into a span: the queries you actually run during an incident are easy in APL. No tutorial needed if you've used either.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I have no affiliation with Axiom. They're the backend I picked for my own deployment after looking at Honeycomb, Grafana Cloud, and self-hosted Jaeger.&lt;/p&gt;

&lt;h2&gt;
  
  
  How do I create Axiom datasets for OpenTelemetry?
&lt;/h2&gt;

&lt;p&gt;Create two datasets in the Axiom UI: one &lt;strong&gt;Events&lt;/strong&gt; dataset (holds both traces and logs) and one &lt;strong&gt;Metrics&lt;/strong&gt; dataset. Those are the only two dataset types Axiom exposes in the UI; metrics are kept separate because they use a different storage format under the hood.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Sign in to Axiom and go to &lt;strong&gt;Datasets&lt;/strong&gt; → &lt;strong&gt;New dataset&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Create &lt;code&gt;mineru-events&lt;/code&gt; with &lt;strong&gt;Events&lt;/strong&gt; type. This holds traces and logs together.&lt;/li&gt;
&lt;li&gt;Create &lt;code&gt;mineru-metrics&lt;/code&gt; with &lt;strong&gt;Metrics&lt;/strong&gt; type.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Names are arbitrary — substitute whatever fits your naming convention. I prefix everything with the service name so multiple endpoints (staging, prod, experiments) don't collide in queries.&lt;/p&gt;

&lt;h2&gt;
  
  
  How do I generate an Axiom API token for OTLP ingest?
&lt;/h2&gt;

&lt;p&gt;In the Axiom UI, go to &lt;strong&gt;Settings → API tokens → Generate new token&lt;/strong&gt;. Use an &lt;strong&gt;Advanced API token&lt;/strong&gt; (prefix &lt;code&gt;xaat-&lt;/code&gt;) and explicitly grant the &lt;strong&gt;Ingest&lt;/strong&gt; scope on &lt;strong&gt;both&lt;/strong&gt; datasets you just created. Forgetting the scope on the metrics dataset is one of the common failure modes — it surfaces as &lt;code&gt;403 Forbidden&lt;/code&gt; in the worker logs while events ingest works fine. Copy the resulting &lt;code&gt;xaat-&lt;/code&gt; prefixed string and paste it into your RunPod endpoint config in the next section.&lt;/p&gt;

&lt;p&gt;Treat the token like any production secret: paste it only into RunPod's &lt;strong&gt;Environment Variables&lt;/strong&gt; UI (which encrypts at rest), never check it into git, and rotate when employees with access leave.&lt;/p&gt;

&lt;p&gt;If your Axiom workspace is in the EU region, the management API lives at &lt;code&gt;https://api.eu.axiom.co&lt;/code&gt; (US is &lt;code&gt;https://api.axiom.co&lt;/code&gt;). This is the host you query for token CRUD and REST queries, &lt;strong&gt;not&lt;/strong&gt; the OTLP ingest URL — that's a separate edge-deployment hostname documented in the env-var section below.&lt;/p&gt;

&lt;h2&gt;
  
  
  What environment variables ship RunPod worker telemetry to Axiom?
&lt;/h2&gt;

&lt;p&gt;Set four environment variables on your RunPod endpoint. &lt;code&gt;OTEL_EXPORTER_OTLP_HEADERS&lt;/code&gt; covers both traces and logs (they share the Events dataset); &lt;code&gt;OTEL_EXPORTER_OTLP_METRICS_HEADERS&lt;/code&gt; overrides for metrics only because Axiom uses a different header for its metrics ingest.&lt;/p&gt;

&lt;p&gt;Paste this into your endpoint's &lt;strong&gt;Environment Variables&lt;/strong&gt; section in the RunPod dashboard, substituting your token and dataset names:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;OTEL_EXPORTER_OTLP_ENDPOINT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;https://eu-central-1.aws.edge.axiom.co
&lt;span class="nv"&gt;OTEL_EXPORTER_OTLP_HEADERS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;Authorization&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;Bearer xaat-YOUR-TOKEN,x-axiom-dataset&lt;span class="o"&gt;=&lt;/span&gt;mineru-events
&lt;span class="nv"&gt;OTEL_EXPORTER_OTLP_METRICS_HEADERS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;Authorization&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;Bearer xaat-YOUR-TOKEN,x-axiom-metrics-dataset&lt;span class="o"&gt;=&lt;/span&gt;mineru-metrics
&lt;span class="nv"&gt;OTEL_EXPORTER_OTLP_PROTOCOL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;http/protobuf
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The endpoint URL is your Axiom edge deployment, NOT &lt;code&gt;api.axiom.co&lt;/code&gt; / &lt;code&gt;api.eu.axiom.co&lt;/code&gt;.&lt;/strong&gt; This is the single biggest gotcha and the one that cost me hours when I first set this up. The &lt;code&gt;api.*&lt;/code&gt; hosts are for management API (token creation, queries via REST). OTLP ingest goes to your workspace's &lt;em&gt;edge deployment&lt;/em&gt; hostname. As of writing, the two are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;US East 1:&lt;/strong&gt; &lt;code&gt;https://us-east-1.aws.edge.axiom.co&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;EU Central 1:&lt;/strong&gt; &lt;code&gt;https://eu-central-1.aws.edge.axiom.co&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Use the one that matches the region you picked when creating the Axiom workspace. The current full list lives at &lt;a href="https://axiom.co/docs/reference/edge-deployments" rel="noopener noreferrer"&gt;Axiom's edge deployments doc&lt;/a&gt;. If you send OTLP to &lt;code&gt;api.{eu.}axiom.co&lt;/code&gt;, Axiom returns &lt;code&gt;400 mismatched region&lt;/code&gt; or &lt;code&gt;403 forbidden&lt;/code&gt; depending on path — the OTel SDK logs only the HTTP status code, so you'll see &lt;code&gt;Failed to export ... code: 400&lt;/code&gt; (or 403) with no clue why. Axiom support flagged this for me after I'd been chasing 403s for an hour against the wrong host.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;How the SDK routes each signal:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Traces + logs&lt;/strong&gt; → use the generic &lt;code&gt;OTEL_EXPORTER_OTLP_HEADERS&lt;/code&gt;, so spans and log records both land in the &lt;code&gt;mineru-events&lt;/code&gt; dataset.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Metrics&lt;/strong&gt; → &lt;code&gt;OTEL_EXPORTER_OTLP_METRICS_HEADERS&lt;/code&gt; overrides for metrics only, with the distinct &lt;code&gt;x-axiom-metrics-dataset&lt;/code&gt; header that Axiom's metrics ingest requires.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The dataset names in the two &lt;code&gt;*-dataset&lt;/code&gt; headers must exactly match the names you created in step 1, and your API token must have ingest scope on both.&lt;/strong&gt; This is the #1 source of "everything is configured but nothing shows up" — &lt;code&gt;mineru-events&lt;/code&gt; and &lt;code&gt;mineru-metrics&lt;/code&gt; above are example names, not magic strings. If you named your datasets differently, update the headers to match. Mismatches surface as &lt;code&gt;404 Not Found&lt;/code&gt; (wrong dataset name) or &lt;code&gt;403 Forbidden&lt;/code&gt; (token lacks ingest scope on that dataset). Both fail silently from the caller's side; the only signal is in the worker's stdout, where the OTel SDK logs each retry with the HTTP status code.&lt;/p&gt;

&lt;p&gt;Three details that trip people up:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Base URL, not the full path.&lt;/strong&gt; Set the endpoint to the edge-deployment root, e.g. &lt;code&gt;https://eu-central-1.aws.edge.axiom.co&lt;/code&gt;, NOT &lt;code&gt;https://eu-central-1.aws.edge.axiom.co/v1/traces&lt;/code&gt;. The &lt;a href="https://opentelemetry-python.readthedocs.io/en/latest/exporter/otlp/otlp.html" rel="noopener noreferrer"&gt;OpenTelemetry Python SDK&lt;/a&gt; appends &lt;code&gt;/v1/traces&lt;/code&gt;, &lt;code&gt;/v1/logs&lt;/code&gt;, &lt;code&gt;/v1/metrics&lt;/code&gt; per signal automatically. If you set the full path, the SDK double-appends and Axiom returns 404.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Header case differs by signal.&lt;/strong&gt; Events ingest uses &lt;code&gt;x-axiom-dataset&lt;/code&gt;. Metrics ingest uses &lt;code&gt;x-axiom-metrics-dataset&lt;/code&gt; (different header name, with &lt;code&gt;-metrics-&lt;/code&gt; in it). Copying the events headers into &lt;code&gt;OTEL_EXPORTER_OTLP_METRICS_HEADERS&lt;/code&gt; as-is sends metrics to the events dataset and Axiom's metrics view stays empty.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Protocol must be protobuf for metrics.&lt;/strong&gt; Axiom's metrics ingest accepts only protobuf, not JSON. The &lt;code&gt;http/protobuf&lt;/code&gt; value above is the default the SDK ships with; don't override it to &lt;code&gt;http/json&lt;/code&gt; thinking it's the safer choice. JSON works for logs and traces but quietly drops metrics.&lt;/p&gt;

&lt;p&gt;Save the variables, redeploy any active workers (or wait for the next cold start), and Axiom should see records within ~60 seconds of the first request that hits a warm worker. The metric reader flushes every 10 s; traces and logs flush every 500 ms.&lt;/p&gt;

&lt;h2&gt;
  
  
  How do I verify OpenTelemetry data is reaching Axiom?
&lt;/h2&gt;

&lt;p&gt;Send one request to the worker, then check three views in the Axiom UI: &lt;strong&gt;Stream&lt;/strong&gt; and &lt;strong&gt;Traces&lt;/strong&gt; on the events dataset, and &lt;strong&gt;Metrics&lt;/strong&gt; on the metrics dataset.&lt;/p&gt;

&lt;p&gt;In the Axiom UI:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Stream&lt;/strong&gt; view → set dataset = &lt;code&gt;mineru-events&lt;/code&gt; → you should see JSON log records flowing as soon as the worker handles a request. Each carries &lt;code&gt;service.name=mineru-runpod&lt;/code&gt;, &lt;code&gt;runpod.endpoint_id=&amp;lt;your-endpoint&amp;gt;&lt;/code&gt;, and &lt;code&gt;job_id&lt;/code&gt; for correlation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Traces&lt;/strong&gt; view → on the same &lt;code&gt;mineru-events&lt;/code&gt; dataset → one trace per RunPod job. The root span is &lt;code&gt;mineru.job&lt;/code&gt;. Its children are &lt;code&gt;mineru.fetch_input&lt;/code&gt;, &lt;code&gt;mineru.parse&lt;/code&gt;, and &lt;code&gt;mineru.package&lt;/code&gt;. The &lt;code&gt;mineru.warmup&lt;/code&gt; span shows up once per worker boot.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Metrics&lt;/strong&gt; view → set dataset = &lt;code&gt;mineru-metrics&lt;/code&gt; → after the first 10-second flush you should see &lt;code&gt;mineru.jobs.total&lt;/code&gt;, &lt;code&gt;mineru.job.duration&lt;/code&gt;, the GPU memory gauges, and the rest of the &lt;a href="https://dev.to/mineru-runpod/guides/observability/#what-gets-emitted"&gt;metric catalog&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If &lt;strong&gt;nothing arrives in any view&lt;/strong&gt;, open the worker's stdout (RunPod dashboard → Logs) and grep for &lt;code&gt;Failed to export&lt;/code&gt;. The OTel SDK logs each retry with the HTTP status code:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;Failed to export ... code: 404, reason: Not Found&lt;/code&gt; — the dataset name in one of the &lt;code&gt;*-dataset&lt;/code&gt; headers doesn't exist in your Axiom workspace. Rename the dataset or update the env var so they match.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;Failed to export ... code: 403, reason: Forbidden&lt;/code&gt; — the dataset exists but your API token doesn't have ingest scope on it. Open the token in &lt;strong&gt;Settings → API tokens&lt;/strong&gt;, add Ingest scope on the dataset, and update the secret in RunPod.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;Failed to export ... code: 403, reason: Forbidden&lt;/code&gt; — &lt;strong&gt;the #1 cause is using &lt;code&gt;api.axiom.co&lt;/code&gt; or &lt;code&gt;api.eu.axiom.co&lt;/code&gt; as the endpoint instead of the edge-deployment URL.&lt;/strong&gt; Confirm by running &lt;code&gt;curl -v POST https://api.eu.axiom.co/v1/traces -H "Authorization: Bearer xaat-..." -H "x-axiom-dataset: &amp;lt;yours&amp;gt;" -H "Content-Type: application/x-protobuf" --data-binary ""&lt;/code&gt; against your endpoint — if you get 403 there but 422 against &lt;code&gt;https://eu-central-1.aws.edge.axiom.co/v1/traces&lt;/code&gt; (or the US edge variant), the URL is the issue. Other 403 causes: token genuinely lacks Ingest scope on the dataset.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;Failed to export ... code: 400, reason: Bad Request&lt;/code&gt; — the dataset resolves and auth works, but the payload is being rejected. Causes: region mismatch (workspace is EU but you're hitting &lt;code&gt;us-east-1.aws.edge.axiom.co&lt;/code&gt;, or vice versa — Axiom returns &lt;code&gt;mismatched region&lt;/code&gt; in the body), wrong header name on metrics (use &lt;code&gt;x-axiom-metrics-dataset&lt;/code&gt;, not &lt;code&gt;x-axiom-dataset&lt;/code&gt;), or the dataset was created of the wrong type for the signal.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;Failed to export ... code: 401, reason: Unauthorized&lt;/code&gt; — the API token is wrong or expired. Generate a fresh token in &lt;strong&gt;Settings → API tokens&lt;/strong&gt; and update the env var.&lt;/li&gt;
&lt;li&gt;No &lt;code&gt;Failed to export&lt;/code&gt; lines AND no &lt;code&gt;[mineru-telemetry] init failed&lt;/code&gt; either — &lt;code&gt;OTEL_EXPORTER_OTLP_ENDPOINT&lt;/code&gt; is empty or the worker hasn't cold-started since you set it. RunPod env-var changes only take effect on the next cold start; warm workers keep the previous values.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Which APL queries help debug a mineru-runpod worker?
&lt;/h2&gt;

&lt;p&gt;APL queries answer the three operational questions that come up most: which errors fired in the last hour, which parses are slowest, and how throughput breaks down by backend or endpoint. All three queries hit the events dataset, where both traces and logs land. The templates below are starting points — adjust attribute paths to match how your Axiom workspace unrolls OTLP resource and span attributes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Recent errors, grouped by error type:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;['mineru-events']
| where ['service.name'] == "mineru-runpod"
| where ['severity_text'] == "ERROR"
| where _time &amp;gt; ago(1h)
| summarize count() by tostring(['attributes.error_type'])
| sort by count_ desc
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Slowest parses in the last hour:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;['mineru-events']
| where ['name'] == "mineru.parse"
| where _time &amp;gt; ago(1h)
| project _time, duration = ['duration'], backend = ['attributes.mineru.backend'], input_format = ['attributes.mineru.input_format']
| sort by duration desc
| take 20
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Throughput by endpoint over time:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;['mineru-events']
| where ['name'] == "mineru.job"
| where _time &amp;gt; ago(24h)
| summarize jobs = count() by endpoint_id = tostring(['resource.runpod.endpoint_id']), bin(_time, 5m)
| render timechart
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Attribute paths in APL depend on how Axiom unrolls OTLP records — &lt;code&gt;['resource.runpod.endpoint_id']&lt;/code&gt; works in my workspace but yours may need &lt;code&gt;['runpod.endpoint_id']&lt;/code&gt; directly. Run a quick &lt;code&gt;| take 5 | project *&lt;/code&gt; against the dataset first to see the actual field names your workspace produces.&lt;/p&gt;

&lt;h2&gt;
  
  
  Does enabling OpenTelemetry slow down cold starts?
&lt;/h2&gt;

&lt;p&gt;Enabling OTel adds roughly 200–500 ms to the first-ever cold start (one-time SDK init plus DNS resolution for the OTLP endpoint). Subsequent FlashBoot snapshot restores on the same host inherit the warm state, so the cost amortizes per host, not per request. For a parse that takes 5–30 seconds wall-clock, the overhead is invisible.&lt;/p&gt;

&lt;p&gt;The numbers from my own deployment: on an RTX 4090 with &lt;code&gt;vlm-auto-engine&lt;/code&gt;, the bare cold start is ~110 s (image pull + vLLM init + model load + warmup parse). OTel init adds 200–500 ms on top of that. On a FlashBoot-restored boot, the same overhead is 0 — the snapshot captures the initialized SDK along with the rest of the process state. See the &lt;a href="https://dev.to/mineru-runpod/blog/runpod-flashboot-mechanism-investigation/"&gt;FlashBoot mechanism investigation&lt;/a&gt; for how the snapshot path actually works.&lt;/p&gt;

&lt;p&gt;If you're cost-sensitive about cold starts and don't need observability on every deployment, leave &lt;code&gt;OTEL_EXPORTER_OTLP_ENDPOINT&lt;/code&gt; unset. The mineru-runpod worker skips the OTel SDK import entirely when that variable is empty — zero overhead, zero behavior change. Flip it on for the endpoints where you actually want the visibility (production, staging) and leave it off for experimentation runs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where does this setup fall down?
&lt;/h2&gt;

&lt;p&gt;Three real limitations: Axiom's free tier caps ingest at 0.5 GB/month and 30-day retention; the GPU gauges emit one time series per device label per metric and cardinality adds up; and OTLP/HTTP export adds modest latency on cold starts (200–500 ms). None of these are blockers for a small or mid-volume serverless deployment, but they're the things to watch as volume grows.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Free-tier ceilings.&lt;/strong&gt; Axiom's free plan is generous (0.5 GB ingest/month, 30-day retention) but you can blow through it with a chatty worker. A debug-logging worker processing 1,000 PDFs/day at ~30 KB of logs per parse hits ~900 MB/month — past the cap. Either keep the log level at &lt;code&gt;info&lt;/code&gt; (the default) or move to Axiom's paid tier (currently $25/month for 5 GB ingest).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Metric cardinality.&lt;/strong&gt; The GPU gauges (&lt;code&gt;mineru.gpu.memory_used_bytes&lt;/code&gt;, &lt;code&gt;mineru.gpu.utilization_percent&lt;/code&gt;) emit one time series per &lt;code&gt;device&lt;/code&gt; label per gauge per worker. A multi-GPU worker times multiple worker instances times four GPU metrics multiplies fast. Axiom's metrics pricing is per-event rather than per-series, so this is a "watch the bill" concern rather than a hard limit. If you scale to dozens of concurrent workers, drop the device label or sample less frequently.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cold-start latency.&lt;/strong&gt; OTel init adds 200–500 ms on first-ever boot. For most workloads this is dominated by the existing ~110 s of vLLM init, so it doesn't matter. If you're optimizing the cold start specifically (chatbot-style low-latency workloads, for example), benchmark with and without OTel before committing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Logs are mirrored, not exclusive.&lt;/strong&gt; The worker still writes stdout JSON to RunPod's dashboard regardless of OTel. That's deliberate: RunPod's UI remains a working fallback when the OTel pipeline misbehaves. The cost is paying twice for log storage if you care about long retention in both places. Most teams don't, and the duplication is the price of the dashboard fallback.&lt;/p&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Does Axiom support OpenTelemetry natively?
&lt;/h3&gt;

&lt;p&gt;Yes. Axiom ingests OTLP/HTTP traces, logs, and metrics on &lt;code&gt;/v1/traces&lt;/code&gt;, &lt;code&gt;/v1/logs&lt;/code&gt;, &lt;code&gt;/v1/metrics&lt;/code&gt; paths — but &lt;strong&gt;the base hostname is your region's edge deployment&lt;/strong&gt;, not &lt;code&gt;api.axiom.co&lt;/code&gt;. The two as of writing are &lt;code&gt;https://us-east-1.aws.edge.axiom.co&lt;/code&gt; and &lt;code&gt;https://eu-central-1.aws.edge.axiom.co&lt;/code&gt; (full list at &lt;a href="https://axiom.co/docs/reference/edge-deployments" rel="noopener noreferrer"&gt;Axiom's edge deployments doc&lt;/a&gt;). The OpenTelemetry Python SDK in the mineru-runpod worker speaks this directly with no Collector or agent in between. See &lt;a href="https://axiom.co/docs/send-data/opentelemetry" rel="noopener noreferrer"&gt;Axiom's OpenTelemetry docs&lt;/a&gt; for the full list of supported signals and headers.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why aren't my metrics showing up in Axiom?
&lt;/h3&gt;

&lt;p&gt;Three common causes. First: &lt;code&gt;OTEL_EXPORTER_OTLP_METRICS_HEADERS&lt;/code&gt; uses the wrong header key. Axiom needs &lt;code&gt;x-axiom-metrics-dataset&lt;/code&gt; (lowercase, with &lt;code&gt;-metrics-dataset&lt;/code&gt;), distinct from the &lt;code&gt;x-axiom-dataset&lt;/code&gt; header used for logs and traces. Second: the API token doesn't have ingest scope on the metrics dataset — common when you create the token with only the logs dataset selected, then later add the metrics one. Surfaces as &lt;code&gt;403 Forbidden&lt;/code&gt; in the worker stdout. Third: &lt;code&gt;OTEL_EXPORTER_OTLP_PROTOCOL&lt;/code&gt; is set to &lt;code&gt;http/json&lt;/code&gt; and Axiom's metrics endpoint accepts only protobuf.&lt;/p&gt;

&lt;h3&gt;
  
  
  What's the cheapest way to observe a RunPod serverless worker?
&lt;/h3&gt;

&lt;p&gt;For ≤1,000 jobs/day with structured logs at &lt;code&gt;info&lt;/code&gt; level, Axiom's free tier (0.5 GB/month, 30-day retention) is the cheapest path with real query power. The OTel SDK is already in the mineru-runpod image; setup is four env vars on the endpoint. RunPod's own log dashboard is free but lacks query language, metrics, and traces — fine for triage, not for SLO tracking.&lt;/p&gt;

&lt;h3&gt;
  
  
  Can I send traces and logs to the same Axiom dataset?
&lt;/h3&gt;

&lt;p&gt;Yes — that's the standard setup. Axiom exposes only two dataset types in the UI (Events and Metrics), and the Events type holds both traces and logs. The configuration above routes traces + logs to one events dataset via &lt;code&gt;OTEL_EXPORTER_OTLP_HEADERS&lt;/code&gt;, and metrics to a separate metrics dataset via &lt;code&gt;OTEL_EXPORTER_OTLP_METRICS_HEADERS&lt;/code&gt;. No per-signal traces override is needed.&lt;/p&gt;

&lt;h3&gt;
  
  
  How much does OTLP/HTTP export add to cold start time?
&lt;/h3&gt;

&lt;p&gt;200–500 ms on the first-ever cold start of a worker on a host. That's one-time SDK init and DNS resolution for the OTLP endpoint. Subsequent FlashBoot snapshot restores on the same host pay zero overhead — the initialized SDK is captured in the process snapshot. On a baseline 110 s cold start (vLLM + model load + warmup), the OTel cost is invisible.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why isn't &lt;code&gt;api.axiom.co&lt;/code&gt; the OTLP endpoint?
&lt;/h3&gt;

&lt;p&gt;Because Axiom splits two concerns onto two hostnames: &lt;code&gt;api.{eu.}axiom.co&lt;/code&gt; is the management API (token CRUD, REST queries, dashboards), while OTLP ingest goes to the &lt;em&gt;edge deployment&lt;/em&gt; hostname for your workspace's region. The split isn't obvious from the OpenTelemetry side because most other backends expose ingest on the same hostname as the management API. Axiom's own &lt;a href="https://axiom.co/docs/send-data/opentelemetry" rel="noopener noreferrer"&gt;OpenTelemetry guide&lt;/a&gt; and &lt;a href="https://axiom.co/docs/reference/edge-deployments" rel="noopener noreferrer"&gt;edge deployments doc&lt;/a&gt; document the edge URLs, but it's easy to miss if you start from a generic OTel tutorial.&lt;/p&gt;

&lt;h3&gt;
  
  
  Does enabling OpenTelemetry require an OpenTelemetry Collector?
&lt;/h3&gt;

&lt;p&gt;No. The mineru-runpod worker uses the OpenTelemetry Python SDK with the OTLP/HTTP exporter, talking directly to Axiom's ingest endpoint. A Collector is useful when you want to fan out to multiple backends, apply sampling rules, or buffer locally — none of which apply to a single-sink serverless worker shipping into Axiom.&lt;/p&gt;

&lt;h2&gt;
  
  
  Same template, different backend
&lt;/h2&gt;

&lt;p&gt;The four env vars above are the only Axiom-specific configuration in this whole setup. Swap the URL and headers and the mineru-runpod worker ships the same logs, traces, and metrics to anything that speaks OTLP/HTTP — Honeycomb, Grafana Cloud, Datadog's OTLP intake, your own OpenTelemetry Collector. Most other backends don't even need the metrics-headers override that Axiom requires — a single &lt;code&gt;OTEL_EXPORTER_OTLP_HEADERS&lt;/code&gt; value covers all three signals. The &lt;a href="https://dev.to/mineru-runpod/guides/observability/"&gt;observability guide&lt;/a&gt; covers the vendor-neutral env-var layout and the metric catalog. Different backends, same template.&lt;/p&gt;

&lt;p&gt;If this saved you time, the easiest way to say thanks is &lt;a href="https://runpod.io?ref=31jdfpnq" rel="noopener noreferrer"&gt;signing up for RunPod through this link&lt;/a&gt;. Star the &lt;a href="https://github.com/sergeyshmakov/mineru-runpod" rel="noopener noreferrer"&gt;repo on GitHub&lt;/a&gt; for updates.&lt;/p&gt;




&lt;p&gt;&lt;small&gt;&lt;strong&gt;Disclosure:&lt;/strong&gt; RunPod links in this post use a referral code that credits me at no cost to you. The post would read the same without it.&lt;/small&gt;&lt;/p&gt;

</description>
      <category>opentelemetry</category>
      <category>axiom</category>
      <category>runpod</category>
      <category>observability</category>
    </item>
    <item>
      <title>How RunPod FlashBoot Actually Works (4-Request Test)</title>
      <dc:creator>Sergey Shmakov</dc:creator>
      <pubDate>Tue, 26 May 2026 00:00:00 +0000</pubDate>
      <link>https://dev.to/sergeyshmakov/how-runpod-flashboot-actually-works-4-request-test-49of</link>
      <guid>https://dev.to/sergeyshmakov/how-runpod-flashboot-actually-works-4-request-test-49of</guid>
      <description>&lt;p&gt;&lt;em&gt;Last Updated: 2026-05-27&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;If you’re shipping vLLM or any heavy ML model on RunPod Serverless, you’ve probably looked at FlashBoot, ticked the checkbox, and then watched your cold starts still take 60-120 seconds. RunPod’s marketing says “1-second cold starts.” Their docs describe FlashBoot as “pre-loading container images.” Neither of those matches what most ML workloads actually see.&lt;/p&gt;

&lt;p&gt;I ran four cold-start tests on a deployed RunPod endpoint serving a vLLM-backed PDF parser. The wall-clock numbers ranged from 7 seconds to 7 minutes. The point of this post is to explain &lt;em&gt;why&lt;/em&gt; — what FlashBoot actually does at the systems level, when it kicks in, and how to set up your worker so it kicks in more often.&lt;/p&gt;

&lt;h2&gt;
  
  
  What does FlashBoot actually do?
&lt;/h2&gt;

&lt;p&gt;FlashBoot is a CRIU-style process snapshot mechanism. When a worker scales to zero, RunPod captures the full process state (Python interpreter, CUDA VRAM, subprocess tree) into a snapshot on the host’s local storage. When the worker scales back up &lt;em&gt;on the same host&lt;/em&gt;, RunPod restores from that snapshot. The restored process resumes mid-stride: model still in VRAM, vLLM engine subprocess still alive, IPC pipes still connected.&lt;/p&gt;

&lt;p&gt;The key qualifier that RunPod’s docs don’t mention: &lt;strong&gt;snapshots are per (host, image SHA), not per endpoint&lt;/strong&gt;. If the next scale-from-zero lands on a different host, there’s no snapshot to restore from. The worker boots fresh and pays the full warmup cost. Once.&lt;/p&gt;

&lt;p&gt;The TL;DR for an ML workload: &lt;strong&gt;set up an eager warmup at worker boot, then let FlashBoot do its thing.&lt;/strong&gt; Each new host pays the warmup tax once. Subsequent scale-from-zeroes on that same host get the snapshot restore and finish a typical request in single-digit seconds.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why do “cold” starts sometimes take 7 seconds and sometimes 110?
&lt;/h2&gt;

&lt;p&gt;Because they’re hitting different parts of the per-host model. Four consecutive requests against the same endpoint, single-page parse on each, with a deliberate scale-to-zero between every one:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Request&lt;/th&gt;
&lt;th&gt;Wall-clock&lt;/th&gt;
&lt;th&gt;Host&lt;/th&gt;
&lt;th&gt;Snapshot?&lt;/th&gt;
&lt;th&gt;What the worker did&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;456 s&lt;/td&gt;
&lt;td&gt;A (post-rebuild)&lt;/td&gt;
&lt;td&gt;none&lt;/td&gt;
&lt;td&gt;Image pull + fitness checks + warmup (101 s) + parse (5.6 s)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;7.6 s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;A (same as R1)&lt;/td&gt;
&lt;td&gt;yes&lt;/td&gt;
&lt;td&gt;Snapshot restore + parse (4.7 s)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;122 s&lt;/td&gt;
&lt;td&gt;B (different host)&lt;/td&gt;
&lt;td&gt;none&lt;/td&gt;
&lt;td&gt;Fitness checks + warmup (101.5 s) + parse (5.6 s)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;7.4 s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;B (same as R3)&lt;/td&gt;
&lt;td&gt;yes&lt;/td&gt;
&lt;td&gt;Snapshot restore + parse (4.6 s)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;First hit on a fresh host pays ~110 s for the warmup. Every subsequent restore on that same host is ~7-8 s.&lt;/strong&gt; A new host, when RunPod’s scheduler picks one, starts the cycle over.&lt;/p&gt;

&lt;p&gt;The 456 s on Request 1 included a one-time image pull (the worker image is ~27 GB; this was the first time that physical host had ever seen it). Strip that off and you get ~110 s of actual boot work, which matches Request 3 exactly.&lt;/p&gt;

&lt;h2&gt;
  
  
  How can you tell if a request hit a snapshot restore?
&lt;/h2&gt;

&lt;p&gt;By what’s &lt;em&gt;missing&lt;/em&gt; from the worker logs. A FlashBoot-restored worker skips its boot sequence entirely — no fitness checks, no Python import logs, no vLLM engine initialization, no model load. The first log line is &lt;code&gt;Jobs in queue: 1&lt;/code&gt;, immediately followed by your handler’s “starting job” entry.&lt;/p&gt;

&lt;p&gt;Compare a fresh boot to a snapshot restore for the same request shape:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fresh boot (Request 3):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;04:45:45 Running 7 fitness check(s)...
04:45:46 All fitness checks passed. (1285.99ms)
04:45:46 [mineru-warmup] starting (backend=vlm-auto-engine ...)
04:45:51 Using vllm-async-engine as the inference engine for VLM.
04:46:23 Initializing a V1 LLM engine (v0.11.2) ...
04:46:47 Model loading took 2.1601 GiB memory and 18.41 seconds
04:47:14 torch.compile takes 22.81 s in total
04:47:17 init engine (profile, create kv cache, warmup model) took 30.66 seconds
04:47:18 get vllm-async-engine predictor cost: 87.26s
04:47:28 [mineru-warmup] done in 101.5s
04:47:28 Jobs in queue: 1
04:47:28 Started.
04:47:28 "starting job" {...}
04:47:34 "done" {...elapsed_seconds: 5.58...}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Snapshot restore (Request 4):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;04:51:25 Jobs in queue: 1
04:51:25 Started.
04:51:25 "starting job" {...}
04:51:26 Using vllm-async-engine ... (instant — engine handle restored from snapshot)
04:51:30 "done" {...elapsed_seconds: 4.58...}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No boot sequence. Three timestamps. The vLLM engine subprocess PID from the previous boot is reused — same &lt;code&gt;EngineCore_DP0 pid=NNN&lt;/code&gt; from the snapshot. If you grep your own worker logs for the gap between &lt;code&gt;Jobs in queue: 1&lt;/code&gt; and the previous activity, you’ll see whether RunPod did a fresh boot or a snapshot restore.&lt;/p&gt;

&lt;h2&gt;
  
  
  What does the FlashBoot snapshot preserve?
&lt;/h2&gt;

&lt;p&gt;Everything that lived in the worker process at snapshot time, mediated by CRIU semantics:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Python interpreter state.&lt;/strong&gt; Module imports stay loaded. Globals (job counters, contextvars, signal handlers) keep their values. The &lt;code&gt;MinerU&lt;/code&gt; engine registry returns the same handles it returned before the snapshot.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GPU VRAM.&lt;/strong&gt; Model weights (~2.16 GiB for our VLM), vLLM’s KV cache (~8.17 GiB on a 24 GB card), and captured CUDA graphs (~0.3 GiB) all survive. The first request after restore parses with the same allocations it had before.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The subprocess tree.&lt;/strong&gt; vLLM runs its engine in a child process for memory isolation. That subprocess gets captured along with the parent and restored with its IPC pipes intact. The engine PID persists.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;torch.compile&lt;/code&gt; cache.&lt;/strong&gt; The JIT-compiled Dynamo / Inductor output stays valid across restore. No 22-second recompile.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What doesn’t survive: snapshot lifetime is limited. RunPod doesn’t publish the eviction policy, but obvious triggers include image rebuild (new SHA invalidates the snapshot), and presumably long enough idle on a busy host that the snapshot storage gets pushed out.&lt;/p&gt;

&lt;h2&gt;
  
  
  What broke before this worked? The asyncio gotcha
&lt;/h2&gt;

&lt;p&gt;The “eager warmup at boot” idea is obvious in principle: run one throwaway parse during worker startup so the model is loaded and warm before any user request arrives. The implementation has one trap.&lt;/p&gt;

&lt;p&gt;vLLM’s &lt;code&gt;AsyncLLMEngine&lt;/code&gt; binds its IPC primitives (transports, queues) to the asyncio event loop that initialized it. If you call &lt;code&gt;asyncio.run(warmup())&lt;/code&gt; followed by &lt;code&gt;runpod.serverless.start()&lt;/code&gt;, your warmup creates loop A, runs the parse, then tears loop A down when &lt;code&gt;asyncio.run&lt;/code&gt; returns. Then &lt;code&gt;runpod.serverless.start()&lt;/code&gt; creates loop B for serving. When the first user request tries to talk to the vLLM engine through loop B, the engine handle is bound to the now-dead loop A. Result:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;"error_type": "EngineDeadError",
"error_message": "EngineCore encountered an issue. See stack trace (above) for the root cause."
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The engine subprocess itself is still alive. It’s only the parent process’s IPC reference that’s broken.&lt;/p&gt;

&lt;p&gt;The fix is to keep the warmup and the serve loop on the same asyncio event loop. RunPod’s &lt;code&gt;runpod.serverless.start()&lt;/code&gt; internally calls &lt;code&gt;asyncio.run(JobScaler.run())&lt;/code&gt;, but &lt;code&gt;JobScaler&lt;/code&gt; (in &lt;code&gt;runpod.serverless.modules.rp_scale&lt;/code&gt;) is constructible directly and its &lt;code&gt;run()&lt;/code&gt; is an awaitable coroutine. So you can compose:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
import asyncio

from runpod.serverless.modules import rp_ping, rp_scale

from runpod.serverless.modules.rp_fitness import run_fitness_checks

config = {"handler": handler, "concurrency_modifier": _concurrency_modifier, "rp_args": {}}

async def _bootstrap():

    await run_fitness_checks()

    await warmup_async() # &amp;lt;- engine binds to THIS loop

    rp_ping.Heartbeat().start_ping()

    await rp_scale.JobScaler(config).run() # &amp;lt;- and stays here

asyncio.run(_bootstrap())
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now both phases share one event loop. The engine handle stays valid across the warmup → serve transition. FlashBoot captures a snapshot of a process where the loop, the engine, and the IPC are all alive together. On restore, they come back together too.&lt;/p&gt;

&lt;p&gt;This does reach into runpod-python’s internals (the &lt;code&gt;runpod.serverless.modules.*&lt;/code&gt; submodules aren’t part of the documented public API). Cheap to guard against drift: a unit test that asserts &lt;code&gt;JobScaler&lt;/code&gt; exists with the expected constructor and an awaitable &lt;code&gt;run()&lt;/code&gt; method. If RunPod refactors, CI catches it before production does.&lt;/p&gt;

&lt;h2&gt;
  
  
  When does the warmup pay off and when doesn’t it?
&lt;/h2&gt;

&lt;p&gt;Per host, not per endpoint. The math depends on your traffic pattern.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;Likely outcome&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;workers_min ≥ 1&lt;/code&gt; (always-on worker)&lt;/td&gt;
&lt;td&gt;Worker stays on its host. Every request is on a fully warm worker (~5 s parse). No cold starts at all.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;High-frequency endpoint, workers scale up and down fast&lt;/td&gt;
&lt;td&gt;Same hosts get re-selected. Most cold starts are happy-path restores (~7 s).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Quiet endpoint, infrequent requests, long idle gaps&lt;/td&gt;
&lt;td&gt;RunPod’s scheduler may pick a different host. Some cold starts will be on new hosts (~110 s).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;First request after a rebuild&lt;/td&gt;
&lt;td&gt;Always cold path. Every endpoint’s first request after a fresh image pays ~5-7 min (image pull) + ~110 s (warmup). One-time per worker host.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;MINERU_SKIP_WARMUP=1&lt;/code&gt; (warmup off)&lt;/td&gt;
&lt;td&gt;Every cold start is ~110-130 s. No per-host amortization. Don’t do this in production.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The case that stings is “quiet endpoint with sporadic traffic” — a few requests an hour, 10-minute idle gaps, RunPod bouncing between hosts. Without warmup, every cold start would be ~110-130 s. With warmup, you get a mix: some 7-second restores, some 110-second fresh boots. The mix tilts toward fast as the endpoint warms up across more hosts and RunPod’s scheduler starts re-selecting them.&lt;/p&gt;

&lt;p&gt;If your traffic is sustained enough that you can pin a worker (&lt;code&gt;workers_min=1&lt;/code&gt;), you skip the entire question. You’re paying for the GPU 24/7 but never paying a cold start. For workloads with even modest cost sensitivity, the warmup + FlashBoot path is the better trade.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this means if you’re shipping vLLM on RunPod
&lt;/h2&gt;

&lt;p&gt;Three takeaways from the live measurements:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Always set up an eager warmup at worker boot.&lt;/strong&gt; Loading the model on first request is silently worse than it sounds — you don’t just pay 110 s once per cold start, you pay it every time a host doesn’t have a snapshot, AND you forfeit the per-host amortization that makes the second-hit-on-a-host cheap. Without warmup, FlashBoot has nothing to snapshot.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compose warmup and the serving loop under one &lt;code&gt;asyncio.run()&lt;/code&gt;.&lt;/strong&gt; If you &lt;code&gt;asyncio.run()&lt;/code&gt; the warmup separately, the engine dies at the loop boundary. The fix is straightforward but the failure mode is opaque (&lt;code&gt;EngineDeadError&lt;/code&gt; 75 ms into the first request) — easy to misdiagnose as a vLLM bug.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Don’t market your cold start as “X seconds” without acknowledging the per-host mix.&lt;/strong&gt; A snapshot-restore cold start is genuinely 7-8 seconds. A new-host cold start is ~110 s. Both are big improvements over the no-warmup baseline (~110-130 s per request, every request). But your users will see the mix, and a too-clean claim makes the bad days look broken.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The whole investigation was on a 24 GB A5000 / RTX 4090 class GPU running MinerU’s 1.2B VLM via vLLM 0.11.2. The numbers will shift on larger models (more VRAM to snapshot, longer model load on cold path) but the mechanism applies the same way. If your cold start dominates wall-clock latency on a serverless GPU workload, set up boot-time warmup, watch the worker logs for the snapshot pattern, and tune your &lt;code&gt;workers_min&lt;/code&gt; accordingly.&lt;/p&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Does FlashBoot snapshot the vLLM engine subprocess?
&lt;/h3&gt;

&lt;p&gt;Yes. The vLLM engine runs as a child process for memory isolation, and FlashBoot’s CRIU-style mechanism captures the full process tree including subprocesses. The engine’s PID persists across snapshot/restore, and its IPC pipes back to the parent stay connected.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why does my cold start take 60-120 seconds even with FlashBoot enabled?
&lt;/h3&gt;

&lt;p&gt;Most likely your model is being loaded lazily on first request rather than at worker boot. FlashBoot only snapshots state that already exists in the worker process when it scales to zero. If your model loads on first request, the snapshot captures a worker without the model, and every cold start has to load the model again. Move the model load to worker boot (before &lt;code&gt;runpod.serverless.start()&lt;/code&gt;) and FlashBoot will start carrying the warm state forward.&lt;/p&gt;

&lt;h3&gt;
  
  
  What’s the difference between FlashBoot and a network volume?
&lt;/h3&gt;

&lt;p&gt;A network volume is shared file storage attached to your worker (e.g., for model weights you don’t want to bake into the Docker image). FlashBoot is process-state preservation — it captures the running Python process, including data already loaded from disk into VRAM. They solve different problems and can be used together: a network volume avoids re-downloading model files on image pull; FlashBoot avoids re-loading them into VRAM on cold start.&lt;/p&gt;

&lt;h3&gt;
  
  
  Does FlashBoot work for non-GPU workloads?
&lt;/h3&gt;

&lt;p&gt;The mechanism (process snapshot via CRIU or equivalent) doesn’t depend on GPU memory specifically. CPU-bound workloads with significant cold-start cost (heavy library imports, large in-memory indices, JIT compilation) should benefit similarly. The framing in this post happens to use a GPU workload because that’s where the cold-start tax is most painful.&lt;/p&gt;

&lt;h3&gt;
  
  
  How do I know if my worker is hitting a snapshot restore vs a fresh boot?
&lt;/h3&gt;

&lt;p&gt;Check the worker logs in the RunPod dashboard. A fresh boot shows fitness checks, framework init logs, and any warmup output. A snapshot restore is silent until the first &lt;code&gt;Jobs in queue: 1&lt;/code&gt; line, then jumps straight to your handler’s request-processing logs. The presence or absence of the boot sequence is the cleanest signal.&lt;/p&gt;

&lt;h3&gt;
  
  
  Is FlashBoot the same as RunPod’s “Active Workers” tier?
&lt;/h3&gt;

&lt;p&gt;No. Active Workers are a billing tier where you pre-commit to a number of workers that are always on, billed at a discount in exchange for the 24/7 commitment. FlashBoot is a free runtime optimization that applies to flex (scale-to-zero) workers. The two can be combined: an Active Worker on the same endpoint can also benefit from FlashBoot when it cycles, though for a worker that never goes idle there’s nothing to snapshot.&lt;/p&gt;

&lt;h3&gt;
  
  
  Will FlashBoot survive a Docker image rebuild?
&lt;/h3&gt;

&lt;p&gt;No. Each image gets its own SHA, and FlashBoot snapshots are scoped to (host, image SHA). When you push a new image, all existing snapshots are invalid. The first request after a rebuild on any host pays the full cold-start cost (image pull + warmup). Once each host has served the new image once, subsequent restores work normally.&lt;/p&gt;

&lt;h2&gt;
  
  
  What’s next
&lt;/h2&gt;

&lt;p&gt;The &lt;code&gt;runpod-mineru&lt;/code&gt; repo wraps all of this into one Docker image: MinerU 3.2.x + the &lt;code&gt;MinerU2.5-Pro-2605-1.2B&lt;/code&gt; VLM, the JobScaler-bypass composition for warmup, structured logging, and the rest. Open source (&lt;a href="https://github.com/sergeyshmakov/mineru-runpod" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;), MIT-licensed, deploys from the RunPod Hub in two clicks.&lt;/p&gt;

&lt;p&gt;If you want the deeper breakdown of which phases of a cold start cost what, the &lt;a href="https://sergeyshmakov.github.io/mineru-runpod/guides/troubleshooting/#flashboot-mechanism-confirmed" rel="noopener noreferrer"&gt;troubleshooting guide&lt;/a&gt; has the per-phase timing table from the same test runs. The &lt;a href="https://sergeyshmakov.github.io/mineru-runpod/guides/scaling/#flashboot" rel="noopener noreferrer"&gt;scaling guide&lt;/a&gt; covers when to pair FlashBoot with &lt;code&gt;workers_min ≥ 1&lt;/code&gt; for fully predictable latency.&lt;/p&gt;




&lt;p&gt;&lt;small&gt;&lt;strong&gt;Disclosure:&lt;/strong&gt; RunPod links in this post use a referral code that credits me at no cost to you. The post would read the same without it.&lt;/small&gt;&lt;/p&gt;

</description>
      <category>runpod</category>
      <category>flashboot</category>
      <category>serverless</category>
      <category>vllm</category>
    </item>
    <item>
      <title>RunPod 20 MB Response Cap: Fix NoneType with Cloudflare R2</title>
      <dc:creator>Sergey Shmakov</dc:creator>
      <pubDate>Wed, 20 May 2026 00:00:00 +0000</pubDate>
      <link>https://dev.to/sergeyshmakov/runpod-20-mb-response-cap-fix-nonetype-with-cloudflare-r2-48la</link>
      <guid>https://dev.to/sergeyshmakov/runpod-20-mb-response-cap-fix-nonetype-with-cloudflare-r2-48la</guid>
      <description>&lt;p&gt;&lt;em&gt;Last Updated: 2026-05-27&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;If your RunPod serverless worker logs say &lt;code&gt;done&lt;/code&gt; but your client raises &lt;code&gt;unexpected handler return type: &amp;lt;class 'NoneType'&amp;gt;&lt;/code&gt;, you’ve hit RunPod’s bidirectional 20 MB payload cap on &lt;code&gt;/runsync&lt;/code&gt;. The handler succeeded. The gateway dropped the response on the way back because the payload was too large.&lt;/p&gt;

&lt;p&gt;The fix is two steps. Set &lt;code&gt;return: "s3"&lt;/code&gt; on the job, and configure four env vars on the endpoint pointing at a Cloudflare R2 bucket. The worker uploads the result to R2 and returns a small presigned URL. Your client downloads from R2 directly. No gateway cap in the path.&lt;/p&gt;

&lt;p&gt;I hit this on an 82-page Cyrillic fiscal report (30 MB input, ~25 MB output with embedded images) running my open-source &lt;a href="https://github.com/sergeyshmakov/mineru-runpod" rel="noopener noreferrer"&gt;mineru-runpod&lt;/a&gt; template. Two retries via &lt;code&gt;return: "inline"&lt;/code&gt; and &lt;code&gt;return: "tarball_b64"&lt;/code&gt; failed the same way. R2 mode worked first try. The rest of this post is the symptom, the env-var recipe, the cost comparison vs S3, and a few gotchas worth knowing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why does my RunPod worker return NoneType after a successful parse?
&lt;/h2&gt;

&lt;p&gt;The worker handler completed and returned a valid dict. RunPod’s runtime then tried to POST that result back to RunPod’s API via &lt;code&gt;/job-done&lt;/code&gt;, and the API returned HTTP 400 because the payload exceeded ~20 MB. The result was discarded. The SDK saw no output, returned &lt;code&gt;None&lt;/code&gt; to the client, and the client wrapper raised the &lt;code&gt;NoneType&lt;/code&gt; error.&lt;/p&gt;

&lt;p&gt;The worker logs make the chain explicit:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[mineru-worker] done: elapsed=91.77s phase_ms={'fetch_input': 972, 'mineru_parse': 90789, 'package': 66}
{"requestId": "sync-fdcd03cd-...", "message": "Failed to return job results. | 400, message='Bad Request',

 url='https://api.runpod.ai/v2/&amp;lt;endpoint&amp;gt;/job-done/&amp;lt;worker&amp;gt;/sync-fdcd03cd-...?gpu=NVIDIA+RTX+A5000&amp;amp;isStream=false'"}

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The first line shows the handler finished cleanly: 82 pages parsed in 91.8 s on the worker (this test ran on A5000; on the current 4090 default the warm parse is 2–3× faster). The second line shows the gateway rejecting the result. The handler already returned and never knows the rejection happened. The SDK sees the discarded result and returns &lt;code&gt;None&lt;/code&gt; to your code.&lt;/p&gt;

&lt;p&gt;If you see this &lt;code&gt;NoneType&lt;/code&gt; error on a small doc, the diagnosis is different (worker OOM, crash, timeout). On a multi-page parse that the worker logs as &lt;code&gt;done&lt;/code&gt;, the answer is almost always the 20 MB cap.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is RunPod’s /runsync response payload limit?
&lt;/h2&gt;

&lt;p&gt;RunPod’s &lt;code&gt;/runsync&lt;/code&gt; gateway caps payloads at roughly &lt;strong&gt;20 MB in both directions&lt;/strong&gt;. The request cap affects &lt;code&gt;file_b64&lt;/code&gt; inline uploads. The response cap affects what the worker can return. Both are independent of execution time and memory budget. A fast, successful parse can hit the response cap simply by producing a large output.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Direction&lt;/th&gt;
&lt;th&gt;Limit&lt;/th&gt;
&lt;th&gt;What triggers it&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Request → gateway → worker&lt;/td&gt;
&lt;td&gt;~20 MB&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;file_b64&lt;/code&gt; inline transport for large PDFs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Worker → gateway → client&lt;/td&gt;
&lt;td&gt;~20 MB&lt;/td&gt;
&lt;td&gt;Multi-page parse outputs with embedded images&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The request cap is in &lt;a href="https://docs.runpod.io/serverless/endpoints/operations" rel="noopener noreferrer"&gt;RunPod’s docs&lt;/a&gt; and widely discussed. The response cap is mentioned only in passing. I found three open issues on the runpod-workers repos where other users hit the same symptom and didn’t realise what it was, so this post is partly to make that searchable.&lt;/p&gt;

&lt;p&gt;Practical threshold for mineru-runpod: pure-text PDFs are fine for longer. Image-heavy PDFs with embedded raster output hit the response cap around 50–80 pages on &lt;code&gt;inline&lt;/code&gt; or &lt;code&gt;tarball_b64&lt;/code&gt; transport.&lt;/p&gt;

&lt;h2&gt;
  
  
  Does &lt;code&gt;return: "tarball_b64"&lt;/code&gt; get around the 20 MB cap?
&lt;/h2&gt;

&lt;p&gt;No. &lt;code&gt;return: "tarball_b64"&lt;/code&gt; gzips the output into a single .tar.gz before base64-encoding it. Gzip compresses the JSON and Markdown text well, but the page images inside the tarball are already raster bytes (PNG, JPEG) and barely compress further. Multi-page parses with embedded images keep the tarball over 20 MB.&lt;/p&gt;

&lt;p&gt;I confirmed this on the same 82-page PDF. Same 400 from &lt;code&gt;/job-done&lt;/code&gt;. Same &lt;code&gt;NoneType&lt;/code&gt; in the client. Both &lt;code&gt;inline&lt;/code&gt; and &lt;code&gt;tarball_b64&lt;/code&gt; route through the gateway response, so both inherit the cap. Only &lt;code&gt;return: "s3"&lt;/code&gt; avoids it because the worker uploads out-of-band.&lt;/p&gt;

&lt;h2&gt;
  
  
  How do I configure Cloudflare R2 to bypass the RunPod response cap?
&lt;/h2&gt;

&lt;p&gt;Set &lt;code&gt;return: "s3"&lt;/code&gt; in the job input, then add four env vars on the RunPod endpoint pointing at a &lt;a href="https://developers.cloudflare.com/r2/" rel="noopener noreferrer"&gt;Cloudflare R2&lt;/a&gt; bucket. The worker uploads the gzipped tarball directly to R2 and returns a small presigned URL (~1 h TTL). Your client downloads from R2.&lt;/p&gt;

&lt;p&gt;The job input changes one field:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{

  "input": {

    "file_url": "https://example.com/big.pdf",

    "return": "s3"

  }

}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The four env vars go on the &lt;strong&gt;endpoint&lt;/strong&gt; (not the template — they’re secrets):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Env var&lt;/th&gt;
&lt;th&gt;Cloudflare R2 value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;BUCKET_ENDPOINT_URL&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;https://&amp;lt;account-id&amp;gt;.r2.cloudflarestorage.com&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;BUCKET_NAME&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;your bucket name&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;BUCKET_ACCESS_KEY_ID&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;R2 API token access key&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;BUCKET_SECRET_ACCESS_KEY&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;R2 API token secret&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;BUCKET_REGION&lt;/code&gt; &lt;em&gt;(optional)&lt;/em&gt;
&lt;/td&gt;
&lt;td&gt;&lt;code&gt;auto&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;You generate the access key pair in the Cloudflare dashboard: &lt;strong&gt;R2 → Manage R2 API Tokens → Create API Token → Object Read &amp;amp; Write scoped to the bucket&lt;/strong&gt;. The worker auto-restarts when you save endpoint env vars in RunPod. Test with one small doc before sending production traffic.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why pick Cloudflare R2 over AWS S3 for RunPod output storage?
&lt;/h2&gt;

&lt;p&gt;R2 has &lt;strong&gt;zero egress fees&lt;/strong&gt; , a 10 GB free storage tier, 1M Class A ops and 10M Class B ops per month free, and is fully S3-compatible. AWS S3 charges egress at roughly $0.085/GB plus storage at $0.023/GB/month. For a RunPod pipeline doing dozens of GB of I/O per month, R2’s bill stays near zero while S3 lands in the $5–$15 range.&lt;/p&gt;

&lt;p&gt;A back-of-envelope month for the workload I tested:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;1,000 multi-page parses, average output 8 MB → 8 GB stored then deleted&lt;/li&gt;
&lt;li&gt;1,000 worker→bucket uploads + 1,000 client→bucket downloads = 2,000 ops&lt;/li&gt;
&lt;li&gt;Storage: free (under 10 GB). Egress: free (R2 doesn’t bill egress). Ops: free (well under 1M Class A).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Same workload on S3: ~$0.18 storage + ~$0.68 egress + per-request fees, maybe $1–$3 total. Cheap but R2’s $0 is cheaper.&lt;/p&gt;

&lt;p&gt;S3 still makes sense if you’re already deep in AWS, if you need IAM-controlled access patterns, or if RunPod workers and your AWS region are colocated tightly enough that egress doesn’t apply. For everyone else and especially for solo / indie deploys, R2 is the right default. See &lt;a href="https://developers.cloudflare.com/r2/pricing/" rel="noopener noreferrer"&gt;R2 pricing&lt;/a&gt; for current rates.&lt;/p&gt;

&lt;h2&gt;
  
  
  What does the parse flow look like end-to-end with &lt;code&gt;return: "s3"&lt;/code&gt;?
&lt;/h2&gt;

&lt;p&gt;The worker fetches the input PDF, runs MinerU, gzips the outputs into a tarball, uploads to R2 via the configured &lt;code&gt;BUCKET_*&lt;/code&gt; env vars, and returns a small JSON response with &lt;code&gt;tarball_url&lt;/code&gt;, &lt;code&gt;tarball_url_expires_in&lt;/code&gt; (3600 s), and &lt;code&gt;bucket_key&lt;/code&gt;. Your client follows the URL and extracts the tarball locally. No payload ever crosses RunPod’s 20 MB-capped response path.&lt;/p&gt;

&lt;p&gt;Concrete numbers from the 82-page test (on A5000; current default is 4090):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
result = client.parse_document(

    file_url="https://pub-....r2.dev/report.pdf",

    backend="vlm-auto-engine",

    return_format="s3",

)
# result["tarball_url"] -&amp;gt; presigned R2 URL, valid ~1 h
# result["tarball_url_expires_in"] -&amp;gt; 3600
# result["bucket_key"] -&amp;gt; "report-&amp;lt;hash&amp;gt;.tar.gz"

client.save_s3_tarball(result, "./out/")

# downloads + extracts -&amp;gt; out/report.md, out/report_content_list_v2.json, out/images/, ...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;End-to-end wall-clock: 211 s for an 82-page doc on a cold worker. Breakdown: ~112 s before MinerU started parsing (worker boot + warmup), ~92 s warm parsing (1.1 s/page on A5000), ~11 s gzip and upload to R2 (the &lt;code&gt;package&lt;/code&gt; phase). The extracted output: 313 KB Markdown plus structured JSON plus per-page images. Roughly 3.5 minutes for a document that previously couldn’t return its output at all.&lt;/p&gt;

&lt;p&gt;The cold-start portion is a separate concern from the response cap. &lt;a href="https://sergeyshmakov.github.io/mineru-runpod/blog/runpod-flashboot-mechanism-investigation/" rel="noopener noreferrer"&gt;The FlashBoot mechanism investigation&lt;/a&gt; covers why the ~112 s exists, how the boot-time warmup interacts with RunPod’s snapshot system, and when subsequent cold starts are much faster.&lt;/p&gt;

&lt;h2&gt;
  
  
  What should I watch out for with the R2 bridge?
&lt;/h2&gt;

&lt;p&gt;Four things the docs don’t say loudly. The presigned URL TTL is 60 minutes. R2 doesn’t auto-clean uploaded objects. One bucket can serve input and output. The 20 MB cap applies to &lt;code&gt;/run&lt;/code&gt; (async) too, not just &lt;code&gt;/runsync&lt;/code&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Presigned URL TTL is 60 minutes.&lt;/strong&gt; If your client is slow to download (e.g. a job-queue worker that picks up results minutes later), bump &lt;code&gt;_S3_PRESIGN_TTL_SECONDS&lt;/code&gt; in the handler. Don’t rely on the default in long-tail flows.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;R2 doesn’t auto-clean uploaded objects.&lt;/strong&gt; Add an &lt;a href="https://developers.cloudflare.com/r2/buckets/object-lifecycles/" rel="noopener noreferrer"&gt;R2 lifecycle rule&lt;/a&gt; (e.g. delete after 7 days) so your output bucket doesn’t grow forever.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;One R2 bucket can serve input and output.&lt;/strong&gt; Upload your PDFs to R2 ahead of time, pass &lt;code&gt;file_url&lt;/code&gt; pointing at the R2 public dev URL, and the worker writes outputs to the same bucket at the root. Add &lt;code&gt;BUCKET_PREFIX&lt;/code&gt; env var if you want outputs in a subdirectory.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The 20 MB cap applies to &lt;code&gt;/run&lt;/code&gt; (async) too.&lt;/strong&gt; Same gateway, same limit. Switching to async polling doesn’t help.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;h3&gt;
  
  
  How do I get the R2 access key for &lt;code&gt;BUCKET_ACCESS_KEY_ID&lt;/code&gt; and &lt;code&gt;BUCKET_SECRET_ACCESS_KEY&lt;/code&gt;?
&lt;/h3&gt;

&lt;p&gt;In the Cloudflare dashboard: &lt;strong&gt;R2 → Manage R2 API Tokens → Create API Token&lt;/strong&gt;. Set permissions to “Object Read &amp;amp; Write” scoped to the specific bucket. Cloudflare shows the access key ID and secret access key once; copy both into your RunPod endpoint env vars immediately. The secret isn’t retrievable later.&lt;/p&gt;

&lt;h3&gt;
  
  
  Does the presigned URL expire?
&lt;/h3&gt;

&lt;p&gt;Yes. The default TTL is 3600 seconds (one hour). If your downstream client picks up the response asynchronously (job queue, cron, etc.), download promptly or bump &lt;code&gt;_S3_PRESIGN_TTL_SECONDS&lt;/code&gt; in the worker handler before redeploying.&lt;/p&gt;

&lt;h3&gt;
  
  
  Can I reuse the same R2 bucket for input and output?
&lt;/h3&gt;

&lt;p&gt;Yes. The worker doesn’t care about the bucket layout. Upload your input PDFs to &lt;code&gt;bucket/inputs/&lt;/code&gt; and the worker writes outputs to &lt;code&gt;bucket/&amp;lt;basename&amp;gt;-&amp;lt;hash&amp;gt;.tar.gz&lt;/code&gt; at the root. Add &lt;code&gt;BUCKET_PREFIX&lt;/code&gt; env var if you want outputs pushed into a subdirectory.&lt;/p&gt;

&lt;h3&gt;
  
  
  What if I can’t set up R2? Is there a fallback?
&lt;/h3&gt;

&lt;p&gt;Page chunking. Split the parse with &lt;code&gt;start_page&lt;/code&gt; and &lt;code&gt;end_page&lt;/code&gt; into segments small enough that each output tarball stays under 20 MB, then concatenate the &lt;code&gt;.md&lt;/code&gt; files client-side. Slower (you may pay multiple cold starts if the worker scales to zero between calls) and you handle joining yourself, but no infra changes needed.&lt;/p&gt;

&lt;h3&gt;
  
  
  Is the 20 MB cap on &lt;code&gt;/run&lt;/code&gt; too, or only &lt;code&gt;/runsync&lt;/code&gt;?
&lt;/h3&gt;

&lt;p&gt;Both. RunPod’s &lt;code&gt;/run&lt;/code&gt; (async) and &lt;code&gt;/runsync&lt;/code&gt; (synchronous) share the same gateway and the same payload limits. Switching to async doesn’t help the response-size problem. The cap is at the gateway layer, not the polling protocol.&lt;/p&gt;

&lt;h3&gt;
  
  
  Does using &lt;code&gt;return: "s3"&lt;/code&gt; add to cold-start time?
&lt;/h3&gt;

&lt;p&gt;No. The S3 upload happens at the end of the parse, not the beginning. The handler’s &lt;code&gt;package&lt;/code&gt; phase grew from ~95 ms (in-memory tarball) to ~11 s (gzip + upload to R2) on an 82-page job, but cold-start is unchanged. The S3 mode adds a small constant to warm-job latency, not a multiplier.&lt;/p&gt;

&lt;h3&gt;
  
  
  How big can the R2-uploaded tarball be?
&lt;/h3&gt;

&lt;p&gt;Effectively unlimited for mineru-runpod workloads. R2 supports multipart uploads up to 5 TB per object. You’ll hit the worker’s &lt;code&gt;executionTimeoutMs&lt;/code&gt; long before you hit R2’s per-object limit.&lt;/p&gt;

&lt;h3&gt;
  
  
  Does R2 work for input PDFs too, or only output?
&lt;/h3&gt;

&lt;p&gt;Both. The worker accepts &lt;code&gt;file_url&lt;/code&gt; pointing at an R2 public dev URL (or a presigned R2 GET URL for private buckets) and fetches the input from R2. This avoids the inbound 20 MB cap on &lt;code&gt;file_b64&lt;/code&gt; for large PDFs. You can run an R2-in / R2-out setup with one bucket and avoid every payload-size limit RunPod has.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where to next
&lt;/h2&gt;

&lt;p&gt;If you’ve shipped a multi-page PDF pipeline on RunPod and you’re not using &lt;code&gt;return: "s3"&lt;/code&gt;, you’ll hit the gateway cap eventually. Set it up before you need it. The cost is ten minutes of env-var configuration and possibly zero dollars per month at indie volumes.&lt;/p&gt;

&lt;p&gt;If you’re new to the template, the &lt;a href="https://sergeyshmakov.github.io/mineru-runpod/getting-started/overview/" rel="noopener noreferrer"&gt;getting-started guide&lt;/a&gt; walks through the full deploy in about ten minutes. For the cold-start side of the picture (separate from the response cap covered here), see &lt;a href="https://sergeyshmakov.github.io/mineru-runpod/blog/runpod-flashboot-mechanism-investigation/" rel="noopener noreferrer"&gt;the FlashBoot mechanism investigation&lt;/a&gt;. For GPU sizing, &lt;a href="https://sergeyshmakov.github.io/mineru-runpod/guides/choosing-gpu/" rel="noopener noreferrer"&gt;Choosing a GPU&lt;/a&gt; covers when the default &lt;code&gt;ADA_24&lt;/code&gt; (RTX 4090) is enough and when to opt up.&lt;/p&gt;

&lt;p&gt;If this saved you time, the easiest way to say thanks is &lt;a href="https://runpod.io?ref=31jdfpnq" rel="noopener noreferrer"&gt;signing up for RunPod through this link&lt;/a&gt;. Star the &lt;a href="https://github.com/sergeyshmakov/mineru-runpod" rel="noopener noreferrer"&gt;repo on GitHub&lt;/a&gt; for updates.&lt;/p&gt;




&lt;p&gt;&lt;small&gt;&lt;strong&gt;Disclosure:&lt;/strong&gt; RunPod links in this post use a referral code that credits me at no cost to you. The post would read the same without it.&lt;/small&gt;&lt;/p&gt;

</description>
      <category>runpod</category>
      <category>cloudflarer2</category>
      <category>serverless</category>
      <category>mineru</category>
    </item>
    <item>
      <title>Serverless MinerU on RunPod: honest cost math (2026)</title>
      <dc:creator>Sergey Shmakov</dc:creator>
      <pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate>
      <link>https://dev.to/sergeyshmakov/serverless-mineru-on-runpod-honest-cost-math-2026-3abe</link>
      <guid>https://dev.to/sergeyshmakov/serverless-mineru-on-runpod-honest-cost-math-2026-3abe</guid>
      <description>&lt;p&gt;&lt;em&gt;Last Updated: 2026-05-27&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;If you’re building a RAG pipeline, a document indexer, or any product that ingests PDFs at scale, you’ve probably hit the same wall I did. Hosted OCR APIs charge pennies per page that compound into thousands per million. CPU parsers are too slow for production volume. A permanent GPU pod is wasteful when traffic comes in bursts.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/opendatalab/MinerU" rel="noopener noreferrer"&gt;MinerU 2.5&lt;/a&gt; is genuinely state-of-the-art for PDF → Markdown / structured JSON. Apache 2.0 license. The &lt;a href="https://huggingface.co/opendatalab/MinerU2.5-Pro-2605-1.2B" rel="noopener noreferrer"&gt;&lt;code&gt;MinerU2.5-Pro-2605-1.2B&lt;/code&gt; model&lt;/a&gt; fits comfortably on a 24 GB GPU. RunPod Serverless scales to zero when nothing is calling. Wiring those two together is the obvious move.&lt;/p&gt;

&lt;p&gt;Real numbers from my open-source &lt;a href="https://github.com/sergeyshmakov/mineru-runpod" rel="noopener noreferrer"&gt;mineru-runpod&lt;/a&gt; template, measured on a 24 GB RTX 4090 in May 2026: &lt;strong&gt;~$0.001 per page for warm parses, plus a ~$0.03 fixed tax per cold start&lt;/strong&gt;. The all-in per-page cost depends on how much work you do before the worker scales back to zero. Here’s the deploy, the response shape, and the workload patterns this template is the right fit for.&lt;/p&gt;

&lt;h2&gt;
  
  
  What does it actually cost to run MinerU on RunPod Serverless?
&lt;/h2&gt;

&lt;p&gt;About &lt;strong&gt;$0.001 per page&lt;/strong&gt; on an RTX 4090 once the worker is warm. Each scale-from-zero adds a &lt;strong&gt;~$0.03 fixed tax&lt;/strong&gt; : roughly 110 seconds of GPU billing for vLLM engine init plus model load. Per-page math depends entirely on amortization. Sparse traffic with one short request per cold start lands closer to $0.005–$0.01 per page.&lt;/p&gt;

&lt;p&gt;Real workload-shape math using &lt;code&gt;ADA_24&lt;/code&gt; (RTX 4090, ~$1.10/hr Flex):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Workload shape&lt;/th&gt;
&lt;th&gt;Per-page cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1,000 pages amortized across one cold start&lt;/td&gt;
&lt;td&gt;~$0.001&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;100 pages amortized across one cold start&lt;/td&gt;
&lt;td&gt;~$0.0013&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10 pages then idle out&lt;/td&gt;
&lt;td&gt;~$0.004&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;One short doc per scale-from-zero (worst case)&lt;/td&gt;
&lt;td&gt;~$0.007&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Compared to alternatives:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool / setup&lt;/th&gt;
&lt;th&gt;Per-page cost&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Hosted OCR APIs (typical)&lt;/td&gt;
&lt;td&gt;$0.001 – $0.01&lt;/td&gt;
&lt;td&gt;vendor lock-in, rate limits, documents leave your stack&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Permanent GPU pod (24 h on A5000)&lt;/td&gt;
&lt;td&gt;$0.001 – $0.003&lt;/td&gt;
&lt;td&gt;24 h of bills whether you use it or not&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;mineru-runpod, amortized&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~$0.001 – $0.004&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;scales to zero; cold-start tax is real&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Marker / Nougat on CPU&lt;/td&gt;
&lt;td&gt;$0 cash, $$$ time&lt;/td&gt;
&lt;td&gt;~30 s/page sequential (&lt;a href="https://github.com/datalab-to/marker" rel="noopener noreferrer"&gt;Marker docs&lt;/a&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The trick is RunPod’s per-second billing. No worker running, no bill. The catch is every scale-from-zero pays a real fixed cost.&lt;/p&gt;

&lt;h2&gt;
  
  
  How do I deploy MinerU to RunPod Serverless in ten minutes?
&lt;/h2&gt;

&lt;p&gt;Fork the repo, point RunPod’s GitHub auto-build at your fork, create a Serverless Endpoint with &lt;code&gt;ADA_24&lt;/code&gt; (RTX 4090) and FlashBoot enabled, send a request via the included Python client. Total wall-clock from RunPod sign-up to first parse: roughly ten minutes, dominated by the image build (~5–10 min) plus the first cold start (~110 s).&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Get a RunPod account
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://runpod.io?ref=31jdfpnq" rel="noopener noreferrer"&gt;Sign up here&lt;/a&gt;. Add $5 of credit. That covers several thousand cold starts plus a few million warm pages.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Fork the repo
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
gh repo fork sergeyshmakov/mineru-runpod --clone

cd mineru-runpod

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The repo stays small. &lt;code&gt;Dockerfile&lt;/code&gt;, &lt;code&gt;handler.py&lt;/code&gt;, a &lt;code&gt;worker/&lt;/code&gt; package, a Python client (&lt;code&gt;mineru_client&lt;/code&gt;), three GitHub Actions workflows, Hub metadata under &lt;code&gt;.runpod/&lt;/code&gt;. MIT licensed, ~30 files.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Wire RunPod’s GitHub auto-build
&lt;/h3&gt;

&lt;p&gt;In the RunPod dashboard:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Serverless → Templates → New → Import Git Repository&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Point at your fork. Branch &lt;code&gt;main&lt;/code&gt;, Dockerfile path &lt;code&gt;Dockerfile&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;RunPod clones, builds the image, stores it in its own registry, and gives you a &lt;code&gt;template_id&lt;/code&gt;. The build runs ~5–10 minutes. Watch the log if you want.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  4. Create the endpoint
&lt;/h3&gt;

&lt;p&gt;Dashboard path:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Serverless → Endpoints → New&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Template: the one you just created&lt;/li&gt;
&lt;li&gt;GPU pool: &lt;code&gt;ADA_24&lt;/code&gt; (RTX 4090, 24 GB)&lt;/li&gt;
&lt;li&gt;Workers min: &lt;code&gt;0&lt;/code&gt;, max: &lt;code&gt;3&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Idle timeout: &lt;code&gt;10&lt;/code&gt; seconds&lt;/li&gt;
&lt;li&gt;FlashBoot: on&lt;/li&gt;
&lt;li&gt;Save, grab the endpoint id&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Or as code (reproducible across redeploys):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
pip install -e .[deploy]

python deploy.py --template-id &amp;lt;tid&amp;gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;deploy.py&lt;/code&gt; exposes every endpoint setting as a CLI flag.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Parse your first PDF
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
from mineru_client import MineruClient

client = MineruClient(

    endpoint_id="&amp;lt;your-endpoint-id&amp;gt;",

    api_key="&amp;lt;your-runpod-api-key&amp;gt;",

)

result = client.parse_document(

    file_url="https://example.com/report.pdf",

    end_page=4, # smoke test on first 5 pages

)

client.save_tarball(result, "./out/doc")

# → ./out/doc/&amp;lt;basename&amp;gt;.md
# → ./out/doc/&amp;lt;basename&amp;gt;_content_list_v2.json
# → ./out/doc/&amp;lt;basename&amp;gt;_middle.json
# → ./out/doc/images/*.png
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;First parse pays a cold start. Subsequent parses on the same warm worker run at ~1–6 s/page on the 4090, content density dependent. After 10 s of idle, the worker scales to zero.&lt;/p&gt;

&lt;h2&gt;
  
  
  What does the MinerU response actually contain?
&lt;/h2&gt;

&lt;p&gt;Three structured outputs plus extracted images. &lt;code&gt;&amp;lt;basename&amp;gt;.md&lt;/code&gt; is Markdown with LaTeX equations, HTML tables, and image references. &lt;code&gt;&amp;lt;basename&amp;gt;_content_list_v2.json&lt;/code&gt; is a flat list of typed entries (text, equation, table, image, code) each tagged with &lt;code&gt;page_idx&lt;/code&gt;. &lt;code&gt;&amp;lt;basename&amp;gt;_middle.json&lt;/code&gt; carries the full layout with bounding boxes and reading order. Pick the transport via &lt;code&gt;return&lt;/code&gt;: &lt;code&gt;tarball_b64&lt;/code&gt;, &lt;code&gt;inline&lt;/code&gt;, or &lt;code&gt;s3&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;For a document indexer or RAG pipeline, &lt;code&gt;content_list_v2.json&lt;/code&gt; is the file you’ll spend the most time with. Group entries by &lt;code&gt;level: "title"&lt;/code&gt; boundaries for section-based chunking. Embed each chunk and store &lt;code&gt;page_idx&lt;/code&gt; for citation back to the source.&lt;/p&gt;

&lt;p&gt;The Markdown is for human-readable display. &lt;code&gt;middle.json&lt;/code&gt; has bounding boxes per span when you need page coordinates for hover-to-source UI.&lt;/p&gt;

&lt;p&gt;Transport options on the request: &lt;code&gt;tarball_b64&lt;/code&gt; (default) for outputs under ~20 MB, &lt;code&gt;inline&lt;/code&gt; if you want the markdown directly in the JSON response, &lt;code&gt;s3&lt;/code&gt; for anything that would exceed RunPod’s response cap. See &lt;a href="https://sergeyshmakov.github.io/mineru-runpod/blog/runpod-20mb-response-cap-r2-bridge/" rel="noopener noreferrer"&gt;the R2 bridge post&lt;/a&gt; for the &lt;code&gt;s3&lt;/code&gt; setup.&lt;/p&gt;

&lt;h2&gt;
  
  
  When does mineru-runpod fit your workload, and when doesn’t it?
&lt;/h2&gt;

&lt;p&gt;Good fit: batch ingest jobs, bursty traffic (50 docs in a minute, then quiet), background pipelines, OCR-API replacement. Poor fit: interactive single-document apps (cold starts make users think it’s broken), sparse traffic (one job per cold start dominates the bill), strict latency SLOs without provisioning &lt;code&gt;workers_min ≥ 1&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;I run this template in production for a document indexer. Six months of operation, here’s the honest fit picture:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Good fit:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Batch ingest.&lt;/strong&gt; Drop 500 PDFs into a queue. One cold start amortizes across the whole batch at ~$0.001 per page.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bursty traffic.&lt;/strong&gt; A user uploads 50 documents in a minute. One cold start, 49 warm parses.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Background pipelines.&lt;/strong&gt; Nightly cron processes yesterday’s intake. Cold start cost is rounding error against a multi-hour batch.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OCR-API replacement.&lt;/strong&gt; Comparable per-page cost without shipping documents to a third party.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Poor fit:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Interactive single-document parsing.&lt;/strong&gt; Your user uploads one PDF and waits two minutes for the cold start. They’ll think it’s broken.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sparse traffic (one job every 20–60 min).&lt;/strong&gt; Almost every request is a cold start. The ~$0.03 cold-start tax dominates. Rent a permanent low-tier GPU pod and skip serverless instead.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Strict latency SLOs.&lt;/strong&gt; Cold-start latency is partly outside your control. Provisioning &lt;code&gt;workers_min ≥ 1&lt;/code&gt; eliminates cold starts but you pay for the warm worker around the clock.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The repo’s defaults (&lt;code&gt;workers_min=0&lt;/code&gt;, &lt;code&gt;idle_timeout=10s&lt;/code&gt;) are tuned for batch-with-bursts. The dashboard’s scaling settings are where you tune for other patterns.&lt;/p&gt;

&lt;h2&gt;
  
  
  What’s the real cold-start cost on RunPod Serverless?
&lt;/h2&gt;

&lt;p&gt;Roughly &lt;strong&gt;110 seconds&lt;/strong&gt; before MinerU starts parsing your first request after a scale-from-zero. The composition: ~3 s fitness checks, ~20 s vLLM engine config, ~20 s model load, ~25 s torch.compile, ~5 s CUDA graph capture, ~5 s of actual parse. Billed at ~$1.10/hr on the 4090 default, that’s roughly &lt;strong&gt;$0.03 per cold start&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The per-phase breakdown is documented in the &lt;a href="https://sergeyshmakov.github.io/mineru-runpod/guides/troubleshooting/" rel="noopener noreferrer"&gt;troubleshooting guide&lt;/a&gt; if you want to see where the time goes. The boot-time warmup in this template loads MinerU’s model and JIT-compiles vLLM kernels &lt;em&gt;before&lt;/em&gt; the worker accepts requests. When RunPod’s FlashBoot snapshot is available on a subsequent scale-from-zero, the wall-clock drops to ~7–8 seconds because the snapshot captured a warm process. When the snapshot isn’t available (new host, image rebuild), warmup re-runs and you pay the full ~110 s again.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://sergeyshmakov.github.io/mineru-runpod/blog/runpod-flashboot-mechanism-investigation/" rel="noopener noreferrer"&gt;The FlashBoot mechanism investigation&lt;/a&gt; covers when the fast path applies, with measured numbers across multiple consecutive cold starts.&lt;/p&gt;

&lt;h2&gt;
  
  
  What should I watch out for before going to production?
&lt;/h2&gt;

&lt;p&gt;Three production gotchas the marketing won’t mention. The 20 MB response cap silently drops large outputs (symptom: &lt;code&gt;NoneType&lt;/code&gt; after a successful parse — covered by the R2 bridge). &lt;code&gt;execution_timeout&lt;/code&gt; defaults to 900 s and won’t cover full books. &lt;code&gt;file_b64&lt;/code&gt; inline payloads cap around 10 MB on the way in. None of these crash the worker; they manifest as confusing client-side errors.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;20 MB response cap.&lt;/strong&gt; RunPod’s &lt;code&gt;/runsync&lt;/code&gt; gateway drops responses over ~20 MB. Multi-page parses with embedded images hit this around 50–80 pages. Worker logs &lt;code&gt;done&lt;/code&gt;; client gets &lt;code&gt;NoneType&lt;/code&gt;. Fix: &lt;code&gt;return: "s3"&lt;/code&gt; + Cloudflare R2, walked through in &lt;a href="https://sergeyshmakov.github.io/mineru-runpod/blog/runpod-20mb-response-cap-r2-bridge/" rel="noopener noreferrer"&gt;the R2 bridge post&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Long-job timeout.&lt;/strong&gt; Repo defaults &lt;code&gt;execution_timeout=900s&lt;/code&gt; (good for ~150–300 pages on 4090). A 5,000-page book is 80–500 minutes depending on content density. Bump &lt;code&gt;execution_timeout&lt;/code&gt; for long jobs; the endpoint upper limit is 24 hours.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Inline payload cap on the way in.&lt;/strong&gt; &lt;code&gt;file_b64&lt;/code&gt; requests cap around 10 MB. For bigger files, pass &lt;code&gt;file_url&lt;/code&gt; and let the worker fetch from your storage. R2 public dev URLs work well.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cold-start economics.&lt;/strong&gt; “Pennies per page” depends on amortization. Track average pages per cold start in your logs. If it’s under 30, bump &lt;code&gt;idle_timeout&lt;/code&gt; or run &lt;code&gt;workers_min=1&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Where to next
&lt;/h2&gt;

&lt;p&gt;The repo ships with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Typed Python client (&lt;code&gt;MineruClient&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;deploy.py&lt;/code&gt; / &lt;code&gt;destroy.py&lt;/code&gt; for endpoint lifecycle automation&lt;/li&gt;
&lt;li&gt;Reference adapter pattern for wrapping MinerU output into domain models&lt;/li&gt;
&lt;li&gt;96 unit tests, CI on every PR&lt;/li&gt;
&lt;li&gt;Commitlint + semantic-release for automated CHANGELOG / GitHub Releases&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For the deeper context that didn’t fit:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://sergeyshmakov.github.io/mineru-runpod/blog/runpod-flashboot-mechanism-investigation/" rel="noopener noreferrer"&gt;How RunPod FlashBoot actually works&lt;/a&gt; — four-request investigation into the cold-start mechanism and the per-host snapshot caveat.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://sergeyshmakov.github.io/mineru-runpod/blog/runpod-20mb-response-cap-r2-bridge/" rel="noopener noreferrer"&gt;The R2 bridge for the 20 MB response cap&lt;/a&gt; — fix for &lt;code&gt;NoneType&lt;/code&gt; on multi-page outputs.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://sergeyshmakov.github.io/mineru-runpod/guides/choosing-gpu/" rel="noopener noreferrer"&gt;Choosing a GPU&lt;/a&gt; — when 24 GB is enough, when to opt up to 48 GB.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If this saved you time, the easiest way to say thanks is &lt;a href="https://runpod.io?ref=31jdfpnq" rel="noopener noreferrer"&gt;signing up for RunPod through this link&lt;/a&gt;. Star the &lt;a href="https://github.com/sergeyshmakov/mineru-runpod" rel="noopener noreferrer"&gt;repo on GitHub&lt;/a&gt; for updates.&lt;/p&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;h3&gt;
  
  
  How does mineru-runpod compare to hosted PDF APIs?
&lt;/h3&gt;

&lt;p&gt;Per-page cost is in the same ballpark ($0.001–$0.004) when amortizing cold starts across reasonable batches. The differences are control and lock-in. You deploy your own RunPod endpoint, pick your GPU and concurrency, run whichever MinerU version you want, and never send documents to a third party. The trade-off is operating a serverless template instead of consuming a managed API.&lt;/p&gt;

&lt;h3&gt;
  
  
  Can MinerU 2.5 handle non-English PDFs?
&lt;/h3&gt;

&lt;p&gt;Yes. The &lt;code&gt;vlm-auto-engine&lt;/code&gt; default backend handles English and Chinese well per &lt;a href="https://huggingface.co/opendatalab/MinerU2.5-Pro-2605-1.2B" rel="noopener noreferrer"&gt;the model card&lt;/a&gt;. For other scripts (Cyrillic, Arabic, Devanagari, Japanese, Korean), the &lt;code&gt;pipeline&lt;/code&gt; backend uses PaddleOCR with script-family models, covering 109 languages. Empirically the Pro VLM also handles Cyrillic correctly even though &lt;code&gt;lang&lt;/code&gt; is ignored on the VLM path. Switch backends per-request via the &lt;code&gt;backend&lt;/code&gt; field.&lt;/p&gt;

&lt;h3&gt;
  
  
  What’s the difference between &lt;code&gt;vlm-auto-engine&lt;/code&gt;, &lt;code&gt;pipeline&lt;/code&gt;, and &lt;code&gt;hybrid-auto-engine&lt;/code&gt;?
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;vlm-auto-engine&lt;/code&gt; uses MinerU’s 1.2B VLM via &lt;a href="https://github.com/vllm-project/vllm" rel="noopener noreferrer"&gt;vLLM&lt;/a&gt;. Fastest on English / Chinese, ~1–6 s/page warm. &lt;code&gt;pipeline&lt;/code&gt; uses PaddleOCR plus dedicated layout / formula / table models. Slower (~3–5 s/page) but more memory-predictable (4 GB minimum VRAM) and covers 109 languages. &lt;code&gt;hybrid-auto-engine&lt;/code&gt; routes each page through either backend based on content. Highest quality on mixed-content docs; needs 48 GB on dense layouts.&lt;/p&gt;

&lt;h3&gt;
  
  
  Does the per-page cost include the cold-start tax?
&lt;/h3&gt;

&lt;p&gt;No. The ~$0.001 per page is warm-worker math. Each scale-from-zero adds a roughly $0.03 fixed cost on the 4090 default. Your effective per-page cost is &lt;code&gt;(0.001 × pages) + (0.03 × cold_starts) / pages&lt;/code&gt;. For 100 pages across one cold start, that’s $0.0013 per page. For 10 pages, it’s $0.004.&lt;/p&gt;

&lt;h3&gt;
  
  
  Can I use mineru-runpod with my own MinerU model?
&lt;/h3&gt;

&lt;p&gt;Yes. Fork the repo and update the Dockerfile’s &lt;code&gt;huggingface_hub.snapshot_download&lt;/code&gt; call to point at your model. Rebuild and redeploy. The handler is model-agnostic; MinerU’s &lt;code&gt;aio_do_parse&lt;/code&gt; resolves whatever model is in &lt;code&gt;HF_HOME&lt;/code&gt; at runtime.&lt;/p&gt;

&lt;h3&gt;
  
  
  What GPU does the template default to?
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;ADA_24&lt;/code&gt; (RTX 4090, 24 GB). Switched from &lt;code&gt;AMPERE_24&lt;/code&gt; (A5000) on 2026-05-26 after measuring per-page cost. The 4090 is 2–4× faster per page than the A5000 and cheaper per page despite the higher hourly rate. See &lt;a href="https://sergeyshmakov.github.io/mineru-runpod/guides/choosing-gpu/" rel="noopener noreferrer"&gt;Choosing a GPU&lt;/a&gt; for the full math and when to opt up to 48 GB.&lt;/p&gt;

&lt;h3&gt;
  
  
  How do I keep my RunPod endpoint warm to avoid cold starts?
&lt;/h3&gt;

&lt;p&gt;Set &lt;code&gt;workers_min=1&lt;/code&gt; on the endpoint. You pay for the always-on worker around the clock (~$0.000306/s on the 4090 default, so ~$26/day or ~$800/month). Worth it if your traffic is steady enough that the warm worker stays busy, or if your latency SLO can’t tolerate the cold-start window. For bursty traffic, &lt;code&gt;workers_min=0&lt;/code&gt; with FlashBoot enabled is usually cheaper.&lt;/p&gt;




&lt;p&gt;&lt;small&gt;&lt;strong&gt;Disclosure:&lt;/strong&gt; RunPod links in this post use a referral code that credits me at no cost to you. The post would read the same without it.&lt;/small&gt;&lt;/p&gt;

</description>
      <category>pdf</category>
      <category>runpod</category>
      <category>serverless</category>
      <category>mineru</category>
    </item>
  </channel>
</rss>
