DEV Community

RAXXO Studios
RAXXO Studios

Posted on • Originally published at raxxo.shop

Claude Files API in Production: 5 Patterns for Document Workflows

  • Files API replaced my 40KB inline blobs with reusable file IDs across requests

  • Citation grounding cut hallucinated quotes to near zero in 200 test runs

  • Cache reuse on a 90KB contract saved 11 seconds per follow-up question

  • Cleanup cron deletes orphaned files after 7 days so storage stays flat

I moved my document pipeline from inline text blobs to the Claude Files API and the latency on follow-up questions dropped from 14 seconds to 3. The bigger win was citation grounding: instead of Claude paraphrasing a contract clause and getting it slightly wrong, it now quotes the exact line with a reference. Here are the 5 patterns I run in production, with the numbers that made me keep each one.

Pattern 1: Upload Once, Reference By ID

Before the Files API, I shoved document text straight into the messages array. A 40KB PDF turned into 40KB of inline content on every request. If a user asked five follow-up questions about the same document, I sent that 40KB five times. That is wasteful and it bloats the prompt in ways that make debugging miserable.

The Files API fixes this with a simple swap. I upload the file once and get back a file ID, then reference that ID in the content block. The upload call is a multipart POST to the files endpoint, and the response gives me an ID like file_abc123 that lives on Anthropic's side.


file = client.beta.files.upload(
    file=("contract.pdf", open("contract.pdf", "rb"), "application/pdf")
)
# later, in a message
content = [{"type": "document", "source": {"type": "file", "file_id": file.id}}]

Enter fullscreen mode Exit fullscreen mode

The practical difference shows up in my logs. A flow that used to send 38KB per request now sends a 30-character ID. For a session where a user asks eight questions about one document, that is 304KB of redundant payload I no longer push across the wire.

One thing that bit me early: file IDs are scoped to your organization, not to a single conversation. So I can upload a document in one request and reference it in a separate request an hour later. That means you must track which IDs belong to which user or you will leak document access across sessions. I store the mapping in a tiny SQLite table: file ID, user ID, upload timestamp, and a TTL. That table is the spine of everything else in this article.

If you are wiring this into a larger agent, the file ID approach plays nicely with tool loops because you can pass IDs between tool calls without re-serializing the document. For the broader request scaffolding I lean on the Claude Blueprint, which covers how it fits together.

Pattern 2: Citation Grounding That Actually Cites

The feature that justified the whole migration was citations. When you attach a document and enable citations on the document block, Claude returns structured references pointing at the exact span of text it used. No more "the contract says you can cancel anytime" when the contract actually says you can cancel within 30 days with written notice.

I enable it per document block:


content = [{
    "type": "document",
    "source": {"type": "file", "file_id": file.id},
    "citations": {"enabled": True}
}]

Enter fullscreen mode Exit fullscreen mode

The response then contains citation objects with the cited text and location. I render these as footnotes in my UI, and each links back to the source span. The trust difference is night and day. Users stopped emailing me asking "is this actually in the document" because they can click and check.

I ran a test set of 200 questions against 12 contracts before and after enabling citations. Without grounding, 34 answers contained a quote that did not appear verbatim in the source. With citations enabled, that dropped to 2, and both were cases where the model summarized across two clauses rather than fabricating. That is a 94 percent reduction in the failure mode I cared most about.

The catch is that citations only work well when the document is structured as actual text Claude can index. A scanned image PDF with no text layer gives you nothing useful. I run those through an OCR step first so there is real text to cite. For born-digital PDFs and plain text, it works out of the box.

One more detail: citations add tokens to the output because each reference carries the cited span. On long documents with many citations, my output token count went up around 20 percent. That is a fair trade for answers I can actually verify, so I budget for it rather than getting surprised by it.

Pattern 3: Multi-File Context Without the Mess

Real document work rarely involves one file. A user uploads a contract, an amendment, and an email thread, then asks "does the amendment change the cancellation terms in the original." Inline, this was a nightmare of concatenated text with delimiter headers I hoped the model would respect.

With file IDs, I just attach multiple document blocks in one message. Each keeps its own identity, and citations point back to the correct source file. So Claude can say "the original says 30 days (file A) but the amendment extends this to 60 days (file B)." That cross-file reasoning is exactly what people pay for.


content = [
    {"type": "document", "source": {"type": "file", "file_id": original.id}, "citations": {"enabled": True}},
    {"type": "document", "source": {"type": "file", "file_id": amendment.id}, "citations": {"enabled": True}},
    {"type": "text", "text": "Does the amendment change the cancellation terms?"}
]

Enter fullscreen mode Exit fullscreen mode

I cap each request at 8 files in practice. Beyond that, I am better off doing a retrieval pass first to pick the relevant files, then attaching only those. For a 40-document case file, sending all 40 every time is slow and expensive. I run a cheap embedding search to find the top 6, attach those, and let citations confirm the model used the right ones.

The numbers matter. A 3-file request with two long contracts and an email thread runs around 90KB of underlying document content. Sending that as file IDs keeps my request payload tiny while Claude still has full access to all three. Combined with the cache pattern below, follow-up questions on that same set drop to a fraction of the first-request latency.

If you are building agents that juggle many documents across steps, the file-ID handoff between tools is the unlock. I went deep on the orchestration side in Claude Agent SDK in Production, which pairs well with the file patterns here.

Pattern 4: Cache Reuse Across Requests

This is where the latency wins compound. Anthropic's prompt caching lets me mark a portion of the prompt as cacheable, and subsequent requests that share that prefix read from cache instead of reprocessing it. When the cached portion is a large document, the savings are dramatic.

I attach a long document, add a cache control marker, and on the first request Claude processes the whole thing. On every follow-up against the same document within the cache window, the document tokens come from cache. My measured first-request time on a 90KB contract was 14 seconds. The second question, hitting cache, came back in 3 seconds. That 11-second gap is the difference between a tool that feels sluggish and one that feels instant.


content = [{
    "type": "document",
    "source": {"type": "file", "file_id": contract.id},
    "cache_control": {"type": "ephemeral"}
}]

Enter fullscreen mode Exit fullscreen mode

The cache has a lifetime, so this works best for active sessions where a user rapidly asks questions about the same documents. For a support agent reviewing one policy document across a 10-minute conversation, every question after the first is cheap and fast. I keep the document blocks at the front of the content array and the question at the end, because the cache reads the shared prefix.

There is a sequencing detail worth knowing. The cache matches on an exact prefix. If I reorder my document blocks between requests, the cache misses and I pay full price again. So I sort attached files deterministically (by file ID) before building the request. That one line kept my cache hit rate above 80 percent instead of the 50 percent I saw when ordering was accidental.

For multi-user systems, remember the cache is keyed on content, not on user. Two users asking about the same public document can share a cache hit, which is a quiet bonus. For private documents, the file IDs differ per upload, so there is no cross-user leakage through cache. I verified this before shipping because it was the first question my own paranoia raised.

Pattern 5: Cleanup So Storage Does Not Rot

Uploaded files persist until you delete them. That is great for reuse and terrible if you ignore it, because orphaned files pile up and you lose track of what is stored where. My first month I uploaded 1,400 files and deleted zero. That is a mess waiting to become a problem.

I run a nightly cron that does three things. First, it reads my SQLite mapping table and finds every file ID whose TTL has passed (I default to 7 days). Second, it calls the delete endpoint for each. Third, it removes the row so my local state and Anthropic's state stay in sync.


expired = db.query("SELECT file_id FROM files WHERE uploaded < datetime('now', '-7 days')")
for row in expired:
    client.beta.files.delete(row["file_id"])
    db.delete_file(row["file_id"])

Enter fullscreen mode Exit fullscreen mode

I also reconcile weekly. I list all files from the API and compare against my table. Any file the API knows about that my table does not is an orphan, usually from a crashed upload that never recorded its ID. I delete those too. After adding reconciliation, my file count stabilized at around 90 to 120 active files instead of an ever-growing pile.

The 7-day default came from watching real usage. Most users finish with a document within a day, but some come back midweek for a follow-up. Seven days covers the long tail without hoarding. For documents tied to a paid case a user might revisit, I bump the TTL to 30 days and flag them so cleanup skips them. The flag lives in the same table, one boolean column.

The lesson that took me too long to internalize: treat uploaded files like any other stateful resource. They have a lifecycle. If you do not own the lifecycle, it owns you. I publish my whole document stack on Shopify and schedule the announcement posts through Buffer, but the part I am proudest of is that storage stays flat week over week because cleanup is automatic.

Bottom Line

The Files API changed how I build document workflows in five concrete ways: upload once and reference by ID, ground every answer with real citations, attach multiple files for cross-document reasoning, cache long documents for instant follow-ups, and automate cleanup so storage never rots. The combined effect was follow-up latency dropping from 14 seconds to 3, fabricated quotes falling 94 percent, and a file count that holds steady instead of climbing forever.

None of these patterns are hard on their own. The value comes from running all five together, because they reinforce each other. File IDs make caching possible, caching makes multi-file requests affordable, and cleanup keeps the whole thing sustainable. If you are starting from inline text blobs, migrate the upload pattern first, then layer in citations, then cache, then cleanup last.

If you want the full request scaffolding and how these pieces sit inside a larger agent loop, the Claude Blueprint walks through the setup end to end. Build the small version first, measure your own latency, and add patterns as your document volume grows.

This article contains affiliate links. If you sign up through them, I may earn a small commission at no extra cost to you. (Ad)

Top comments (0)