Prompt caching vs Statefull approach

#api #llm #ai

We still haven’t figured out how to optimize binge-watching all 38 seasons of The Simpsons 🍿. Luckily, there’s no real need to, but in the world of LLMs, it’s a whole different story.

Here’s how some providers handle prompt caching:

✅ OpenAI: Auto-caches the longest matching prefix (after 1,024 tokens, then in 128-token chunks). No config needed; up to 80% lower latency and 50% lower input costs.
⚙️ Anthropic: Manual caching via headers (𝘢𝘯𝘵𝘩𝘳𝘰𝘱𝘪𝘤-𝘣𝘦𝘵𝘢: 𝘱𝘳𝘰𝘮𝘱𝘵-𝘤𝘢𝘤𝘩𝘪𝘯𝘨-𝘠𝘠𝘠𝘠-𝘔𝘔-𝘋𝘋 + 𝘤𝘢𝘤𝘩𝘦_𝘤𝘰𝘯𝘵𝘳𝘰𝘭). Works only for exact prefix matches. Reading saves ~90% cost; writes add ~25%.
🔧 AWS Bedrock: Opt-in with 𝘌𝘯𝘢𝘣𝘭𝘦𝘗𝘳𝘰𝘮𝘱𝘵𝘊𝘢𝘤𝘩𝘪𝘯𝘨=𝘵𝘳𝘶𝘦, TTL of 5 minutes. Saves up to 90% on input and 85% on latency.
📦 Google Vertex: Manual 𝘊𝘢𝘤𝘩𝘦𝘥𝘊𝘰𝘯𝘵𝘦𝘯𝘵, caches by token-hours, up to 75% discount on reads, TTL up to 1 hour. More complex to manage.

If you are more interested into new method and possibilites with STATEFUL API for AI model inference go to ark-labs.cloud

Give it a try at ark-labs.cloud 🚀