DEV Community

Cover image for Prompt caching vs Statefull approach
Conrad Bogusz
Conrad Bogusz

Posted on

Prompt caching vs Statefull approach

We still havenโ€™t figured out how to optimize binge-watching all 38 seasons of The Simpsons ๐Ÿฟ. Luckily, thereโ€™s no real need to, but in the world of LLMs, itโ€™s a whole different story.

Hereโ€™s how some providers handle prompt caching:

โœ… OpenAI: Auto-caches the longest matching prefix (after 1,024 tokens, then in 128-token chunks). No config needed; up to 80% lower latency and 50% lower input costs.
โš™๏ธ Anthropic: Manual caching via headers (๐˜ข๐˜ฏ๐˜ต๐˜ฉ๐˜ณ๐˜ฐ๐˜ฑ๐˜ช๐˜ค-๐˜ฃ๐˜ฆ๐˜ต๐˜ข: ๐˜ฑ๐˜ณ๐˜ฐ๐˜ฎ๐˜ฑ๐˜ต-๐˜ค๐˜ข๐˜ค๐˜ฉ๐˜ช๐˜ฏ๐˜จ-๐˜ ๐˜ ๐˜ ๐˜ -๐˜”๐˜”-๐˜‹๐˜‹ + ๐˜ค๐˜ข๐˜ค๐˜ฉ๐˜ฆ_๐˜ค๐˜ฐ๐˜ฏ๐˜ต๐˜ณ๐˜ฐ๐˜ญ). Works only for exact prefix matches. Reading saves ~90% cost; writes add ~25%.
๐Ÿ”ง AWS Bedrock: Opt-in with ๐˜Œ๐˜ฏ๐˜ข๐˜ฃ๐˜ญ๐˜ฆ๐˜—๐˜ณ๐˜ฐ๐˜ฎ๐˜ฑ๐˜ต๐˜Š๐˜ข๐˜ค๐˜ฉ๐˜ช๐˜ฏ๐˜จ=๐˜ต๐˜ณ๐˜ถ๐˜ฆ, TTL of 5 minutes. Saves up to 90% on input and 85% on latency.
๐Ÿ“ฆ Google Vertex: Manual ๐˜Š๐˜ข๐˜ค๐˜ฉ๐˜ฆ๐˜ฅ๐˜Š๐˜ฐ๐˜ฏ๐˜ต๐˜ฆ๐˜ฏ๐˜ต, caches by token-hours, up to 75% discount on reads, TTL up to 1 hour. More complex to manage.

If you are more interested into new method and possibilites with STATEFUL API for AI model inference go to ark-labs.cloud

Give it a try at ark-labs.cloud ๐Ÿš€

Top comments (0)