We still havenโt figured out how to optimize binge-watching all 38 seasons of The Simpsons ๐ฟ. Luckily, thereโs no real need to, but in the world of LLMs, itโs a whole different story.
Hereโs how some providers handle prompt caching:
โ
OpenAI: Auto-caches the longest matching prefix (after 1,024 tokens, then in 128-token chunks). No config needed; up to 80% lower latency and 50% lower input costs.
โ๏ธ Anthropic: Manual caching via headers (๐ข๐ฏ๐ต๐ฉ๐ณ๐ฐ๐ฑ๐ช๐ค-๐ฃ๐ฆ๐ต๐ข: ๐ฑ๐ณ๐ฐ๐ฎ๐ฑ๐ต-๐ค๐ข๐ค๐ฉ๐ช๐ฏ๐จ-๐ ๐ ๐ ๐ -๐๐-๐๐ + ๐ค๐ข๐ค๐ฉ๐ฆ_๐ค๐ฐ๐ฏ๐ต๐ณ๐ฐ๐ญ). Works only for exact prefix matches. Reading saves ~90% cost; writes add ~25%.
๐ง AWS Bedrock: Opt-in with ๐๐ฏ๐ข๐ฃ๐ญ๐ฆ๐๐ณ๐ฐ๐ฎ๐ฑ๐ต๐๐ข๐ค๐ฉ๐ช๐ฏ๐จ=๐ต๐ณ๐ถ๐ฆ, TTL of 5 minutes. Saves up to 90% on input and 85% on latency.
๐ฆ Google Vertex: Manual ๐๐ข๐ค๐ฉ๐ฆ๐ฅ๐๐ฐ๐ฏ๐ต๐ฆ๐ฏ๐ต, caches by token-hours, up to 75% discount on reads, TTL up to 1 hour. More complex to manage.
If you are more interested into new method and possibilites with STATEFUL API for AI model inference go to ark-labs.cloud
Give it a try at ark-labs.cloud ๐
Top comments (0)