What is Context Caching?
Google Gemini's Context Caching is a feature that caches context once it's input and reuses it in subsequent requests. Cached tokens can be used at approximately 25% of the standard rate, enabling significant cost savings.
Basic Specifications
- Cache Validity Period: Default 3,600 seconds (1 hour)
- Token Requirement: Minimum 32,768 tokens
- Model-Specific Isolation: Caches are managed in conjunction with the model name.
Use Cases
Large-Scale DB Analysis
When analyzing an SQLite DB built with FTS5+BM25, leveraging Context Caching can be highly effective.
- Keyword Extraction Phase: Fast processing with the Flash model
- Answer Generation Phase: High-accuracy analysis with the Pro model
- Fact-Checking Phase: Cache training data to reduce processing time
Batch Analysis
During batch processing, reusing the same context across all samples is effective.
from google import genai
from google.genai import caching
client = genai.Client()
# キャッシュ作成(32,768トークン以上が必要)
cached_content = caching.CachedContent.create(
client=client,
name="patent_batch_cache",
contents=[{"text": doc["title"]} for doc in patent_docs[:1000]],
model="gemini-2.5-flash",
config={"expire_time": "2026-03-09T00:00:00Z"}
)
# キャッシュを使った生成
for doc in patent_docs:
result = client.models.generate_content(
model="gemini-2.5-flash",
contents=doc["text"],
cached_content=cached_content.name
)
Cost Calculation Example
For training data of 75,458 tokens:
- Without caching: Standard rate
- With caching: Approximately 25% of the standard rate
When the same context is repeatedly used in batch processing, the effect of caching becomes very significant.
Limitations and Considerations
- A minimum of 32,768 tokens is required when creating a cache.
- Caches are isolated per model, so sharing caches between different models is not possible.
- It is important to set the expiration time to match the analysis processing time.
Summary
By appropriately utilizing Context Caching, you can significantly reduce API costs for large-scale data analysis and shorten processing times. The key to success is to use the Flash model for keyword extraction, the Pro model for answer generation, and to reuse the same context during batch processing.
Top comments (0)