128K context sounds great — until your prompts cost $2 each. Here's how to optimize tokens and process massive documents for pennies.
You got access to 128K context. Excited, you paste your entire codebase. Then you check the bill.
100K tokens per request × 2.80/1M = 2.80/1M=0.28 per call. Not bad for one request. But 1000 calls? $280.
Here's how to process massive documents smarter.
1. Trim Before Sending
def trim_context(text, max_chars=4000):
"""Keep only what matters."""
# Remove whitespace
text = " ".join(text.split())
# Truncate with a summary note
if len(text) > max_chars:
text = text[:max_chars] + "...[truncated]"
return text
Savings: 60% fewer tokens on verbose documents.
2. Chunk + Summarize (RAG-Lite)
def chunk_and_summarize(text, chunk_size=2000):
"""Split large docs, summarize each chunk, then combine."""
chunks = [text[i:i+chunk_size] for i in range(0, len(text), chunk_size)]
summaries = []
for chunk in chunks:
summary = client.chat.completions.create(
model="deepseek-v4-flash",
messages=[{"role": "user", "content": f"Summarize:\n{chunk}"}],
max_tokens=100 # ← Cap response
)
summaries.append(summary.choices[0].message.content)
# Combine summaries into final answer
combined = " ".join(summaries)
return client.chat.completions.create(
model="deepseek-v4-pro",
messages=[{"role": "user", "content": f"Based on: {combined}\n\nAnswer: {user_question}"}]
)
3. Use Cheaper Models for Preprocessing
| Task | Model | Cost |
|---|---|---|
| Summarize chunks | glm-4-flash |
$0.10/1M |
| Final answer | deepseek-v4-pro |
$1.40/1M |
Strategy: Flash model for the heavy lifting, Pro model only for the final polish.
The Math
| Approach | 100K doc × 1000 calls | Cost |
|---|---|---|
| Naive (pro model, full text) | 100M tokens | $280 |
| Trimmed + flash preprocess | 20M + 10M tokens | $32 |
| Savings | 89% |
Try it free: aibridge-api.com — 14 models, one API.




Top comments (0)