ARE EXTRA SERVICES BURNING YOUR TOKENS?
I built an autonomous AI trading agent that runs 24/7, scanning hundreds of crypto and traditional finance markets, analyzing technical indicators, and executing trades when signals align.
It worked perfectly — until I checked the bill. Continue reading →
The Problem: Silent Token Bleeding
Every 60 seconds, the system called an expensive AI model to analyze potentially dozens of markets simultaneously. Even when markets were quiet or showing weak signals, it still ran full AI analysis.
| Metric | Before |
|---|---|
| AI calls/day | 7,200 |
| Daily token cost | $8-$52 |
| Monthly cost | $240-$600 |
I was paying premium AI prices for noise.
The Architecture (What Changed)
Before: Scan → Trigger → AI Research → Execute
↓
Every trigger burns tokens
After: Scan → Trigger → TA Filter → AI Research → Execute
(cheap) ↑
Only CONFIRMED signals
The key insight: use expensive AI as a last resort, not a first step.
Layer 1: Skill System Purge
My agent loaded 100+ skills on every turn — ASCII art generators, pixel art tools, Minecraft server managers, smart home controllers. None of them relevant to autonomous trading.
Fix: Removed 16 unnecessary categories (~80+ skills). System prompt shrank 50-60%. Each turn saves ~2,000+ tokens.
Layer 2: Pre-AI Technical Analysis Filter
Built a statistical pre-filter that runs multi-timeframe technical analysis:
- EMA crossovers (1h/4h/1d)
- RSI, ADX, ATR calculations
- Volume confirmation
- Trend alignment scoring
All pure computation. Zero AI cost. Only signals scoring ≥65/100 as "CONFIRMED" proceed to AI analysis.
Layer 3: Systematic Tuning
| Setting | Before | After |
|---|---|---|
| Scan interval | 60s | 3min |
| Trigger threshold | 75 | 80 |
| Max AI per cycle | 5 | 2 |
| Max tokens/call | 2048 | 1024 |
| News fetch | Every call | Removed |
| System prompt | ~2,200 chars | ~1,645 chars |
The Results
| Metric | Before | After | Reduction |
|---|---|---|---|
| AI calls/day | 7,200 | ~960 | 87% |
| Daily cost | $8-$52 | $3-$10 | 80%+ |
| System prompt | 2,200 chars | 1,645 chars | 25% |
| Monthly cost | $240-$600 | $90-$300 | $150-$300 saved |
The Architecture That Emerged
The system now has five distinct layers, each reducing load for the next:
- Heartbeat (every 3 min) — scans 230+ markets
- Trigger Engine — fires statistical signals (price spikes, volume, breakouts)
- TA Filter — multi-TF analysis, scores signals CONFIRMED/WEAK/REJECTED
- AI Research (only CONFIRMED) — deep analysis with reasoning
- Risk Gates — 10 independent compliance checks before execution
The AI model is deployed only when statistical analysis has already validated the opportunity.
Key Lessons for Building AI Systems
- Never let AI compute what you can calculate — Technical indicators are math. Run them cheaply first.
- Every service loaded costs something — Don't load skills your agent doesn't need. They accumulate.
- Align frequency with signal timeframes — Trading 4-hour candles? Don't scan every 60 seconds.
- Use statistical thresholds before AI — Filter with math, reserve AI for nuance.
- Build defensive architecture — Each layer should reduce workload for the next.
Why This Matters
This isn't just about saving tokens. It demonstrates:
- Systems thinking — Mapped entire architecture, identified bottlenecks systematically
- Data-driven optimization — Measured before/after metrics with clear ROI
- Real-world AI deployment — Running 24/7 AI systems in production
- Engineering rigor — Layered architecture where each component has a specific purpose
Built with Hermes Agent, Next.js 16, Hyperliquid API, and OpenRouter. Source code on GitHub. I'm a Hermes Agent contributor — if your team builds AI systems and needs someone who knows where the hidden costs live, let me talk.
Top comments (6)
The pre-AI statistical filter is the layer most teams underuse. We
saw a similar 60-70% drop in another domain (auto-translation of user
content for our LMS) by not sending text to the model unless a content
hash had actually changed. Same shape -- cheap deterministic gate before
the expensive call. Two things that compounded for us: caching keyed
by (content_hash, source_lang, target_lang) instead of (entity_id,
locale), and bundling canonical artifacts (Bible verses in our case)
as substitution tokens so the model only translates the prose around
them, not the verbatim quotes. Wonder if you've explored a similar
cache layer for the AI analysis stage -- same signal triggering the
same prompt-shape should ideally hit a cache, not re-invoke.
The Layer 1 skill purge is the part nobody talks about. I spent a week running four agents and discovered I'd been shipping the full 100-skill manifest to every turn same problem on a totally different domain. Trimming to 12 skills cut my context burn by ~60% and the agents made fewer wrong-tool choices, which I didn't predict. Cheap pre-filter first, expensive reasoning last is a useful rule. The thing your post doesn't address: when the cheap TA filter is wrong, you lose the trade you would have won.
80% reduction is exactly the kind of result that shows up when you stop treating the agent as one monolithic premium call. Curious how much of your win came from each lever - in these architectures it's usually some mix of: caching the stable context instead of re-sending it, pruning what actually goes in the window, and routing cheap calls to a small model.
The trading-agent angle adds a nice twist: a lot of the loop is structured, repetitive classification/parsing that a small fast model nails for cents, and you only need the expensive reasoning model for the genuinely ambiguous decisions. That task-difficulty split is where the 80% almost always hides. Would love to see the actual model-routing breakdown if you do a follow-up - which calls you kept on the big model vs pushed down. Great architecture writeup.
yup! this really helped me
Some comments may only be visible to logged-in visitors. Sign in to view all comments.