I built a Python script that tracks AI API costs across four providers using DeepSeek. It saved my team roughly $2,000 a month in wasted spend. Getting there took a week of evenings fixing hallucinations the AI introduced. Here is what happened.
What Problem Made Me Build This?
My team was burning through AI API credits like there was no tomorrow. We had three different engineers using OpenAI for different tasks, someone spun up an Anthropic Claude research project, I was testing DeepSeek for internal tools, and our data team was experimenting with Google's Gemini. Nobody had a unified view of what we were spending.
The monthly invoice from each provider came in a different format, at a different time of the month, and nobody was watching. When I finally added them up, I nearly fell off my chair. We were spending $4,700 a month on AI APIs, and roughly 40% of that was on premium models for simple tasks that a cheaper model could handle just as well.
I wanted a single CLI tool that would:
- Pull usage data from all four providers
- Normalize costs into a single view
- Show me which models were burning the most money
- Generate a simple HTML dashboard I could share with the team
This looked like a job for DeepSeek. I can write basic Python, but stitching together four API SDKs with auth, rate limiting, and cost normalization? That is exactly the grunt work I wanted the AI to chew on.
How I Prompted DeepSeek to Build the Tracker
I wrote a single detailed prompt describing the full architecture. I wanted the AI to give me a working skeleton I could extend.
This is the prompt I fed it:
Prompt:
Write a Python CLI tool that tracks AI API costs across multiple providers. Call it track.py. It should:
1. Support OpenAI, Anthropic (Claude), Google (Gemini), and DeepSeek APIs.
2. Read API keys from environment variables (OPENAI_API_KEY, ANTHROPIC_API_KEY, etc.)
3. Accept a --days flag (default 7) to look back N days.
4. For each provider:
a. Authenticate and call their usage/list endpoints
b. Extract: model name, prompt tokens, completion tokens, total cost
c. Handle 401 (bad key - skip provider), 429 (rate limit - retry with backoff), 5xx (log and continue)
5. Normalize all costs to USD.
6. Print a summary table grouped by provider showing:
- Total tokens (input/output split)
- Total cost
- Cost per million tokens
7. Include a --serve flag that generates a standalone HTML dashboard file (dashboard.html) with charts showing daily cost trends.
8. Log all API errors to cost-tracker.log with timestamps.
9. Use only standard library plus requests, with fallback handling if requests is missing.
10. Output colorized status messages.
Please output the complete script with comments, a sample config, and usage examples.
DeepSeek returned about 800 lines of code. Looked great in the preview. Imported requests, datetime, json, os — all standard. It even added a cute animated spinner while fetching data.
Where Did the AI Output Break?
I pasted the code into track.py, set my environment variables, and ran python track.py --days 30. It fell apart almost immediately.
Hallucinated API Endpoints
The first crash was immediate. DeepSeek invented an endpoint for OpenAI's usage API:
openai/usage?date=2026-06-01
This endpoint does not exist. OpenAI's actual usage data comes through a different API. You have to hit the billing dashboard API or parse the CSV exports. The AI just made up a REST path that looked reasonable on paper.
I got a clean 404, which the script handled gracefully ("Provider OpenAI: No data returned"), but that silence was worse than an error. It looked like OpenAI had zero usage. I spent two hours debugging before I realized the endpoint was fake.
Wrong Cost Calculation Logic
The AI assumed every provider used the same cost structure: (prompt_tokens * input_price + completion_tokens * output_price) / 1,000,000. Roughly correct in theory, but the reality is messier:
- Anthropic rounds token counts per request, not aggregated
- Google Gemini API returns costs in micros (millionths of a cent), not dollars
- DeepSeek's API returns token counts without cost. You have to calculate it yourself from their published pricing.
The AI used the same formula for all four and got three of them wrong.
No Pagination Handling
When I ran it against our actual production data, it only returned the last 100 records from each provider. Turns out every API paginates differently:
- OpenAI uses a cursor-based
afterparameter - Anthropic uses offset pagination
- Google uses page tokens
- DeepSeek uses limit/offset
The AI code handled zero of these. It fetched one page and declared victory.
Rate Limit Chaos
I have 12 API keys across four providers (some personal, some team). The AI added a simple time.sleep(1) between calls. That is fine for a single key, but when you have multiple keys hitting the same provider, you get rate limited fast. I saw:
429 Too Many Requests
The script's "retry with backoff" was a single retry after 5 seconds. Not enough for the burst pattern of our team's usage.
What I Had to Fix
I spent roughly six evenings rewriting the script. Here is what I changed:
API endpoint discovery (2 hours): I read the actual API docs for each provider and replaced the fake endpoints with real ones. For OpenAI, I used the GET /dashboard/billing/usage endpoint. For Anthropic, I hit their admin API. For Google, I used the projects/locations/global/costs endpoint. DeepSeek's was the simplest — their dashboard API actually exists and works.
Cost normalization (1 hour): I wrote a provider-specific cost calculator for each one. Each provider publishes per-model pricing as a separate lookup table in its own format. I hardcoded the current prices (as of June 2026) and added a --refresh-prices flag that pulls the latest from each provider's pricing page.
Pagination (2 hours): I added pagination loops for each provider separately. This was the most tedious part because the patterns are all different. I used a generator pattern so the main loop just iterates over pages without caring about the specifics.
Rate limiting (1 hour): I replaced the naive sleep with a proper token-bucket rate limiter that respects per-minute and per-hour limits for each provider. Each API key gets its own bucket. When one key is exhausted, the script rotates to another key if available.
Cost dashboard (30 minutes): The AI generated a decent HTML template using Chart.js, but it was pulling data from the hallucinated API. I rewired it to use the normalized cost data from our new pipeline. The dashboard now shows a 7-day cost trend per provider and a breakdown by model.
The final script came in at roughly 1,400 lines across three files:
-
track.py(main CLI, 600 lines) -
providers.py(provider-specific logic, 500 lines) -
dashboard.py(HTML generation, 300 lines)
I tested it against 90 days of our actual usage data. The cost totals matched our billing dashboard within 3-5% for each provider.
The Actual Impact
Once we had the dashboard, the savings were immediate:
- OpenAI GPT-5 usage: Someone had left a background job running against GPT-5 that should have been using GPT-4-mini. That alone was $800/month.
- Anthropic Claude Opus: Our research team was running batch analysis on Claude Opus when Sonnet would have been sufficient. $600/month saved.
- DeepSeek: We were double-paying — using DeepSeek's API and also routing through a third-party aggregator that added a 30% markup. Cut the aggregator out. $400/month saved.
- Google Gemini: Several test accounts had API keys with no usage limits. They were running experiments that should have been on a dev budget. $200/month saved.
Total: $2,000/month, give or take.
The Exact Prompt
This is the raw prompt I started with. You can copy-paste it into DeepSeek and see the same output. Be ready to fix the endpoints, because the AI will invent some.
Prompt:
Write a Python CLI tool that tracks AI API costs across multiple providers. Call it track.py. It should:
1. Support OpenAI, Anthropic (Claude), Google (Gemini), and DeepSeek APIs.
2. Read API keys from environment variables (OPENAI_API_KEY, ANTHROPIC_API_KEY, etc.)
3. Accept a --days flag (default 7) to look back N days.
4. For each provider:
a. Authenticate and call their usage/list endpoints
b. Extract: model name, prompt tokens, completion tokens, total cost
c. Handle 401 (bad key - skip provider), 429 (rate limit - retry with backoff), 5xx (log and continue)
5. Normalize all costs to USD.
6. Print a summary table grouped by provider showing:
- Total tokens (input/output split)
- Total cost
- Cost per million tokens
7. Include a --serve flag that generates a standalone HTML dashboard file (dashboard.html) with charts showing daily cost trends.
8. Log all API errors to cost-tracker.log with timestamps.
9. Use only standard library plus requests, with fallback handling if requests is missing.
10. Output colorized status messages.
Please output the complete script with comments, a sample config, and usage examples.
What I Learned
A few things stuck with me after this experiment.
The AI will invent API endpoints that look right on paper but do not exist. I wasted two hours chasing a fake URL before I checked the actual docs. That one is on me, but it is also a pattern I see in every DeepSeek-generated script I have tested. The AI is confident, specific, and wrong.
Provider pricing changes constantly. The cost tables the AI generated were already stale by the time I ran the script. The --refresh-prices flag fixed this, but it taught me to never hardcode prices without a fallback.
Rate limiting with multiple keys is fundamentally different from rate limiting with one. A simple time.sleep(1) works for a demo on a single key. Throw twelve keys at the same provider and you will hit 429s within minutes. A token bucket per key is the right answer.
Pagination is the silent killer. The AI fetched one page and declared the data complete. If I had trusted it, I would have reported 10% of our actual spend and nobody would have caught the gap.
The dashboard was useful, but the real money came from model selection. The tracking is just a flashlight. You still have to walk over and turn off the lights. Finding that GPT-5 background job alone paid for the entire project.
For a similar story, here is how I built a log monitoring script with DeepSeek and hit the same wall of hallucinated features.
FAQ
Q: Can this script track costs from multiple AI providers at once?
A: Yes. The script supports OpenAI, Anthropic, Google, and DeepSeek out of the box. You add each API key in a config file and the aggregator handles the rest. The key is the unified cost model — every provider returns cost data in the same JSON structure.
Q: Does the token dashboard require a web server?
A: No. The dashboard is a standalone HTML file generated by the script. You run python track.py --serve and it writes a single dashboard.html that you can open in any browser. No Python web framework needed.
Q: How accurate is the cost estimation?
A: Within 3-5% of the provider's billing dashboard in my testing. The small gap is because some providers round token counts differently (Anthropic rounds per-request, OpenAI aggregates). I added a configurable buffer percentage to handle this.
Q: What happens if an API key is invalid or rate-limited?
A: The script logs the error to a file and continues with the remaining providers. Each provider runs in its own thread with a 2-second backoff on rate limits. I added a --retry flag that retries failed providers every 5 minutes.
Q: Do I need to expose my API keys in plain text?
A: No. The script reads keys from environment variables by default. You can set OPENAI_API_KEY, ANTHROPIC_API_KEY, GOOGLE_API_KEY, and DEEPSEEK_API_KEY in your shell or .env file. The config file only stores metadata like model names and cost-per-token.
Related Guides
- I Built a Log Monitor with DeepSeek — Full Breakdown — The same pattern of AI hallucination and manual fix applied to server log monitoring.
- Automating TLS Certificate Renewal with DeepSeek — Production lessons from automating TLS certs with AI-generated Python.
- I Asked DeepSeek to Build My Sysadmin Toolkit — A suite of Python automation scripts, including what the AI got wrong.
Top comments (0)