For months I was paying for Claude twice. The monthly subscription and the API tokens every time an agent made a call.
Turns out I didn't have to.
The problem
When you import the Anthropic SDK directly, every token gets billed:
This charges per token consumed
from anthropic import Anthropic
client = Anthropic(api_key="sk-...")
response = client.messages.create(...)
1,000 fiscal document analyses per month with Sonnet: between $25 and $80 USD on top of what you already pay for the subscription. And that scales linearly with every new agent you add to the system.
The discovery
Claude Code CLI has an authentication hierarchy that almost nobody documents:
- CLAUDE_CODE_USE_BEDROCK / USE_VERTEX (cloud providers)
- ANTHROPIC_AUTH_TOKEN (proxies / gateways)
- ANTHROPIC_API_KEY ← per-token billing starts here
- apiKeyHelper script (rotating credentials)
- ~/.claude/.credentials.json ← your Max subscription lives here When you run claude login, the CLI stores an OAuth token at ~/.claude/.credentials.json. That token is exactly the same one your interactive terminal uses. If there is no ANTHROPIC_API_KEY defined in the environment, the CLI falls back to position 5 and uses your subscription session. claude -p spawned as a subprocess from your code does the same thing. No API key. No additional invoice. What claude_runner is It's a module that acts as a bridge between agents and Claude, using claude -p as a subprocess instead of the SDK. It exists in two versions: tacos-aragon-fiscal/src/claude_runner.py ← Python, fiscal document analysis pmo-agent/claude-runner.js ← JavaScript, production agent Both do the same thing: spawn the claude -p process, parse the output, and return the response to the calling agent. No secrets in the code. No environment variable that can leak into logs. Python implementation (79 lines) The Python version is minimal. No external dependencies beyond the standard library. """ claude_runner.py — No API key, no per-token costs. Replaces direct Anthropic SDK calls. """
import subprocess, sys, os, tempfile
CLAUDE_TIMEOUT = 300 # 5 minutes default
def run(system_prompt: str, user_message: str,
model: str = "sonnet",
max_budget: float = 2.0,
timeout: int = CLAUDE_TIMEOUT,
session_id: str | None = None) -> str:
full_prompt = f"=== INSTRUCTIONS ===\n{system_prompt}\n\n=== TASK ===\n{user_message}"
tmp = tempfile.NamedTemporaryFile(mode='w', suffix='.txt',
delete=False, encoding='utf-8')
tmp.write(full_prompt)
tmp.close()
try:
cmd = (
f'type "{tmp.name}" | claude -p'
f' --output-format text'
f' --model {model}'
f' --permission-mode bypassPermissions'
f' --max-budget-usd {max_budget}'
)
if session_id:
cmd += f' --session-id {session_id}'
comspec = os.environ.get('COMSPEC', r'C:\Windows\system32\cmd.exe')
result = subprocess.run(
[comspec, '/c', cmd],
capture_output=True, text=True,
timeout=timeout, encoding='utf-8', errors='replace'
)
if result.returncode != 0:
raise RuntimeError(f"Claude error: {result.stderr}")
return result.stdout.strip()
finally:
os.unlink(tmp.name)
Why a temp file instead of passing the prompt directly? Windows has a ~32,767 character limit on command-line arguments. Fiscal analysis prompts exceed that regularly. The temp file is the reliable solution.
JavaScript implementation (with stream-json)
const { spawn } = require('child_process');
const crypto = require('crypto');
const fs = require('fs');
const os = require('os');
const path = require('path');
let _running = false; // Anti-reentrance guard
async function runClaude({ promptFile, userPrompt, projectName }) {
if (_running) {
return { ok: false, error: 'Execution already in progress' };
}
_running = true;
const mcpConfig = getMcpConfigForProject(projectName);
const { sessionId, isNew } = getOrCreateSession(projectName);
const finalPrompt = isNew
? ${await fs.promises.readFile(promptFile, 'utf8')}\n\n${userPrompt}
: userPrompt;
const tmpId = crypto.randomBytes(4).toString('hex');
const tmpPrompt = path.join(os.tmpdir(), pmo-prompt-${tmpId}.txt);
const tmpBat = path.join(os.tmpdir(), pmo-run-${tmpId}.bat);
await fs.promises.writeFile(tmpPrompt, finalPrompt, 'utf8');
const batContent = [
'@echo off',
type "${tmpPrompt}" | claude -p ^,
--output-format stream-json ^,
--model sonnet ^,
--mcp-config "${mcpConfig}" ^,
--strict-mcp-config ^,
--permission-mode bypassPermissions ^,
--max-turns 20 ^,
--max-budget-usd 2.00 ^,
--session-id ${sessionId}
].join('\r\n');
await fs.promises.writeFile(tmpBat, batContent, 'utf8');
return new Promise((resolve) => {
let finalOutput = '';
let watchdog;
const resetWatchdog = () => {
clearTimeout(watchdog);
watchdog = setTimeout(() => {
proc.kill();
resolve({ ok: false, error: 'Inactivity timeout' });
}, 5 * 60 * 1000);
};
const proc = spawn('cmd.exe', ['/c', tmpBat]);
proc.stdout.on('data', (chunk) => {
resetWatchdog();
for (const line of chunk.toString().split('\n')) {
if (!line.trim()) continue;
try {
const event = JSON.parse(line);
processStreamEvent(event, (output) => { finalOutput = output; });
} catch { /* non-JSON line, ignore */ }
}
});
proc.on('close', async (code) => {
clearTimeout(watchdog);
_running = false;
await Promise.allSettled([
fs.promises.unlink(tmpPrompt),
fs.promises.unlink(tmpBat)
]);
resolve({ ok: code === 0, output: finalOutput, exitCode: code, sessionId });
});
resetWatchdog();
});
}
The stream-json event parser
function processStreamEvent(event, onResult) {
switch (event.type) {
case 'system':
console.log(`🔌 MCP: ${event.mcp_servers.map(s => s.name).join(', ')}`);
break;
case 'assistant':
for (const block of event.message.content) {
if (block.type === 'thinking') console.log(`💭 ${block.thinking}`);
if (block.type === 'tool_use') console.log(`🔧 ${block.name}`);
if (block.type === 'text') console.log(`💬 ${block.text}`);
}
break;
case 'result':
onResult(event.result);
// cost_usd is informational in Plan Max — not a real charge
console.log(`💰 Equivalent cost: $${event.cost_usd.toFixed(4)}`);
break;
}
}
Why stream-json instead of text? With text the process looks frozen until it finishes. With stream-json the inactivity watchdog can tell Claude is still working because each tool_use event resets the timer. Five minutes without any event means something went wrong.
The three decisions that make it work in production
- Anti-reentrance guard Without _running, the same error can trigger the agent twice in a row. Two Claude instances operating on the same codebase simultaneously is a race condition that ends badly.
- Dynamic MCP config per project Instead of loading the global config with all servers in the ecosystem, the runner generates a minimal JSON before each execution: { "mcpServers": { "project-tacos-bot": { "command": "python", "args": ["C:/servers/tacos-bot-mcp/index.py"] } } } If you load 6 MCP servers with 24 tools each when you only need 24 tools from one of them, you are injecting the schema of 144 tools into the context. More input tokens, slower execution, and Claude has to ignore tools it will never use.
- Session management per project Each project gets a session UUID with a 1-hour TTL and an 8-message limit: const SESSION_TTL_MS = 60 * 60 * 1000; const SESSION_MAX_MESSAGES = 8;
function getOrCreateSession(projectName) {
const now = Date.now();
const existing = activeSessions.get(projectName);
if (existing &&
(now - existing.createdAt) < SESSION_TTL_MS &&
existing.messages < SESSION_MAX_MESSAGES) {
return { sessionId: existing.sessionId, isNew: false };
}
const sessionId = crypto.randomUUID();
activeSessions.set(projectName, { sessionId, createdAt: now, messages: 0 });
return { sessionId, isNew: true };
}
On an active session, the prompt skips the instructions. Claude remembers them from the first message of the session. On a new session, they go in full. The difference in input tokens is significant when the system prompt is long.
The trap to avoid
If at any point you define ANTHROPIC_API_KEY in the server environment, the CLI detects it at position 3 in the hierarchy and starts billing per token with no warning. The runner stops using the subscription without you noticing.
Check before going to production
echo $ANTHROPIC_API_KEY
If this returns anything, there is a problem
unset ANTHROPIC_API_KEY
That variable must not exist in the environment where the agent runs.
The cost_usd field is not a charge
The result event in the stream-json output includes a cost_usd field. That number shows what it would cost if you were in API mode. In Plan Max it is not deducted from any balance and generates no billing.
I log it as an efficiency reference to know exactly how much I am saving per execution and to catch if a call is consuming more context than expected.
Cost comparison
Setup
Per execution
500 analyses/month
1,000/month
Sonnet SDK direct
~$0.047
~$23
~$47
GPT-4o API direct
~$0.039
~$19
~$39
claude_runner Plan Max
$0.00
$0.00
$0.00
Break-even against Plan Max at $200/month: 71 executions per day. After that point, the API costs more than the full plan.
What runs in production
tacos-aragon-fiscal (Python): downloads CFDIs from Mexico's SAT tax authority, parses the fiscal XML, calculates taxes, and detects inconsistencies between declared income and actual sales.
pmo-agent (JavaScript): detects errors in production processes, proposes fixes ordered by impact, waits for Telegram approval, applies changes, commits, restarts PM2, and reports exactly what it did. If something ends up worse than before, it runs git checkout automatically.
aragon-git-guardian: intercepts every git push, scans the repo with grep across 10 security categories, and if it finds anything calls claude_runner with the specific evidence for contextual analysis. No AI when there are no findings, no cost when the repo is clean.
Monthly cost for all these agents combined: $0 additional on top of the subscription.
How to replicate it
1. Install and authenticate Claude Code
npm install -g @anthropic-ai/claude-code
claude login
2. Verify there is no API key in the environment
unset ANTHROPIC_API_KEY
echo $ANTHROPIC_API_KEY # should be empty
3. Test that the CLI uses the subscription
claude -p "respond with just: ok"
Should respond without asking for an API key
After that: spawn claude -p as a subprocess with spawn (Node.js) or subprocess (Python), parse the stream-json, and manage the process lifecycle.
The highest-impact decision is writing your own MCP server instead of using generic third-party MCPs. Your own context is cleaner, uses fewer tokens, and Claude enters the session already knowing how your system is built.
PMO Agent repo: github.com/Gumagonza1/pmo-agent
Claude_runner repo:https://github.com/Gumagonza1/claude-runner
Self-healing server article: dev.to/gumagonza1/i-built-a-self-healing-production-server-using-claude-code-no-api-key-no-extra-cost-1eoo
Top comments (0)