DEV Community

Cover image for claude_runner — how I eliminated Claude API costs by using the subscription I was already paying for
Gumaro Gonzalez
Gumaro Gonzalez

Posted on

claude_runner — how I eliminated Claude API costs by using the subscription I was already paying for

For months I was paying for Claude twice. The monthly subscription and the API tokens every time an agent made a call.
Turns out I didn't have to.
The problem
When you import the Anthropic SDK directly, every token gets billed:

This charges per token consumed

from anthropic import Anthropic
client = Anthropic(api_key="sk-...")
response = client.messages.create(...)
1,000 fiscal document analyses per month with Sonnet: between $25 and $80 USD on top of what you already pay for the subscription. And that scales linearly with every new agent you add to the system.
The discovery
Claude Code CLI has an authentication hierarchy that almost nobody documents:

  1. CLAUDE_CODE_USE_BEDROCK / USE_VERTEX (cloud providers)
  2. ANTHROPIC_AUTH_TOKEN (proxies / gateways)
  3. ANTHROPIC_API_KEY ← per-token billing starts here
  4. apiKeyHelper script (rotating credentials)
  5. ~/.claude/.credentials.json ← your Max subscription lives here When you run claude login, the CLI stores an OAuth token at ~/.claude/.credentials.json. That token is exactly the same one your interactive terminal uses. If there is no ANTHROPIC_API_KEY defined in the environment, the CLI falls back to position 5 and uses your subscription session. claude -p spawned as a subprocess from your code does the same thing. No API key. No additional invoice. What claude_runner is It's a module that acts as a bridge between agents and Claude, using claude -p as a subprocess instead of the SDK. It exists in two versions: tacos-aragon-fiscal/src/claude_runner.py ← Python, fiscal document analysis pmo-agent/claude-runner.js ← JavaScript, production agent Both do the same thing: spawn the claude -p process, parse the output, and return the response to the calling agent. No secrets in the code. No environment variable that can leak into logs. Python implementation (79 lines) The Python version is minimal. No external dependencies beyond the standard library. """ claude_runner.py — No API key, no per-token costs. Replaces direct Anthropic SDK calls. """

import subprocess, sys, os, tempfile

CLAUDE_TIMEOUT = 300 # 5 minutes default

def run(system_prompt: str, user_message: str,
model: str = "sonnet",
max_budget: float = 2.0,
timeout: int = CLAUDE_TIMEOUT,
session_id: str | None = None) -> str:

full_prompt = f"=== INSTRUCTIONS ===\n{system_prompt}\n\n=== TASK ===\n{user_message}"

tmp = tempfile.NamedTemporaryFile(mode='w', suffix='.txt',
                                  delete=False, encoding='utf-8')
tmp.write(full_prompt)
tmp.close()

try:
    cmd = (
        f'type "{tmp.name}" | claude -p'
        f' --output-format text'
        f' --model {model}'
        f' --permission-mode bypassPermissions'
        f' --max-budget-usd {max_budget}'
    )
    if session_id:
        cmd += f' --session-id {session_id}'

    comspec = os.environ.get('COMSPEC', r'C:\Windows\system32\cmd.exe')
    result = subprocess.run(
        [comspec, '/c', cmd],
        capture_output=True, text=True,
        timeout=timeout, encoding='utf-8', errors='replace'
    )

    if result.returncode != 0:
        raise RuntimeError(f"Claude error: {result.stderr}")

    return result.stdout.strip()

finally:
    os.unlink(tmp.name)
Enter fullscreen mode Exit fullscreen mode

Why a temp file instead of passing the prompt directly? Windows has a ~32,767 character limit on command-line arguments. Fiscal analysis prompts exceed that regularly. The temp file is the reliable solution.
JavaScript implementation (with stream-json)
const { spawn } = require('child_process');
const crypto = require('crypto');
const fs = require('fs');
const os = require('os');
const path = require('path');

let _running = false; // Anti-reentrance guard

async function runClaude({ promptFile, userPrompt, projectName }) {

if (_running) {
return { ok: false, error: 'Execution already in progress' };
}
_running = true;

const mcpConfig = getMcpConfigForProject(projectName);
const { sessionId, isNew } = getOrCreateSession(projectName);

const finalPrompt = isNew
? ${await fs.promises.readFile(promptFile, 'utf8')}\n\n${userPrompt}
: userPrompt;

const tmpId = crypto.randomBytes(4).toString('hex');
const tmpPrompt = path.join(os.tmpdir(), pmo-prompt-${tmpId}.txt);
const tmpBat = path.join(os.tmpdir(), pmo-run-${tmpId}.bat);

await fs.promises.writeFile(tmpPrompt, finalPrompt, 'utf8');

const batContent = [
'@echo off',
type "${tmpPrompt}" | claude -p ^,
--output-format stream-json ^,
--model sonnet ^,
--mcp-config "${mcpConfig}" ^,
--strict-mcp-config ^,
--permission-mode bypassPermissions ^,
--max-turns 20 ^,
--max-budget-usd 2.00 ^,
--session-id ${sessionId}
].join('\r\n');

await fs.promises.writeFile(tmpBat, batContent, 'utf8');

return new Promise((resolve) => {
let finalOutput = '';
let watchdog;

const resetWatchdog = () => {
  clearTimeout(watchdog);
  watchdog = setTimeout(() => {
    proc.kill();
    resolve({ ok: false, error: 'Inactivity timeout' });
  }, 5 * 60 * 1000);
};

const proc = spawn('cmd.exe', ['/c', tmpBat]);

proc.stdout.on('data', (chunk) => {
  resetWatchdog();
  for (const line of chunk.toString().split('\n')) {
    if (!line.trim()) continue;
    try {
      const event = JSON.parse(line);
      processStreamEvent(event, (output) => { finalOutput = output; });
    } catch { /* non-JSON line, ignore */ }
  }
});

proc.on('close', async (code) => {
  clearTimeout(watchdog);
  _running = false;
  await Promise.allSettled([
    fs.promises.unlink(tmpPrompt),
    fs.promises.unlink(tmpBat)
  ]);
  resolve({ ok: code === 0, output: finalOutput, exitCode: code, sessionId });
});

resetWatchdog();
Enter fullscreen mode Exit fullscreen mode

});
}
The stream-json event parser
function processStreamEvent(event, onResult) {
switch (event.type) {

case 'system':
  console.log(`🔌 MCP: ${event.mcp_servers.map(s => s.name).join(', ')}`);
  break;

case 'assistant':
  for (const block of event.message.content) {
    if (block.type === 'thinking') console.log(`💭 ${block.thinking}`);
    if (block.type === 'tool_use') console.log(`🔧 ${block.name}`);
    if (block.type === 'text')     console.log(`💬 ${block.text}`);
  }
  break;

case 'result':
  onResult(event.result);
  // cost_usd is informational in Plan Max — not a real charge
  console.log(`💰 Equivalent cost: $${event.cost_usd.toFixed(4)}`);
  break;
Enter fullscreen mode Exit fullscreen mode

}
}
Why stream-json instead of text? With text the process looks frozen until it finishes. With stream-json the inactivity watchdog can tell Claude is still working because each tool_use event resets the timer. Five minutes without any event means something went wrong.
The three decisions that make it work in production

  1. Anti-reentrance guard Without _running, the same error can trigger the agent twice in a row. Two Claude instances operating on the same codebase simultaneously is a race condition that ends badly.
  2. Dynamic MCP config per project Instead of loading the global config with all servers in the ecosystem, the runner generates a minimal JSON before each execution: { "mcpServers": { "project-tacos-bot": { "command": "python", "args": ["C:/servers/tacos-bot-mcp/index.py"] } } } If you load 6 MCP servers with 24 tools each when you only need 24 tools from one of them, you are injecting the schema of 144 tools into the context. More input tokens, slower execution, and Claude has to ignore tools it will never use.
  3. Session management per project Each project gets a session UUID with a 1-hour TTL and an 8-message limit: const SESSION_TTL_MS = 60 * 60 * 1000; const SESSION_MAX_MESSAGES = 8;

function getOrCreateSession(projectName) {
const now = Date.now();
const existing = activeSessions.get(projectName);

if (existing &&
(now - existing.createdAt) < SESSION_TTL_MS &&
existing.messages < SESSION_MAX_MESSAGES) {
return { sessionId: existing.sessionId, isNew: false };
}

const sessionId = crypto.randomUUID();
activeSessions.set(projectName, { sessionId, createdAt: now, messages: 0 });
return { sessionId, isNew: true };
}
On an active session, the prompt skips the instructions. Claude remembers them from the first message of the session. On a new session, they go in full. The difference in input tokens is significant when the system prompt is long.
The trap to avoid
If at any point you define ANTHROPIC_API_KEY in the server environment, the CLI detects it at position 3 in the hierarchy and starts billing per token with no warning. The runner stops using the subscription without you noticing.

Check before going to production

echo $ANTHROPIC_API_KEY

If this returns anything, there is a problem

unset ANTHROPIC_API_KEY
That variable must not exist in the environment where the agent runs.
The cost_usd field is not a charge
The result event in the stream-json output includes a cost_usd field. That number shows what it would cost if you were in API mode. In Plan Max it is not deducted from any balance and generates no billing.
I log it as an efficiency reference to know exactly how much I am saving per execution and to catch if a call is consuming more context than expected.
Cost comparison
Setup
Per execution
500 analyses/month
1,000/month
Sonnet SDK direct
~$0.047
~$23
~$47
GPT-4o API direct
~$0.039
~$19
~$39
claude_runner Plan Max
$0.00
$0.00
$0.00
Break-even against Plan Max at $200/month: 71 executions per day. After that point, the API costs more than the full plan.
What runs in production
tacos-aragon-fiscal (Python): downloads CFDIs from Mexico's SAT tax authority, parses the fiscal XML, calculates taxes, and detects inconsistencies between declared income and actual sales.
pmo-agent (JavaScript): detects errors in production processes, proposes fixes ordered by impact, waits for Telegram approval, applies changes, commits, restarts PM2, and reports exactly what it did. If something ends up worse than before, it runs git checkout automatically.
aragon-git-guardian: intercepts every git push, scans the repo with grep across 10 security categories, and if it finds anything calls claude_runner with the specific evidence for contextual analysis. No AI when there are no findings, no cost when the repo is clean.
Monthly cost for all these agents combined: $0 additional on top of the subscription.
How to replicate it

1. Install and authenticate Claude Code

npm install -g @anthropic-ai/claude-code
claude login

2. Verify there is no API key in the environment

unset ANTHROPIC_API_KEY
echo $ANTHROPIC_API_KEY # should be empty

3. Test that the CLI uses the subscription

claude -p "respond with just: ok"

Should respond without asking for an API key

After that: spawn claude -p as a subprocess with spawn (Node.js) or subprocess (Python), parse the stream-json, and manage the process lifecycle.
The highest-impact decision is writing your own MCP server instead of using generic third-party MCPs. Your own context is cleaner, uses fewer tokens, and Claude enters the session already knowing how your system is built.

PMO Agent repo: github.com/Gumagonza1/pmo-agent

Claude_runner repo:https://github.com/Gumagonza1/claude-runner

Self-healing server article: dev.to/gumagonza1/i-built-a-self-healing-production-server-using-claude-code-no-api-key-no-extra-cost-1eoo

Top comments (0)