Custodia-Admin

Posted on Mar 13 • Originally published at pagebolt.dev

How to Optimize AI Agent Costs — Inference, API Calls, and Infrastructure

#agents #costs #optimization #inference

How to Optimize AI Agent Costs — Inference, API Calls, and Infrastructure

Agents are expensive. Every API call costs money. Every inference costs money. Every screenshot costs money. At scale, the bill adds up fast.

Your agent workflow might cost $0.02 per execution. That's fine for 100 runs. At 10,000 runs per month, you're paying $200. At 100,000 runs, you're at $2,000.

Here's how to cut those costs without sacrificing performance.

Where Agent Costs Live

1. Inference (LLM calls)

GPT-4: $0.03 per 1K input tokens
GPT-3.5: $0.0005 per 1K input tokens
Claude 3: $0.003 per 1K input tokens

A single agent workflow might make 5-10 LLM calls. Each call costs tokens. At scale, this dominates the budget.

2. API Calls

Stripe: $0 (but slow at high volume)
AWS API calls: $0.0000002 per call (negligible)
Custom API calls: depends on your pricing

3. Infrastructure

Browser automation: Puppeteer, Playwright, Selenium = CPU-intensive
PageBolt API: Pay per screenshot/video
Hosting agents: EC2, Lambda, serverless containers

4. Data Transfer

Screenshots, videos, logs = bandwidth costs
S3 storage: $0.023 per GB/month

Cost Optimization Strategies

Strategy 1: Reduce Inference Calls

Not every step needs an LLM call. Many agent actions are deterministic.

# Bad: LLM decides every step
for page in pages:
    decision = llm.call(f"What should I do with {page}?")
    execute(decision)

# Good: Use logic for deterministic steps, LLM only for ambiguous ones
for page in pages:
    if page.matches_pattern(expected_format):
        execute_deterministic_action(page)
    else:
        decision = llm.call(f"How should I handle this unexpected format?")
        execute(decision)

Result: 80% fewer LLM calls. Cost drops from $200 to $40.

Strategy 2: Batch Processing

Process multiple items in a single LLM call instead of one-by-one.

# Bad: 1 LLM call per item
for item in items:
    classification = llm.call(f"Classify this: {item}")

# Good: Batch 10 items per LLM call
for batch in chunks(items, size=10):
    classifications = llm.call(f"Classify these 10 items: {batch}")

Result: 10x fewer calls. Cost drops proportionally.

Strategy 3: Caching and Memoization

Agent workflows often repeat the same tasks. Cache the results.

cache = {}

def process_page(url):
    if url in cache:
        return cache[url]

    result = agent.process(url)
    cache[url] = result
    return result

If 30% of your agent's work is duplicate, caching eliminates that cost.

Result: 30% cost reduction for repeated workflows.

Strategy 4: Use Cheaper Models for Initial Filtering

Use GPT-3.5 for simple filtering, GPT-4 only for complex reasoning.

# Tier 1: GPT-3.5 for classification
initial_category = gpt35.call(f"Is this a support ticket or sales inquiry?")

# Tier 2: GPT-4 only if ambiguous
if initial_category == "ambiguous":
    refined_category = gpt4.call(f"Deeper analysis required...")

Result: 90% of work done at 1/60th the cost.

Strategy 5: Minimize Screenshots/Videos

Screenshots and videos are expensive to store and process. Capture selectively.

# Bad: Screenshot every step
for step in workflow:
    execute(step)
    screenshot()  # 10 screenshots per workflow

# Good: Screenshot only critical steps
critical_steps = ["login", "form_submission", "confirmation"]
for step in workflow:
    execute(step)
    if step.name in critical_steps:
        screenshot()  # 3 screenshots per workflow

Result: 70% fewer screenshots. Saves on storage and API costs.

Strategy 6: Optimize API Calls

Not all APIs are equal. Some are slow, some are expensive.

# Bad: Call Stripe API for every transaction
for transaction in transactions:
    stripe.create_charge(transaction)  # 1 API call per transaction

# Good: Batch API calls where possible
stripe.create_charges_batch(transactions)  # 1 API call for 100 transactions

Result: 100x fewer API calls.

Real-World Example

Agent workflow: "Process 10,000 customer support tickets per month"

Original costs:

5 LLM calls per ticket × 10,000 = 50,000 calls = $1,500 (GPT-3.5)
3 screenshots per ticket × 10,000 = 30,000 screenshots = $300
Infrastructure: $200
Total: $2,000/month

Optimized:

Tier 1 (GPT-3.5): 10,000 calls = $5
Tier 2 (GPT-4, 20% only): 2,000 calls = $60
Batch processing: 80% reduction = $12 (instead of $60)
Selective screenshots (3 → 1 per ticket): $100
Infrastructure: $200
Total: $377/month

Savings: 81%

Where to Focus First

Identify your biggest cost driver — Is it inference? Screenshots? API calls?
Optimize that single thing — Often yields 50%+ savings
Iterate — Move to the next biggest cost driver

Most teams can cut costs by 60-80% with targeted optimizations.

Try it free: 100 requests/month on PageBolt—optimize your agent workflow costs while maintaining visibility into every action. No credit card required.

Top comments (1)

Henry Godnick • Mar 15

Great write-up on the infrastructure side. One angle I don't see mentioned much is the individual developer experience of cost visibility. Most of the optimization patterns you describe work at the system level, but developers interacting with LLMs through tools like Cursor, Claude, or Copilot often have no idea what their sessions are costing in real time.

I've been running TokenBar on my Mac which sits in the menu bar and shows live token usage and cost across multiple providers. It's been eye-opening honestly. You start to notice patterns like how much context window bloat actually costs per session, which directly connects to the caching and prompt optimization strategies you mentioned.

tokenbar.site for anyone building on top of LLM APIs who wants that real-time feedback loop.