DEV Community

AgencyBoxx
AgencyBoxx

Posted on • Originally published at agencyboxx.com

How We Run 12 AI Agents for $3/Day: OpenClaw Token Management

Burning $50 in API credits in under two hours taught us a hard lesson: AI token spend can quickly kill a multi-agent system budget if you're not careful. This wasn't just a cost blip; it was a foundational moment for our team. Today, we run 12 distinct AI agents across three instances, serving over 75 concurrent clients and processing more than 700 email actions daily, all while keeping our total AI token cost consistently between $2.50 and $3.00 per day.

This isn't some theoretical best practice; it's our production reality. We built our entire OpenClaw architecture around this principle: make AI economically sustainable at scale. If you're building agent-based systems and want to avoid the token cost spiral, this post is for you. I'm sharing the exact strategies, model routing decisions, and architectural choices that keep our OpenClaw token management cost predictable and low.

The $50 Burn: Why AI Costs Explode Without a Strategy

When we first deployed a new agent, we made a classic mistake: we let it run wild, routing every task to a premium, expensive model like GPT-4. It felt right at the time – maximum intelligence for every decision! The result? A $50 bill in less than two hours. It was a painful, but incredibly valuable, lesson. We realized that for anyone managing an agency's operations or a dev team's budget, unpredictable, consumption-based AI costs are a non-starter.

Every API call to models like Claude, GPT-4, or Gemini incurs a charge based on input and output tokens. If you have a system running 50+ background services, and every minor task triggers a call to a frontier model, your budget will hemorrhage. This is why token cost became an operational problem for us, not just a development one. We needed a deliberate strategy, not just a "hope for the best" approach.

Our 80/20 Rule: The Core of Cost Optimization

Here's the critical insight we gained: most multi-agent tasks don't require frontier-model intelligence. Our extensive production data shows that roughly 80% of agent activity involves tasks where smaller, cheaper models (or even free local models) perform identically, or with negligible difference, to their more expensive counterparts. This 80/20 rule is the cornerstone of effective OpenClaw cost optimization.

What kind of tasks fall into the 80%? Think about it:

  • Simple Classification: Is this email a sales lead, support request, or internal communication?
  • Data Extraction: Pulling a name, email, or order number from a structured text.
  • Summarization of Short Texts: Condensing a few sentences into a single one.
  • Sentiment Analysis: Basic positive/negative detection.
  • Routing Decisions: Deciding which specialized agent should handle a request based on keywords.

For these tasks, a model like gpt-3.5-turbo, claude-3-haiku, or even a locally hosted Llama 3 8B can often do the job just as well as gpt-4-turbo or claude-3-opus. The cost difference is orders of magnitude. For the remaining 20% of tasks that genuinely demand high-level reasoning (complex problem-solving, creative content generation, nuanced interpretation), we ensure premium models are only engaged after the input has been compressed and filtered. This means only the minimum, most critical context reaches the expensive model.

Implementing Smart Model Routing: Our ModelRouter Pattern

To put the 80/20 rule into practice, we developed a ModelRouter component. This isn't just an if/else statement; it's a dedicated service that intercepts agent requests and intelligently directs them to the most cost-effective model capable of handling the task. This is perhaps the single most impactful architectural decision we made.

Here's a simplified look at how our ModelRouter might work. Imagine an agent needs to perform a classify_email action:

# src/agents/model_router.py

class ModelRouter:
    def __init__(self, config):
        self.config = config
        self.premium_model = config.get('PREMIUM_MODEL', 'gpt-4-turbo')
        self.standard_model = config.get('STANDARD_MODEL', 'gpt-3.5-turbo')
        self.local_model_endpoint = config.get('LOCAL_LLM_ENDPOINT', 'http://localhost:11434/api/generate')

    def route_task_to_model(self, task_type: str, prompt: str, complexity_score: float = 0.5):
        if task_type == 'classify_email':
            # Simple classification, can use cheaper or local model
            if len(prompt.split()) < 100 and complexity_score < 0.3:
                print(f"Routing '{task_type}' to local LLM for efficiency.")
                return 'local_llama_3_8b'
            else:
                print(f"Routing '{task_type}' to standard cloud LLM.")
                return self.standard_model

        elif task_type == 'complex_reasoning_and_planning':
            # Requires advanced reasoning, use premium model
            print(f"Routing '{task_type}' to premium cloud LLM.")
            return self.premium_model

        elif task_type == 'summarize_short_text':
            # Simple summarization, standard model is fine
            print(f"Routing '{task_type}' to standard cloud LLM.")
                return self.standard_model

        # Default to standard if no specific rule matched
        print(f"Routing '{task_type}' to standard cloud LLM by default.")
        return self.standard_model

# Example usage in an agent
# agent_router = ModelRouter(app_config)
# chosen_model = agent_router.route_task_to_model('classify_email', email_content, email_complexity)
# response = llm_client.call(chosen_model, prompt)
Enter fullscreen mode Exit fullscreen mode

This ModelRouter allows us to define specific criteria for each task type. We use simple heuristics like prompt length and a dynamically calculated complexity_score (derived from keyword density, entity count, and previous model error rates) to make routing decisions. This isn't just about saving money; it's about optimizing resource allocation for every task.

The Power of Local LLMs for the 80%

One of the most significant advancements in our cost-saving strategy has been the integration of local Large Language Models. For those 80% of tasks, especially simple classifications or data extractions, running a model like Llama 3 8B or Mistral 7B on our own infrastructure via tools like Ollama or llama.cpp is virtually free after the initial hardware cost.

We run a dedicated Ollama instance on a small GPU-enabled server (an old gaming PC, honestly) that handles a significant portion of our classify_email and summarize_short_text tasks. This isn't just about zero token cost; it also offers:

  • Privacy: Sensitive data never leaves our network.
  • Speed: Latency can be lower than cloud APIs for specific tasks.
  • Control: We can fine-tune these models without incurring cloud costs for every iteration.

Here's how you might integrate a local LLM call into your agent's workflow, assuming your ModelRouter decides to use it:

# src/agents/llm_client.py
import requests
import json

class LLMClient:
    def __init__(self, local_endpoint: str = 'http://localhost:11434/api/generate'):
        self.local_endpoint = local_endpoint

    def call(self, model_name: str, prompt: str, temperature: float = 0.7):
        if model_name.startswith('local_'): # Our convention for local models
            return self._call_local_llm(model_name.replace('local_', ''), prompt)
        else:
            # Placeholder for actual cloud LLM calls (e.g., OpenAI, Anthropic clients)
            print(f"Calling cloud LLM: {model_name} with prompt: {prompt[:50]}...")
            # In a real scenario, this would use the appropriate SDK
            # For example: openai.chat.completions.create(model=model_name, messages=[...])
            return f"[Cloud LLM Response for {model_name}] {prompt.upper()}"

    def _call_local_llm(self, model_id: str, prompt: str):
        try:
            payload = {
                "model": model_id,
                "prompt": prompt,
                "stream": False,
                "options": {"temperature": 0.1}
            }
            headers = {'Content-Type': 'application/json'}
            response = requests.post(self.local_endpoint, data=json.dumps(payload), headers=headers)
            response.raise_for_status() # Raise an exception for HTTP errors
            return response.json()['response']
        except requests.exceptions.ConnectionError:
            print(f"Local LLM endpoint {self.local_endpoint} not reachable. Falling back to standard cloud model.")
            # Fallback strategy: if local LLM is down, route to a cheap cloud model
            return self.call('gpt-3.5-turbo', f"Fallback: {prompt}") # Recursive call with fallback
        except Exception as e:
            print(f"Error calling local LLM {model_id}: {e}")
            return self.call('gpt-3.5-turbo', f"Error fallback: {prompt}")

# Example of using the client after routing
# llm_client = LLMClient()
# router = ModelRouter(app_config)
# chosen_model = router.route_task_to_model('classify_email', 'Is this a new client lead?', 0.2)
# response = llm_client.call(chosen_model, 'Is this a new client lead?')
# print(response)
Enter fullscreen mode Exit fullscreen mode

This setup allows for a resilient system where local LLMs handle the bulk, and cloud models act as a fallback or for high-complexity tasks. The ConnectionError fallback is a crucial "gotcha" – always have a plan B when relying on local infrastructure.

Input Compression and Filtering for Premium Models

Even when we do need a premium model for that 20% of complex tasks, we optimize the input. This is critical because token costs are directly tied to the length of your input and output. We implement a multi-stage filtering and compression pipeline:

  1. Initial Filtering: Remove irrelevant boilerplate, signatures, or quoted history from emails before passing them to the agent.
  2. Contextual Summarization (Cheap Model First): For lengthy documents that still need premium analysis, we first pass them through a cheap gpt-3.5-turbo or local LLM to extract the absolute core information relevant to the task. This summarized context is then fed to the premium model.
  3. Keyword/Entity Extraction: Instead of sending an entire document, sometimes we only need specific entities or keywords. A cheap model can extract these, and only the extracted data is sent to the premium model for further reasoning.

This isn't about dumbing down the input; it's about intelligent distillation. We ensure the premium model receives only the most pertinent, high-signal tokens, maximizing the value of every dollar spent.

Key Takeaways

  1. Embrace the 80/20 Rule: Most AI agent tasks don't need expensive frontier models. Identify and route simpler tasks to cheaper or local LLMs.
  2. Implement a ModelRouter: Create a dedicated component to intelligently direct requests to the most cost-effective model based on task type, prompt length, and inferred complexity.
  3. Leverage Local LLMs: For privacy, speed, and zero token costs, integrate local models (e.g., via Ollama) for high-volume, low-complexity tasks. Always plan for a cloud fallback.
  4. Optimize Input for Premium Models: Even for complex tasks, pre-filter, summarize, or extract key information using cheaper methods before engaging expensive LLMs to minimize token consumption.
  5. Monitor Costs Relentlessly: Continuously track your token usage and costs. Our $3/day budget wasn't an accident; it was the result of constant iteration and optimization based on real-world spend data.

What are your biggest challenges in managing AI agent costs in production? Share your thoughts below!


Originally published at https://agencyboxx.com/blog/openclaw-token-management-cost

Top comments (0)