Ari Anggoro

Posted on Nov 15

How to Build an AI Agent That Doesn't Cost a Fortune to Run.

#ai #flowork #opensource #aitraining

flowork - Let's be real. You’ve seen the "BabyAGI" demos. You’ve watched agents spin up, browse the web, write code, and... run up a massive API bill.

There's a moment of pure terror every developer feels the morning after letting an autonomous agent run wild. You slowly open the OpenAI billing dashboard, peeking through your fingers, half-expecting to see a number that looks less like a server cost and more like a down payment on a car.

The hard truth is: LLM calls are expensive. And autonomous agents are, by their very nature, loops that make a lot of LLM calls.

But what if they didn't have to be? What if you could build, test, and even deploy agents that are smart, capable, and—most importantly—don't require you to sell a kidney?

Good news. You can. The secret isn't in the AI model itself; it's in the scaffolding you build around it. It's about control, budgeting, and making smart architectural choices. We’re going to dive into how to do this, using real-world code examples from the Flowork platform, which is designed from the ground up to solve this exact problem.

Strategy 1: Ditch the Pricey APIs (When You Can)

The most straightforward way to save money is to... well, not spend it.

The primary cost of 99% of AI agents is the call to a proprietary, high-end model like GPT-4, Claude 3 Opus, or Gemini Advanced. These models are incredible, but they're also priced like premium, handcrafted goods.

The solution? Go local.

For the last year, the open-source community has been on fire. Models like Llama 3, Mistral 7B, and Phi-3 are now so good that they can handle a massive chunk of agent-related tasks.

Think about what an agent really does. It’s a loop:

Observe: Look at the current state.
Think: Decide what to do next.
Act: Execute a tool (e.g., Google search, read a file).
Repeat.

While the "Think" step can be a complex, multi-step reasoning process, a lot of the time it’s simple classification or routing.

"Based on this user prompt, should I use the Google Search tool or the read_file tool?"
"Is this user's sentiment positive, negative, or neutral?"
"Summarize this block of text."

You don't need a $20/million-token model for that. A local Mistral 7B model running on your own hardware (or a cheap GPU instance) can do it almost instantly, for a one-time cost of zero.

How Flowork Implements This

This isn't just theory. A smart agent platform builds this choice right into its core. In Flowork, the AIProviderManagerService is the central "router" that decides which AI to call.

When a task needs AI, it doesn't just hardcode openai.chat.completions.create(). It calls its internal manager, which then checks what kind of AI the user assigned to that task.

Look at this snippet from ai_provider_manager_service.py. This is the brain of the operation:

#
# A snippet from C:\FLOWORK\flowork-core\flowork_kernel\services\ai_provider_manager_service\ai_provider_manager_service.py

    def query_ai_by_task(
        self, task_type: str, prompt: str, endpoint_id: str = None, **kwargs
    ) -> dict:
        # ... (logic to find the target_endpoint_id) ...

        if not target_endpoint_id:
            return {
                "error": f"No default or specified AI model is configured for task type '{task_type}'."
            }

        # PATH 1: The Expensive Route (Paid API)
        if target_endpoint_id in self.loaded_providers:
            provider = self.get_provider(target_endpoint_id)
            if provider:
                is_ready, msg = provider.is_ready()
                if is_ready:
                    return provider.generate_response(prompt, **kwargs)
                else:
                    return {
                        "error": f"Provider '{target_endpoint_id}' for task '{task_type}' is not ready: {msg}"
                    }
            # ... (error handling) ...

        # PATH 2: The "Free" Route (Local Model)
        elif target_endpoint_id.startswith("(Local Model)"):
            model_info = self.local_models.get(target_endpoint_id)
            if not model_info:
                return {
                    "error": f"Local model '{target_endpoint_id}' not found in the manager's index."
                }

            model_type = model_info.get("type")
            if model_type == "gguf":
                if not LLAMA_CPP_AVAILABLE:
                    return {
                        "error": "Library 'llama-cpp-python' is required to use local GGUF models."
                    }

                model_full_path = model_info.get("full_path")
                # ... (error handling) ...

                try:
                    # This is the magic: it runs a *local script* instead of an API call
                    worker_path = os.path.join(
                        self.kernel.project_root_path,
                        "flowork_kernel",
                        "workers",
                        "ai_worker.py",
                    )
                    gpu_layers_setting = self.loc.get_setting("ai_gpu_layers", 40)
                    command = [
                        sys.executable,
                        worker_path,
                        model_full_path,
                        str(gpu_layers_setting),
                    ]

                    # ... (call the subprocess) ...

                    process = subprocess.run(
                        command,
                        input=prompt,
                        # ... (other args) ...
                    )

                    if process.returncode == 0:
                        return {"type": "text", "data": process.stdout}
                    else:
                        return {
                            "type": "text",
                            "data": f"ERROR: AI Worker process failed: {process.stderr}",
                        }
                except Exception as e:
                    self.logger.critical(
                        f"Error calling local GGUF worker: {e}"
                    )
                    return {"error": str(e)}

            # ... (other local model types) ...

        else:
            return {
                "error": f"Unsupported or unknown AI endpoint type for task '{task_type}': {target_endpoint_id}"
            }

This is a beautiful piece of architecture. The system doesn't care if the "AI Provider" is a billion-dollar API or a 4GB GGUF file on your hard drive. It just makes the request, and the manager routes it.

And what does that ai_worker.py look like? It's a clean, isolated Python script that does one thing: load a local model with llama-cpp-python and answer a prompt.

#
# This is the *entire* worker script: C:\FLOWORK\flowork-core\flowork_kernel\workers\ai_worker.py
import sys
import json
import io
from llama_cpp import Llama

def main():
    """
    This worker script is designed to be called by a subprocess.
    It loads a GGUF model, takes a prompt from stdin,
    generates a response, and prints the response to stdout.
    This isolates the memory-intensive AI operations from the main application.
    """
    sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-8', errors='replace')
    sys.stderr = io.TextIOWrapper(sys.stderr.buffer, encoding='utf-8', errors='replace')

    if len(sys.argv) < 3:
        print(json.dumps({"error": "Worker requires model_path and n_gpu_layers arguments."}))
        sys.exit(1)

    model_path = sys.argv[1]
    n_gpu_layers_from_arg = int(sys.argv[2])

    print(f"AI Worker: Initializing. Loading model: {model_path} with n_gpu_layers={n_gpu_layers_from_arg}", file=sys.stderr)

    try:
        prompt = sys.stdin.read()
        llm = Llama(
            model_path=model_path,
            n_ctx=8192,
            n_gpu_layers=n_gpu_layers_from_arg,
            verbose=False
        )
        messages = [{"role": "user", "content": prompt}]
        response = llm.create_chat_completion(
            messages=messages,
            max_tokens=2048,
            temperature=0.2
        )
        raw_suggestions = response['choices'][0]['message']['content'].strip()
        print(raw_suggestions) # Send result to stdout

    except Exception as e:
        error_response = {"error": f"AI Worker failed: {e}", "traceback": str(e)}
        print(json.dumps(error_response), file=sys.stderr)
        sys.exit(1)

if __name__ == "__main__":
    main()

Takeaway: Your first step to a cheap agent is to stop hardcoding OpenAI. Build a routing layer that can direct tasks to a "free" local model or a "premium" API, and use the free one 90% of the time.

Strategy 2: Put Your Agent on a Leash (The "Gas" Model)

Okay, so you've swapped your "brain" (the LLM) for a free, local one. You're safe now, right?

Wrong.

What about the "Act" step? Your agent's tools can be just as dangerous.

An agent in a loop calling Google Search can burn through your Search API quota.
An agent in a loop calling fs_write can fill up your hard drive.
An agent in a loop calling shell_exec can... well, God help you.

The problem is the loop. You need a way to kill the agent if it starts "misbehaving" or gets stuck.

The solution: Give your agent a finite budget.

This is exactly how Ethereum works. Every operation—adding two numbers, storing a value—costs a tiny amount of "gas." When you run out of gas, the program stops, period.

We can apply the exact same logic to an AI agent. We can create a "price list" for every tool it's allowed to use.

How Flowork Implements This

This is, in my opinion, the most brilliant part of Flowork's design. It's called the Flowork Agent Contract (FAC), and it's enforced by the AgentContext.

Step 1: The "Price List"

First, define the cost of every "dangerous" operation. This is just a simple dictionary.

#
# From: C:\FLOWORK\flowork-core\flowork_kernel\context.py

GAS_COSTS = {
    "HTTP_FETCH": 15,
    "FS_READ_KB": 1,
    "FS_WRITE_KB": 2,
    "EPISODIC_WRITE": 5,
    "EPISODIC_READ": 2,
    "AGENT_TOOL_CALL": 10,
    "SHELL_EXEC": 25
}

Simple, right? Making a web request costs 15 "gas." Running a shell command costs 25. Reading a file costs 1 gas unit per kilobyte.

Step 2: The "Bouncer" (Budget Enforcement)

Next, you need a function that enforces this price. In Flowork, this is the _enforce_gas method within the AgentContext. This function is the "bouncer" at the club.

#
# From: C:\FLOWORK\flowork-core\flowork_kernel\context.py

    def _enforce_gas(self, cost: int, operation_name: str):
        if self.kill_flag:
            raise Exception(f"Agent (ID: {self.agent_id}) is terminated. Cannot perform '{operation_name}'.")

        # This is the check. fac_runtime holds the budget.
        if not self.fac_runtime.consume_gas(cost):
            # Not enough gas!
            self.kill_flag = True
            self.timeline.log(
                event_type="agent_killed",
                data={"reason": "OUT_OF_GAS", "failed_operation": operation_name, "gas_spent": self.fac_runtime.get_gas_spent()}
            )
            # Kill the agent by raising an exception
            raise Exception(f"Out of Gas! Agent (ID: {self.agent_id}) terminated. Failed on '{operation_name}'.")

        # If we're here, we had enough gas. Log the purchase.
        self.timeline.log(
            event_type="gas_spent",
            data={"operation": operation_name, "cost": cost, "total_spent": self.fac_runtime.get_gas_spent()}
        )

This is so clean. The self.fac_runtime.consume_gas(cost) line (which we'll see next) tries to "spend" the gas. If it fails, the agent is immediately killed.

Step 3: The "Wallet" (The Budget Itself)

The fac_runtime (Flowork Agent Contract Runtime) holds the "wallet." It's just a simple class that tracks the budget.

#
# From: C:\FLOWORK\flowork-core\flowork_kernel\fac_enforcer.py

class BudgetMeter:
    """ (English Hardcode) Simple gas meter. """
    def __init__(self, total: int):
        self.total = int(total)
        self.used = 0

    def remaining(self) -> int:
        return max(0, self.total - self.used)

    def consume(self, units: int) -> bool:
        if units < 0:
            units = 0
        if self.used + units > self.total:
            # Not enough gas! Return False.
            return False
        self.used += units
        # Gas spent. Return True.
        return True

class FacRuntime:
    """
    (English Hardcode) Bind FAC to executable capability instances with budget & memory access.
    """
    def __init__(self, fac_dict: Dict[str, Any]):
        # ... (other validation) ...

        # Here's the wallet!
        self.budget = BudgetMeter(int(fac_dict.get("budget_gas", 0)))

        # ... (other init) ...

When an agent is created (via boot_agent in context.py ), it's given a fac_dict (a contract) that specifies its budget_gas. This creates the BudgetMeter with a full tank.

Step 4: Collecting the "Tax"

Finally, you just have to "tax" every dangerous action. You hook into every tool call and run the bouncer (_enforce_gas) before you execute the tool.

#
# From: C:\FLOWORK\flowork-core\flowork_kernel\context.py

    def http_fetch(self, url: str, method: str = "GET", headers: dict = None, json_data: dict = None, params: dict = None) -> dict:
        """(Refactor R4) Sekarang cek izin DULU, baru gas."""

        # 1. (Check permissions - not shown)
        self._enforce_permission("http", url, "http_fetch")

        # 2. COLLECT THE TAX
        self._enforce_gas(GAS_COSTS["HTTP_FETCH"], "http_fetch")

        # 3. If the tax was paid (no exception), run the action.
        response = None
        log_data = {"url": url, "method": method}
        try:
            response = self.http_client.request(
                method=method, url=url, headers=headers, json=json_data, params=params
            )
            # ... (rest of the function) ...
        finally:
            self.timeline.log("http_fetch", log_data)


    def shell_exec(self, command: str) -> dict:
        """(R4 BARU) Aksi baru: Eksekusi shell (shell_exec)."""

        # 1. Check permissions
        self._enforce_permission("shell", command, "shell_exec")

        # ... (chaos injection for testing) ...

        log_data = {"command": command}
        try:
            # 2. COLLECT THE (bigger) TAX
            self._enforce_gas(GAS_COSTS["SHELL_EXEC"], "shell_exec")

            # 3. Run the action
            result = subprocess.run(
                command,
                shell=True,
                capture_output=True,
                text=True,
                timeout=10,
                check=True
            )
            # ... (rest of the function) ...
        finally:
            self.timeline.log("shell_exec", log_data)

This is a robust, non-negotiable budget. The agent can't "forget" to pay or find a loophole. The AgentContext is its only connection to the outside world, and that context is the tax collector.

Takeaway: Don't just trust your agent. Give it a small allowance ("gas") and kill it when its wallet is empty. This is your ultimate safety net against runaway costs.

Strategy 3: The "Swarm" - Parallelism for Pennies

Here's an advanced concept: what if you need to run 10 tasks at once?

Most agents are sequential: Think -> Act -> Think -> Act. This is slow. A Swarm Architecture lets you "fan out" a task to multiple agents at once and aggregate the results.

This sounds expensive, but it's actually a massive cost-saving feature if you combine it with our first two strategies.

Fan-out to Local Models: Spin up 10 local agents. They all work in parallel for free.
Set a "Quorum": You don't always need all 10 agents to finish. Maybe you just need the first good answer.
Timeouts and Retries: A good swarm coordinator, like Flowork's, will manage timeouts. If an agent gets "stuck" (or you just kill it), the whole system doesn't collapse.

Look at the SwarmCoordinator from swarm.py. It's built to manage this.

#
# From: C:\FLOWORK\flowork-core\flowork_kernel\swarm.py

    def fan_out(self,
                engine_ids: List[str],
                task: Dict[str, Any],
                *,
                quorum: str = "all",  # <-- The magic word
                per_engine_timeout_s: Optional[float] = None,
                retries: Optional[int] = None,
                backoff_base_s: Optional[float] = None) -> Dict[str, Any]:

        # ... (setup) ...

        with ThreadPoolExecutor(max_workers=max_workers) as pool:
            future_map: Dict[Future, Tuple[str, Dict[str, Any]]] = {}
            for eid in engine_ids:
                # ... (submit tasks to the pool) ...

            for fut in as_completed(future_map):
                # ... (get result) ...

                if ok:
                    success_count += 1
                else:
                    failure_count += 1

                # This is the cost-saver!
                if quorum == "any" and success_count >= 1:
                    # (In a real implementation, you'd cancel the other futures here)
                    pass 

        # ... (aggregate results) ...

        # Check if we got *enough* answers
        quorum_met = self._check_quorum(quorum, total, success_count)
        # ... (return summary) ...

By setting quorum="any", you can fire off a task to five different local models. The first one to return a valid answer "wins," and you can ignore the rest. This is high-speed, parallel processing for the cost of a few CPU cycles.

Conclusion: It's All About the Scaffolding

An AI agent is not just an LLM in a while True: loop. That's a recipe for a financial disaster.

A robust, cost-effective agent is a sophisticated piece of engineering where the AI is just one component. The real magic is in the scaffolding that controls it:

A Model Router: A system (like Flowork's AIProviderManagerService ) that can intelligently send a task to a "cheap" local model or a "premium" API, depending on the need.
A "Gas" System: A non-negotiable budget (like the BudgetMeter and GAS_COSTS ) that "taxes" every action an agent takes—API calls, file I/O, shell commands.
An Execution Context: A secure "sandbox" (like the AgentContext ) that enforces this gas budget and kills the agent instantly if it runs out of money.

Stop worrying about your agent running up a $5,000 bill. Start building the scaffolding that makes it impossible for it to do so. Give it a $5 budget (or 10,000 "gas") and let it run. The worst that can happen is it stops—which is exactly what you want.

https://github.com/flowork-dev/Visual-AI-Workflow-Automation-Platform