DEV Community

Bruce Mcpherson
Bruce Mcpherson

Posted on

Combining local and hosted llm to minimize token cost

My current large project is gas-fakes, which is an emulation that allows local execution, continuous integration, and containerization of native Apps Script code. In other words, we are not just ’emulating Apps Script’ – we are liberating it.

Initially, AI generated code and testing was not something I was comfortable publishing, so to this point real people have coded and tested the majority of the repo. However, now the architecture and techniques are fully mature the remaining work is largely just busy work implementing and testing the remaining, less used, Apps Script platform methods.

As of gas-fakes v2.5.3 we are at 4399/6708 methods and 10,500 parity tests on the emulation against the live Apps Script platform. Now feel a little more confident about allowing AI to do some of coding work.

As an open source developer, my work is voluntary and unpaid, and therefore have to balance the potential token cost at my own personal expense, versus the value of any time saving I might make.

This article is about combining the planning capability of antigravity, with the a free local model (Gemma running under oMLX on a Mac) doing the grunt work. Like this my Gemini costs are minimal, and the local heavy work is free.

Note: this article is specific to Mac/AntiGravity combination. You can use a similar technique for other combinations but I’m not covering them here. gas-fakes collaborators will have this set up already implemented in their repo fork (but need to tweak the .gemini/settings to point to their local path for the mcp tools) , and the local model oMLX delegation will be ignored if they don’t have it running.

Evaluation of the repo content.

Before we start this is the status of the repo according to a Gemini assessment, before I start to use this local model to help with the grunt work. Clearly my priority is to maintain (or improve) the current quality.

You can get a detailed analysis here.

Executive Summary & Core Metrics

Evaluation Dimension Grade Key Focus Area / Findings
Architectural Design & Viability A+ Exceptional synchronous design mimicking V8 GAS on top of Node’s async landscape.
Parity Tracking & Completeness A Data-driven tracking system mapping thousands of live Apps Script methods via /progress.
Testing, Quality Assurance & Fidelity A Massive test footprint (~10,000+ internal/cyclical validation passes) proving true 1:1 behavioral parity.
Edge-Case & Platform Oddities Handling A- Deeply transparent about platform limits, script execution quirks, and modern auth drift.
Ecosystem & Modern Stack Readiness A+ Integrated Model Context Protocol (MCP) server, gf_agent automation tool, and containerization.

💎 Overall Project Score: 94/100 (Enterprise Grade / Production Dev Tool)

Gemini Directive & Hybrid Planned Hierarchy

We’ll be using a strict, hierarchical delegation model.

The Roles

Strategic Planner (Gemini): The hosted, powerful LLM. Its role is high-level planning, context management, decision-making, and orchestration. It determines what needs to be done.

Focused Executor (Local Model): The local, specialized LLM. Its role is high-fidelity, resource-intensive execution of specific tasks. It determines how the task is completed.

The Delegation Mechanism (query_local_model)

The Planner is equipped with a specific tool, query_local_model. When the Planner determines that a task requires local computation, it does not attempt to solve it itself. Instead, it generates a structured call to query_local_model, passing the necessary context and instructions to the local MCP Server.

The local model executes the task and returns the result to the Strategic Planner, which then integrates it into the final response.

Operational Constraints (The Golden Rule)

To prevent unnecessary cloud API usage, we can give the Planner strict directives:

The Planner is strictly forbidden from drafting implementation details, writing production code, or creating tests directly when the query_local_model tool is available. If the task falls within the scope of a specialized, local execution, the Planner must delegate the task to the local Executor.

We specify this as part of the skills training – snippet of the important rule below:

Implementation & Focused Execution (CRITICAL DELEGATION GATE)

[!IMPORTANT] ZERO-TOLERANCE DELEGATION GATE You are FORBIDDEN from using write_file or replace to implement logic, write tests, perform refactoring, diagnose/fix debug errors, or draft documentation yourself. MANDATORY SEQUENCE:
Gather context (Research).
CALL omlx/query_local_model with a comprehensive prompt containing specific constraints.
Review and synthesize the output.
Apply changes to files (e.g., in src/, test/, etc.).
EXCEPTION: This mandate only applies if the local model is available and its use has not been explicitly forbidden by the user. If unavailable or forbidden, you may proceed with the tasks using your own weights, but you MUST document the reason in your update_topic.

Cost and Token Savings

Hosted LLMs operating under token-based pricing complex can quickly accumulate unaffordable costs. By offloading the heavy lifting to the local model, you reduce the number of hosted tokens you have to pay for.

oMLX Setup and Documentation

Since I’m using oMLX to serve my local model, let’s look at how to set that up. If you are not using a Mac, there are other local model orchestrators you can use, but the initial setup of those up is outside the scope of this article.

Overview: What is oMLX?

oMLX allows the Planner to offload specific, resource-intensive tasks to a local, specialized LLM (the Executor).

Instead of relying solely on the hosted API for every request, oMLX acts as a middleware layer. The hosted model dynamically decides when a task is best suited for local execution.

Setup and Configuration

The core of the oMLX system is the MCP Server (Model Communication Protocol Server), which acts as the local endpoint for the Focused Executor.

The oMLX MCP Server

an mcp tool acts as a local server. This server listens for requests from the hosted LLM (Gemini) and routes them to the locally running model instance.

Configuration Methods

You control the server and the overall system behavior via these environment variables. the mcp tool uses these to know where to delegate tasks.

Validating that AntiGravity is using the local server

An important step to verify everything is working.

  • Ask the agy cli – ‘are you able to use the local model’
  • The mcp server will inform you when it is are using the local model – you’ll see messages like this – omlx/query_local_model(Delegate documentation generation to local model)
  • Check the oMlx dashboard (http://127.0.0.1:8000/admin/dashboard)- notice the ‘generating’ comment against the gemma model

Links

See this to get started with gas-fakes.

GitHub: gas-fakes

Top comments (0)