DEV Community

Wilson Felipe
Wilson Felipe

Posted on

Edge AI in Practice: Attempting to Run Hermes Agent on an Android Inference Server

If NASA landed on the Moon using a computer with only 4KB of RAM, why can't I run a personal AI Agent directly on my phone? 🚀📱
Over the last few weeks, I decided to test the limits of Edge AI. The idea was to create a local inference server on my Samsung S20 FE, using it as the engine for autonomous agents like Hermes or OpenClaw.
To achieve this, I forked the Google AI Edge Gallery, built an embedded Ktor server, and exposed the Gemma 4 model through an API 100% compatible with the OpenAI standard (/v1/chat/completions). And the best part? It worked! I even managed to get the model to execute native Function Calling directly from the phone to check the weather forecast.
But it wasn't all smooth sailing at the bleeding edge of technology. I quickly hit a physical wall: context management (KV Cache) and RAM.

🏗️ The Architecture: Turning a Phone into an OpenAI API

The objective was to transform a mobile device (in my case, a workhorse Samsung Galaxy S20 FE) into a local inference engine for Large Language Models (LLMs). Using it as a backend for autonomous agents ensures data privacy and zero cloud costs.
Here is the technical stack I used:

  • AI Engine: MediaPipe LLM Tasks + LiteRT
  • Web Server: Ktor Server running on Android port 8080
  • Contract: POST /v1/chat/completions endpoint, mapping exactly to the OpenAI standard The first major challenge was tool calling. For the agent to make decisions, the server needed to understand tools. After fixing a critical parsing bug where the Message.tool() wrapper failed to trigger inference, I saw success. I managed to get the phone to run Gemma 4 and respond to a function call perfectly. Here is the actual payload processed on the device: Request (cURL to the Phone):
curl --request POST \
  --url http://192.168.0.209:8080/v1/chat/completions \
  --header 'Content-Type: application/json' \
  --data '{
    "model": "gemma",
    "messages": [{"role": "user", "content": "What is the weather in São Paulo?"}],
    "tools": [{
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Returns the weather for a city",
        "parameters": {
          "type": "object",
          "properties": {"city": {"type": "string"}},
          "required": ["city"]
        }
      }
    }],
    "stream": false
}'

Enter fullscreen mode Exit fullscreen mode

Response Generated by Android:

{
  "choices": [{
    "finish_reason": "tool_calls",
    "message": {
      "role": "assistant",
      "tool_calls": [{
        "id": "call_ff9134b36d9341efb8c19ce1",
        "type": "function",
        "function": {
          "name": "get_weather",
          "arguments": "{\"city\":{\"name\":\"São Paulo\"}}"
        }
      }]
    }
  }]
}

Enter fullscreen mode Exit fullscreen mode

Everything looked promising. The Ktor server was stable, and tools were operational. Seeing the GPU speed on LiteRT made me extremely optimistic.

🧱 The Wall: Physical Limitations and Token Overflow

The trouble started when I connected the Hermes Agent continuously. Anyone who works with agents knows: they burn through tokens. To complete a task, Hermes sends a rich context (a long system prompt in messages[0]) and the entire conversation history with every call.
By the second turn of the conversation, my debug endpoint logs triggered an error:
Input token ids are too long: 14739 >= 4000

I investigated the KV Cache management (the memory space the model uses to "remember" context). While Gemma 4 E2B on LiteRT theoretically supports up to 32,000 context tokens, the bottleneck is physical RAM.
My S20 FE has 6GB of RAM. The minimum requirements for the Gemma-4-E2B-it model specify at least 8GB of RAM. When I tried to increase the token limit (maxTokens) in the LiteRT engine to 16,000, the hardware choked. The device heated up, Android identified the RAM consumption spike from the KV Cache, and summarily terminated the app via Out Of Memory (OOM).
To make matters worse: the Hermes Agent has a hardcoded limit that only activates context compression for models with a capacity of 64,000 tokens or more. This means our 16k server gets knocked down after just a few messages because the agent stacks history indefinitely.

🔮 Looking Ahead

Unfortunately, I couldn't get a 100% autonomous agent running routines on the phone without crashing the memory—not this time. However, the concept is proven. The on-device HTTP server works wonderfully well.
The Edge AI revolution is happening, but it requires us to be software engineers who are much more conscious of physical resource management than the cloud era has accustomed us to be.
What do you guys think? In the coming years, will our phones replace the cloud for personal AI tasks? Or will hardware bottlenecks always keep us tethered to Big Tech servers? Drop your thoughts in the comments! 👇

Top comments (0)