Gary Doman/TizWildin

Posted on May 16

CPU-First Hermes: Local GGUF Streaming With llamafile, Token Tracking, and Safe Timeouts

#hermesagentchallenge #devchallenge #agents #tizwildin

Hermes Agent Challenge Submission

This is a submission for the Hermes Agent Challenge

What I Built

I built a CPU-first local Hermes Agent runtime pattern that removes the usual GPU/server assumption from agent execution.

The idea is simple:

Hermes Agent should be able to run local GGUF model generation on CPU-only hardware, stream the output visibly, tokenize or chunk the generation as it arrives, and timeout safely if the model stops producing output.

Most AI agent stacks assume one of two things:

you have access to a cloud API
you have access to a GPU/server inference backend

This project goes in the opposite direction.

It uses llamafile as the local execution layer for compatible GGUF models so the agent can run directly on a normal computer without needing a hosted model server, rented GPU, remote inference API, or always-online backend.

The runtime pattern focuses on:

local llamafile execution
compatible GGUF models
CPU-first inference
no required GPU
no required cloud server
no required remote API
one-word-at-a-time visible streaming
token/chunk output tracking
stall detection
timeout protection
partial output preservation
future ARC-style receipts for replay and debugging

This is grounded in my existing local AI work in LuciferAI_Local, which is focused on local/offline assistant behavior, llamafile / GGUF model use, and privacy-first execution without requiring cloud infrastructure.

The goal is not only to run a model locally.

The goal is to make local model generation observable enough for agent workflows.

A local agent should know whether its model is alive, generating, slow, stalled, timed out, or complete.

Demo

The runtime flow looks like this:

User task
  ↓
Hermes Agent receives the task
  ↓
Runtime sends the prompt to a local llamafile / GGUF backend
  ↓
The model runs locally on CPU
  ↓
Output streams back one word or token chunk at a time
  ↓
Each generated unit is tracked
  ↓
A watchdog monitors the time since the last output
  ↓
If nothing new appears, the run times out safely
  ↓
Partial output and generation metadata are preserved

The important part is that there is no remote inference requirement in this flow.

No GPU server
No cloud API
No hosted model endpoint
No external inference dependency

The local machine runs the compatible GGUF model through llamafile.

The agent runtime watches the stream.

Example stream:

Hermes
is
running
locally
on
CPU
through
llamafile
with
tracked
tokenized
generation
...

Example successful generation record:

{
  "run_id": "cpu-gguf-demo-001",
  "engine": "llamafile",
  "model_format": "GGUF",
  "execution_mode": "cpu_first",
  "gpu_required": false,
  "server_required": false,
  "stream_mode": "one_word_or_token_chunk_at_a_time",
  "generated_units": 11,
  "last_generated": "generation",
  "status": "completed",
  "timeout_triggered": false
}

Example timeout record:

{
  "run_id": "cpu-gguf-demo-002",
  "engine": "llamafile",
  "model_format": "GGUF",
  "execution_mode": "cpu_first",
  "gpu_required": false,
  "server_required": false,
  "stream_mode": "one_word_or_token_chunk_at_a_time",
  "generated_units": 4,
  "last_generated": "locally",
  "status": "timed_out",
  "timeout_triggered": true,
  "reason": "no new generation detected inside timeout window"
}

This makes the model call observable.

Instead of blindly waiting for a local model process to finish, the runtime can see progress as it happens.

Code

Core related repository:

LuciferAI_Local — local/offline AI terminal assistant direction using local model execution, llamafile / GGUF support, and no required cloud API dependency.

Related ARC / local-agent infrastructure:

ARC-Neuron LLMBuilder — local AI build-and-memory system focused on model promotion, benchmark receipts, and governed model improvement.
arc-lucifer-cleanroom-runtime — local-first runtime direction for receipts, replay, rollback, ranked memory, and sandboxed AI execution.
ARC-Core — event/receipt spine for tracking state changes and execution records.
omnibinary-runtime — binary-first runtime direction for intake, classification, planning, and execution records.
Arc-RAR — archive/rollback direction for preserving runs and project state.
arc-language-module — language graph and routing foundation for future model/language memory work.
TizWildin Entertainment HUB — public hub for the broader software, AI, automation, and audio ecosystem.

This Hermes Agent challenge build focuses on the local model runtime pattern:

Hermes Agent
  ↓
local runtime wrapper
  ↓
llamafile
  ↓
compatible GGUF model
  ↓
CPU inference
  ↓
streamed output
  ↓
token/chunk tracker
  ↓
timeout watchdog
  ↓
agent-safe final or partial result

My Tech Stack

Hermes Agent
llamafile
compatible GGUF models
CPU-first local inference
Python runtime wrapper
local process / local HTTP streaming
one-word-at-a-time output streaming
token/chunk generation tracking
timeout watchdog
JSON / JSONL generation records
local-first execution
optional ARC-style receipt/event logging

The core runtime pattern is:

Prompt
  ↓
llamafile / GGUF
  ↓
CPU generation
  ↓
streamed words or token chunks
  ↓
generation tracker
  ↓
timeout watchdog
  ↓
final or partial result

Conceptual Python-style loop:

import time

last_output_time = time.time()
generated_units = []
timeout_seconds = 20

for chunk in stream_from_llamafile(prompt):
    units = tokenize_or_split_output(chunk)

    for unit in units:
        generated_units.append(unit)
        last_output_time = time.time()
        print(unit, flush=True)

    if time.time() - last_output_time > timeout_seconds:
        raise TimeoutError("No new generation detected.")

The real point is the watchdog.

The agent does not wait forever.

The runtime tracks whether new output is arriving.

If generation stops for too long, the runtime can timeout safely, preserve partial output, and let the agent decide whether to retry, fallback, or stop.

How I Used Hermes Agent

Hermes Agent is the agentic layer that benefits from this CPU-first runtime.

The runtime gives Hermes Agent a local model execution path that does not require a GPU server or remote inference endpoint.

That matters because local agent execution should be accessible.

A developer should be able to experiment with an agent on a normal machine, using a compatible GGUF model, without needing to deploy a backend server or rent GPU time just to see the agent think.

In this pattern:

Hermes Agent supplies the agent workflow.
llamafile supplies the local GGUF execution path.
The CPU supplies the inference hardware.
The stream tracker supplies liveness.
The tokenizer/chunker turns raw output into trackable generation units.
The timeout watchdog supplies safety.

Together, it creates an agent runtime that can answer questions like:

Did the local model start generating?
Is it still generating?
Is it generating slowly or normally?
How much has it generated?
What was the last token, word, or chunk?
Did it stall?
Did it timeout safely?
Is there partial output worth preserving?
Should the agent retry, fallback, or stop?

That turns local model generation into an observable process instead of a blind wait.

Why This Negates the Usual GPU / Server Requirement

This project is built around a different assumption:

The first useful version of a local agent should not require a GPU cluster or hosted model server.

A GPU can make inference faster.

A server can make deployment easier for teams.

But neither should be mandatory for a basic local agent runtime.

llamafile makes this possible because it packages local model execution into a developer-friendly form that can run compatible GGUF models directly on the machine.

That means the agent runtime can be designed around:

local files
local processes
CPU execution
local streaming
local generation tracking
local timeout rules
local logs
local privacy

The practical result:

Instead of:
Hermes Agent → remote API/server/GPU backend → response

Use:
Hermes Agent → local llamafile → compatible GGUF on CPU → streamed tracked response

This does not mean every GGUF will be fast on every CPU.

Large models still need enough RAM, and model size/quantization matter.

But the runtime no longer requires a GPU or external server as a hard dependency.

That is the key win.

It makes agent experimentation more accessible, more private, and more portable.

Tokenized / Chunked Output Tracking

Streaming alone is useful, but tracking the stream is what makes it agent-safe.

The runtime should not only print output.

It should record generation progress.

That can include:

generated token/chunk count
generated word count
time of first output
time of last output
tokens or chunks per second
timeout threshold
final status
partial output
error reason
retry/fallback decision

Example run metadata:

{
  "run_id": "tracked-local-generation-001",
  "first_output_after_ms": 812,
  "last_output_after_ms": 6912,
  "generated_units": 42,
  "timeout_seconds": 20,
  "status": "completed",
  "partial_output_preserved": true
}

This gives Hermes Agent a much better local model boundary.

Instead of asking only, “What was the answer?” the runtime can ask:

Did generation begin?
Did it keep moving?
Did it stall?
Did it finish?
What partial state can be saved?

For agents, that difference matters.

Older Hardware Direction

This runtime pattern is also designed for older CPU-only machines.

The goal is not to pretend old hardware will run huge models quickly.

The goal is to make the runtime graceful:

small compatible GGUF models can run locally
output appears progressively
slow generation is still visible
stalls are detected
partial output is not lost
timeouts prevent infinite waits
the agent can fallback instead of freezing

That means even limited hardware can participate in local AI workflows.

The machine does not need to be a GPU workstation to be useful.

Current Status

This is an experimental Hermes Agent challenge submission focused on a local-first runtime pattern.

The current focus is:

CPU-first compatible GGUF execution through llamafile
removing hard GPU dependency for local agent experiments
removing hard server/API dependency for local agent experiments
one-word-at-a-time or token/chunk streaming
generation progress tracking
timeout detection when no new output appears
partial output preservation
older-hardware-friendly execution direction
future ARC-style run receipts and replay logs

It is not presented as a finished production inference framework.

It is a practical runtime direction for making local Hermes Agent workflows more observable, safer, more portable, and easier to debug.

Future Roadmap

Next steps:

Add a clean demo script for running a compatible GGUF through llamafile
Add configurable timeout windows
Track generated words, chunks, and token timing
Save generation records as JSONL
Add retry and fallback behavior
Add ARC-style receipts for each generation run
Add replayable local run manifests
Connect successful and failed runs into the broader ARC runtime archive
Add UI indicators for “generating,” “slow,” “stalled,” “timed out,” and “completed”
Document old-hardware test profiles from LuciferAI_Local-style runs
Add model-size guidance for CPU-only GGUF usage

Closing Thought

Local agents should not require a GPU server just to begin thinking.

If a compatible GGUF model can run through llamafile on CPU, the agent runtime should be able to stream it, track it, tokenize or chunk its output, detect stalls, and preserve the run.

That is the core idea of this Hermes Agent experiment: