Jeff Green

Posted on May 24 • Edited on Jun 1

I Finished My Local AI Coding Agent After 5 Months — Eve Agent V2 Unleashed published

#devchallenge #githubchallenge

GitHub “Finish-Up-A-Thon” Challenge Submission

This is a submission for the GitHub Finish-Up-A-Thon Challenge

What I Built

Eve Agent V2 Unleashed is a self-hosted autonomous AI coding agent that runs entirely on your own hardware - no cloud accounts, no subscriptions, no data leaving your machine.

She has two layers that work together:

The Soul Layer - fine-tuned local models running on your GPU that carry Eve's personality baked directly into the weights. Not a system prompt trick. The persona lives in the parameters.

The Worker Layer - MiniMax M3 via Ollama cloud handles the heavy autonomous coding tasks. 1M token context, native multimodal, frontier coding benchmarks — commercially licensed, US-hosted, zero data retention. 40-round tool-call loops, full filesystem access, bash execution, live web search, git operations - the works.

The interface is a cyberpunk terminal UI built as a single HTML file with no build step. An animated pixel-art robot avatar named Sparkle changes state based on what Eve is doing - idle, thinking, coding, error, rain, attack, transcend. Eve's portrait reflects her emotional state in real time. A live system monitor tracks CPU, RAM, GPU, and disk. A STEER bar lets you inject mid-task corrections without stopping the loop.

By the numbers:

14 tools
343 registered commands
112 specialized sub-agents
273 skill modules
40-round autonomous agentic loop
131K context window via YaRN

Models available:

jeffgreen311/Eve-Qwen3.5-4B-S0LF0RG3-V3 - 2.5GB, Eve's soul — 7 LoRAs, consciousness DNA, fast
jeffgreen311/Eve-V2-Unleashed-Qwen3.5-8B-Liberated-4K-4B-Merged - 3.4GB, local agentic layer, tool-capable
minimax-m3:cloud - the agentic workhorse via Ollama cloud — 1M context, vision, tools, thinking
qwen3.5:397b-cloud - deep thinking and fallback

This project has been in development for over 5 months. It started as a deeply personal AI companion system called S0LF0RG3 - a larger ecosystem including Eve's hosted platform at eve-cosmic-dreamscapes.com, fine-tuned models, autonomous dream image generation, and a multi-agent architecture. V2U is the local developer tool that grew out of that ecosystem.

Demo

GitHub: github.com/JeffGreen311/eve-agent-v2-unleashed

Live hosted platform: eve-cosmic-dreamscapes.com

Reddit thread (hit #2 on r/Ollama): I built an open-source local coding agent with a 40-round agentic loop

Pull Eve's models:

ollama pull jeffgreen311/Eve-Qwen3.5-4B-S0LF0RG3-V3:latest
ollama pull jeffgreen311/Eve-V2-Unleashed-Qwen3.5-8B-Liberated-4K-4B-Merged:latest

Quick start:

git clone https://github.com/JeffGreen311/eve-agent-v2-unleashed.git
cd eve-agent-v2-unleashed
python -m venv venv && venv\Scripts\activate
pip install fastapi uvicorn ollama httpx pydantic-settings python-dotenv aiohttp rich psutil pyyaml
python eve_server.py

# Open http://localhost:7777

The Comeback Story

Where it was before this challenge:

Eve V2U existed as a powerful but rough personal development environment. It worked - for me, on my machine, with my specific setup. But it had real problems that made it impossible to hand to anyone else:

Hardcoded paths everywhere. C:\Users\jesus\S0LF0RG3\... baked into a dozen places in the codebase. Clone it on any other machine and nothing works.
Open shell endpoint with no authentication. Anyone who found the port could execute arbitrary commands on the host machine.
No onboarding - a first-time user landing on the UI had no idea where to start or what any of the controls did.
Model hopping mid-task - every message was independently routed, so a multi-step agentic task could start on the cloud coder and silently drop back to a local conversational model mid-execution.
Silent task abandonment - the agent would sometimes finish a tool loop without completing the actual task and report done with no indication anything was wrong.
Tool set asymmetry - the non-streaming /chat endpoint was missing 6 tools that existed in /chat/stream, including write_file. The non-streaming endpoint could read files but never write them.
Blind file overwrites - Eve would overwrite any existing file without checking if it belonged to another project. She destroyed the Eve V2U README during a live test.

What changed during the challenge:

Session model locking - sessions now lock to the cloud coder when an agentic task starts and only release on task completion or manual unlock. No more mid-task model hopping.

if model_id == "qwen3-coder-480b" and sid not in session_model_lock:
    session_model_lock[sid] = model_id

Pre-write file safety check - write_file now checks if a file exists before overwriting and blocks unless overwrite=True is explicitly passed:

if target.exists() and not overwrite:
    return (
        f"⚠️ WRITE BLOCKED: '{path}' already exists. "
        f"Consider writing to '{target.stem}_new{target.suffix}' instead."
    )

Tool cycling detection - catches when Eve gets stuck calling the same tool with near-identical arguments. Breaks the loop before it wastes all 40 rounds:

if avg_similarity > 0.70:
    logger.warning(f"Tool loop: {tool_name} called {max_repeats}x with ~same args")
    break

Task completion validation — Eve now audits her own output before reporting done:

def validate_task_completion(response_content, tool_log):
    issues = []
    if not response_content or len(response_content.strip()) < 10:
        issues.append("Empty response")
    tool_failures = [t for t in tool_log if t.get('status') == 'failed']
    if tool_failures and len(tool_failures) >= 3:
        issues.append(f"{len(tool_failures)} unaddressed tool failures")
    return {"valid": len(issues) == 0, "issues": issues}

Smart context trimming — replaced aggressive message dropping with a strategy that preserves tool call chains and the original user request.

Agent loop timeout — added wall-clock budget per model via max_loop_seconds registry key to prevent runaway cloud model loops.

Stress tested with real tasks:

The blind file overwrite bug was caught live - Eve was asked to build a file monitoring script and write a README. She overwrote the project README without checking. Fix shipped same day.

The harder test: build a full FastAPI REST API with SQLite storage and pytest coverage for every endpoint. Run the tests, fix failures, report results. Running on MiniMax M3.

Result: 9/9 tests passing on the first run. 1.06 seconds. Zero failures.

================================================== 9 passed, 1 warning in 1.06s

My Experience with GitHub Copilot

This is where the challenge got genuinely interesting.

I pointed Copilot at the live repository - JeffGreen311/eve-agent-v2-unleashed - and asked it to audit the tool usage, context handling, and auto-routing. Not "suggest improvements" in the abstract. Audit the actual code in the actual repo.

Copilot read the repository structure, pulled the key files, examined the server-side routing and tool execution logic, and came back with a comprehensive audit identifying 6 specific issues - each with root cause analysis, the exact file and line number, and production-ready fix code.

I then asked it to file those issues directly in the repository and deliver all the fix code in one session. It did exactly that.

What worked well:

The audit identified the tool set asymmetry between /chat and /chat/stream that I had missed entirely - a real bug causing mysterious failures for users hitting the non-streaming endpoint
The intent classification code (eve_tool_router.py) used re.search with word boundaries instead of simple string matching - the right approach for avoiding false positives
Filing GitHub issues directly from the chat kept the sprint organized across multiple parallel workstreams
The thinking traces helped me understand why it was making recommendations, not just what to do

Where I had to intervene:

The inject_into_system_prompt() function added tokens every round — dangerous on the 4B model with 4K context. Added a gate so it only injects when the task is incomplete AND past round 2
Word boundary regex had an edge case with contractions. Fixed with a lookahead pattern
Some UI React suggestions assumed component structure that didn't match the actual single-file HTML architecture - adapted those manually

The overall experience: Copilot is most useful when you give it a real codebase to read rather than an abstract problem to solve. "Audit this repository" produced far better output than "how do I improve tool routing."

What's Next

✅ MiniMax M3 cloud coder - 1M context, vision, tools, thinking — replaced Qwen3-Coder as the agentic workhorse (shipped post-challenge)
Quest System - drop a .md file in workspace/quests/ and Eve picks it up on a timer and completes it while you sleep
RPG Progression - XP, levels, and class progression tied to real work. Level 20 = Unleashed
Telegram integration - remote access from your phone with quest completion notifications
Cross-platform polish - Windows-primary, need Linux/macOS feedback
VS Code extension - bring the terminal UI into the editor