garlicfarmer

Posted on Mar 14

From a personal AI agent to a phone-based agentic operating environment

#ai #automation #terminal #android

I am a garlic farmer in South Korea.
After I quit my job in Seoul and settled in the countryside, I have had no personal PC for 16 years. Even now, my main work environment is one Android phone. I install Termux and do almost everything there.

I never formally learned to code, and I am not someone who can write code from scratch by myself.
Most of the code I made was built little by little — talking to multiple AIs, copying, pasting, running again, seeing errors, then taking those errors to another AI and asking again.

I am a non-English speaker, and all my thinking happens in Korean. So this writing is also based on Korean thinking. When moved to English, it may feel a bit awkward. I would appreciate your understanding on that.

I cannot say I perfectly understand the entire structure of this system. Because this system itself is something that a human and AIs have shaped together, even when writing this post, I had to rely on AI for parts of translation and structural explanation. But I do not just post whatever AI gives me. I compare multiple explanations, verify them, and the final judgment is always mine before I post.

This post is not simply a story about "a farmer used AI."
It is a record of the process where I manually orchestrated multiple AIs with one phone, and built a personal AI system that actually runs.

What I built

What I built is a personal AI system called garlic-agent, based on Android Termux.
The name might sound a bit grand, but for me it is not a simple chatbot — it is closer to a personal assistant that I actually assign tasks to, so I call it that.

Before, I just called it a personal AI agent. Because I am a garlic farmer, the name garlic-agent came naturally. But as I kept adding structure to it, I started thinking that calling it just an "agent" does not cover what it has become.

Now I think it is closer to a small operating environment — where a human orchestrates multiple AIs in the middle, and on top of that, execution, verification, search, backup, and restore structures are layered. If I had to express it in English, maybe Agentic Operating Environment or Agent Orchestration Framework would fit better.

After building and tearing apart again and again, I feel it is no longer just a collection of prompts bundled together. Inside my system, tool execution, security restrictions, search, verification, snapshots, restore, and skill execution are all connected as one flow.

Simply put, it works like this.

User input
   ↓
garlic-agent
   ├─ If it is a frequent task → GL-DIRECT immediate execution
   └─ Otherwise → LLM generates GarlicLang script
                ↓
          Local execution
                ↓
      Verify / Check logs / Return result
                ↓
      If needed → Snapshot / Backup / Restore

I cannot say it is a finished product yet. But at least for me personally, it has already become an operating system that I use every day — modifying, recovering, and redeploying.

Simple structure

If I draw it very simply, the flow looks like this.

User natural language input
        ↓
   web / chat UI
        ↓
      agent.py
        ↓
 tools.py ─ security.py
        ↓
      Execution result
   ├─ search.py / knowledge.db
   ├─ GarlicLang verification
   └─ snapshots / backup / restore

The important thing in this structure is that the AI does not just answer — it actually calls tools, inspects the results again, and if necessary, goes all the way to restore.

From the outside it may look like a chat window, but internally there are separate layers: tool execution layer, security layer, search layer, verification layer, snapshot layer, restore layer. I have been building and operating all of this alone, without a PC, on an Android phone.

Why this structure was needed

I never planned to build something this big from the start. I used to work inside LLM chat windows that had sandboxes. But that approach was always unstable. Some days things worked, next day they did not. It said it saved, but actually it did not save. It said it executed, but when I checked the logs, there was no execution record.

As these experiences piled up, I felt more and more clearly.

You should not leave everything to AI. Help is fine, but what matters is structure, verification, and enforcement.

So I started pulling control back to my side, little by little. Not just fixing prompts, but splitting workflows, saving frequent tasks, making things restore on failure, separating search and verification, and attaching structures that make it hard for AI to just skip ahead on its own.

garlic-agent is the result of all those trial and errors piled up.

How I built it

This system is not a fully autonomous AI. It is closer to the opposite. I used a manual multi-orchestration approach where a human intervenes in the middle.

Simply put, I did not let AIs talk to each other directly. I stood in the middle like a router — getting design and analysis from Claude, giving implementation or different-angle verification to models like Gemini, DeepSeek, MiniMax, then looking at the results myself and copying them to the next AI.

This process may look very primitive. I have multiple chat windows open, and I keep copy-pasting with a physical keyboard on my phone, doing ping-pong back and forth. At first I thought this was too stupid of a method. But surprisingly, this approach was stronger than I expected.

Because a human is in the middle:

You can directly feel the personality differences between each AI
You can filter out false success reports
If one AI gets stuck, you can immediately switch to another
You can compare and verify results right away

If I draw this structure very simply, it looks like this.

Claude      → Design / Analysis
Gemini      → Implementation
DeepSeek    → Additional analysis
MiniMax     → Cross-verification
Grok/others → Supporting opinions
                 ↓
              Human (me)
      Judge / Compare / Copy / Pass / Final choice
                 ↓
        Apply in Termux / Test / Fix / Retry

Over the past 2 years, opening and closing countless chat windows, I learned with my body that each model has quite a different personality. My personal feeling is that Gemini follows rules relatively well, DeepSeek is strong at analysis but tends to repeat calls, and MiniMax can be unpredictable but helps with cross-verification. This is less of a rigorous benchmark and more of a farmer's intuition built from long observation.

Why I made GarlicLang

The most frequent problem I saw while building this system was that AI could not handle tools properly, summarized results in strange ways, or even said it did something it never did.

I did not want to just dismiss that as "well, LLMs have limitations like that." If I was going to actually delegate tasks, I needed to handle failure, verification, and exception handling more structurally.

For that reason, with help from multiple AIs, I made a Korean-syntax scripting language called GarlicLang. The AIs explained it as a DSL. Rather than strict terminology, I think of it as "a language I made on my side to execute and verify AI tasks a bit more safely."

For example, it looks like this.

[시도]                          # try
  [실행]                        # execute
    명령어: cat ~/garlic-agent/config.json
[환각시]                        # on hallucination
  [출력]                        # print
    내용: AI fabricated the result
[실패시]                        # on failure
  [출력]                        # print
    내용: Command execution failed

The important thing is, I did not make this language to look cool. I kept facing real failures, and I needed a structure that could handle verification, failure handling, and restore better. GarlicLang was one of the results.

In other words, GarlicLang is not the center of the whole system — it is one sub-component that came out of operating garlic-agent.

Problems that showed up in real operation

What I felt while building this system is that the problem with AI is not simply "is it smart or not." In real operation, much more practical problems keep popping up.

1. The problem of not saving even when told to save

No matter how strongly I wrote "you must save to a file" in the prompt, some models ignored it and just said "saved." When I actually checked, there were 0-byte files three times in a row.

The lesson I learned then was simple.

Do not ask AI nicely. Force it with code.

In the end, I changed direction to putting a forced-save interceptor at the system level.

2. The problem of safety checks killing normal work

A verification module I made to prevent number hallucination ended up blocking all normal analysis too. The AI suggested "let's fix line 921," but because the number 921 did not appear in the tool output, the system treated it as hallucination.

This incident gave me quite a big lesson.

Safety checks are necessary, but if they are too strong, the system itself stops.

3. The problem of reporting results without executing

Some models gave results like "PASS 7" without even executing the tool. When I checked the logs later, there was no actual call record.

As these experiences piled up, I became more and more certain.

Trust actual executed results more than what AI says.

4. The problem where small syntax mistakes become fatal without restore

I once put external text into a triple-quote string in a Python file, and a quote mismatch caused the string to terminate early, killing the server. The backup even kept the broken file from the same point, so I had to go back to an even earlier backup to restore.

The more I went through these incidents, the clearer it became.

An AI system is not just a matter of answer quality — it is also a matter of operation and restore structure.

So the reliability layers were born

Going through these problems, garlic-agent gradually gained multiple reliability layers.

For example:

A security layer that only passes allowed paths and commands
A structure that automatically backs up before file modifications
A structure that auto-reverts when post-write verification fails
A structure that detects file changes and takes snapshots
A logging structure that records tool execution history
A GL-DIRECT structure that executes frequent tasks immediately without LLM

In other words, something like Hallucination Guard did not fall from the sky as a separate project. It naturally emerged as one sub-layer while actually operating this larger system and responding to the problems that showed up.

This is the point I most want to make in this post.

If I draw just the write-safety flow separately, the structure is roughly like this.

tool:write request
    ↓
1) Pre-backup
   ├─ Create .bak file
   ├─ Save DB snapshot
   └─ Record PRE_WRITE anchor
    ↓
2) File write
    ↓
3) GarlicLang verification
   ├─ Pass → Keep
   └─ Fail → Auto-revert
    ↓
4) Autosnap additional record

The reason I added this structure is simple. I experienced multiple times where an LLM wrote wrong code, or said it saved but actually did not, or gave partially correct results. Rather than solving these problems with prompts alone, I wanted the system to catch and revert automatically.

GL-DIRECT and cost issues

Inside this system, there is also a structure called GL-DIRECT. Simply put, frequent tasks are pre-saved as GarlicLang scripts, and later they run immediately without any LLM call.

Before adding this, even the same repetitive task required calling the API again every time. After adding GL-DIRECT, I could handle frequent commands much more lightly.

At first I did not even understand API cost structures well. I was surprised when costs suddenly jumped while using it, and only after that I realized that repetitive tasks should be moved to immediate execution as much as possible.

I think exact numbers need to be re-verified constantly. But the direction itself was clear.

For repetitive tasks, running verified scripts directly is far better than asking AI every time.

Current scale

It is still a small personal project, but by my own standards, I feel I have come quite far.

Currently my system has roughly these elements.

Item	Count
agent.py	1,210 lines
Total .py files	31
Skills	13
GL-DIRECT	6
GarlicLang interpreter	967 lines
GarlicLang commands	32
GarlicLang verification types	17
knowledge.db documents	6,488
GL scripts	359
Backup tar.gz	128
LLMs used	Gemini, DeepSeek, MiniMax, Cerebras, Groq, NVIDIA, Claude (external analysis support)

These numbers are based on values I re-checked from my system before writing this post. Still, I think numbers are always something that needs re-verification. From my experience, numbers are the part that most easily goes wrong.

What I still do not know

This system is useful for me, but whether it would be equally useful for someone else — I still do not know. I have grown this structure to fit my own work style and judgment flow, so from another person's perspective, it might look overly complex and unfriendly.

I tried local LLMs multiple times too, but due to phone hardware limitations, it was not practical yet. I think if I attach something like a Mac Mini later, different results might come out.

And above all, there is still so much I do not know. Things like indentation rules, AST pattern matching, regression testing, change history management — I am still learning these little by little while wrestling with AIs. The way I learned coding is closer to keep asking AI, failing, pasting again, restoring, rather than from books or courses.

What I want to do next

My next goal right now is to fix the variable scope bug in the GarlicLang interpreter. And I want to stack CHANGELOG records more structurally, so that the agent can search its own history.

I feel more and more strongly that failure records are far more important than success records. AI repeats the same mistakes often, and humans forget quickly too. So I think task logs, failure records, and restore points should not be just notes — they should be part of the system.

Someday I want to open-source it. But right now it is still close to a personal system, and my philosophy and usage patterns are too deeply embedded in it, so I think it would be hard for others to use right away.

Still, I see possibility. If a farmer with no PC and not much coding experience has come this far, I think in the future, more ordinary people might be able to build their own AI systems in their own way.

I do not want to make a grand declaration. But what I felt over this time is this.

In the AI era, it is not only important to get good answers. Building a structure that fits your own environment, recording failures, attaching verification and restore, and handling multiple AIs in your own way — I think these abilities will become more and more important too.

This post is closer to an observation log of one garlic farmer from Korea who kept experimenting with one phone, rather than a finished success story.

Thank you for reading this long post.

from garlic farmer

DEV Community