We Let an LLM Control a File System and Run Commands – Here’s What Actually Broke First

I wanted to push an LLM beyond simple chat and see if it could actually build real code.

So I gave it direct access to the file system and the ability to run terminal commands. The task was straightforward: “Create a clean React login page with email, password, remember-me checkbox, and form validation.”

It started confidently. Within minutes everything broke.
The System We Built
We connected two tools to the LLM:

file_system (list, read, write, delete files)
run_command (execute npm, start dev server, etc.)

We used MCP (the “USB-C for AI” protocol) so the model could call tools cleanly. The goal was to let the LLM act like a real developer — explore the folder, create files, install packages, and test the app.
It sounded simple. It was not.

Failure #1: It Assumed the Project Already Existed
What broke:
The model immediately started writing Login.jsx in an empty folder. No package.json, no React setup, no dependencies.
Why it broke:
The LLM had no understanding of project bootstrapping. It assumed a full React app was already there.
What we learned:
We had to explicitly tell it “first create the project structure” in every new session. This became our first mandatory step.

Failure #2: It Ran Commands at the Wrong Time
What broke:
After creating a few files, it ran npm start and npm run build before any dependencies were installed. The terminal exploded with 47 errors.
Why it broke:
The model treated commands like a checklist instead of understanding dependencies. It didn’t realise you can’t run the app before npm install.
What we learned:
We added a rule: never run npm start or npm run build until package.json exists and all dependencies are installed. This single rule saved us from multiple crashes.

Failure #3: It Mixed Concerns and Created Messy Code
What broke:
It put all the Tailwind CSS and form logic inside a single Login.jsx file. The component became 180 lines long, impossible to read, and had styling mixed with business logic.
Why it broke:
The model was optimising for “one file = done” instead of proper component structure.
What we learned:
We had to force it to create separate files (Login.jsx, Login.css, utils/validation.js). Once we added this constraint, the code quality jumped dramatically.

Failure #4: It Had No Memory of Previous Mistakes
What broke:
Even after we fixed the directory issue, in the next loop it tried to create the same wrong file again in the wrong location.
Why it broke:
The model had no persistent memory of what it had already tried and failed.
What we learned:
We started saving a small agent-log.md file after every loop so the model could read its own history before making the next decision. This simple trick reduced repeated mistakes by almost 70%.
After 8 loops and 14 minutes, we finally had a clean, working React login page with proper validation and structure.

The Real Lesson
The LLM wasn’t the problem. The problem was that we treated it like a magician instead of a junior developer with superpowers.
Once we gave it real tools (file system + terminal) and forced it to work inside real constraints, it went from completely broken to actually useful.

In 2026, the biggest unlock isn’t a smarter model.
It’s giving the model the right tools and the right guardrails.
I no longer ask LLMs to “write me some code.”
I give them a file system, terminal access, and clear rules.
That single change is what turns toys into tools you can actually ship.

DEV Community

We Let an LLM Control a File System and Run Commands – Here’s What Actually Broke First

Top comments (0)