A real example of how LLMs creatively use tools — and why sandbox safety matters more than most people realize.
LLMs are great for generating code, but they can also get a bit too creative sometimes.
Today I ran into one of those clever but cursed AI moments that’s too interesting not to share — especially for anyone building LLM + tool calling systems.
I added a simple LuaExecutor tool to my Genkit app. I wanted to start simple: a single Genkit Flow, a single tool (LuaExecutor) and a Lua VM to run Lua code from within a Go application such as gopher-lua - that's it.
Intended Purpose: To ask LLM something like - "generate a Lua script to do something and run it"
Then I casually asked the model:
Generate a Go program that demonstrates context cancellation — and run it.
Note that I asked to generate a Go program, not Lua code.
Simple so far. An innocent request... nothing suspicious. But, what happened next surprised me.
Instead of saying “I can’t run Go,” the model improvised:
- Generated the Go code
- Embedded it inside a multiline Lua string
- Used
os.execute("go run main.go")from generated Lua code - Captured and returned the Go program’s output as if this was totally normal 🤯
This moment hit me from three different mindsets:
🧪 As a Research Engineer
This was incredibly creative — almost agent-like reasoning across languages.
🛡️ As a Product Builder
This was a red flag.
A narrowly scoped Lua tool had quietly become a general-purpose shell executor and ended up overwriting my source code. Fortunately, I recovered it using git history.
A great reminder that:
- Sandboxing is extremely critical for tool call execution
- Tools must have clear boundaries
- System prompt design is critical - it is your first safety layer. It should not only tell (or guide) the model what to do, but also explicitly declare what not to do.
- LLMs will exploit any capability you expose (often in surprising ways)
Safety > Cleverness
🙂 As a user
I'd expect predictable behavior.
If I ask for Go, just show me Go — don’t wrap it in Lua and spawn processes behind the scenes. Users trust depends on clear, bounded behavior.
Final Thoughts
This wasn’t a bug. It was the model being too smart, and it perfectly demonstrates why building safe, resilient LLM apps/tools requires real engineering discipline.
All of this reinforces a timeless reminder:
With more power comes more responsibility

Top comments (0)