LLMs + Tool Calls: Clever But Cursed

#llm #ai #go #safety

A real example of how LLMs creatively use tools — and why sandbox safety matters more than most people realize.

LLMs are great for generating code, but they can also get a bit too creative sometimes.

Today I ran into one of those clever but cursed AI moments that’s too interesting not to share — especially for anyone building LLM + tool calling systems.

I added a simple LuaExecutor tool to my Genkit app. I wanted to start simple: a single Genkit Flow, a single tool (LuaExecutor) and a Lua VM to run Lua code from within a Go application such as gopher-lua - that's it.

Intended Purpose: To ask LLM something like - "generate a Lua script to do something and run it"

Then I casually asked the model:

Generate a Go program that demonstrates context cancellation — and run it.

Note that I asked to generate a Go program, not Lua code.

Simple so far. An innocent request... nothing suspicious. But, what happened next surprised me.

Instead of saying “I can’t run Go,” the model improvised:

Generated the Go code
Embedded it inside a multiline Lua string
Used os.execute("go run main.go") from generated Lua code
Captured and returned the Go program’s output as if this was totally normal 🤯

This moment hit me from three different mindsets:

🧪 As a Research Engineer

This was incredibly creative — almost agent-like reasoning across languages.

🛡️ As a Product Builder

This was a red flag.

A narrowly scoped Lua tool had quietly become a general-purpose shell executor and ended up overwriting my source code. Fortunately, I recovered it using git history.

A great reminder that:

Sandboxing is extremely critical for tool call execution
Tools must have clear boundaries
System prompt design is critical - it is your first safety layer. It should not only tell (or guide) the model what to do, but also explicitly declare what not to do.
LLMs will exploit any capability you expose (often in surprising ways)

Safety > Cleverness

🙂 As a user

I'd expect predictable behavior.

If I ask for Go, just show me Go — don’t wrap it in Lua and spawn processes behind the scenes. Users trust depends on clear, bounded behavior.

Final Thoughts

This wasn’t a bug. It was the model being too smart, and it perfectly demonstrates why building safe, resilient LLM apps/tools requires real engineering discipline.

All of this reinforces a timeless reminder: