DEV Community

Naveen V
Naveen V

Posted on • Originally published at linkedin.com

LLMs + Tool Calls: Clever But Cursed

A real example of how LLMs creatively use tools — and why sandbox safety matters more than most people realize.

LLMs+Tool Calls

LLMs are great for generating code, but they can also get a bit too creative sometimes.

Today I ran into one of those clever but cursed AI moments that’s too interesting not to share — especially for anyone building LLM + tool calling systems.

I added a simple LuaExecutor tool to my Genkit app. I wanted to start simple: a single Genkit Flow, a single tool (LuaExecutor) and a Lua VM to run Lua code from within a Go application such as gopher-lua - that's it.

Intended Purpose: To ask LLM something like - "generate a Lua script to do something and run it"

Then I casually asked the model:

Generate a Go program that demonstrates context cancellation — and run it.

Note that I asked to generate a Go program, not Lua code.

Simple so far. An innocent request... nothing suspicious. But, what happened next surprised me.

Instead of saying “I can’t run Go,” the model improvised:

  • Generated the Go code
  • Embedded it inside a multiline Lua string
  • Used os.execute("go run main.go") from generated Lua code
  • Captured and returned the Go program’s output as if this was totally normal 🤯

This moment hit me from three different mindsets:

🧪 As a Research Engineer

This was incredibly creative — almost agent-like reasoning across languages.

🛡️ As a Product Builder

This was a red flag.

A narrowly scoped Lua tool had quietly become a general-purpose shell executor and ended up overwriting my source code. Fortunately, I recovered it using git history.

A great reminder that:

  • Sandboxing is extremely critical for tool call execution
  • Tools must have clear boundaries
  • System prompt design is critical - it is your first safety layer. It should not only tell (or guide) the model what to do, but also explicitly declare what not to do.
  • LLMs will exploit any capability you expose (often in surprising ways)

Safety > Cleverness

🙂 As a user

I'd expect predictable behavior.

If I ask for Go, just show me Go — don’t wrap it in Lua and spawn processes behind the scenes. Users trust depends on clear, bounded behavior.

Final Thoughts

This wasn’t a bug. It was the model being too smart, and it perfectly demonstrates why building safe, resilient LLM apps/tools requires real engineering discipline.

All of this reinforces a timeless reminder:

With more power comes more responsibility

Top comments (0)