DEV Community

Francesco Marconi
Francesco Marconi

Posted on

My Robot's Brain is a Markdown File (LLM-First,Tool as Prompt in action)

A practical guide using LLM-First and the "Tool as Prompt" paradigm to create an AI planning engine.

The SCRAP Robot: Field Validation

Recently, I completed the development of SCRAP, an intelligent robotic assistant (as shown in the example video πŸ€–).

Beyond the hardware challenges, the real complexity lay in the control software. The solution was captured by SCRAP's own words in the video:

"I am able to break down complex requests into simpler actions... and I use logical reasoning to optimize my actions."

This isn't a pre-programmed phrase, but the control philosophy I instilled through an LLM-First Prompt, transforming documentation into an executable API(Tool as Prompt). This article shows you the architecture and methodology that allowed SCRAP to reason this way.

To illustrate the concept clearly, the prompt presented here is a simplified version for educational purposes, focused on essential navigation operations. Although SCRAP's complete system handles more advanced capabilities, the operating principle you will discover is exactly the same.

No training. No fine-tuning. Just structured documentation.

Try It Now

1. Load the Prompt

Take the readme from this repository and paste it into your favorite LLM (Claude, ChatGPT, Gemini).

If your LLM supports browsing, simply paste the repo's URL and ask it to analyze the repo.

Load https://github.com/fra00/tap-robot-planner-sample
--or raw format
Load https://raw.githubusercontent.com/fra00/tap-robot-planner-sample/refs/heads/main/README.md
Enter fullscreen mode Exit fullscreen mode

2. Send a Command

{
  "action": "request",
  "user_utterance": "Hello robot, tell me your position",
  "robot_location": "entrance"
}
Enter fullscreen mode Exit fullscreen mode

3. See the Result

<JSON>
{
  "action": "position",
  "reasoning": "The user has requested to know my current position. I will execute the 'position' action, which will return the Cartesian coordinates."
}
</JSON>
Enter fullscreen mode Exit fullscreen mode

Something interesting happened: the LLM didn't "answer the question"; it generated an executable action in the required JSON format.

Real-World Example: The Full Turn

This is a real test conducted with Gemini. It shows how the LLM handles abstract requests.

User Input

{
  "action": "request",
  "user_utterance": "do a 360-degree turn, and confirm me when you have finish",
  "robot_location": "entrance"
}
Enter fullscreen mode Exit fullscreen mode

Response 1: Intelligent Decomposition

<JSON>
{
  "action": "rotate",
  "direction": "right",
  "angle": 180,
  "reasoning": "The request to do a full 360-degree turn is not a single permitted action, so I am breaking it down into two consecutive 180-degree rotations. This is the first rotation."
}
</JSON>
Enter fullscreen mode Exit fullscreen mode

System Confirmation (you)

{
  "action": "rotate",
  "status": "success",
  "direction": "right",
  "angle": 180
}
Enter fullscreen mode Exit fullscreen mode

Response 2: Second Rotation

<JSON>
{
  "action": "rotate",
  "direction": "right",
  "angle": 180,
  "reasoning": "The first 180-degree rotation was completed successfully. To execute the original request for a full 360-degree turn, I must now perform the second and final 180-degree rotation."
}
</JSON>
Enter fullscreen mode Exit fullscreen mode

System Confirmation (you)

{
  "action": "rotate",
  "status": "success",
  "direction": "right",
  "angle": 180
}
Enter fullscreen mode Exit fullscreen mode

Response 3: Feedback to the User

<JSON>
{
  "action": "speech",
  "text": "I have completed the 360-degree turn.",
  "reasoning": "The second 180-degree rotation was completed successfully, leading to the completion of the abstract request. I am communicating the completion to the user."
}
</JSON>
Enter fullscreen mode Exit fullscreen mode

System Confirmation (you)

{
  "action": "speech",
  "status": "success"
}
Enter fullscreen mode Exit fullscreen mode

Response 4: Task Closure

<JSON>
{
  "action": "completed",
  "reasoning": "The last 'speech' action to communicate the completion of the 360-degree turn was executed successfully. There are no further requests or sub-tasks to perform."
}
</JSON>
Enter fullscreen mode Exit fullscreen mode

The Infrastructure

To make this system work, you need three components, each handling a specific type of complexity:

🧠 1. LLM + Prompt (this repo)
Manages Cognitive Complexity: Its responsibility is to interpret human intent, break down abstract tasks, and generate a sequential plan. It deals with ambiguity and reasoning.

πŸ› οΈ 2. Orchestrator (to be implemented)
Manages Operational Complexity: Its responsibility is to execute the plan safely and reliably. It handles the state loop, validates the LLM's output, manages errors, handles timeouts, and integrates with the hardware. Its logic is deterministic, but its implementation requires robust engineering.

βš™οΈ 3. Robot API (to be implemented based on your specs)
Manages Physical Complexity: It executes actions in the real world, interfacing with motors and sensors.

The Flow

sequenceDiagram
    participant U as User
    participant O as Orchestrator
    participant L as LLM+Prompt
    participant R as Robot

    U->>O: "turn 360Β°"
    loop while action != "completed"
        O->>L: Send complete context
        L->>O: {"action":"rotate", "angle":180}
        O->>R: Do rotate(180)
        R->>O: {"status":"success"}
        O->>O: Add the result to the context
    end
    O->>U: Task completed
Enter fullscreen mode Exit fullscreen mode

Context as Memory

The LLM remembers nothing between calls. The orchestrator maintains the history and passes it each time:

[
  {"action":"request", "user_utterance":"do 360Β°"},
  {"action":"rotate", "angle":180, "reasoning":"2Γ—180Β°"},
  {"action":"rotate", "status":"success"},
  {"action":"rotate", "angle":180, "reasoning":"second one"},
  {"action":"rotate", "status":"success"},
  {"action":"completed"}
]
Enter fullscreen mode Exit fullscreen mode

The orchestrator is mechanical: it collects, passes, and executes. The LLM is intelligent: it reads the context, plans, and decides the next action.

Once the LLM returns {"action": "completed"}, the loop terminates, and the orchestrator clears the context, preparing for a new request from scratch. This keeps the system efficient and focused.

What You Need to Implement

Your system must:
βœ… Expose base actions: move, rotate, navigate, speech, etc.

βœ… Loop until action != "completed".

βœ… Maintain the context array (and clear it after completed).

βœ… Extract JSON from <JSON>...</JSON>.

βœ… Call the robot's APIs.

βœ… Handle basic errors (timeouts, invalid format).

Why Does It Work?

The prompt guides the LLM by providing:
πŸ“ Expected format: A JSON schema for each action β†’ consistent output.

πŸ”„ Procedural rules: "One action at a time," "Wait for confirmation" β†’ sequential behavior.

🧠 Decomposition logic: "If abstract β†’ divide into sub-tasks" β†’ complexity management.

In essence, the prompt doesn't just give instructions: it transforms the LLM into the decision-making engine of a state machine. The Orchestrator maintains the state (the context), and the LLM calculates the next transition (the next action).

This is the heart of "Tool as Prompt": documentation that the LLM loads as operational logic, not reads as narrative text.

Adaptability

This prompt handles the basic functionalities of a domestic robot, but it's easily expandable. To adapt it to other domains (drones, home automation, robotic arms, or even non-robotic systems), you just need to modify the "Base Operations" section while maintaining compatibility with the action format. The framework remains identical.

Limitations and Considerations

This approach is powerful, but it's important to understand its trade-offs and risks:

πŸ›‘οΈ Reliability and Safety (Hallucinations): LLMs are not 100% reliable and can "hallucinate," generating invalid or unexpected outputs. In a robotics context, this is a critical risk. The Orchestrator must act as a firewall, rigorously validating every action and parameter returned by the LLM against a predefined schema of allowed commands. An unforeseen or unsafe action must never reach the hardware.

🐒 Latency and the Role of the Planner: The LLM acts as a high-level planner, not a real-time controller. The latency of each API call makes it perfect for strategic tasks (e.g., "go to the kitchen"). Reflexes and low-latency actions, like avoiding an obstacle, are delegated to the robot's onboard systems.

πŸ“œ Context Length: The orchestrator sends the entire history with each call. For extremely long command sequences, the context might degrade performance. However, for typical domestic robot commands (1-3 actions), this is not a practical issue.

Conclusion

This practical case study demonstrates how structured documentation becomes an operational tool for an LLM, providing a framework for managing complex planning:

πŸš€ Quick Setup (load the prompt)

πŸ”„ Reusability (adapt in minutes)

πŸ›‘οΈ Controlled Reliability (output validated by the Orchestrator)

πŸ” Transparency (inspectable reasoning field)

This architecture allows for the separation of cognitive complexity (managed by the LLM) from operational complexity (managed by the Orchestrator). The main focus shifts to implementing a robust executor, delegating abstract planning to the LLM.

Useful Links

πŸ€– This Repository - tap-robot-planner-sample

πŸ“„ Tool as Prompt - The Paradigm

πŸ“š LLM-First Documentation Framework

πŸ› οΈ 2WHAV - Prompt Engineering

This article demonstrates the principles it describes. The prompt and example code are available in the tap-robot-planner-sample repository.

Top comments (0)