shashank ms

Posted on Jul 1

Integrating LLM with Robotics: A Step-by-Step Guide

#aiinfrastructure #oxlo #ai

Integrating large language models into robotics stacks has moved from research novelty to production requirement. Modern robots must parse natural language instructions, maintain multi-step task memory, and invoke tools such as grippers, cameras, and navigation planners. This guide walks through a practical integration architecture, from perception encoding to actuation, and shows how to run the inference backend on Oxlo.ai using standard OpenAI SDK patterns.

Architecture Overview

A typical LLM-driven robot pipeline has four layers. First, perception modules convert sensor data, images, or telemetry into text or embeddings. Second, a state manager assembles prompts that include the current environment description, task history, and available actions. Third, an inference backend generates structured plans or function calls. Fourth, an execution layer maps those outputs to motor commands or API calls via ROS2, MQTT, or direct hardware interfaces.

The inference layer is where latency, context length, and cost become critical. Robotics workloads often append long sensor logs and multi-turn conversation history to every request, which makes token-based billing unpredictable. Oxlo.ai uses flat per-request pricing, so the cost of a planning cycle stays constant even when you include detailed LiDAR summaries or extended system prompts.

Choosing an Inference Backend

Robotics agents need models that support function calling, JSON mode, and long context windows. Oxlo.ai provides more than 45 models across seven categories, including several that are particularly well suited for embodied AI.

Qwen 3 32B: Built for multilingual reasoning and agent workflows, making it a strong default for instruction following and tool orchestration.
GLM 5: A 744B MoE model optimized for long-horizon agentic tasks when the robot must execute extended sequences.
Minimax M2.5: Focused on coding and agentic tool use, useful when the robot interacts with software APIs or generates motion scripts.
DeepSeek V4 Flash: Offers a 1M context window and efficient MoE inference, ideal if you want to pass entire telemetry buffers or previous episode logs in one prompt.
Kimi K2.6: Supports advanced reasoning, agentic coding, and vision with a 131K context, which lets you include base64-encoded camera frames or scene descriptions.

Because Oxlo.ai is fully OpenAI SDK compatible, you can point existing robot code to https://api.oxlo.ai/v1 without rewriting your client logic. There are no cold starts on popular models, so the first request after idle time returns at full speed, an important property when a robot waits for a plan before actuating.

Step 1: Perception and State Encoding

Raw sensor data is usually too verbose for the context window. The standard approach is to compress observations into a structured text description. For example, a manipulation stack might publish:

{
  "objects": [
    {"label": "red_cube", "position": [0.45, -0.12, 0.02], "graspable": true},
    {"label": "blue_bowl", "position": [0.51, 0.08, 0.03], "graspable": false}
  ],
  "gripper_state": "open",
  "last_action": "approach_red_cube"
}

Your state manager should serialize this into a concise system or user message. Keep units explicit and coordinate frames consistent. If you run vision models, Oxlo.ai supports image input through the chat completions endpoint, so you can also pass a camera frame directly to a vision-capable model such as Kimi K2.6 or Gemma 3 27B and ask for a textual scene graph in return.

Step 2: Prompt Engineering for Control

Robotics prompts should separate environment state from task instructions. A reliable pattern is:

SYSTEM:
You are a robot control agent. The user will describe a goal.
Respond with a JSON object containing:
- "reasoning": a brief plan
- "action": one of ["move", "grasp", "place", "inspect"]
- "parameters": action-specific arguments

Current environment:
{serialized_state}

USER:
Pick up the red cube and place it in the blue bowl.

Enable JSON mode in the API request to constrain output and simplify downstream parsing. Because Oxlo.ai supports JSON mode and streaming responses, you can validate the schema token by token or wait for the complete object before sending anything to the motor controller.

Step 3: Tool Use and Function Calling

For complex agents, hard-coding a JSON schema is fragile. Function calling lets the model decide which capability to invoke and with what arguments. Below is a minimal Python example using the OpenAI SDK against Oxlo.ai. The robot has two tools: move_base and capture_image.

import openai
import os

client = openai.OpenAI(
    base_url="https://api.oxlo.ai/v1",
    api_key=os.environ["OXLO_API_KEY"]
)

tools = [
    {
        "type": "function",
        "function": {
            "name": "move_base",
            "description": "Move the robot to a target pose in map coordinates",
            "parameters": {
                "type": "object",
                "properties": {
                    "x": {"type": "number"},
                    "y": {"type": "number"},
                    "theta": {"type": "number"}
                },
                "required": ["x", "y", "theta"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "capture_image",
            "description": "Capture an image from the front camera",
            "parameters": {
                "type": "object",
                "properties": {}
            }
        }
    }
]

response = client.chat.completions.create(
    model=os.environ["OXLO_MODEL"],  # e.g., Qwen 3 32B or Minimax M2.5
    messages=[
        {"role": "system", "content": "You control a mobile manipulator. Use tools to complete tasks."},
        {"role": "user", "content": "Go to coordinates (2.0, 1.5, 0.0), then take a picture."}
    ],
    tools=tools,
    tool_choice="auto"
)

print(response.choices[0].message.tool_calls)

Models such as Qwen 3 32B, GLM 5, and Minimax M2.5 on Oxlo.ai handle multi-tool agentic workflows well. If the plan requires more context than a single prompt allows, you can maintain conversation history across turns because the flat per-request pricing does not penalize long multi-turn state logs.

Step 4: Closing the Loop with Actuation

Once the LLM returns tool calls or JSON, the execution layer validates arguments, checks safety bounds, and converts them to hardware commands. A simple loop looks like this:

import json

def execute_tool_call(tool_call):
    name = tool_call.function.name
    args = json.loads(tool_call.function.arguments)

    if name == "move_base":
        ros2_publish_goal(args["x"], args["y"], args["theta"])
    elif name == "capture_image":
        img = camera.capture()
        return encode_image_base64(img)
    # ... additional actuators

# Main loop
while task_not_complete:
    state = get_robot_state()
    response = llm_plan(state, history)
    for tc in response.tool_calls:
        observation = execute_tool_call(tc)
        if observation:
            history.append({
                "role": "tool",
                "content": observation,
                "tool_call_id": tc.id
            })

This loop is essentially a ReAct pattern implemented with native function calling. By keeping the inference backend on Oxlo.ai, you avoid cold-start latency that could stall the control loop between iterations.

Managing Context and Cost

Robotics prompts grow quickly. A single planning request might include a detailed world model, previous failure explanations, and image captions. On token-based providers, this directly increases cost and forces engineers to trim context aggressively, which hurts task success rates.

Oxlo.ai charges a flat rate per API request regardless of prompt length. For long-context and agentic workloads, this can yield significant savings because a request with a 10,000-token system prompt costs the same as a 100-token greeting. You can therefore afford to keep richer state history and larger telemetry buffers in context. See https://oxlo.ai/pricing for current plan details.

The free plan offers 60 requests per day across more than 16 models, which is enough to prototype a control loop. When you move to production, the Pro and Premium plans provide 1,000 and 5,000 requests per day respectively, with priority queueing on Premium for time-sensitive robot operations.

Deployment Patterns

Most teams start with a cloud-backed inference stack to iterate quickly. Oxlo.ai fits this pattern naturally because the API is reachable from any edge device with an internet connection and the OpenAI SDK handles retries and streaming. For robots that operate in bandwidth-constrained environments, you can cache model responses or run smaller local policies for reflexive tasks while reserving the LLM for high-level planning calls to Oxlo.ai.

If you need guaranteed throughput or dedicated compute, the Enterprise tier provides custom pricing, unlimited requests, and dedicated GPUs. This is worth evaluating when you move from lab demos to fleet deployment.

Conclusion

Integrating an LLM into a robotics stack is primarily an exercise in state management and structured output parsing. The inference backend should stay out of your way, offer predictable costs, and support the tool-use patterns that agentic control requires. Oxlo.ai provides OpenAI-compatible endpoints, request-based pricing that favors long-context robot prompts, and a wide model catalog including Qwen 3 32B, GLM 5, and DeepSeek V4 Flash for planning and reasoning. If you are building a new robot agent, the 7-day full-access trial on the free tier is a practical place to validate your control loop before scaling up.

DEV Community