Daniel Romero

Posted on Jul 2

Building my humanoid robot

#ai #robotics #machinelearning #python

Building my humanoid robot

In December 2025 I decided to finally work on an idea I'd had for a while: to build, set up, and train a humanoid robot. My starting point was the K-Bot, an open source project from K-Scale, with open CAD and detailed documentation.

From the K-Scale documentation page I grabbed the project's Onshape links and started printing the parts. I used PLA for most of the structure, and for the parts that take more stress, the sides of the torso, I ordered them in nylon from JLC3DP, where I wanted more strength and a bit of flex. That same CAD also holds the description of the robot's joints and links, a file I used later for simulation and to know the limits of each joint, which saved me a lot of guessing when it came to programming the motion.

With the parts printed I started assembling piece by piece, checking fits, screws, cable routing, and the warping from 3D printing, reading the notes in the documentation and going through the history of their Discord conversations. This was the stage that took the most patience, because I wanted to avoid slack or bigger problems in the assembly.

I bought the motors from the Robstride store on AliExpress and they took a while to arrive, but in the end it worked out. With the parts assembled and the motors in hand, the next step was to make all of it move.

The motors and CAN communication

I decided to start with the K-Bot's right arm, with six Robstride motors: five in the arm joints (pitch, roll, yaw, elbow, and wrist) and one in the gripper, of different models and sizes depending on the joint, bigger near the shoulder, where the required torque is higher, and smaller toward the tips. Each one has its own ID, and they all talk over the same CAN bus.

CAN, short for Controller Area Network, is a bus that came from the automotive industry. It's two twisted wires carrying a differential signal, with several devices hanging off that same pair, usually running at 1 Mbps. Each message has an identifier, and it's that ID that also settles priority when two nodes try to talk at the same time: whoever has the lower ID wins the bus. In my case, the host sends the command frames and reads the return frames through a USB to CAN converter, a SavvyCAN-FD-X2, which supports CAN-FD and reaches 12 Mbps, even though the motor bus runs at 1 Mbps. I picked this converter based on another open source project, OpenArm. Each motor is a node with its own ID. To give an idea of the headroom, at peak use, with six motors at 100 times per second, I take up around 15% of the bus bandwidth.

I tested each motor separately before integrating everything: power it up, bring up the bus, check the communication, watch the motion response and the limits of each joint. This isolated test helped me understand the behavior of each actuator and catch problems early, before sending commands to the whole robot.

Each Robstride is controlled in what they call MIT mode, a scheme that became known through MIT's mini-cheetah quadruped. In a single CAN frame I send the target position, the target velocity, the stiffness and damping gains (kp and kd), and a reference torque, all packed into the 8 bytes of data in the frame. The motor itself closes the loop and computes the final torque: kp times the position error, plus kd times the velocity error, plus the torque. That lets me choose how firm or soft each joint feels just by changing the gains. A higher kd is what gave me smooth motion, right at the motor, without needing any filter in software. And the soft-stop, when I want to release the arm, is just zeroing the stiffness and leaving a light damping, so it stops without locking up abruptly.

One detail that makes this scale well: I don't sit waiting for each motor's reply in the middle of the loop. The library I chose keeps a little memory box per motor with the last state it reported, position, velocity, torque, and temperature, and keeps updating that box in the background as the return frames come in over the bus. When my code asks where the motor is, it reads that memory right away, without going out to the wire. That's what lets me command the six motors at 100 Hz without choking.

Getting LeRobot to talk to these motors, though, took me a good while. This framework, which I use for teleoperation and training, already comes with support for Robstride motors. Except that support talks to them using the standard CAN frame, with an 11-bit identifier, in MIT mode. The motors on my K-Bot are set to Robstride's default protocol, the private mode, which uses the 29-bit extended identifier. They are two legitimate modes of the motor itself, but they don't talk to each other: with the bus on 29 bits and LeRobot sending on 11, the motor simply wouldn't move, and without throwing any error, which threw me off quite a bit at the start. To get to 11-bit mode I'd have to send a protocol switch command to each motor and power cycle it.

I had two paths, reconfigure the motors to 11-bit mode or talk to them in the mode they were already in. I went with the second, because I didn't want to have to do that protocol switch on each motor, one by one. For that I used motorbridge, a driver written in Rust that speaks Robstride's private protocol on the 29-bit bus, with the same MIT command underneath. It has a wheel for aarch64, so it runs on the Raspberry Pi without any hassle. I wrapped that driver in a layer of my own application and started sending all the commands through it. That layer also solves a unit difference for free: LeRobot does the motion math in degrees and the motor uses radians, and the conversion happens on every read and write, so I don't have to remember that in the rest of the code.

Adapting LeRobot

LeRobot is an open source library maintained by Hugging Face that standardizes the whole flow of teaching a robot by demonstration: you define the robot and a way to teleoperate it, record the demonstrations in a common dataset format, train a policy on top of that data, and run inference on the real robot. The base class contracts hold for any robot, so if mine follows those contracts, it drops into that pipeline and reuses the recording, training, and visualization tools that are already there.

It all revolves around two ideas: a Robot, which knows how to read an observation and execute an action, and a Teleoperator, which produces an action from some input. I wrote my arm as a subclass of Robot that, underneath, sends the MIT commands through motorbridge, and I wrote the PS4 controller as a Teleoperator.

The official teleoperation CLI didn't fit my case. It had the feedback sending tied to a specific robot, it didn't call the part that reads the controller buttons, and it didn't turn on the motor torque, so the arm would stay loose the whole time. So I wrote my own teleoperation command. It runs a loop around 100 times per second: reads the observation, reads the controller, computes the action, and sends it to the arm. The PS4 buttons become commands to engage the control, stop, and go back to the starting position, and there's a ramp on the gains when I engage, so it doesn't jump, plus per-joint limits so it doesn't go past what the mechanical structure can take.

On the joystick, each analog stick controls the velocity of a joint: the more I tilt it, the faster it turns. On each pass of the loop I take that tilt, multiply it by the max velocity of that joint and by the time of the step, and add the result to a position target that keeps growing. Holding the stick pushes that target little by little, which gives a natural feel of steering the joint. The teleop works only with that target, and what brings the arm's real position up to it is the control at the motor.

That changes how I turn on the torque. If I simply powered the motors, each one would try to go to the target stored at that moment, which is usually zero. Since the arm is almost never sitting exactly at zero, the motor would pull hard to close that gap all at once, and the arm would jerk. To avoid that, the instant I engage, before anything else I copy the current position of each joint into its target. That way the motor turns on already wanting to stay where the arm is, without moving, and only from there do the sticks start pushing the targets, with no jolt.

Collecting data

With teleoperation working, I started recording demonstrations. Each demonstration is a whole episode of picking up the bottles and putting them in the basket, recorded while I teleoperate the arm myself. On each frame LeRobot stores the observation of that instant, the images from the three cameras and the state of all the joints, along with the action I ran through the controller. It's that observation-action pair, repeated frame by frame across hundreds of episodes, that becomes the training material.

While the control runs at 100 times per second, the recording happens at 30 frames per second. Storing three images and writing everything to disk on every control step would be too heavy, and 30 fps is already enough for the model to learn the motion, on top of being the rate the model I chose is trained at. LeRobot separates the data of each episode: the numeric part, state and action, goes into a table of columns, and the images from each camera are grouped into a compressed video, one per camera. Since there are thousands of frames per episode, that saves a lot of space. The image writing runs on separate threads so it doesn't stall the control loop, and the video compression happens at the end, when I close the episode.

An important choice was how to describe the task. In the text that goes with each episode I include the object and the quantity, something like pick up a number X of bottles and put them in the basket. That way the model has to read the instruction to know how many times to repeat the motion. The most valuable scenes are the ones where the table has more bottles than what was asked, for example three on the table and the request to pick only one. Those are what teach the model to stop at the right amount, instead of just grabbing everything in front of it.

Collecting is more hands-on than it looks. I vary the position and rotation of the bottles on each episode to cover the whole workspace, and when I mess something up, grab a bottle the wrong way, drop one, or fumble in the middle, I re-record that episode instead of letting it slide, because a bad example teaches worse than a missing one.

Choosing and training the model

With the data in hand, I still had to choose the model, and to move fast I went with a VLA. VLA stands for Vision-Language-Action. It's a kind of model that takes image, text, and the robot's state at the same time and produces movement as output. It starts from the models that already understand image and language, the same ones behind the assistants that can see a photo, and gains the ability to generate action, translating all of that into commands for the joints. When I show the cameras and say in text what the task is, it responds with the arm's movement.

Among the open VLAs, I picked SmolVLA, a compact version of this kind of model, made inside the LeRobot ecosystem, from Hugging Face. Inside it has a vision and language model as a base and a part dedicated to producing action, and it comes pretrained with lots of examples from robots of many kinds. It's small enough to train and run on my GPU without much trouble. I did set up the path for a bigger model, pi0.5, but SmolVLA stayed as the main one because it's lighter and faster to iterate.

With the model chosen, I moved on to the fine-tuning: taking this model that already knows the basics of manipulating things and adjusting it with my own examples, from my robot and my task. In this fine-tuning the model still learns from my images and instructions, but only the action part gets updated, something like 100 million of the 450 million parameters, while the vision and language base stays as it was pretrained. That's what makes it fit comfortably on a single GPU (in my case an RTX 3090). It comes out much cheaper than training from scratch, and that's what let me get to a result with a few hundred demonstrations instead of thousands. The training itself is watching the loss curve drop and settle, saving several checkpoints along the way, and then testing some of them on the real arm to find the best one, which isn't always the last.

At the end of this process I have a checkpoint that handles the task. What was left then was the practical part: putting this trained model in command of the arm.

Inference: when the model takes over the robot

During data collection, the one generating the actions was the PS4 controller: on each pass of the loop, the teleop read the joystick, computed the joint targets, and the follower sent that to the motors. At inference, the model steps in exactly at that point. The only thing that changes in the loop is where the action comes from: where I used to read the controller, now I call SmolVLA. It gets the same observation, the camera images and the joint state, returns an action in the same format, and it goes down through the same layer to the motors. In practice, the model drives the arm through the same door I used with the controller in my hand. The difference is the pace: inference runs at 30 times per second, against the 100 of manual control, so I interpolate between one action of the model and the next to smooth the target that reaches the motors.

With the action path identical between training and inference, what the model learns to produce is exactly what the robot knows how to execute, without any translation in the middle. And since the source of the action is interchangeable, if inference starts drifting off I take over the arm with the controller right away, through the same layer, without having to stop anything.

What the model sees comes from three USB cameras, each from a different point of view. One on the robot's head, looking forward, one on the wrist, close to the gripper, for finer manipulation, and a third fixed on a tripod above the table, giving a top view where no object gets hidden behind another. On each pass SmolVLA gets the three images along with the task text and the arm state, and from that it decides the next action. The multiple views give a better sense of depth and object position, which a single camera wouldn't, and that counts a lot when it comes to closing the gripper at the right spot.

Physical API

Up to here I've talked about the whole software layer that controls the robot. The idea I've been chasing the most lately is called a physical API. We use APIs all the time to send commands to a system and get responses, and what I'm building is a version of that for the physical world, a layer that connects what the robot does in the real world to the data, the training, and the interaction with the people around it.

This starts with the hardware that stays with the robot: in the head sits a Raspberry Pi 5. It's what runs the teleoperation and the recording, sends the datasets to the training machine, and also drives a 7-inch touchscreen that became the robot's face. When idle, the screen shows an animation of blinking eyes.

The first part of the physical API lives on that screen: collecting human feedback during inference. While the robot runs a task, anyone can judge right there whether that run went well or not and, when it didn't, point out what went wrong. Underneath, my teleoperation command brings up a local HTTP server. When I hit the stop, the state changes, the screen notices and swaps the eyes for the feedback window, then the answer goes back to the server and gets recorded, with the plan of using that to decide what goes into the next training.

And the feedback is just the start. The same layer that talks to the motors gives telemetry and observability: since each motor already reports its own state on every cycle, I can track the temperature of each one and catch overheating before it turns into a problem, or check the battery health from the bus voltage. And the same channel works for maintenance, like doing a firmware update on the motors without taking anything apart.

The plan is for this layer to grow beyond feedback and become the physical API I have in mind: a way to help the robot get better, in a continuous loop of use and correction, without relying only on isolated data collection sessions.

Conclusion

It took me about 90 days to do everything I've described here. It was a deep dive into areas I didn't know well, and what's standing today became the base to keep going.

The list of next steps is already big. I want to build the second arm, swap the joystick for a miniature replica of the robot that I teleoperate by moving a small copy instead of mapping everything on the controller, and have some parts made in aluminum, because the printed plastic structure won't take the weight of the motors of both arms.

There's a lot ahead, and I plan to document every step. Thanks for following along, and see you in the next one.

Top comments (2)

mote • Jul 5

The per-motor memory box is a good detail â reading last-known state locally instead of blocking on the bus is how you get to 100Hz without jitter.

When the robot pulls off something worth keeping during teleop â a clean grasp, a stumble recovery â where does that episode live? If it is only in LeRobot training buffer, you lose the raw motor state that made it work. We have been building moteDB as a Rust-native store for this gap: time-series motor data, video frames, and structured events in one place, so the control loop and training pipeline read from the same source.

You mentioned 15% bus utilization at 100Hz with 6 motors. Does the full robot push that significantly higher, or is there room for IMU data on the same CAN bus?

Daniel Romero • Jul 6

Good question about storage. LeRobot datasets actually record motor state (positions/velocities) alongside the video frames for every episode, so the raw state behind a good grasp isn't lost. The training pipeline reads from the same dataset. moteDB looks interesting though, I'll take a look.

About the bus, scaling linearly a full robot with 20+ motors at 100Hz would push utilization past 50%, so I'd go with multiple CAN buses instead of a single one. There's definitely room for an IMU though. Something like the HEXFELLOW Y200, a CAN-native IMU mounted on the torso, only adds a few percent of load at 200Hz. The practical catch is bitrate, since every node on the same physical bus needs the same speed. So the IMU either runs at the motors 1Mbps or gets its own bus.