21.9% of robot training data violates physics — we measured it.

#ai #data #machinelearning #testing

We ran 216 RoboTurk robot teleoperation episodes through a physics
checker. 21.9% failed.

“This is not synthetic data. These are publicly used datasets.”

Not mislabeled. Not missing values. Physically impossible motion —
data that violates Newton's laws, rigid-body kinematics, or IMU
internal consistency.

These episodes were being used to train robot arms.

What we checked

Seven biomechanical laws validated per window:

Newton's Second Law (F = ma coupling)
Segment resonance frequency
Rigid body kinematics
Jerk bounds (human motion ≤ 500 m/s³, Flash & Hogan 1985)
IMU internal consistency
Ballistocardiography
Joule heating (EMG + thermal)

Each window gets a score 0–100 and a tier:
GOLD / SILVER / BRONZE / REJECTED

Results on real datasets

Dataset	Result
RoboTurk Open-X (216 episodes)	21.9% rejected as physically invalid
PAMAP2 (100Hz IMU)	+4.23% F1 after filtering corrupted windows
WESAD (stress classification)	+3.1% F1 improvement
UCI HAR	+2.51% F1 vs corrupted baseline
WISDM 2019	+1.74% F1 improvement

The F1 improvements are from training classifiers on certified-only
data vs all data. Not a bigger model. Not more data. Just cleaner data.

Why this matters for Physical AI

When you train on images, bad data hurts accuracy.
When you train a robot arm, bad data teaches physically impossible
movement patterns. The arm learns to move like it has no mass.

A prosthetic hand trained on corrupted EMG data fails the person
wearing it. A humanoid robot trained on synthetic motion that violates
rigid-body kinematics learns to move like a cartoon.

There's no standard quality floor for motion data.
We built one.

The tool

pip install s2s-certify
s2s-certify your_imu_data.csv --segment forearm

from s2s_standard_v1_3 import S2SPipeline

pipe = S2SPipeline(segment="forearm")
result = pipe.certify(imu_raw={"timestamps_ns": ts, 
                               "accel": acc, 
                               "gyro": gyro})

print(result["tier"])   # GOLD / SILVER / BRONZE / REJECTED
print(result["score"])  # 0–100

Zero runtime dependencies. 116/116 tests passing.

Reference benchmark

We published a reproducible benchmark: 29 windows from NinaPro DB5,
PAMAP2, and WESAD — real data, not synthetic.

real_human: 20/21 correctly certified (95%)
corrupted: correctly rejected or downgraded
Results: experiments/s2s_reference_benchmark.json

Anyone can run it and get the same numbers.

What we found that surprised us

The most common failure mode in RoboTurk wasn't jerk violations.
It was IMU internal consistency — the translation and rotation
channels showed different teleoperation latency.
52% of rejected windows had this pattern.

This means the robot arm's internal sensors were decoupled during
recording — the motion looked valid to a human reviewer but the
physics said otherwise.

GitHub: https://github.com/timbo4u1/S2S

PyPI: pip install s2s-certify

DOI: 10.5281/zenodo.18878307

If you're working with IMU, EMG, or robot teleoperation data and
want to know what percentage of your dataset is physically valid —
run it and see.