After reading the excellent Action100M paper, I became very excited about the potential of fully automated, large-scale video action annotation.
High-quality temporal action hierarchies open doors for training stronger video world models, video-language models (VLMs), vision-language-action models (VLAs), humanoid control policies, and physical reasoning systems.
But two practical problems quickly appeared:
- There was no convenient way to visualize these rich, hierarchical annotations together with the video.
- Generating such annotations at scale for new/custom video datasets still felt out of reach for many researchers and engineers.
So I built two tools to help move things forward.
1. Kriya Visualizer – See Action100M-style Annotations Come Alive
I created a lightweight, static web-based visualizer specifically designed for Action100M-style temporal action trees.
Features (current version):
- Video player synced with the annotation timeline
- Hierarchical timeline (one row per level in the action tree)
- Nodes highlight at the current timestamp
- Side panel with metadata, full transcript, and raw JSON view
- Clean, single-screen layout (no installation needed)
It's open source under MIT license → feel free to fork, improve, or use it in your projects.
Access Here: https://ankk98.github.io/kriya-viz/
GitHub repo: https://github.com/Ankk98/kriya-viz
If you're working with Action100M data (or any similar dense temporal action hierarchy), give it a try and let me know what features would make it more useful.
2. Kriya-EPIC-KITCHENS – Automatic Annotations on Egocentric Videos
Next, I wanted to test how well fully automatic annotation works on real, challenging egocentric data.
I ran the Kriya Full Automated Action Annotation API (early preview) on a small subset of videos from the popular EPIC-KITCHENS-100 dataset.
Result: A preview Hugging Face dataset with ~6 videos fully annotated in Action100M style, no human labeling involved.
- Temporal segments with hierarchical actions
- Natural language captions/descriptions per segment
- Ready to download and use
Dataset link: https://huggingface.co/datasets/ankk98/kriya-epic-kitchens
Early results on kitchen egocentric videos look very promising. I'm excited to see if/how these annotations can feed downstream tasks:
- Video world models
- VLM / VLA fine-tuning
- Robotic manipulation from egocentric views
- Physical AI reasoning
The current API version deliberately follows the Action100M pipeline closely. An improved version that addresses some limitations is already in the works.
API docs (early preview): https://mindandmotionlabs.com/api-docs.html
(You send videos → get back structured temporal action hierarchies)
Why This Matters
Manual video annotation at scale is expensive and slow. If high-quality automatic annotation becomes reliable, we can:
- Train on orders-of-magnitude more grounded video data
- Build more general-purpose video understanding and action generation models
- Accelerate progress toward capable robotic and embodied AI systems
These two small releases are just early steps. Kriya Visualizer for inspection/debugging, and Kriya-EPIC-KITCHENS as a proof-of-concept dataset.
Feedback, feature requests, collaboration ideas, or even just "I tried it and here's what broke" are very welcome!
What are you building with video action data right now? Drop a comment below 👇

Top comments (0)