Kriya: Tools for Exploring and Generating Action100M-style Video Annotations

#ai #computervision #robotics #dataset

After reading the excellent Action100M paper, I became very excited about the potential of fully automated, large-scale video action annotation.

High-quality temporal action hierarchies open doors for training stronger video world models, video-language models (VLMs), vision-language-action models (VLAs), humanoid control policies, and physical reasoning systems.

But two practical problems quickly appeared:

There was no convenient way to visualize these rich, hierarchical annotations together with the video.
Generating such annotations at scale for new/custom video datasets still felt out of reach for many researchers and engineers.

So I built two tools to help move things forward.

1. Kriya Visualizer – See Action100M-style Annotations Come Alive

I created a lightweight, static web-based visualizer specifically designed for Action100M-style temporal action trees.

Features (current version):

Video player synced with the annotation timeline
Hierarchical timeline (one row per level in the action tree)
Nodes highlight at the current timestamp
Side panel with metadata, full transcript, and raw JSON view
Clean, single-screen layout (no installation needed)

It's open source under MIT license → feel free to fork, improve, or use it in your projects.

Access Here: https://ankk98.github.io/kriya-viz/

GitHub repo: https://github.com/Ankk98/kriya-viz

If you're working with Action100M data (or any similar dense temporal action hierarchy), give it a try and let me know what features would make it more useful.

2. Kriya-EPIC-KITCHENS – Automatic Annotations on Egocentric Videos

Next, I wanted to test how well fully automatic annotation works on real, challenging egocentric data.

I ran the Kriya Full Automated Action Annotation API (early preview) on a small subset of videos from the popular EPIC-KITCHENS-100 dataset.

Result: A preview Hugging Face dataset with ~6 videos fully annotated in Action100M style, no human labeling involved.

Temporal segments with hierarchical actions
Natural language captions/descriptions per segment
Ready to download and use

Dataset link: https://huggingface.co/datasets/ankk98/kriya-epic-kitchens

Early results on kitchen egocentric videos look very promising. I'm excited to see if/how these annotations can feed downstream tasks:

Video world models
VLM / VLA fine-tuning
Robotic manipulation from egocentric views
Physical AI reasoning

The current API version deliberately follows the Action100M pipeline closely. An improved version that addresses some limitations is already in the works.

API docs (early preview): https://mindandmotionlabs.com/api-docs.html

(You send videos → get back structured temporal action hierarchies)

Why This Matters

Manual video annotation at scale is expensive and slow. If high-quality automatic annotation becomes reliable, we can:

Train on orders-of-magnitude more grounded video data
Build more general-purpose video understanding and action generation models
Accelerate progress toward capable robotic and embodied AI systems

These two small releases are just early steps. Kriya Visualizer for inspection/debugging, and Kriya-EPIC-KITCHENS as a proof-of-concept dataset.

Feedback, feature requests, collaboration ideas, or even just "I tried it and here's what broke" are very welcome!

What are you building with video action data right now? Drop a comment below 👇

DEV Community

Kriya: Tools for Exploring and Generating Action100M-style Video Annotations

1. Kriya Visualizer – See Action100M-style Annotations Come Alive

2. Kriya-EPIC-KITCHENS – Automatic Annotations on Egocentric Videos

Why This Matters

Top comments (0)