Real-time, open source, AI video generation is here and here's what you can build with it

#ai #machinelearning #opensource #performance

I think that most people think of AI video as "type a prompt, wait 30 seconds, get a clip." Sure, that's one way to do it. But there's another side of this that doesn't get enough attention: generating video in real-time. As in, live, with 30+ frames per second while you're watching.

Been working in this space for the past few months, and I want to walk through what's actually possible right now, what tools exist, and what people are building. Not hype. Just the current state of things.

What "real-time" actually means here

When I say real-time AI video, I mean running diffusion models fast enough that the output feels live. You point a webcam at yourself, and the model transforms what it sees into something else. Frame by frame, in the moment. Or you feed in a text prompt and watch the scene generate continuously, not as a rendered clip you download later.

The breakthrough that made this possible was StreamDiffusion, which hit 91 FPS on an RTX 4090 using stream batch processing and a technique called Stochastic Similarity Filtering (basically, it skips redundant frames). That was the proof-of-concept that diffusion models could actually keep up with live video.

Since then, the space has moved fast:

LTX-Video (Lightricks) generates 30 FPS at 1216x704, faster than real-time on H100
CausVid (MIT/Adobe) uses causal DiT with KV caching for continuous streaming generation at ~9.4 FPS, indefinitely
SDXL Lightning does high-quality image generation in 4 steps, fast enough for interactive use
Krea Realtime takes a different approach, with hight visual quality and a hefty 14B model

And there's more like Memflow, RewardForcing, LongLive...

And we are past the demo stage since people are building actual things with them.

What people are actually building

Here's what I find genuinely interesting. The real-time constraint changes what's possible. Once AI video works live, it stops being a "content creation tool" and becomes something closer to a medium.

Some examples from projects I've seen recently:

Interactive art installations. Feed a camera pointed at gallery visitors into a depth estimation preprocessor, then into a diffusion model. The art reacts to the audience in real-time. One person built a projection-mapped canvas where body tracking drove the AI generation.

Live VJ performances. TouchDesigner users are routing real-time AI generation into their visual performance rigs. MIDI controllers adjust prompts and LoRA weights on the fly. Beat detection triggers style changes. The AI becomes another instrument.

Games with AI-generated worlds. We built a proof of concept: a choose-your-own-adventure game where Unity handles the game logic and an LLM writes the narrative, but every scene is rendered live by a diffusion model. The visuals are generated, not pre-made. It's janky and early, but the potential is obvious.

Webcam transformations. The simplest use case. Point your webcam at something, see it transformed in real-time. Change styles, apply LoRAs, adjust denoising strength. It's the "hello world" of real-time AI video, and it's surprisingly fun to just play with.

Preprocessor/postprocessor pipelines. This is where it gets technical. You can chain depth estimation (Depth Anything V2), pose detection (MediaPipe), edge detection, and segmentation (SAM2) as preprocessors before the diffusion model, then run frame interpolation (RIFE) and upscaling (FlashVSR) as postprocessors. The whole pipeline runs in real-time. It's basically building a video processing graph where AI models are nodes.

One community member left Scope, generating real-time video for 2.5 hours. And you can watch it on YouTube, it's pretty amazing to see it!

Here's more about David's project, and the cover image for this post is from his video!

The open source tool we're building for this

Full disclosure: I work on this. But I think it's genuinely useful, so here goes.

Scope is an open-source tool for running real-time generative video models/pipelines. It's built by Daydream.

What it does:

Runs multiple real-time models: StreamDiffusion V2, LongLive, Krea Realtime, and more
LoRA support for style control
Local-first (runs on your GPU)
API for integration with Unity, Unreal Engine, TouchDesigner, ComfyUI
Spout/NDI output for routing video to other tools
Desktop app for quick experimentation
Plugin architecture for preprocessors and postprocessors

We've been shipping updates weekly. LoRA support, new model integrations, desktop app improvements. The docs walk you through getting started from scratch.

I've been documenting my own experiments in a YouTube series called "Building with Scope in Public" where I test things like how resolution and denoising steps affect FPS across different models, and whether AI upscalers can fix the quality tradeoff from running at lower resolutions. Short answer: they can, and the free open-source option (Video2X) holds up surprisingly well against the $300 commercial one.

If you want to build something with this

We're kicking off our second AI Video Program cohort on February 9th.

You SHOULD apply until the end of Feb 6!

It's a structured program where a small group of developers, researchers, and creative technologists build projects with Scope over a few weeks.

What you get:

Hands-on workshops walking through Scope setup, the API, integration with other tools
Daily office hours where you can ask anything
Individual check-ins with the team
A curated group of people who are actually experimenting in this space
$5,000+ in prizes ($2,500 / $1,750 / $750)

The first cohort ran over the holidays and produced some genuinely cool projects. People built everything from real-time poetry machines (voice to text to image generation in TouchDesigner) to projection-mapped interactive pieces to experimental game prototypes.

We're looking for people who want to experiment. Maybe you want to extend Scope's pipeline with new preprocessor or postprocessor modules. Maybe you want to connect it to a game engine or a VJ rig. Maybe you just want to see what happens when you point a diffusion model at your webcam and start tweaking parameters. All of that works.

You need a GPU (local or cloud - and for cloud, we got you covered if you're comfortable with RunPod), 5+ hours per week, and some technical background. You don't need to be an expert in diffusion models.

If you have questions about the program or about real-time AI video in general, drop them in the comments. Happy to dig into specifics. Or just find me (@viborc) on our Discord.