i generated AI video on a GTX 1660. here's what it actually takes.

#ai #video #opensource #machinelearning

A few weeks ago, generating video from a single image required either a cloud API with per-second billing, or a GPU with 24+ GB VRAM. FramePack changed that.

FramePack F1 generates video from a single image on 6 GB VRAM. That's a GTX 1660, an RTX 3060, or basically any GPU sold in the last 5 years. I've been running it locally and the results are genuinely usable — not "proof of concept" usable, but "I'd actually put this in a project" usable.

Here's what's actually involved, because "runs on 6 GB VRAM" doesn't tell the whole story.

What You're Actually Downloading

FramePack isn't one file. It's a pipeline with five components, and you need all of them:

Component	Size	What It Does
FramePack F1 I2V Model (FP8)	13 GB	The core diffusion model — generates video frames
LLaVA LLaMA3 Text Encoder (FP8)	8.5 GB	Understands your text prompt
HunyuanVideo VAE	2.3 GB	Encodes your input image to latent space, decodes generated frames back to pixels
SigCLIP Vision Encoder	900 MB	Understands the content of your input image
CLIP-L Text Encoder	240 MB	Additional text understanding (shared with HunyuanVideo)

Total download: ~25 GB. Plus you need ComfyUI installed and the ComfyUI-FramePackWrapper custom nodes from Kijai.

So yeah — the model fits in 6 GB VRAM, but your hard drive needs 25 GB and the initial download takes a while.

Why 6 GB VRAM Works

Most video generation models load everything into VRAM at once. A 14B parameter model at FP16 needs ~28 GB just for the weights. That's why Wan 2.1 14B needs a 3090 or better.

FramePack uses next-frame prediction. Instead of generating all frames simultaneously, it generates one frame at a time, keeping only what it needs in memory. The model itself is 13 GB on disk but the FP8 quantization and frame-by-frame approach mean it peaks around 6 GB of VRAM usage.

The trade-off is speed. Generating a 3-second clip takes several minutes on a mid-range GPU. On a high-end card it's faster, but it's never going to be real-time. The architecture is optimized for memory, not throughput.

What It Uses Under the Hood

FramePack F1 is built on the HunyuanVideo backbone. That's why it shares components with HunyuanVideo (the VAE, CLIP-L encoder). The pipeline works like this:

SigCLIP Vision Encoder looks at your input image and creates visual embeddings — a numerical representation of what's in the image
DualCLIPLoader loads both text encoders (CLIP-L + LLaVA LLaMA3) to process your text prompt
VAE encodes your input image into latent space
FramePackSampler takes the image latent, vision embeddings, and text conditioning, then generates video frames one at a time using next-frame prediction
VAE decodes the generated latent frames back into actual pixels

The sampler has a gpu_memory_preservation parameter set to 6.0 GB by default — it actively manages memory to stay within that budget.

What the Results Look Like

FramePack does motion from a still image. Give it a photo of a person and it'll add natural movement — head turns, blinking, subtle body motion. Give it a landscape and it'll add wind, clouds, water flow.

It's strongest with:

Portraits and people — natural micro-movements
Nature scenes — wind, water, atmospheric effects
Simple compositions — one clear subject against a background

It struggles with:

Complex multi-person scenes — tracking gets confused
Fast action — it's tuned for gentle, natural motion
Long durations — quality degrades after ~4 seconds

The output resolution follows your input image. Feed it a 512x768 portrait, you get a 512x768 video.

Running It

If you want to set this up manually: install ComfyUI, clone the FramePackWrapper custom nodes, download all five model files to the correct ComfyUI subdirectories, build a workflow connecting all the nodes in the right order, and pray nothing conflicts with your existing setup.

Or — and this is what I built — Locally Uncensored handles the entire pipeline. Open the Create tab, pick the FramePack bundle, one-click download all five components, upload an image, write what motion you want, generate. The app builds the correct workflow automatically.

It also does text-to-image, image-to-image, and text-to-video with other models (Wan 2.1, CogVideoX, FLUX, SDXL). ComfyUI gets auto-detected or one-click installed. Open source, AGPL-3.0.

The Honest Take

6 GB VRAM video generation is real, and it works. But let's not pretend it's magic:

25 GB download before you generate anything
Several minutes per clip on mid-range hardware
3-4 seconds of usable output per generation
Quality varies — some images animate beautifully, others look weird

It's a tool for specific use cases, not a replacement for cloud video gen services. But for those use cases — quick social media content, animated product shots, bringing concept art to life — running it locally for free on a GPU you already own is genuinely compelling.

GitHub: PurpleDoubleD/locally-uncensored