A few weeks ago, generating video from a single image required either a cloud API with per-second billing, or a GPU with 24+ GB VRAM. FramePack changed that.
FramePack F1 generates video from a single image on 6 GB VRAM. That's a GTX 1660, an RTX 3060, or basically any GPU sold in the last 5 years. I've been running it locally and the results are genuinely usable — not "proof of concept" usable, but "I'd actually put this in a project" usable.
Here's what's actually involved, because "runs on 6 GB VRAM" doesn't tell the whole story.
What You're Actually Downloading
FramePack isn't one file. It's a pipeline with five components, and you need all of them:
| Component | Size | What It Does |
|---|---|---|
| FramePack F1 I2V Model (FP8) | 13 GB | The core diffusion model — generates video frames |
| LLaVA LLaMA3 Text Encoder (FP8) | 8.5 GB | Understands your text prompt |
| HunyuanVideo VAE | 2.3 GB | Encodes your input image to latent space, decodes generated frames back to pixels |
| SigCLIP Vision Encoder | 900 MB | Understands the content of your input image |
| CLIP-L Text Encoder | 240 MB | Additional text understanding (shared with HunyuanVideo) |
Total download: ~25 GB. Plus you need ComfyUI installed and the ComfyUI-FramePackWrapper custom nodes from Kijai.
So yeah — the model fits in 6 GB VRAM, but your hard drive needs 25 GB and the initial download takes a while.
Why 6 GB VRAM Works
Most video generation models load everything into VRAM at once. A 14B parameter model at FP16 needs ~28 GB just for the weights. That's why Wan 2.1 14B needs a 3090 or better.
FramePack uses next-frame prediction. Instead of generating all frames simultaneously, it generates one frame at a time, keeping only what it needs in memory. The model itself is 13 GB on disk but the FP8 quantization and frame-by-frame approach mean it peaks around 6 GB of VRAM usage.
The trade-off is speed. Generating a 3-second clip takes several minutes on a mid-range GPU. On a high-end card it's faster, but it's never going to be real-time. The architecture is optimized for memory, not throughput.
What It Uses Under the Hood
FramePack F1 is built on the HunyuanVideo backbone. That's why it shares components with HunyuanVideo (the VAE, CLIP-L encoder). The pipeline works like this:
- SigCLIP Vision Encoder looks at your input image and creates visual embeddings — a numerical representation of what's in the image
- DualCLIPLoader loads both text encoders (CLIP-L + LLaVA LLaMA3) to process your text prompt
- VAE encodes your input image into latent space
- FramePackSampler takes the image latent, vision embeddings, and text conditioning, then generates video frames one at a time using next-frame prediction
- VAE decodes the generated latent frames back into actual pixels
The sampler has a gpu_memory_preservation parameter set to 6.0 GB by default — it actively manages memory to stay within that budget.
What the Results Look Like
FramePack does motion from a still image. Give it a photo of a person and it'll add natural movement — head turns, blinking, subtle body motion. Give it a landscape and it'll add wind, clouds, water flow.
It's strongest with:
- Portraits and people — natural micro-movements
- Nature scenes — wind, water, atmospheric effects
- Simple compositions — one clear subject against a background
It struggles with:
- Complex multi-person scenes — tracking gets confused
- Fast action — it's tuned for gentle, natural motion
- Long durations — quality degrades after ~4 seconds
The output resolution follows your input image. Feed it a 512x768 portrait, you get a 512x768 video.
Running It
If you want to set this up manually: install ComfyUI, clone the FramePackWrapper custom nodes, download all five model files to the correct ComfyUI subdirectories, build a workflow connecting all the nodes in the right order, and pray nothing conflicts with your existing setup.
Or — and this is what I built — Locally Uncensored handles the entire pipeline. Open the Create tab, pick the FramePack bundle, one-click download all five components, upload an image, write what motion you want, generate. The app builds the correct workflow automatically.
It also does text-to-image, image-to-image, and text-to-video with other models (Wan 2.1, CogVideoX, FLUX, SDXL). ComfyUI gets auto-detected or one-click installed. Open source, AGPL-3.0.
The Honest Take
6 GB VRAM video generation is real, and it works. But let's not pretend it's magic:
- 25 GB download before you generate anything
- Several minutes per clip on mid-range hardware
- 3-4 seconds of usable output per generation
- Quality varies — some images animate beautifully, others look weird
It's a tool for specific use cases, not a replacement for cloud video gen services. But for those use cases — quick social media content, animated product shots, bringing concept art to life — running it locally for free on a GPU you already own is genuinely compelling.
GitHub: PurpleDoubleD/locally-uncensored
Top comments (0)