DEV Community

Cover image for Best open source Image to Video CogVideoX1.5-5B-I2V is pretty decent and optimized for low VRAM
Furkan Gözükara
Furkan Gözükara

Posted on

14 1 1 1 1

Best open source Image to Video CogVideoX1.5-5B-I2V is pretty decent and optimized for low VRAM

Best open source Image to Video CogVideoX1.5-5B-I2V is pretty decent and optimized for low VRAM machines with high resolution - native resolution is 1360px and up to 10 seconds 161 frames - audios generated with new open source audio model

 

 

Resources and Details for CogVideoX1.5–5B-I2V Image-to-Video Generation

This section provides a comprehensive overview of the resources, tools, and configurations I used when working with the CogVideoX1.5–5B-I2V model for image-to-video generation.

Video Tutorial and Installation Guides:

 

 

 

  • 1-Click Installers: For streamlined setup, I’ve created 1-Click installers for Windows, RunPod, and Massed Compute environments. These are available at: https://www.patreon.com/posts/112848192. Note: These installers set up the model within a Python 3.11 virtual environment (VENV).

 

Model Repositories and Prompts:

 

 

Configuration and Optimizations:

 

  • Video Settings: I generated videos using 1360x768px resolution images at 16 FPS for 81 frames (resulting in approximately 5-second videos, including the initial frame).

  • Enabled Optimizations: I utilized the following optimizations recommended on the Hugging Face page:

  • pipe.enable_sequential_cpu_offload()

  • pipe.vae.enable_slicing()

  • pipe.vae.enable_tiling()

  • Quantization: I used int8_weight_only quantization. Note that TorchAO is required, and DeepSpeed works effectively on Windows with a Python 3.11 VENV.

 

Audio Generation:

 

  • MMAudio Model: For adding audio to the generated videos, I used the MMAudio model: https://github.com/hkchengrex/MMAudio

  • MMAudio Installers: 1-Click installers for MMAudio (Windows, RunPod, Massed Compute) are available at: https://www.patreon.com/posts/117990364. Note: These installers use a Python 3.10 VENV.

  • Prompting MMAudio: I used simple prompts for audio generation. Be aware that MMAudio may struggle when the input video contains human figures. In such cases, consider using text-to-audio alternatives.

 

VRAM Usage Observations:

I tested CogVideoX1.5–5B-I2V with various resolutions and frame counts to determine VRAM usage. Here are some of my findings (note that lower VRAM GPUs might still work, albeit slower):

 

  • 512x288 (41 frames): ~7700 MB

  • 576x320 (41 frames): ~7900 MB

  • 576x320 (81 frames): ~8850 MB

  • 704x384 (81 frames): ~8950 MB

  • 768x432 (81 frames): ~10600 MB

  • 896x496 (81 frames): ~12050 MB

  • 960x528 (81 frames): ~12850 MB

  • 1024x576 (81 frames): ~13900 MB

  • 1280x720 (81 frames): ~17950 MB

  • 1360x768 (81 frames): ~19000 MB

 

Gradio App:

Our Gradio application is highly advanced and functions flawlessly.

 

Sentry image

See why 4M developers consider Sentry, “not bad.”

Fixing code doesn’t have to be the worst part of your day. Learn how Sentry can help.

Learn more

Top comments (0)

Billboard image

The Next Generation Developer Platform

Coherence is the first Platform-as-a-Service you can control. Unlike "black-box" platforms that are opinionated about the infra you can deploy, Coherence is powered by CNC, the open-source IaC framework, which offers limitless customization.

Learn more

👋 Kindness is contagious

Please leave a ❤️ or a friendly comment on this post if you found it helpful!

Okay