DEV Community

Cover image for Best open source Image to Video CogVideoX1.5-5B-I2V is pretty decent and optimized for low VRAM
Furkan Gözükara
Furkan Gözükara

Posted on

11 1 1 1 1

Best open source Image to Video CogVideoX1.5-5B-I2V is pretty decent and optimized for low VRAM

Best open source Image to Video CogVideoX1.5-5B-I2V is pretty decent and optimized for low VRAM machines with high resolution - native resolution is 1360px and up to 10 seconds 161 frames - audios generated with new open source audio model

 

 

Resources and Details for CogVideoX1.5–5B-I2V Image-to-Video Generation

This section provides a comprehensive overview of the resources, tools, and configurations I used when working with the CogVideoX1.5–5B-I2V model for image-to-video generation.

Video Tutorial and Installation Guides:

 

 

 

  • 1-Click Installers: For streamlined setup, I’ve created 1-Click installers for Windows, RunPod, and Massed Compute environments. These are available at: https://www.patreon.com/posts/112848192. Note: These installers set up the model within a Python 3.11 virtual environment (VENV).

 

Model Repositories and Prompts:

 

 

Configuration and Optimizations:

 

  • Video Settings: I generated videos using 1360x768px resolution images at 16 FPS for 81 frames (resulting in approximately 5-second videos, including the initial frame).

  • Enabled Optimizations: I utilized the following optimizations recommended on the Hugging Face page:

  • pipe.enable_sequential_cpu_offload()

  • pipe.vae.enable_slicing()

  • pipe.vae.enable_tiling()

  • Quantization: I used int8_weight_only quantization. Note that TorchAO is required, and DeepSpeed works effectively on Windows with a Python 3.11 VENV.

 

Audio Generation:

 

  • MMAudio Model: For adding audio to the generated videos, I used the MMAudio model: https://github.com/hkchengrex/MMAudio

  • MMAudio Installers: 1-Click installers for MMAudio (Windows, RunPod, Massed Compute) are available at: https://www.patreon.com/posts/117990364. Note: These installers use a Python 3.10 VENV.

  • Prompting MMAudio: I used simple prompts for audio generation. Be aware that MMAudio may struggle when the input video contains human figures. In such cases, consider using text-to-audio alternatives.

 

VRAM Usage Observations:

I tested CogVideoX1.5–5B-I2V with various resolutions and frame counts to determine VRAM usage. Here are some of my findings (note that lower VRAM GPUs might still work, albeit slower):

 

  • 512x288 (41 frames): ~7700 MB

  • 576x320 (41 frames): ~7900 MB

  • 576x320 (81 frames): ~8850 MB

  • 704x384 (81 frames): ~8950 MB

  • 768x432 (81 frames): ~10600 MB

  • 896x496 (81 frames): ~12050 MB

  • 960x528 (81 frames): ~12850 MB

  • 1024x576 (81 frames): ~13900 MB

  • 1280x720 (81 frames): ~17950 MB

  • 1360x768 (81 frames): ~19000 MB

 

Gradio App:

Our Gradio application is highly advanced and functions flawlessly.

 

API Trace View

How I Cut 22.3 Seconds Off an API Call with Sentry 👀

Struggling with slow API calls? Dan Mindru walks through how he used Sentry's new Trace View feature to shave off 22.3 seconds from an API call.

Get a practical walkthrough of how to identify bottlenecks, split tasks into multiple parallel tasks, identify slow AI model calls, and more.

Read more →

Top comments (0)

Image of Docusign

🛠️ Bring your solution into Docusign. Reach over 1.6M customers.

Docusign is now extensible. Overcome challenges with disconnected products and inaccessible data by bringing your solutions into Docusign and publishing to 1.6M customers in the App Center.

Learn more