DEV Community

local ai
local ai

Posted on

I Ran LTX 2.3 Locally — Image to Video with Audio, No Cloud Required

I Ran LTX 2.3 Locally — Image to Video with Audio, No Cloud Required

Cover

Last Wednesday night, I got my 12th "content policy violation" of the month.

I wasn't doing anything illegal. Just a portrait photo, a simple motion prompt. The kind of thing any filmmaker would shoot on set.

The platform didn't care. The error message was the same cold boilerplate it always is.

That was the moment I decided I was done with cloud video generation.


Two hours later, someone dropped a link in a Discord server I'm in.

"LTX 2.3 GGUF is out. Runs on consumer GPUs. Image-to-video with native audio."

I stared at that message for a few seconds.

Native audio. Not dubbed afterward. Not a separate step. Generated alongside the video, synchronized, as one output.

I closed the browser tab with the content violation error and started downloading the model.


What is LTX 2.3?

LTX-Video is an open-source video generation model from Lightricks, an Israeli company that's been in the media processing space for a while. Version 2.3 is their most capable release yet, and what makes it genuinely interesting compared to everything else out there:

It generates video and audio simultaneously.

Not video first, then audio layered on top. The model jointly produces both streams — synchronized dialogue, ambient sound, environmental audio — as a single generation pass. That's architecturally different from most pipelines where audio is an afterthought.

Other notable upgrades in 2.3:

  • Redesigned VAE for sharper fine details (hair, fabric texture, edges)
  • Significantly improved image-to-video quality
  • 4K resolution support at up to 50 FPS
  • Better prompt understanding and camera motion control
  • Portrait (9:16) support alongside landscape

The base model sits at 19 billion parameters. Running it at full precision would require 38GB+ VRAM — firmly in server territory.

Then GGUF happened.


Why GGUF Changes Everything

The short version: GGUF is a quantization format that compresses model weights from 16-bit floats down to 4-bit (or lower). Same model, roughly one-fifth the size.

The version I'm using is Q4_K_S — about 10.7GB for the main model file. My GPU is an RTX 3080 with 10GB VRAM. The text encoder (Gemma 3 12B) offloads to CPU/RAM. Main model runs on GPU.

Result: a 5-second, 960×544 video with audio in about 2-3 minutes.

Is that fast? No. Is it running entirely on my own hardware, with no cloud, no API calls, no usage logs? Yes.

That trade-off is completely worth it to me.


What the Output Actually Looks Like

I ran an image-to-video test with a portrait photo. The prompt was minimal — I wanted to see what the model would do with almost no direction.

Input image:

Input image

First output:

Second test with a different input:

Honest assessment: it's not perfect. At Q4 quantization you lose some sharpness compared to the full BF16 model. Motion can be slightly jerky on complex scenes.

But the audio synchronization is genuinely impressive. And more importantly — this ran on my desk, with no data leaving my machine.


The Privacy Argument (And Why It Actually Matters)

Let me be direct about something most AI tool reviews dance around.

Every image you upload to a cloud video generation service is stored on someone else's server. Every prompt you type is logged. Every generation becomes part of your usage profile. The terms of service you clicked through without reading probably give them broad rights to that data.

I'm not being paranoid. This is just how SaaS works.

Local inference changes the equation completely. The model lives on your hard drive. Inference runs on your GPU. The output files go to your output folder.

The entire pipeline is air-gapped from the internet.

No usage logs. No content moderation API calls. No third party with visibility into what you're creating.

If you're working on creative projects that might not survive a content policy review — not because they're harmful, but because algorithms are bad at context — this matters.

What you create is between you and your hardware.


Hardware Requirements

Here's what you actually need:

Component Minimum Recommended
GPU RTX 3080 10GB RTX 4080 16GB+
RAM 32GB (text encoder on CPU) 64GB
Storage 30GB free 50GB+
OS Windows 10/11 Windows 11

Model files you need:

  • Main model: LTX-2.3-distilled-Q4_K_S.gguf (~10.7GB)
  • Text encoder: Gemma 3 12B fp4 + LTX text projection layer
  • Video VAE: LTX23_video_vae_bf16.safetensors
  • Audio VAE: LTX23_audio_vae_bf16.safetensors
  • LoRA: LTX-2-Image2Vid-Adapter.safetensors

If your VRAM is under 12GB, the text encoder (Gemma 3 12B) will run on CPU. You'll need 32GB of system RAM for that to work without swapping to disk.


One-Click Setup

I've packaged a complete pre-configured environment that includes:

  • Full ComfyUI installation with all required custom nodes pre-installed
  • All model files (no separate downloads needed)
  • A Gradio web interface — just open a browser, upload an image, write a prompt, hit generate
  • Pre-tuned workflow matching the settings that produced the videos above

Double-click 01-run.bat. Browser opens. Generate.

No Python environment setup. No node installation. No YAML configuration. It just works.

Download: https://www.patreon.com/localai


A Note on What This Enables

I've been running local AI models for a few years now. What's changed recently isn't the existence of local models — it's the capability gap closing.

Twelve months ago, local video generation was a curiosity. The outputs were bad enough that cloud services, despite their restrictions, were clearly better.

That's no longer true.

LTX 2.3 at Q4 quantization produces outputs that are competitive with mid-tier cloud services. And it does something cloud services can't do by design: it generates audio and video together, in a single pass, with no content filtering, on hardware you own.

That's a meaningful shift.

The technology for completely private, unrestricted, high-quality video generation now fits on a consumer GPU. What people do with that capability — the creative projects they pursue, the content they make — is genuinely up to them.

That's new.


Download the one-click package: https://www.patreon.com/posts/ltx-2-3-locally-152521808

Running questions? Drop a comment. I respond to most of them.

Top comments (0)