DEV Community

deAPI
deAPI

Posted on

Getting Started with LTX-Video: Text-to-Video Generation API (Python)

A 5-second AI video clip used to cost between $0.25 and $0.50 on platforms like Runway. Through deAPI, the same clip runs on LTX-Video models for $0.005 to $0.053 depending on model and resolution. This tutorial gets you from zero to a generated video in Python.

We'll cover three generation modes: text-to-video, image-to-video (animating a still image), and audio-to-video (lip-synced clips from an audio file). All three run through the same API with the same authentication.

What you'll need

  • Python 3.8+
  • The requests library (pip install requests)
  • A deAPI account - sign up at app.deapi.ai/dashboard and grab your API key from Settings -> API Keys. $5 in free credits, no card required.

Pick your model

deAPI runs three LTX-Video models. Each trades speed for quality:

Model Slug Max resolution Audio sync Price (768x768)
LTX-Video 13B Ltxv_13B_0_9_8_Distilled_FP8 768x768 No ~$0.009 (~4s max)
LTX-2 19B Ltx2_19B_Dist_FP8 1024x1024 No ~$0.041 (5s)
LTX-2.3 22B Ltx2_3_22B_Dist_INT8 1024x1024 Yes ~$0.047 (5s)

Start with LTX-Video 13B for testing - it's the fastest and cheapest. Move to LTX-2.3 when you need higher quality or audio synchronization.

Your first video

Every deAPI request follows an async pattern: send a generation request, get back a request_id, poll until the result is ready. Here's the complete flow:

import requests
import time

API_KEY = "dpn-sk-your-key-here"
BASE = "https://api.deapi.ai/api/v2"
HEADERS = {
    "Authorization": f"Bearer {API_KEY}",
    "Accept": "application/json",
    "Content-Type": "application/json"
}

# Generate a clip (120 frames at 30 fps = ~4 seconds, the LTX-Video 13B max)
response = requests.post(f"{BASE}/videos/generations", headers=HEADERS, json={
    "prompt": "A lighthouse on a rocky cliff during a storm at night. Rain hammers the rocks while the beam cuts through thick fog. Waves crash against the base, sending spray upward. Camera slowly pushes in from a wide shot. Cinematic, 35mm film grain.",
    "model": "Ltxv_13B_0_9_8_Distilled_FP8",
    "width": 512,
    "height": 512,
    "frames": 120,
    "fps": 30,
    "steps": 1,
    "guidance": 7.5,
    "seed": 42
})

request_id = response.json()["data"]["request_id"]
print(f"Request ID: {request_id}")

# Poll for the result
while True:
    result = requests.get(
        f"{BASE}/jobs/{request_id}",
        headers=HEADERS
    ).json()

    status = result["data"]["status"]
    print(f"Status: {status}")

    if status == "done":
        video_url = result["data"]["result_url"]
        print(f"Video URL: {video_url}")

        video = requests.get(video_url)
        with open("output.mp4", "wb") as f:
            f.write(video.content)
        print("Saved to output.mp4")
        break

    if status == "error":
        print(f"Error: {result['data']}")
        break

    time.sleep(3)
Enter fullscreen mode Exit fullscreen mode

Generation takes 30-90 seconds depending on the model and resolution. The result URL returns an MP4 file.

Writing prompts that work

LTX-Video reads prompts like a language model, not a keyword parser. Three things make the biggest difference:

First: write in full sentences. "A 35-year-old woman with dark hair speaks to the camera in a modern office" beats "woman, dark hair, office, talking, 35yo." Front-load your subject - the model weighs earlier words more heavily.

Micro-motions matter more than you'd expect. Without explicit instructions like "subtle head nods" or "natural blinks," faces tend to freeze. A prompt that says "she speaks expressively, gesturing with her right hand" produces a fundamentally different clip than "a woman talking."

Camera movement is the other lever. Static or slow-dolly shots pair best with dialogue and faces. Save the sweeping drone moves for wide establishing shots where face coherence doesn't matter.

Aim for 60-200 words in your prompt. Below 60, the output gets generic. Past 200, you give the model enough detail to stay coherent across the full clip.

Image-to-video: animate a still

The image-to-video endpoint takes a still image and a motion prompt, then generates a clip where that image comes to life. Product photos and portraits work especially well - anything with a clear subject and defined edges.

import requests
import time

API_KEY = "dpn-sk-your-key-here"
BASE = "https://api.deapi.ai/api/v2"
HEADERS = {"Authorization": f"Bearer {API_KEY}", "Accept": "application/json"}

with open("portrait.jpg", "rb") as img:
    response = requests.post(f"{BASE}/videos/animations", headers=HEADERS,
        data={
            "prompt": "The woman slowly turns her head to the right, smiles gently, and looks directly at the camera. Soft natural light from a window. Shallow depth of field.",
            "model": "Ltx2_19B_Dist_FP8",
            "width": 768,
            "height": 768,
            "frames": 120,
            "fps": 24,
            "steps": 8,
            "guidance": 1,
            "seed": 42
        },
        files={"first_frame_image": ("portrait.jpg", img, "image/jpeg")}
    )

request_id = response.json()["data"]["request_id"]

while True:
    result = requests.get(
        f"{BASE}/jobs/{request_id}", headers=HEADERS
    ).json()
    if result["data"]["status"] == "done":
        video = requests.get(result["data"]["result_url"])
        with open("animated_portrait.mp4", "wb") as f:
            f.write(video.content)
        print("Saved to animated_portrait.mp4")
        break
    if result["data"]["status"] == "error":
        print(f"Error: {result['data']}")
        break
    time.sleep(3)
Enter fullscreen mode Exit fullscreen mode

LTX-2 and LTX-2.3 also support a last_frame_image parameter. Pin both the start and end frames, and the model interpolates the motion between them - useful for controlled transitions in product demos.

Audio-to-video: lip-sync from speech

This is where LTX-2.3 stands alone. Feed it an audio file alongside a text prompt, and the generated character's mouth movements sync to the speech. Phoneme-level accuracy, not just generic "mouth opening and closing."

import requests
import time

API_KEY = "dpn-sk-your-key-here"
BASE = "https://api.deapi.ai/api/v2"
HEADERS = {"Authorization": f"Bearer {API_KEY}", "Accept": "application/json"}

with open("narration.mp3", "rb") as audio:
    response = requests.post(f"{BASE}/videos/audio-syncs", headers=HEADERS,
        data={
            "prompt": "A male news anchor in his 40s sits behind a studio desk, speaking directly to camera. Professional lighting, shallow depth of field on the background. He gestures occasionally with his right hand while maintaining eye contact.",
            "model": "Ltx2_3_22B_Dist_INT8",
            "width": 768,
            "height": 768,
            "frames": 120,
            "fps": 24,
            "seed": 42
        },
        files={"audio": ("narration.mp3", audio, "audio/mpeg")}
    )

request_id = response.json()["data"]["request_id"]

while True:
    result = requests.get(
        f"{BASE}/jobs/{request_id}", headers=HEADERS
    ).json()
    if result["data"]["status"] == "done":
        video = requests.get(result["data"]["result_url"])
        with open("lipsync_video.mp4", "wb") as f:
            f.write(video.content)
        print("Saved to lipsync_video.mp4")
        break
    if result["data"]["status"] == "error":
        print(f"Error: {result['data']}")
        break
    time.sleep(3)
Enter fullscreen mode Exit fullscreen mode

Audio files can be up to 11 seconds long (MP3, WAV, OGG, or FLAC). For the strongest results, combine audio with a first_frame_image - the image locks the character's appearance while the audio drives every lip movement.

What it costs

Model 512x512 768x768 1024x1024
LTX-Video 13B ~$0.006 ~$0.009 -
LTX-2 19B - ~$0.041 ~$0.046
LTX-2.3 22B - ~$0.047 ~$0.053

Note on clip length: LTX-Video 13B runs at a fixed 30 fps, so its 120-frame maximum is a ~4-second clip (the 512x512 figure above is for that clip). LTX-2 and LTX-2.3 run at 24 fps, so 120 frames is a true 5-second clip.

For comparison, Runway Gen-3 charges $0.25-0.50 per 5-second clip. Generating 100 test clips on LTX-Video 13B costs about $0.60 total - the kind of budget where you can iterate freely without watching a billing dashboard.

The $5 free credit covers roughly 800 clips on LTX-Video 13B, or about 100 on LTX-2.3 at 768x768.

What's next

Pair video generation with deAPI's text-to-speech endpoint and you have a complete talking-head pipeline: generate the narration with Kokoro or Qwen3 TTS, then feed that audio straight into LTX-2.3. The lip-synced clip comes back ready to use.

Sign up at deapi.ai to get $5 in free credits. Your first video generates in under two minutes.

Top comments (1)

Collapse
 
pietrus914 profile image
Piotr

deAPI is really cheap!