DEV Community

Cover image for How to use the Grok text to video API (complete guide)
Preecha
Preecha

Posted on

How to use the Grok text to video API (complete guide)

TL;DR

The Grok text-to-video API generates video from a text prompt. You call POST /v1/videos/generations, get a request_id immediately, then poll GET /v1/videos/{request_id} until status is "done". The model is grok-imagine-video, pricing starts at $0.05 per second at 480p, and the xAI Python SDK can handle polling automatically.

Try Apidog today

Introduction

xAI generated 1.2 billion videos in January 2026 alone. That was the first month after launching the Grok text-to-video API on January 28, 2026. The model also ranked number one on the Artificial Analysis text-to-video leaderboard that same month. Those numbers matter because they show the infrastructure has already been tested at scale.

This guide shows you how to:

  • Make your first text-to-video request
  • Poll for the generated video
  • Tune duration, resolution, and aspect ratio
  • Write better prompts
  • Use reference images
  • Extend or edit existing videos
  • Test the async polling flow without spending credits on every frontend test

The API is async. Your frontend should not block while waiting for video generation. Instead, it needs to render loading, success, and error states while polling for completion.

If you're building a video generation UI, mock the generation and polling endpoints during development. Apidog's Smart Mock can simulate both endpoints so your team can build the player UI before the backend flow is finalized.

What is the Grok text-to-video API?

The Grok text-to-video API is part of xAI's media generation suite at https://api.x.ai.

You send a text prompt to the grok-imagine-video model, and the API generates a short video clip from scratch. No source image is required.

The API sits alongside:

  • A synchronous image generation endpoint: POST /v1/images/generations
  • The grok-imagine-image model
  • Video extension and editing endpoints

The text-to-video endpoint is different from image-to-video generation because you provide only words. The model creates the scene, motion, composition, and visual style from your prompt.

Use text-to-video when you want the model to create the scene from scratch. Use image-to-video when you already have a source image and want to animate it.

How text-to-video generation works

Most API calls are synchronous:

  1. Send a request
  2. Wait briefly
  3. Receive the final response

Video generation takes longer, so the Grok video API uses an async pattern:

  1. Send a POST request with your prompt
  2. Receive a request_id immediately
  3. Poll a GET endpoint with that request_id
  4. Continue polling while status is "processing"
  5. Stop when status becomes "done"
  6. Read the generated video URL from the response

Flow:

POST /v1/videos/generations
        ↓
{ "request_id": "..." }
        ↓
GET /v1/videos/{request_id}
        ↓
status: processing
        ↓
GET /v1/videos/{request_id}
        ↓
status: done
        ↓
video.url
Enter fullscreen mode Exit fullscreen mode

This keeps HTTP connections short and lets your app decide how often to poll.

Prerequisites

Before writing code, create the following:

  1. An xAI account at console.x.ai
  2. An API key from the xAI console
  3. Billing access enabled for generation requests

Image

Store your API key as an environment variable instead of hardcoding it:

export XAI_API_KEY="your_api_key_here"
Enter fullscreen mode Exit fullscreen mode

If you want to use the xAI Python SDK:

pip install xai-sdk
Enter fullscreen mode Exit fullscreen mode

For raw HTTP requests:

pip install requests
Enter fullscreen mode Exit fullscreen mode

Your first text-to-video request

Endpoint:

POST https://api.x.ai/v1/videos/generations
Enter fullscreen mode Exit fullscreen mode

Required fields:

Field Value
model grok-imagine-video
prompt Your video description

Using curl

curl -X POST https://api.x.ai/v1/videos/generations \
  -H "Authorization: Bearer $XAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "grok-imagine-video",
    "prompt": "A golden retriever running through autumn leaves in slow motion, cinematic lighting"
  }'
Enter fullscreen mode Exit fullscreen mode

Response:

{
  "request_id": "d97415a1-5796-b7ec-379f-4e6819e08fdf"
}
Enter fullscreen mode Exit fullscreen mode

That request_id is used to retrieve the completed video.

Using Python with requests

import os
import requests

API_KEY = os.environ["XAI_API_KEY"]
BASE_URL = "https://api.x.ai"

headers = {
    "Authorization": f"Bearer {API_KEY}",
    "Content-Type": "application/json",
}

payload = {
    "model": "grok-imagine-video",
    "prompt": "A golden retriever running through autumn leaves in slow motion, cinematic lighting",
}

response = requests.post(
    f"{BASE_URL}/v1/videos/generations",
    headers=headers,
    json=payload,
)

response.raise_for_status()

data = response.json()
request_id = data["request_id"]

print(f"Generation started. Request ID: {request_id}")
Enter fullscreen mode Exit fullscreen mode

Polling for the video result

After receiving a request_id, poll:

GET /v1/videos/{request_id}
Enter fullscreen mode Exit fullscreen mode

The status field can be:

Status Meaning
processing The video is still generating
done The video is complete and the URL is available
failed The generation failed

Python polling loop

import os
import time
import requests

API_KEY = os.environ["XAI_API_KEY"]
BASE_URL = "https://api.x.ai"

headers = {
    "Authorization": f"Bearer {API_KEY}",
}

def poll_video(request_id: str, interval: int = 5, max_attempts: int = 60) -> dict:
    """Poll until video generation is complete."""
    url = f"{BASE_URL}/v1/videos/{request_id}"

    for attempt in range(max_attempts):
        response = requests.get(url, headers=headers)
        response.raise_for_status()

        data = response.json()

        status = data.get("status")
        progress = data.get("progress", 0)

        print(f"Attempt {attempt + 1}: status={status}, progress={progress}%")

        if status == "done":
            return data

        if status == "failed":
            raise RuntimeError(f"Video generation failed: {data}")

        time.sleep(interval)

    raise TimeoutError(f"Video not ready after {max_attempts} attempts")
Enter fullscreen mode Exit fullscreen mode

Full generate-and-poll workflow

import os
import time
import requests

API_KEY = os.environ["XAI_API_KEY"]
BASE_URL = "https://api.x.ai"

headers = {
    "Authorization": f"Bearer {API_KEY}",
}

def poll_video(request_id: str, interval: int = 5, max_attempts: int = 60) -> dict:
    url = f"{BASE_URL}/v1/videos/{request_id}"

    for attempt in range(max_attempts):
        response = requests.get(url, headers=headers)
        response.raise_for_status()

        data = response.json()
        status = data.get("status")
        progress = data.get("progress", 0)

        print(f"Attempt {attempt + 1}: status={status}, progress={progress}%")

        if status == "done":
            return data

        if status == "failed":
            raise RuntimeError(f"Video generation failed: {data}")

        time.sleep(interval)

    raise TimeoutError(f"Video not ready after {max_attempts} attempts")

def generate_video(prompt: str) -> str:
    """Generate a video and return its URL."""
    response = requests.post(
        f"{BASE_URL}/v1/videos/generations",
        headers={**headers, "Content-Type": "application/json"},
        json={
            "model": "grok-imagine-video",
            "prompt": prompt,
        },
    )

    response.raise_for_status()

    request_id = response.json()["request_id"]
    print(f"Request ID: {request_id}")

    result = poll_video(request_id)

    video_url = result["video"]["url"]
    print(f"Video ready: {video_url}")

    return video_url

video_url = generate_video(
    "A timelapse of a city skyline at sunset transitioning to night, aerial view"
)
Enter fullscreen mode Exit fullscreen mode

When complete, the poll response looks like this:

{
  "status": "done",
  "video": {
    "url": "https://vidgen.x.ai/....mp4",
    "duration": 8,
    "respect_moderation": true
  },
  "progress": 100,
  "usage": {
    "cost_in_usd_ticks": 500000000
  }
}
Enter fullscreen mode Exit fullscreen mode

Using the xAI Python SDK

If you do not want to implement polling yourself, use the xAI SDK. The client.video.generate() method blocks until the video is ready.

from xai_sdk import Client
import os

client = Client(api_key=os.environ["XAI_API_KEY"])

result = client.video.generate(
    model="grok-imagine-video",
    prompt="A golden retriever running through autumn leaves in slow motion",
    duration=8,
    resolution="720p",
    aspect_ratio="16:9",
)

print(f"Video URL: {result.video.url}")
print(f"Duration: {result.video.duration}s")
Enter fullscreen mode Exit fullscreen mode

Use the SDK when you want the shortest path to working code.

Use raw HTTP requests when you need:

  • Custom retry behavior
  • Frontend progress updates
  • Custom polling intervals
  • More detailed logging
  • Test control over processing, done, and failed states

Writing effective prompts for video generation

Your prompt is the most important input. A specific prompt usually produces better results than a vague one.

A useful structure:

[subject and scene].
[motion].
[camera behavior].
[style, lighting, and mood].
Enter fullscreen mode Exit fullscreen mode

1. Describe the scene clearly

Weak:

A coffee mug.
Enter fullscreen mode Exit fullscreen mode

Better:

A white ceramic coffee mug on a wooden table beside a rain-streaked window.
Enter fullscreen mode Exit fullscreen mode

2. Add explicit motion

Weak:

A coffee mug on a table.
Enter fullscreen mode Exit fullscreen mode

Better:

A white ceramic coffee mug on a wooden table. Steam curls upward while raindrops slide down the window behind it.
Enter fullscreen mode Exit fullscreen mode

3. Specify the camera style

Use terms like:

  • close-up
  • tracking shot
  • overhead drone view
  • handheld
  • slow dolly in
  • camera orbit
  • wide establishing shot

Example:

The camera slowly orbits the mug as steam rises from the coffee.
Enter fullscreen mode Exit fullscreen mode

4. Define lighting and mood

Lighting examples:

  • golden hour
  • overcast
  • neon-lit
  • studio three-point lighting
  • soft window light

Mood examples:

  • melancholic
  • calm
  • energetic
  • cinematic
  • dreamlike

Example:

Foggy morning, soft window light, quiet melancholic mood.
Enter fullscreen mode Exit fullscreen mode

5. Add style references in text

You can guide the visual format with terms like:

  • cinematic
  • documentary
  • anime
  • stop-motion
  • hyperlapse
  • IMAX-style
  • product commercial

Prompt template

A lone astronaut floats past the International Space Station,
tether drifting behind them. The camera tracks slowly alongside,
showing Earth below. Cinematic, IMAX quality, warm sunrise light
reflecting off the visor.
Enter fullscreen mode Exit fullscreen mode

Controlling resolution, duration, and aspect ratio

The generation endpoint accepts optional parameters for output length and dimensions.

Duration

{
  "duration": 10
}
Enter fullscreen mode Exit fullscreen mode

Range:

  • Minimum: 1 second
  • Maximum: 15 seconds
  • Default: 6 seconds

Longer videos cost more. For example, a 10-second clip at 480p costs $0.50.

Resolution

{
  "resolution": "720p"
}
Enter fullscreen mode Exit fullscreen mode

Options:

Resolution Use case
480p Default, prototyping, cheaper tests
720p Production output where quality matters

Aspect ratio

{
  "aspect_ratio": "9:16"
}
Enter fullscreen mode Exit fullscreen mode

Available ratios:

Ratio Best for
16:9 Desktop, YouTube, presentations
9:16 TikTok, Instagram Reels, mobile
1:1 Instagram feed, social cards
4:3 Classic video, presentations
3:4 Portrait mobile content
3:2 Standard photo ratio
2:3 Portrait photography

Full request with all parameters

curl -X POST https://api.x.ai/v1/videos/generations \
  -H "Authorization: Bearer $XAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "grok-imagine-video",
    "prompt": "A coastal town at dawn, waves breaking gently on a rocky shore",
    "duration": 10,
    "resolution": "720p",
    "aspect_ratio": "16:9"
  }'
Enter fullscreen mode Exit fullscreen mode

Using reference images to guide video style

The reference_images parameter accepts an array of up to 7 image URLs.

These images guide the style and content of the generated video, but they do not become the source frame.

Example:

{
  "model": "grok-imagine-video",
  "prompt": "A coastal town at dawn, waves breaking gently on a rocky shore",
  "reference_images": [
    {
      "url": "https://example.com/my-style-reference.jpg"
    },
    {
      "url": "https://example.com/color-palette-reference.jpg"
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

Reference images work best when they share a consistent aesthetic. Avoid mixing unrelated styles unless you intentionally want the model to blend them.

Use reference images to guide:

  • Color grading
  • Composition
  • Texture
  • Lighting style
  • Overall visual mood

Do not confuse reference images with image-to-video. In text-to-video with reference images, the prompt still drives the scene. In image-to-video, the source image becomes the first frame.

Extending and editing generated videos

xAI provides two additional endpoints for videos you have already generated.

Extend a video

POST /v1/videos/extensions
Enter fullscreen mode Exit fullscreen mode

Use this endpoint to add more footage to an existing generated video.

You pass:

  • The request_id of the original video
  • A new prompt for the extension

This is useful when you want a longer sequence without generating more than 15 seconds in a single request.

Edit a video

POST /v1/videos/edits
Enter fullscreen mode Exit fullscreen mode

Use this endpoint to modify an existing generated video with a text instruction.

Examples:

  • Change the visual style
  • Alter scene details
  • Apply effects
  • Adjust the look of an existing clip

Both endpoints use the same async pattern:

  1. Send the request
  2. Receive a request_id
  3. Poll GET /v1/videos/{request_id}
  4. Wait for status: "done"

Reading the cost from the API response

The completed poll response includes a usage object:

{
  "usage": {
    "cost_in_usd_ticks": 500000000
  }
}
Enter fullscreen mode Exit fullscreen mode

The unit is USD ticks. Divide by 10,000,000 to convert ticks to dollars.

cost_in_usd = result["usage"]["cost_in_usd_ticks"] / 10_000_000

print(f"Cost: ${cost_in_usd:.4f}")
Enter fullscreen mode Exit fullscreen mode

Output:

Cost: $0.0500
Enter fullscreen mode Exit fullscreen mode

Pricing reference

Resolution Price per second 10-second clip
480p $0.05 $0.50
720p $0.07 $0.70

A value of 500000000 ticks equals $0.50. That is a 10-second clip at 480p.

For production systems, log cost_in_usd_ticks from every completed response. This gives you a simple usage dashboard without querying billing separately.

Example log payload:

{
  "request_id": "d97415a1-5796-b7ec-379f-4e6819e08fdf",
  "status": "done",
  "duration": 10,
  "resolution": "480p",
  "cost_in_usd_ticks": 500000000
}
Enter fullscreen mode Exit fullscreen mode

How to test your Grok video API with Apidog

The async polling pattern creates a frontend testing problem.

Your UI needs to handle:

  1. Loading while polling
  2. Success when the video URL is available
  3. Failure when generation fails

Testing those states with real API calls costs money and takes time. Apidog's Smart Mock lets you define mock responses for both endpoints and test the full flow instantly.

Image

Use case 1: Mock the frontend flow with Smart Mock

You need to mock two endpoints:

POST /v1/videos/generations
GET /v1/videos/{request_id}
Enter fullscreen mode Exit fullscreen mode

Mock the generation endpoint

In Apidog:

  1. Create POST /v1/videos/generations
  2. Define the response schema with a request_id string field
  3. Enable Smart Mock

Mock response:

{
  "request_id": "d97415a1-5796-b7ec-379f-4e6819e08fdf"
}
Enter fullscreen mode Exit fullscreen mode

Mock the polling endpoint

Create:

GET /v1/videos/{request_id}
Enter fullscreen mode Exit fullscreen mode

Define the response schema with:

  • status
  • video.url
  • video.duration
  • video.respect_moderation
  • progress
  • usage.cost_in_usd_ticks

Mock successful response:

{
  "status": "done",
  "video": {
    "url": "https://vidgen.x.ai/mock-video-12345.mp4",
    "duration": 8,
    "respect_moderation": true
  },
  "progress": 100,
  "usage": {
    "cost_in_usd_ticks": 400000000
  }
}
Enter fullscreen mode Exit fullscreen mode

To test loading state, return:

{
  "status": "processing",
  "progress": 45
}
Enter fullscreen mode Exit fullscreen mode

To test failure state, return:

{
  "status": "failed",
  "progress": 100
}
Enter fullscreen mode Exit fullscreen mode

Now frontend developers can build the complete video player flow without spending real API credits.

Use case 2: Validate polling with Test Scenarios

After your integration is working, use Apidog Test Scenarios to automate the generate-then-poll flow.

Step 1: Add the generate request

Add this request as the first step:

POST /v1/videos/generations
Enter fullscreen mode Exit fullscreen mode

In the post-processor, extract request_id using JSONPath:

$.request_id
Enter fullscreen mode Exit fullscreen mode

Store it as:

videoRequestId
Enter fullscreen mode Exit fullscreen mode

Step 2: Add the polling request

Add this request as the second step:

GET /v1/videos/{{videoRequestId}}
Enter fullscreen mode Exit fullscreen mode

Wrap it in a loop.

Break condition:

response.body.status == "done"
Enter fullscreen mode Exit fullscreen mode

Add a wait processor between iterations:

5 seconds
Enter fullscreen mode Exit fullscreen mode

This avoids hammering the endpoint.

Step 3: Assert the final result

Add an assertion to the final GET response:

$.video.url is not empty
Enter fullscreen mode Exit fullscreen mode

This confirms the async flow completed successfully.

You can run this scenario in CI to catch regressions when polling logic changes.

Text-to-video vs image-to-video: which should you use?

Both modes use the grok-imagine-video model, but they solve different problems.

Choose text-to-video when

  • You are generating original content from a concept or script
  • You want the model to control the composition
  • Users provide text prompts
  • You do not have a source image

Choose image-to-video when

  • You have a product photo, illustration, or brand asset to animate
  • You need to preserve details from an existing image
  • You are creating consistent animations from related images
  • You want to animate your own artwork or photography

The key distinction:

Text-to-video creates a scene from scratch.
Image-to-video makes an existing image move.
Enter fullscreen mode Exit fullscreen mode

For products that support both modes, route requests based on input type:

def choose_generation_mode(prompt: str, image_url: str | None):
    if image_url:
        return "image-to-video"

    return "text-to-video"
Enter fullscreen mode Exit fullscreen mode

If the user uploads an image, route to the image-to-video flow. If the user provides only a prompt, route to:

POST /v1/videos/generations
Enter fullscreen mode Exit fullscreen mode

Common errors and fixes

401 Unauthorized

Your API key is missing, expired, or incorrectly formatted.

Check that your header is exactly:

Authorization: Bearer YOUR_XAI_API_KEY
Enter fullscreen mode Exit fullscreen mode

Also confirm that the key is active in the xAI console.

429 Too Many Requests

You hit a rate limit.

The API allows:

  • 60 requests per minute
  • 1 request per second

Fixes:

  • Add delays between requests
  • Poll every 5 to 10 seconds
  • Avoid tight polling loops

status: "failed" in the poll response

The generation failed.

This usually means the prompt was rejected by content moderation. If respect_moderation is true, moderation was applied.

Fixes:

  • Revise the prompt
  • Remove ambiguous wording
  • Remove potentially sensitive language
  • Try a more specific and neutral scene description

Video URL returns 404

Generated video URLs expire after a period of time.

Fix:

Download the MP4 to your own storage immediately after retrieving video.url.

Do not store the generated URL and assume it will work days later.

Empty or frozen video

Vague prompts or prompts without motion cues can produce minimal movement.

Weak:

A car on a road.
Enter fullscreen mode Exit fullscreen mode

Better:

A red sports car speeds along a winding mountain road. The camera follows from behind as trees blur past on both sides.
Enter fullscreen mode Exit fullscreen mode

Add:

  • What moves
  • Direction of movement
  • Speed
  • Camera behavior

Slow generation or polling

720p videos take longer than 480p. Longer durations also take more time.

For development, use:

{
  "duration": 3,
  "resolution": "480p"
}
Enter fullscreen mode Exit fullscreen mode

Then switch to longer 720p generations for production output.

Conclusion

The Grok text-to-video API follows a simple async workflow:

  1. Send a prompt to POST /v1/videos/generations
  2. Receive a request_id
  3. Poll GET /v1/videos/{request_id}
  4. Wait for status: "done"
  5. Read the MP4 URL from video.url

Once your polling loop works, the rest of the integration is mostly parameter tuning.

For production:

  • Track cost_in_usd_ticks
  • Download generated videos to your own storage
  • Poll at reasonable intervals
  • Handle processing, done, and failed
  • Mock both endpoints during frontend development
  • Add automated tests for the async flow

Use Apidog to mock the Grok video endpoints and validate your polling logic before spending credits on real generations.

FAQ

What model name do I use for text-to-video generation?

Use:

grok-imagine-video
Enter fullscreen mode Exit fullscreen mode

This is the required model value for:

POST /v1/videos/generations
Enter fullscreen mode Exit fullscreen mode

How long does video generation take?

It depends on duration and resolution.

Short 480p clips may complete in under 30 seconds. Longer 720p clips can take a few minutes.

Poll every 5 to 10 seconds instead of continuously calling the endpoint.

Can I generate a video longer than 15 seconds?

Not in a single request.

The maximum duration is 15 seconds. To create longer videos, generate a clip and then use:

POST /v1/videos/extensions
Enter fullscreen mode Exit fullscreen mode

How do I download the generated video?

Use the URL from the completed poll response:

video_url = result["video"]["url"]
Enter fullscreen mode Exit fullscreen mode

Download the MP4 to your own storage immediately. The URL is temporary and will expire.

What happens if my prompt violates content moderation?

The job can return:

{
  "status": "failed"
}
Enter fullscreen mode Exit fullscreen mode

The respect_moderation field indicates that moderation was applied. Revise the prompt and try again.

Is there a free tier for the video API?

xAI charges per second of output generated. There is no free tier specifically for video generation. Check console.x.ai for current credit offers for new accounts.

How do reference_images differ from starting with a source image?

reference_images guide the visual style of a text-to-video generation. They influence the look but do not become the subject.

A source image for image-to-video becomes the first frame of the generated video.

What's the best way to test the polling loop without spending credits?

Use Apidog Smart Mock to mock both endpoints:

POST /v1/videos/generations
GET /v1/videos/{request_id}
Enter fullscreen mode Exit fullscreen mode

Define mock responses for:

  • processing
  • done
  • failed

Then your frontend and polling code can run without calling the real API.

Top comments (0)