Preecha

Posted on Jun 9

How to use the Grok text to video API (complete guide)

TL;DR

The Grok text-to-video API generates video from a text prompt. You call POST /v1/videos/generations, get a request_id immediately, then poll GET /v1/videos/{request_id} until status is "done". The model is grok-imagine-video, pricing starts at $0.05 per second at 480p, and the xAI Python SDK can handle polling automatically.

Try Apidog today

Introduction

xAI generated 1.2 billion videos in January 2026 alone. That was the first month after launching the Grok text-to-video API on January 28, 2026. The model also ranked number one on the Artificial Analysis text-to-video leaderboard that same month. Those numbers matter because they show the infrastructure has already been tested at scale.

This guide shows you how to:

Make your first text-to-video request
Poll for the generated video
Tune duration, resolution, and aspect ratio
Write better prompts
Use reference images
Extend or edit existing videos
Test the async polling flow without spending credits on every frontend test

The API is async. Your frontend should not block while waiting for video generation. Instead, it needs to render loading, success, and error states while polling for completion.

If you're building a video generation UI, mock the generation and polling endpoints during development. Apidog's Smart Mock can simulate both endpoints so your team can build the player UI before the backend flow is finalized.

What is the Grok text-to-video API?

The Grok text-to-video API is part of xAI's media generation suite at https://api.x.ai.

You send a text prompt to the grok-imagine-video model, and the API generates a short video clip from scratch. No source image is required.

The API sits alongside:

A synchronous image generation endpoint: POST /v1/images/generations
The grok-imagine-image model
Video extension and editing endpoints

The text-to-video endpoint is different from image-to-video generation because you provide only words. The model creates the scene, motion, composition, and visual style from your prompt.

Use text-to-video when you want the model to create the scene from scratch. Use image-to-video when you already have a source image and want to animate it.

How text-to-video generation works

Most API calls are synchronous:

Send a request
Wait briefly
Receive the final response

Video generation takes longer, so the Grok video API uses an async pattern:

Send a POST request with your prompt
Receive a request_id immediately
Poll a GET endpoint with that request_id
Continue polling while status is "processing"
Stop when status becomes "done"
Read the generated video URL from the response

Flow:

POST /v1/videos/generations
        ↓
{ "request_id": "..." }
        ↓
GET /v1/videos/{request_id}
        ↓
status: processing
        ↓
GET /v1/videos/{request_id}
        ↓
status: done
        ↓
video.url

This keeps HTTP connections short and lets your app decide how often to poll.

Prerequisites

Before writing code, create the following:

An xAI account at console.x.ai
An API key from the xAI console
Billing access enabled for generation requests

Store your API key as an environment variable instead of hardcoding it:

export XAI_API_KEY="your_api_key_here"

If you want to use the xAI Python SDK:

pip install xai-sdk

For raw HTTP requests:

pip install requests

Your first text-to-video request

Endpoint:

POST https://api.x.ai/v1/videos/generations

Required fields:

Field	Value
`model`	`grok-imagine-video`
`prompt`	Your video description

Using curl

curl -X POST https://api.x.ai/v1/videos/generations \
  -H "Authorization: Bearer $XAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "grok-imagine-video",
    "prompt": "A golden retriever running through autumn leaves in slow motion, cinematic lighting"
  }'

Response:

{
  "request_id": "d97415a1-5796-b7ec-379f-4e6819e08fdf"
}

That request_id is used to retrieve the completed video.

Using Python with `requests`

import os
import requests

API_KEY = os.environ["XAI_API_KEY"]
BASE_URL = "https://api.x.ai"

headers = {
    "Authorization": f"Bearer {API_KEY}",
    "Content-Type": "application/json",
}

payload = {
    "model": "grok-imagine-video",
    "prompt": "A golden retriever running through autumn leaves in slow motion, cinematic lighting",
}

response = requests.post(
    f"{BASE_URL}/v1/videos/generations",
    headers=headers,
    json=payload,
)

response.raise_for_status()

data = response.json()
request_id = data["request_id"]

print(f"Generation started. Request ID: {request_id}")

Polling for the video result

After receiving a request_id, poll:

GET /v1/videos/{request_id}

The status field can be:

Status	Meaning
`processing`	The video is still generating
`done`	The video is complete and the URL is available
`failed`	The generation failed

Python polling loop

import os
import time
import requests

API_KEY = os.environ["XAI_API_KEY"]
BASE_URL = "https://api.x.ai"

headers = {
    "Authorization": f"Bearer {API_KEY}",
}

def poll_video(request_id: str, interval: int = 5, max_attempts: int = 60) -> dict:
    """Poll until video generation is complete."""
    url = f"{BASE_URL}/v1/videos/{request_id}"

    for attempt in range(max_attempts):
        response = requests.get(url, headers=headers)
        response.raise_for_status()

        data = response.json()

        status = data.get("status")
        progress = data.get("progress", 0)

        print(f"Attempt {attempt + 1}: status={status}, progress={progress}%")

        if status == "done":
            return data

        if status == "failed":
            raise RuntimeError(f"Video generation failed: {data}")

        time.sleep(interval)

    raise TimeoutError(f"Video not ready after {max_attempts} attempts")

Full generate-and-poll workflow

import os
import time
import requests

API_KEY = os.environ["XAI_API_KEY"]
BASE_URL = "https://api.x.ai"

headers = {
    "Authorization": f"Bearer {API_KEY}",
}

def poll_video(request_id: str, interval: int = 5, max_attempts: int = 60) -> dict:
    url = f"{BASE_URL}/v1/videos/{request_id}"

    for attempt in range(max_attempts):
        response = requests.get(url, headers=headers)
        response.raise_for_status()

        data = response.json()
        status = data.get("status")
        progress = data.get("progress", 0)

        print(f"Attempt {attempt + 1}: status={status}, progress={progress}%")

        if status == "done":
            return data

        if status == "failed":
            raise RuntimeError(f"Video generation failed: {data}")

        time.sleep(interval)

    raise TimeoutError(f"Video not ready after {max_attempts} attempts")

def generate_video(prompt: str) -> str:
    """Generate a video and return its URL."""
    response = requests.post(
        f"{BASE_URL}/v1/videos/generations",
        headers={**headers, "Content-Type": "application/json"},
        json={
            "model": "grok-imagine-video",
            "prompt": prompt,
        },
    )

    response.raise_for_status()

    request_id = response.json()["request_id"]
    print(f"Request ID: {request_id}")

    result = poll_video(request_id)

    video_url = result["video"]["url"]
    print(f"Video ready: {video_url}")

    return video_url

video_url = generate_video(
    "A timelapse of a city skyline at sunset transitioning to night, aerial view"
)

When complete, the poll response looks like this:

{
  "status": "done",
  "video": {
    "url": "https://vidgen.x.ai/....mp4",
    "duration": 8,
    "respect_moderation": true
  },
  "progress": 100,
  "usage": {
    "cost_in_usd_ticks": 500000000
  }
}

Using the xAI Python SDK

If you do not want to implement polling yourself, use the xAI SDK. The client.video.generate() method blocks until the video is ready.

from xai_sdk import Client
import os

client = Client(api_key=os.environ["XAI_API_KEY"])

result = client.video.generate(
    model="grok-imagine-video",
    prompt="A golden retriever running through autumn leaves in slow motion",
    duration=8,
    resolution="720p",
    aspect_ratio="16:9",
)

print(f"Video URL: {result.video.url}")
print(f"Duration: {result.video.duration}s")

Use the SDK when you want the shortest path to working code.

Use raw HTTP requests when you need:

Custom retry behavior
Frontend progress updates
Custom polling intervals
More detailed logging
Test control over processing, done, and failed states

Writing effective prompts for video generation

Your prompt is the most important input. A specific prompt usually produces better results than a vague one.

A useful structure:

[subject and scene].
[motion].
[camera behavior].
[style, lighting, and mood].

1. Describe the scene clearly

Weak:

A coffee mug.

Better:

A white ceramic coffee mug on a wooden table beside a rain-streaked window.

2. Add explicit motion

Weak:

A coffee mug on a table.

Better:

A white ceramic coffee mug on a wooden table. Steam curls upward while raindrops slide down the window behind it.

3. Specify the camera style

Use terms like:

close-up
tracking shot
overhead drone view
handheld
slow dolly in
camera orbit
wide establishing shot

Example:

The camera slowly orbits the mug as steam rises from the coffee.

4. Define lighting and mood

Lighting examples:

golden hour
overcast
neon-lit
studio three-point lighting
soft window light

Mood examples:

melancholic
calm
energetic
cinematic
dreamlike

Example:

Foggy morning, soft window light, quiet melancholic mood.

5. Add style references in text

You can guide the visual format with terms like:

cinematic
documentary
anime
stop-motion
hyperlapse
IMAX-style
product commercial

Prompt template

A lone astronaut floats past the International Space Station,
tether drifting behind them. The camera tracks slowly alongside,
showing Earth below. Cinematic, IMAX quality, warm sunrise light
reflecting off the visor.

Controlling resolution, duration, and aspect ratio

The generation endpoint accepts optional parameters for output length and dimensions.

Duration

{
  "duration": 10
}

Range:

Minimum: 1 second
Maximum: 15 seconds
Default: 6 seconds

Longer videos cost more. For example, a 10-second clip at 480p costs $0.50.

Resolution

{
  "resolution": "720p"
}

Options:

Resolution	Use case
`480p`	Default, prototyping, cheaper tests
`720p`	Production output where quality matters

Aspect ratio

{
  "aspect_ratio": "9:16"
}

Available ratios:

Ratio	Best for
`16:9`	Desktop, YouTube, presentations
`9:16`	TikTok, Instagram Reels, mobile
`1:1`	Instagram feed, social cards
`4:3`	Classic video, presentations
`3:4`	Portrait mobile content
`3:2`	Standard photo ratio
`2:3`	Portrait photography

Full request with all parameters

curl -X POST https://api.x.ai/v1/videos/generations \
  -H "Authorization: Bearer $XAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "grok-imagine-video",
    "prompt": "A coastal town at dawn, waves breaking gently on a rocky shore",
    "duration": 10,
    "resolution": "720p",
    "aspect_ratio": "16:9"
  }'

Using reference images to guide video style

The reference_images parameter accepts an array of up to 7 image URLs.

These images guide the style and content of the generated video, but they do not become the source frame.

Example:

{
  "model": "grok-imagine-video",
  "prompt": "A coastal town at dawn, waves breaking gently on a rocky shore",
  "reference_images": [
    {
      "url": "https://example.com/my-style-reference.jpg"
    },
    {
      "url": "https://example.com/color-palette-reference.jpg"
    }
  ]
}

Reference images work best when they share a consistent aesthetic. Avoid mixing unrelated styles unless you intentionally want the model to blend them.

Use reference images to guide:

Color grading
Composition
Texture
Lighting style
Overall visual mood

Do not confuse reference images with image-to-video. In text-to-video with reference images, the prompt still drives the scene. In image-to-video, the source image becomes the first frame.

Extending and editing generated videos

xAI provides two additional endpoints for videos you have already generated.

Extend a video

POST /v1/videos/extensions

Use this endpoint to add more footage to an existing generated video.

You pass:

The request_id of the original video
A new prompt for the extension

This is useful when you want a longer sequence without generating more than 15 seconds in a single request.

Edit a video

POST /v1/videos/edits

Use this endpoint to modify an existing generated video with a text instruction.

Examples:

Change the visual style
Alter scene details
Apply effects
Adjust the look of an existing clip

Both endpoints use the same async pattern:

Send the request
Receive a request_id
Poll GET /v1/videos/{request_id}
Wait for status: "done"

Reading the cost from the API response

The completed poll response includes a usage object:

{
  "usage": {
    "cost_in_usd_ticks": 500000000
  }
}

The unit is USD ticks. Divide by 10,000,000 to convert ticks to dollars.

cost_in_usd = result["usage"]["cost_in_usd_ticks"] / 10_000_000

print(f"Cost: ${cost_in_usd:.4f}")

Output:

Cost: $0.0500

Pricing reference

Resolution	Price per second	10-second clip
`480p`	$0.05	$0.50
`720p`	$0.07	$0.70

A value of 500000000 ticks equals $0.50. That is a 10-second clip at 480p.

For production systems, log cost_in_usd_ticks from every completed response. This gives you a simple usage dashboard without querying billing separately.

Example log payload:

{
  "request_id": "d97415a1-5796-b7ec-379f-4e6819e08fdf",
  "status": "done",
  "duration": 10,
  "resolution": "480p",
  "cost_in_usd_ticks": 500000000
}

How to test your Grok video API with Apidog

The async polling pattern creates a frontend testing problem.

Your UI needs to handle:

Loading while polling
Success when the video URL is available
Failure when generation fails

Testing those states with real API calls costs money and takes time. Apidog's Smart Mock lets you define mock responses for both endpoints and test the full flow instantly.

Use case 1: Mock the frontend flow with Smart Mock

You need to mock two endpoints:

POST /v1/videos/generations
GET /v1/videos/{request_id}

Mock the generation endpoint

In Apidog:

Create POST /v1/videos/generations
Define the response schema with a request_id string field
Enable Smart Mock

Mock response:

{
  "request_id": "d97415a1-5796-b7ec-379f-4e6819e08fdf"
}

Mock the polling endpoint

Create:

GET /v1/videos/{request_id}

Define the response schema with:

status
video.url
video.duration
video.respect_moderation
progress
usage.cost_in_usd_ticks

Mock successful response:

{
  "status": "done",
  "video": {
    "url": "https://vidgen.x.ai/mock-video-12345.mp4",
    "duration": 8,
    "respect_moderation": true
  },
  "progress": 100,
  "usage": {
    "cost_in_usd_ticks": 400000000
  }
}

To test loading state, return:

{
  "status": "processing",
  "progress": 45
}

To test failure state, return:

{
  "status": "failed",
  "progress": 100
}

Now frontend developers can build the complete video player flow without spending real API credits.

Use case 2: Validate polling with Test Scenarios

After your integration is working, use Apidog Test Scenarios to automate the generate-then-poll flow.

Step 1: Add the generate request

Add this request as the first step:

POST /v1/videos/generations

In the post-processor, extract request_id using JSONPath:

$.request_id

Store it as:

videoRequestId

Step 2: Add the polling request

Add this request as the second step:

GET /v1/videos/{{videoRequestId}}

Wrap it in a loop.

Break condition:

response.body.status == "done"

Add a wait processor between iterations:

5 seconds

This avoids hammering the endpoint.

Step 3: Assert the final result

Add an assertion to the final GET response:

$.video.url is not empty

This confirms the async flow completed successfully.

You can run this scenario in CI to catch regressions when polling logic changes.

Text-to-video vs image-to-video: which should you use?

Both modes use the grok-imagine-video model, but they solve different problems.

Choose text-to-video when

You are generating original content from a concept or script
You want the model to control the composition
Users provide text prompts
You do not have a source image

Choose image-to-video when

You have a product photo, illustration, or brand asset to animate
You need to preserve details from an existing image
You are creating consistent animations from related images
You want to animate your own artwork or photography

The key distinction:

Text-to-video creates a scene from scratch.
Image-to-video makes an existing image move.

For products that support both modes, route requests based on input type:

def choose_generation_mode(prompt: str, image_url: str | None):
    if image_url:
        return "image-to-video"

    return "text-to-video"

If the user uploads an image, route to the image-to-video flow. If the user provides only a prompt, route to:

POST /v1/videos/generations

Common errors and fixes

`401 Unauthorized`

Your API key is missing, expired, or incorrectly formatted.

Check that your header is exactly:

Authorization: Bearer YOUR_XAI_API_KEY

Also confirm that the key is active in the xAI console.

`429 Too Many Requests`

You hit a rate limit.

The API allows:

60 requests per minute
1 request per second

Fixes:

Add delays between requests
Poll every 5 to 10 seconds
Avoid tight polling loops

`status: "failed"` in the poll response

The generation failed.

This usually means the prompt was rejected by content moderation. If respect_moderation is true, moderation was applied.

Fixes:

Revise the prompt
Remove ambiguous wording
Remove potentially sensitive language
Try a more specific and neutral scene description

Video URL returns `404`

Generated video URLs expire after a period of time.

Fix:

Download the MP4 to your own storage immediately after retrieving video.url.

Do not store the generated URL and assume it will work days later.

Empty or frozen video

Vague prompts or prompts without motion cues can produce minimal movement.

Weak:

A car on a road.

Better:

A red sports car speeds along a winding mountain road. The camera follows from behind as trees blur past on both sides.

Add:

What moves
Direction of movement
Speed
Camera behavior

Slow generation or polling

720p videos take longer than 480p. Longer durations also take more time.

For development, use:

{
  "duration": 3,
  "resolution": "480p"
}

Then switch to longer 720p generations for production output.

Conclusion

The Grok text-to-video API follows a simple async workflow:

Send a prompt to POST /v1/videos/generations
Receive a request_id
Poll GET /v1/videos/{request_id}
Wait for status: "done"
Read the MP4 URL from video.url

Once your polling loop works, the rest of the integration is mostly parameter tuning.

For production:

Track cost_in_usd_ticks
Download generated videos to your own storage
Poll at reasonable intervals
Handle processing, done, and failed
Mock both endpoints during frontend development
Add automated tests for the async flow

Use Apidog to mock the Grok video endpoints and validate your polling logic before spending credits on real generations.

FAQ

What model name do I use for text-to-video generation?

Use:

grok-imagine-video

This is the required model value for:

POST /v1/videos/generations

How long does video generation take?

It depends on duration and resolution.

Short 480p clips may complete in under 30 seconds. Longer 720p clips can take a few minutes.

Poll every 5 to 10 seconds instead of continuously calling the endpoint.

Can I generate a video longer than 15 seconds?

Not in a single request.

The maximum duration is 15 seconds. To create longer videos, generate a clip and then use:

POST /v1/videos/extensions

How do I download the generated video?

Use the URL from the completed poll response:

video_url = result["video"]["url"]

Download the MP4 to your own storage immediately. The URL is temporary and will expire.

What happens if my prompt violates content moderation?

The job can return:

{
  "status": "failed"
}

The respect_moderation field indicates that moderation was applied. Revise the prompt and try again.

Is there a free tier for the video API?

xAI charges per second of output generated. There is no free tier specifically for video generation. Check console.x.ai for current credit offers for new accounts.

How do `reference_images` differ from starting with a source image?

reference_images guide the visual style of a text-to-video generation. They influence the look but do not become the subject.

A source image for image-to-video becomes the first frame of the generated video.

What's the best way to test the polling loop without spending credits?

Use Apidog Smart Mock to mock both endpoints:

POST /v1/videos/generations
GET /v1/videos/{request_id}

Define mock responses for:

processing
done
failed

Then your frontend and polling code can run without calling the real API.

TL;DR

Introduction

What is the Grok text-to-video API?

How text-to-video generation works

Prerequisites

Your first text-to-video request

Using curl

Using Python with requests

Polling for the video result

Python polling loop

Full generate-and-poll workflow

Using the xAI Python SDK

Writing effective prompts for video generation

1. Describe the scene clearly

2. Add explicit motion

3. Specify the camera style

4. Define lighting and mood

5. Add style references in text

Prompt template

Controlling resolution, duration, and aspect ratio

Duration

Resolution

Aspect ratio

Full request with all parameters

Using reference images to guide video style

Extending and editing generated videos

Extend a video

Edit a video

Reading the cost from the API response

Pricing reference

How to test your Grok video API with Apidog

Use case 1: Mock the frontend flow with Smart Mock

Mock the generation endpoint

Mock the polling endpoint

Use case 2: Validate polling with Test Scenarios

Step 1: Add the generate request

Step 2: Add the polling request

Step 3: Assert the final result

Text-to-video vs image-to-video: which should you use?

Choose text-to-video when

Choose image-to-video when

Common errors and fixes

401 Unauthorized

429 Too Many Requests

status: "failed" in the poll response

Video URL returns 404

Empty or frozen video

Slow generation or polling

Conclusion

FAQ

What model name do I use for text-to-video generation?

How long does video generation take?

Can I generate a video longer than 15 seconds?

How do I download the generated video?

What happens if my prompt violates content moderation?

Is there a free tier for the video API?

How do reference_images differ from starting with a source image?

What's the best way to test the polling loop without spending credits?

Using Python with `requests`

`401 Unauthorized`

`429 Too Many Requests`

`status: "failed"` in the poll response

Video URL returns `404`

How do `reference_images` differ from starting with a source image?