xAI shipped Grok Voice with the Grok 4.3 release. For developers, the key point is simple: Grok Voice is free on the xAI Console. There is no per-minute charge and no per-token charge for the voice agent model, text-to-speech, speech-to-text, or Custom Voices clone tool. The only billable resource is the underlying Grok 4.3 token usage when the agent reasons, and that usage has its own free console allowance for testing.
This guide shows how to run Grok Voice at zero voice-feature cost: create a console key, clone a voice, open a WebSocket session, stream audio, add tool calls, and test the flow with Apidog before wiring it into a product.
If you also want the broader Grok 4.3 API guide, or a head-to-head against OpenAI’s stack in Grok Voice vs GPT-Realtime, those companion posts cover the rest of the surface.
TL;DR
- Grok Voice is free for users on the xAI Console (
console.x.ai): no per-minute or per-token charge for TTS, STT, voice agent, or Custom Voices. - Flagship model:
grok-voice-think-fast-1.0. - Time-to-first-audio is under 1 second; xAI claims it is roughly 5x faster than the closest competitor.
- 80+ preset voices across 28 languages.
- 5 built-in voice agent personas: Eve, Ara, Rex, Sal, Leo.
- Custom voice cloning works from about 1 minute of speech.
- Production-ready voice generation completes in under 2 minutes.
- WebSocket endpoint:
wss://api.x.ai/v1/realtime?model=grok-voice-think-fast-1.0
- REST endpoints are available for TTS, STT, and Custom Voices.
- Use Apidog to script WebSocket sessions and replay them without rerecording audio.
What Grok Voice gives you for free
The xAI Console is the path to free access. Sign in at console.x.ai, generate an API key, and you can call four voice surfaces with no charge tied to the voice features themselves.
You get access to:
- Voice Agent: real-time speech-to-speech with tool use, server-side voice activity detection, and turn-taking.
- Text-to-Speech: 80+ preset voices across 28 languages, with MP3 or μ-law output.
- Speech-to-Text: streaming and batch transcription across 25 input languages, with word-level timestamps and speaker diarization.
-
Custom Voices: clone your voice from a short sample and use the resulting
voice_idacross TTS and voice agent APIs.
The only meter that ticks is Grok 4.3 token usage when the agent reasons over a request. The console also gives you free credit for testing that surface, which is enough to validate end-to-end flows before billing starts.
Step 1: Get a console key
Go to console.x.ai and sign in with your X account.
From the API Keys page:
- Create a new API key.
- Enable the
voiceandchatscopes. - Export the key once.
- Store it in your local environment.
export XAI_API_KEY="xai-..."
For client-side apps, do not ship the parent API key to the browser. Instead, mint an ephemeral token from the console settings or via the /v1/realtime/sessions endpoint.
Ephemeral tokens carry the same scope but expire in minutes, so they are suitable for browser-based WebSocket sessions.
Step 2: Pick a voice
You can start with a preset voice or create a custom clone.
Option A: Use a preset voice
The voice agent includes five named personas:
| Voice | Description | Good fit |
|---|---|---|
eve |
Female, energetic | Upbeat support flows |
ara |
Female, warm | General assistance |
rex |
Male, confident | Sales scripts |
sal |
Neutral, smooth | Narration and longer reads |
leo |
Male, authoritative | Compliance and formal flows |
For the broader TTS API, the preset library is larger: more than 80 voices across 28 languages. You select them with the voice parameter on the TTS endpoint.
Option B: Clone a custom voice
Upload a WAV file with about one minute of clean speech from a single speaker.
curl https://api.x.ai/v1/custom-voices \
-H "Authorization: Bearer $XAI_API_KEY" \
-F "name=narrator-jane" \
-F "language=en" \
-F "audio=@sample.wav"
The API returns a voice_id in under two minutes. You can reuse that ID across both the TTS endpoint and the voice agent.
Keep the reference clip clean:
- Use a quiet room.
- Record one speaker only.
- Avoid music, effects, or background noise.
- Prefer a consistent single take.
- Do not assume longer is better; the maximum reference clip length is 120 seconds, but clean audio matters more than duration.
Step 3: Make Grok talk over WebSocket
The voice agent runs over a single WebSocket session:
- Open the WebSocket.
- Send a
session.updateevent. - Stream user audio into the socket.
- Receive audio deltas back from the model.
Endpoint:
wss://api.x.ai/v1/realtime?model=grok-voice-think-fast-1.0
A minimal Node.js client:
import WebSocket from "ws";
const ws = new WebSocket(
"wss://api.x.ai/v1/realtime?model=grok-voice-think-fast-1.0",
{
headers: {
Authorization: `Bearer ${process.env.XAI_API_KEY}`,
},
}
);
ws.on("open", () => {
ws.send(JSON.stringify({
type: "session.update",
session: {
voice: "ara",
instructions: "You are a friendly support agent. Keep replies under two sentences.",
input_audio_format: "pcm16",
output_audio_format: "pcm16",
turn_detection: {
type: "server_vad",
},
},
}));
});
ws.on("message", (raw) => {
const event = JSON.parse(raw.toString());
if (event.type === "response.audio.delta") {
process.stdout.write(Buffer.from(event.delta, "base64"));
}
if (event.type === "response.audio.done") {
console.error("Turn complete");
}
});
User audio is sent with input_audio_buffer.append events as base64-encoded PCM16 frames.
The server responds with:
-
response.audio.delta: streamed audio chunks -
response.audio.done: end of the current response turn
PCM16 at 24 kHz is the safe default for browser and desktop apps. Use μ-law when bridging to phone systems.
Step 4: Add tool use
The voice agent supports function calling, so the model can call your APIs during a conversation.
Declare tools in the session config:
ws.send(JSON.stringify({
type: "session.update",
session: {
tools: [
{
type: "function",
name: "lookup_order",
description: "Look up the status of a customer order by order number.",
parameters: {
type: "object",
properties: {
order_id: {
type: "string",
},
},
required: ["order_id"],
},
},
],
},
}));
When the model wants to call your function, it emits:
response.function_call_arguments.done
Your app should then:
- Parse the function name and arguments.
- Run the function on your side.
- Send the result back with a
conversation.item.createevent of typefunction_call_output. - Let the model continue and narrate the result.
A built-in web_search tool is also available, which is useful when you need fresh data without building a retrieval layer yourself.
Step 5: Use TTS without the voice agent
If you only need text-to-speech for audio prompts, voiceovers, podcast intros, or static app audio, skip the WebSocket and call the REST endpoint.
curl https://api.x.ai/v1/tts \
-H "Authorization: Bearer $XAI_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "grok-tts-1",
"voice": "ara",
"input": "Welcome back to your account. Your last login was Tuesday at 3pm.",
"format": "mp3"
}' \
--output greeting.mp3
Supported output formats:
-
mp3: high-fidelity output -
mulaw: 8 kHz telephony output
The TTS endpoint is synchronous. You send text and receive audio bytes back; no streaming session is required.
Step 6: Test the whole flow in Apidog
WebSocket APIs are harder to debug from the terminal because the conversation is stateful. A repeatable test setup helps you isolate changes in voice, instructions, tool calls, and audio frames.
A practical workflow:
- Create a new WebSocket request in Apidog.
- Save the WebSocket URL:
wss://api.x.ai/v1/realtime?model=grok-voice-think-fast-1.0
- Store your bearer token in an Apidog environment variable.
- Stage a script of JSON messages:
session.updateinput_audio_buffer.appendresponse.create
- Replay the script against one connection.
- Capture every server event into a tree.
- Diff two runs side by side when you change the voice or instructions.
This is useful for catching drift in turn-taking behavior before you ship.
Download Apidog, create a WebSocket request, and paste your XAI_API_KEY under environment variables.
The same collection can also hold your TTS and STT REST requests, so you can keep all Grok Voice surfaces in one project. For more on stateful API testing patterns, see API testing tool for QA engineers.
Free tier limits
The console gives you full access without a per-minute or per-token charge for the voice features themselves. The main limits are operational:
- Rate limits: the console enforces request-per-minute caps on each endpoint to prevent abuse. They are suitable for development and demos, not production traffic.
- Custom voice quota: a single account can hold a finite number of custom voice clones at once. Delete unused clones to free slots.
- Reasoning tokens: when the voice agent uses Grok 4.3 reasoning under the hood, it bills against your console credit. Free credit is enough for prototyping; production requires a paid plan.
If you hit rate-limit errors, batch your requests or move to a paid tier. The API behavior stays the same; only the cap changes.
Compare voices before shipping
Run the same script through every candidate voice before going live. Voices handle tone differently, and short tests catch poor pairings quickly.
Use a small test set:
- A two-sentence greeting.
- A confirmation phrase: “Got it, that’s all set.”
- A longer sentence with a number, a date, and a comma.
Also test the same prompt at different tones:
- Calm
- Normal
- Urgent
Grok’s preset voices handle tone shifts better than many TTS engines we have benchmarked, but you should still audit the actual output for your use case.
FAQ
Is the API actually free, or is there a hidden cap?
The voice features — TTS, STT, voice agent, and Custom Voices — carry no per-minute or per-token charge on the console.
The reasoning model under the hood bills against console credit. The console allowance is enough for prototyping.
Do I need an X account?
Yes. Console sign-in uses an X account.
Can I use Grok Voice from a browser?
Yes, but use an ephemeral token.
Mint the token server-side via /v1/realtime/sessions, hand the short-lived token to the browser, and connect to the WebSocket directly. The parent API key should never leave your server.
What audio quality can I expect?
TTS output is available as high-fidelity MP3 or 8 kHz μ-law. The voice agent runs PCM16 at 24 kHz internally.
Quality is on par with major commercial TTS engines; latency is the differentiator.
Does it work with telephony?
Yes. μ-law output is the standard format for SIP and PSTN bridges.
You still need a SIP provider. xAI does not ship its own SIP gateway today.
How does cloning quality compare to other tools?
Cloning quality depends more on reference audio quality than length.
A clean 60-second sample in a quiet room beats a noisy 120-second sample. The resulting voice_id works across both the TTS endpoint and the voice agent without recloning.
Can I use Grok Voice for AI characters in a game?
Yes. The TTS endpoint is fast enough for runtime generation, and Custom Voices lets each character use its own clone.
Watch latency on long lines. Chunked TTS is the recommended pattern.
Wrapping up
Grok Voice is a direct path to building a real-time voice agent with no per-minute charge on the xAI Console. Start with a console key, pick a preset voice, test a WebSocket session, and only then add custom voice cloning or tool calls.
The fastest validation loop is:
- Script a session in Apidog.
- Run it against three preset voices.
- Compare latency, tone, and turn-taking.
- Add tool calls once the base conversation works.
When you are ready to plug it into Grok 4.3 reasoning, see the Grok 4.3 API guide. For a side-by-side against OpenAI’s stack, see Grok Voice vs GPT-Realtime.


Top comments (0)