OpenAI recently introduced a powerful feature called Predicted Outputs, which quietly entered the scene without much attention from the technical media—an oversight that deserves correction. I noticed they mentioned this on X in their developer account, but it didn't get much publicity. I decided to draw attention to it because it's a cool and useful feature.
Predicted Outputs significantly reduce latency for model responses, especially when much of the output is known ahead of time. This feature is particularly beneficial for applications that involve regenerating text documents or code files with minor modifications.
What Are Predicted Outputs?
Predicted Outputs allow developers to speed up API responses from Chat Completions when the expected output is largely predictable. By providing a prediction of the expected response using the prediction parameter in Chat Completions, the model can generate the required output more efficiently. This functionality is currently available with the latest GPT-4o and GPT-4o-mini models.
How Do Predicted Outputs Work?
When you have a response where most of the content is already known, you can supply that expected content as a prediction to the model. The model then uses this prediction to expedite the generation of the response, reducing latency and improving performance.
Custom Example: Updating Configuration Files
Imagine you have a JSON configuration file that needs a minor update. Here's an example of such a file:
{
"appName": "MyApp",
"version": "1.0.0",
"settings": {
"enableFeatureX": false,
"maxUsers": 100
}
}
Suppose you want to update enableFeatureX
to true
. Instead of generating the entire file from scratch, you can provide the original file as a prediction and instruct the model to make the necessary changes.
import OpenAI from "openai";
const config = `
{
"appName": "MyApp",
"version": "1.0.0",
"settings": {
"enableFeatureX": false,
"maxUsers": 100
}
}
`.trim();
const openai = new OpenAI();
const updatePrompt = `
Change "enableFeatureX" to true in the following JSON configuration. Respond only with the updated JSON, without any additional text.
`;
const completion = await openai.chat.completions.create({
model: "gpt-4o",
messages: [
{ role: "user", content: updatePrompt },
{ role: "user", content: config }
],
prediction: {
type: "content",
content: config
}
});
// Output the updated configuration
console.log(completion.choices[0].message.content);
In this example, the model quickly generates the updated configuration file, leveraging the prediction to minimize response time.
Streaming with Predicted Outputs
For applications that require streaming responses, Predicted Outputs offer even greater latency reductions. Here's how you can implement the previous example using streaming:
import OpenAI from "openai";
const config = `...`; // Original JSON configuration
const openai = new OpenAI();
const updatePrompt = `...`; // Prompt as before
const completion = await openai.chat.completions.create({
model: "gpt-4o",
messages: [ /* ... */ ],
prediction: {
type: "content",
content: config
},
stream: true
});
for await (const chunk of completion) {
process.stdout.write(chunk.choices[0]?.delta?.content || "");
}
At the time of writing, there are no alternatives to this solution from competitors. OpenAI's Predicted Outputs feature appears to be a unique offering that addresses the specific need for latency reduction when regenerating known content with minor modifications.
Tips for Developers
The cool thing is that to start using it, you need almost nothing. Just take it and use it by adding just one new parameter to the API request. This makes it very easy for developers to implement this feature in their existing applications.
Limitations
While Predicted Outputs offer significant advantages, there are important considerations:
- Model Compatibility: Only the gpt-4o and gpt-4o-mini models support Predicted Outputs. I think that the fact it's only available on these models is not a problem at all, because they are currently the best models from OpenAI, and the original GPT-4 is too slow for this anyway.
- Billing Implications: Rejected prediction tokens are billed at the same rate as generated tokens. Monitor the rejected_prediction_tokens in the usage data to manage costs.
- Unsupported Parameters:
- - n (values higher than 1)
- - logprobs
- - presence_penalty (values greater than 0)
- - frequency_penalty (values greater than 0)
- - max_completion_tokens
- - tools (function calling is not supported)
- Modality Restrictions: Only text modalities are supported; audio inputs and outputs are incompatible with Predicted Outputs.
Conclusion
OpenAI's Predicted Outputs is a groundbreaking feature that addresses a common challenge in AI applications: reducing latency when the response is largely predictable. By allowing developers to supply expected outputs, it accelerates response times and enhances user experience.
In my personal opinion, OpenAI models are not as strong as Anthropic's models. However, OpenAI produces many cool and really necessary solutions in other areas. Features like Predicted Outputs set OpenAI apart from other AI providers, offering unique solutions that meet specific needs in the developer community.
Top comments (0)