DEV Community

Vrushank for Portkey

Posted on

Stream LLM Responses from Cache

LLMs can become more expensive as your app consumes more tokens. Portkey's AI gateway allows you to cache LLM responses and serve users from the cache to save costs. Here's the best part: now, with streams enabled.

Streams are an efficient way to work with large responses because:

  • They reduce the perceived latency when users are using your app.
  • Your app doesn't have to buffer it in the memory.

Let's check out how to get cached responses to your app through streams, chunk by chunk. Every time portkey serves requests from the cache, we save costs for tokens.

With streaming and caching enabled, we will make a chat completion call to OpenAI through Portkey.

Import and instantiate the portkey.

import Portkey from "portkey-ai";

const portkey = new Portkey({
  apiKey: process.env.PORTKEYAI_API_KEY,
  virtualKey: process.env.OPENAI_API_KEY,
  config: {
    cache:{
      mode: "semantic"
    }
  }
});
Enter fullscreen mode Exit fullscreen mode
apiKey Sign up for Portkey and copy API key
virtualKey Securely store in the vault and reference it using Virtual Keys
config Pass configurations to enable caching

Our app will list the tasks to help with planning a birthday party.

const messages = [{
        role: "system",
        content: "You are very good program manager and have organised many events before. You can break every task in simple and means for others to pick it up.",
    },
    {
        role: "user",
        content: "Help me plan a birthday party for my 8 yr old kid?",
    },
];
Enter fullscreen mode Exit fullscreen mode

Portkey follows same signature as OpenAI's, hence enabling streams in responses is passing stream:true option.

try {
    var response = await portkey.chat.completions.create({
        messages,
        model: "gpt-3.5-turbo",
        stream: true,
    });
    for await (const chunk of response) {
        process.stdout.write(chunk.choices[0]?.delta?.content || "");
    }
} catch (error) {
    console.error("Errors usually happen:", error);
}
Enter fullscreen mode Exit fullscreen mode

You can iterate over the response object, processing each word (chunk) and presenting it to the user as soon as it's received.

Here's a tip: You can skip cache HIT by passing cacheForceRefresh

  var response = await portkey.chat.completions.create({
      messages,
      model: "gpt-3.5-turbo",
      stream: true,
  }, {
      cacheForceRefresh: true
  });
Enter fullscreen mode Exit fullscreen mode

Streaming becomes more effective in providing a smoother user experience and efficiently managing your memory.

Put this into practice today!

Top comments (0)