Stream LLM Responses from Cache

LLMs can become more expensive as your app consumes more tokens. Portkey's AI gateway allows you to cache LLM responses and serve users from the cache to save costs. Here's the best part: now, with streams enabled.

Streams are an efficient way to work with large responses because:

They reduce the perceived latency when users are using your app.
Your app doesn't have to buffer it in the memory.

Let's check out how to get cached responses to your app through streams, chunk by chunk. Every time portkey serves requests from the cache, we save costs for tokens.

With streaming and caching enabled, we will make a chat completion call to OpenAI through Portkey.

Import and instantiate the portkey.

import Portkey from "portkey-ai";

const portkey = new Portkey({
  apiKey: process.env.PORTKEYAI_API_KEY,
  virtualKey: process.env.OPENAI_API_KEY,
  config: {
    cache:{
      mode: "semantic"
    }
  }
});

apiKey	Sign up for Portkey and copy API key
virtualKey	Securely store in the vault and reference it using Virtual Keys
config	Pass configurations to enable caching

Our app will list the tasks to help with planning a birthday party.

const messages = [{
        role: "system",
        content: "You are very good program manager and have organised many events before. You can break every task in simple and means for others to pick it up.",
    },
    {
        role: "user",
        content: "Help me plan a birthday party for my 8 yr old kid?",
    },
];

Portkey follows same signature as OpenAI's, hence enabling streams in responses is passing stream:true option.

try {
    var response = await portkey.chat.completions.create({
        messages,
        model: "gpt-3.5-turbo",
        stream: true,
    });
    for await (const chunk of response) {
        process.stdout.write(chunk.choices[0]?.delta?.content || "");
    }
} catch (error) {
    console.error("Errors usually happen:", error);
}

You can iterate over the response object, processing each word (chunk) and presenting it to the user as soon as it's received.

Here's a tip: You can skip cache HIT by passing cacheForceRefresh

  var response = await portkey.chat.completions.create({
      messages,
      model: "gpt-3.5-turbo",
      stream: true,
  }, {
      cacheForceRefresh: true
  });

Streaming becomes more effective in providing a smoother user experience and efficiently managing your memory.

Put this into practice today!

DEV Community

Stream LLM Responses from Cache

Top comments (0)

Read next

Crash Course: Automating .xccrashpoint Exports in MacOS

Create A Simple Website with OneEntry Headless CMS

#3075. Maximize Happiness of Selected Children

#506. Relative Ranks