DEV Community

Vrushank for Portkey

Posted on

1 1 1 1 1

Stream LLM Responses from Cache

LLMs can become more expensive as your app consumes more tokens. Portkey's AI gateway allows you to cache LLM responses and serve users from the cache to save costs. Here's the best part: now, with streams enabled.

Streams are an efficient way to work with large responses because:

  • They reduce the perceived latency when users are using your app.
  • Your app doesn't have to buffer it in the memory.

Let's check out how to get cached responses to your app through streams, chunk by chunk. Every time portkey serves requests from the cache, we save costs for tokens.

With streaming and caching enabled, we will make a chat completion call to OpenAI through Portkey.

Import and instantiate the portkey.

import Portkey from "portkey-ai";

const portkey = new Portkey({
  apiKey: process.env.PORTKEYAI_API_KEY,
  virtualKey: process.env.OPENAI_API_KEY,
  config: {
    cache:{
      mode: "semantic"
    }
  }
});
Enter fullscreen mode Exit fullscreen mode
apiKey Sign up for Portkey and copy API key
virtualKey Securely store in the vault and reference it using Virtual Keys
config Pass configurations to enable caching

Our app will list the tasks to help with planning a birthday party.

const messages = [{
        role: "system",
        content: "You are very good program manager and have organised many events before. You can break every task in simple and means for others to pick it up.",
    },
    {
        role: "user",
        content: "Help me plan a birthday party for my 8 yr old kid?",
    },
];
Enter fullscreen mode Exit fullscreen mode

Portkey follows same signature as OpenAI's, hence enabling streams in responses is passing stream:true option.

try {
    var response = await portkey.chat.completions.create({
        messages,
        model: "gpt-3.5-turbo",
        stream: true,
    });
    for await (const chunk of response) {
        process.stdout.write(chunk.choices[0]?.delta?.content || "");
    }
} catch (error) {
    console.error("Errors usually happen:", error);
}
Enter fullscreen mode Exit fullscreen mode

You can iterate over the response object, processing each word (chunk) and presenting it to the user as soon as it's received.

Here's a tip: You can skip cache HIT by passing cacheForceRefresh

  var response = await portkey.chat.completions.create({
      messages,
      model: "gpt-3.5-turbo",
      stream: true,
  }, {
      cacheForceRefresh: true
  });
Enter fullscreen mode Exit fullscreen mode

Streaming becomes more effective in providing a smoother user experience and efficiently managing your memory.

Put this into practice today!

👋 While you are here

Reinvent your career. Join DEV.

It takes one minute and is worth it for your career.

Get started

Top comments (0)

A Workflow Copilot. Tailored to You.

Pieces.app image

Our desktop app, with its intelligent copilot, streamlines coding by generating snippets, extracting code from screenshots, and accelerating problem-solving.

Read the docs

👋 Kindness is contagious

Dive into an ocean of knowledge with this thought-provoking post, revered deeply within the supportive DEV Community. Developers of all levels are welcome to join and enhance our collective intelligence.

Saying a simple "thank you" can brighten someone's day. Share your gratitude in the comments below!

On DEV, sharing ideas eases our path and fortifies our community connections. Found this helpful? Sending a quick thanks to the author can be profoundly valued.

Okay