DEV Community

Cover image for How to create LLM fallback from Gemini Flash to GPT-4o?
Aravind Putrevu for Portkey

Posted on

9 8 8 8 8

How to create LLM fallback from Gemini Flash to GPT-4o?

Generative AI has been the hottest technology trend from an year enterprises to startups. Almost every brand is incorporating GenAI and Large Language Models (LLM) in their solutions.

However, an under explored part of Generative AI is the managing resiliency. It is easy to build on a API provided by a LLM vendor like OpenAI, however it is hard to manage if the vendor comes across a service disruption etc.

In this blog, we will take a look at how you can create a resilient generative ai application that switches between GPT-4o to Gemini Flash by using open-source ai-gateway's fallback feature.

Before that..

What is a fallback?

In a scenario involving APIs, if the active endpoint or server goes down, as part of a fallback strategy for high availability using a load balancer, we configure both active and standby endpoints. When the active endpoint goes down, one of the configured secondary endpoints takes over and continues to serve the incoming traffic.

Why do we need fallbacks?

Basically fallbacks ensure application resiliency in disaster scenario's and help aid in quick recovery.

Note: In many cases, during recovery a loss of incoming traffic (such as HTTP requests) is a common phenomena.

Why fallbacks in LLMs?

In the context of Generative AI, having a fallback strategy is crucial to manage resiliency. A traditional server resiliency scenario is no different than in the case of Generative AI. It would imply if the active LLM becomes unavailable, one of the configured secondary LLM takes over and continues to serve incoming requests, thereby maintaining uninterrupted solution experience for users.

Challenges in creating fallbacks for LLMs

While fallbacks in concept for LLMs looks very similar to managing the server resiliency, in reality, due to the growing ecosystem and multiple standards, new levers to change the outputs etc., it is harder to simply switch over and get similar output quality and experience.

Moreover, the amount of custom logic and effort that is needed to add this functionality with changing landscape of LLMs and LLM providers will be hard for someone whose core business is not managing LLMs.

Using open-source AI Gateway to implement fallbacks

To demonstrate fallbacks feature, we'll be building a sample Node.js application and integrating Google's Gemini. We'll be using the OpenAI SDK and Portkey's open-source AI Gateway to demonstrate the fallback to GPT.

If you are new to AI Gateway, you can refer our previous post to learn features of open-source AI Gateway.

AI Gateway

Creating Node.js Project

To start our project, we need to set up a Node.js environment. So, let's create a node project. Below command will initialize a new Node.js project.

npm init
Enter fullscreen mode Exit fullscreen mode

Install Dependencies

Let's install the required dependencies of our project.

npm install express body-parser dotenv
Enter fullscreen mode Exit fullscreen mode

This will install the following packages:

  • express: a popular web framework for Node.js

  • body-parser: middleware for parsing request bodies

  • portkey-ai: a package that enables us for accessing the multiple ai models

  • dotenv: loads environment variables from a .env file

Setting Environment Variables

Next, we'll create a .env folder to securely store our sensitive information such as API credentials.

//.env
GEMINI_API_KEY=YOUR_API_KEY
PORT=3000
Enter fullscreen mode Exit fullscreen mode

Get API Key

Before using Gemini, we need to set up API credentials from Google Developers Console. For that, We need to sign up on our Google account and create an API key.

Once signed in, Go to Google AI Studio.

Click on the Create API key button. It will generate a unique API Key that we'll use to authenticate requests to the Google Generative AI API.

After getting the API key we'll update the .env file with our API key.

Create Express Server

Let's create a index.js file in the root directory and set up a basic express server.

const express = require("express");
const dotenv = require("dotenv");

dotenv.config();

const app = express();
const port = process.env.PORT;

app.get("/", (req, res) => {
  res.send("Hello World");
});

app.listen(port, () => {
  console.log(`Server running on port ${port}`);
});
Enter fullscreen mode Exit fullscreen mode

Here, We're using the "dotenv" package to access the PORT number from the .env file.

At the top of the project, we're loading environment variables using dotenv.config() to make it accessible throughout the file.

Executing the project

In this step, we'll add a start script to the package.json file to easily run our project.

So, Add the following script to the package.json file.

"scripts": {
  "start": "node index.js"
}
Enter fullscreen mode Exit fullscreen mode

The package.json file should look like below:

Let's run the project using the following command:

npm run start
Enter fullscreen mode Exit fullscreen mode

Above command will start the Express server. Now if we go to this URL http://localhost:3000 we'll get this:

Hello World

The Project setup is now done. Next up, we'll adding Gemini to our project in the next section.

Adding Google Gemini

Set up Route

To add the Gemini to our project, We'll create a /generate route where we'll communicate with the Gemini AI.

For that add the following code into the index.js file.

const bodyParser = require("body-parser");
const { generateResponse } = require("./controllers/index.js");

//middleware to parse the body content to JSON
app.use(bodyParser.json());

app.post("/generate", generateResponse);
Enter fullscreen mode Exit fullscreen mode

Here, We're using a body-parser middleware to parse the content into a JSON format.

Configure OpenAI Client with Portkey Gateway

Let's create a controller folder and create a index.js file within it.

Here, we will create a new controller function to handle the generated route declared in the above code.

First, we'll Import the Required packages and API keys that we'll be using.

Note: Portkey adheres to OpenAI API compatibility. Using Porktey AI further enables you to communicate to any LLM using our universal API feature.

import OpenAI from 'openai';
import dotenv from "dotenv";
import { createHeaders } from 'portkey-ai'

dotenv.config();
const GEMINIKEY = process.env.GEMINI_API_KEY;
Enter fullscreen mode Exit fullscreen mode

Then, we'll instantiate our OpenAI client and pass the relevant provider details.

const gateway = new OpenAI({
  apiKey: GEMINIKEY,
  baseURL: "http://localhost:8787/v1",
  defaultHeaders: createHeaders({
    provider: "google",
  })
})
Enter fullscreen mode Exit fullscreen mode

Note: To integrate the Portkey gateway with OpenAI, We have

  • Set the baseURL to the Portkey Gateway URL

  • Included Portkey-specific headers such as provider and others.

Implement Controller Function

Now, we'll write a controller function generateResponse to handle the generation route (/generate) and generate a response to User requests.

export const generateResponse = async (req, res) => {
  try {
    const { prompt } = req.body;

    const completion = await gateway.chat.completions.create({
      messages: [{ role: "user", content: prompt}],
      model: 'gemini-1.5-flash-latest',
    });

    const text = completion.choices[0].message.content;

    res.send({ response: text });

  } catch (err) {
    console.error(err);
    res.status(500).json({ message: "Internal server error" });
  }
};
Enter fullscreen mode Exit fullscreen mode

Here we are taking the prompt from the request body and generating a response based on the prompt using the gateway.chat.completions.create method.

Run Gateway Locally

To run the gateway locally, run the following command in your terminal

npx @portkey-ai/gateway
Enter fullscreen mode Exit fullscreen mode

This will spin up the gateway locally and it’s running on http://localhost:8787/

Run the project

Now, we have to check if our app is working correctly or not!

Let's run our project using:

npm run start
Enter fullscreen mode Exit fullscreen mode

Validating Gemini's Response

Next, we'll make a Post request using Postman to validate our controller function.

We'll send a POST request to http://localhost:3000/generate with the following JSON payload:

{
  "prompt": "Are you an OpenAI model?"
}
Enter fullscreen mode Exit fullscreen mode

Google Gemini

And We got our response:

{
    "response": "I am a large language model, trained by Google. \n"
}
Enter fullscreen mode Exit fullscreen mode

Great! Our Gemini AI integration is Working as expected!

Adding Fallback using AI Gateway

Till now, project is working as expected. But what if Gemini's API doesn't respond?

As discussed earlier, a resilient app yields better customer experience.

That's where Portkey's AI Gateway shines. It has a fallback feature that seamlessly switch between them based on their performance or availability.

If the primary LLM fails to respond or encounters an error, AI Gateway will automatically fallback to the next LLM in the list, ensuring our application's robustness and reliability.

Now, let's add fallback feature to our project!

Create Portkey Configs

First, we'll create a Portkey configuration to define routing rules for all the requests coming to our gateway. For that, Add the following Code:

const configObj = {
  "strategy": {
    "mode": "fallback"
  },
  "targets": [
    {
      "provider": "google",
      "api_key": GEMINIKEY // Add your Gemini API Key
    },
    {
      "provider": "openai",
      "api_key": OpenAIKEY,
      "override_params": {
        "model": "gpt-4o"
      }
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

This config will fallback to OpenAI's gpt-4o if Google's gemini-1.5-flash-latest fails.

Update OpenAI Client

To add the portkey config in our OpenAI client, we'll simply add the config id to the defaultHeaders object.

const gateway = new OpenAI({
  apiKey: GEMINIKEY,
  baseURL: "http://localhost:8787/v1",
  defaultHeaders: createHeaders({
    provider: "google",
    config: configObj
  })
})

Enter fullscreen mode Exit fullscreen mode

Note: If we want to attach the configuration to only a few requests instead of modifying the client, we can send it in the request headers for OpenAI. For example:

let reqHeaders = createHeaders({config: configObj});
openai.chat.completions.create({
  messages: [{role: "user", content: "Say this is a test"}],
  model: "gpt-3.5-turbo"
}, {headers: reqHeaders})

Also, If you have a default configuration set in the client, but also include a configuration in a specific request, the request-specific configuration will take precedence and replace the default config for that particular request.

That's it! Our Setup is done.

Testing the Fallback

To see if our fallback feature is working or not, we'll remove the the Gemini API key from the .env file. And, We'll send a POST request to http://localhost:3000/generate with the following JSON payload:

{
  "prompt": "Are you an OpenAI model?"
}
Enter fullscreen mode Exit fullscreen mode

Open AI Model

And We'll get this response:

{
    "response": "Yes, I am powered by the OpenAI text generation model known as GPT-4o."
}
Enter fullscreen mode Exit fullscreen mode

Awesome! This Means Our Fallback feature is Working perfectly!

As we have deleted the Gemini API key, the First request failed, and Portkey Automatically detected that and automatically fallback to the next LLM in the list that is OpenAI's gpt-3.5-turbo .

Conclusion

In this article, we have explored how to integrate Gemini in our node.js application, also how to leverage AI Gateway’s fallback feature when Gemini is not available.

If you want to know more about Portkey's AI Gateway and give us a star, join our LLMs in Production Discord to hear more about what other AI Engineers are building.

Happy Building!

API Trace View

How I Cut 22.3 Seconds Off an API Call with Sentry

Struggling with slow API calls? Dan Mindru walks through how he used Sentry's new Trace View feature to shave off 22.3 seconds from an API call.

Get a practical walkthrough of how to identify bottlenecks, split tasks into multiple parallel tasks, identify slow AI model calls, and more.

Read more →

Top comments (0)

The Most Contextual AI Development Assistant

Pieces.app image

Our centralized storage agent works on-device, unifying various developer tools to proactively capture and enrich useful materials, streamline collaboration, and solve complex problems through a contextual understanding of your unique workflow.

👥 Ideal for solo developers, teams, and cross-company projects

Learn more

👋 Kindness is contagious

Immerse yourself in a wealth of knowledge with this piece, supported by the inclusive DEV Community—every developer, no matter where they are in their journey, is invited to contribute to our collective wisdom.

A simple “thank you” goes a long way—express your gratitude below in the comments!

Gathering insights enriches our journey on DEV and fortifies our community ties. Did you find this article valuable? Taking a moment to thank the author can have a significant impact.

Okay