Compare (OpenAI) LLM performance from Supabase Edge function directly

#supabase #benchmark #llm #performance

EDIT: added information about the how to configure "inputs" to your Supabase Edge funciton and how to run CLI tool.

As I was continue building my app, I started to wonder, am I using optimal AI model for my Edge function? Behind the scene, these Edge functions will make a call to OpenAI API and I'm sure each model has its strength and weakness so I wanted to find a way to compare it, without manually changing model, capturing response etc.

I then started to look for utility tool out there, thinking I can't be the first one who wanted to do that but while I can find something similar like promptfoo, starting point is different - and while I admit measuring directly with LLM call is probably the most generic way, the reason why I want to call Supabase Edge function directly is, there are quite bit of logic exist within my Edge function that I don't want to manually extract all inputs to LLM myself.

My idea is simple - since I'm talking about the Supabase Edge function, I can certainly create a (temporary?) table to hold data (that way, I don't need to modify response object from the Edge function) I can easily expand request parameters to Edge function without breaking the functionality (this is so that I can pass extra information like model, id to mark each group run etc.) To measure pure LLM performance, I do need to inject stopwatch timer time of code into my Edge function but I learned that you can simply do:

const start = performance.now();
//make a LLM call like
const analyzeResponse = await fetch('https://api.openai.com/v1/chat/completions', {
      method: 'POST',
...
});
const end = performance.now();
const latencyMs = end - start;
console.log(`Latency: ${latencyMs} ms`);

and apparently performance object is built-in to Deno so I thought that is not too bad. Then at the end, simply dump the record into database table (and since I wanted to do this to multiple Edge functions, I wrote a utility code - I will share link to repo at the end of this article where you can see all these code)

To minimize modification to the Edge function itself, I'm inserting record twice - once from the Supabase Edge function and once from the CLI tool itself. Once again I need to do this so that I can capture the raw response from LLM while I can group the execution - I feel this is good compromise but you might not like it (and if so, feel free to modify script - this is expected to run locally and hence I open these tools to public - also pull requests are always welcome!)

Maybe I want to mention about how to pass in parameters - hopefully it is pretty obvious but the input file to run CLI tool can be used for that and for example:

{
  "models": ["o3", "o3-mini", "gpt-4o", "gpt-4o-mini"],

  "inputs": {
    "textInput": "{some resume data}",
    "targetJobDescription": "",
    "userPrompts": [],
    "resumeId": "new"
  }
}

here "inputs" are input to Edge function and this is of course expected to be custom to your need. Once you have input file ready, you can simply execute (assuming you already done npm install)

npm start <function-name> <path-to-scenario>

or for scenario

npm start analyze-resume-data scenario/analyze-resume-data/test-data.json

Once I got data into table, I then wanted a quick way to visualize the result so I vibe code this super preliminary viewer (using Gemini Code Assistant) I have plan to expand this but it serve my purpose already. When you use CLI tool to generate data, then this visualization app will simply fetch record, group them and show it to you in a way (hopefully) bit more consumable. Here is a screenshot of one of my test result:

Maybe there is a way to achieve this by not injecting code but for now, I'm pretty satisfied with what I got but if you have a way to extend this, please do so and share it back to the community!

Lastly, here are the links to my repo: