DEV Community

Cover image for Multimodal Experience with AI/ML API in NodeJS
AI/ML API
AI/ML API

Posted on • Edited on • Originally published at aimlapi.com

Multimodal Experience with AI/ML API in NodeJS

Introduction

Large Language Models excel at text-related tasks. But what if you need to make a model multimodal? How can you teach a text model to process an audio file, for example?

There is a solution: combine two different models. A model that can transcribe an audio recording and a model that can process it. The result of this processing would be a description of what is happening in the audio recording.

This can be easily implemented using the text models of AI/ML API and an audio transcription model, such as Deepgram.

Choosing a Text Model in AI/ML API

Since the text model needs to strictly follow instructions, the best candidate for this would be an Instruct-model.

By going to the models section, we find the right one for our purposes. One of the good candidates would be the Mixtral 8x22B Instruct model.

Obtaining a Token in Deepgram

You can get the key here.

Obtaining a Token in AI/ML API

You can get the key here.

Implementation

Make sure that NodeJS is installed on your machine. If necessary, you can find all the instructions for installing NodeJS here.

For a clear example of implementing multimodality, you can create a web server that will be able to accept the URL of an audio file and a brief "type" of this recording so that the models can understand the context of the speech.

Preparation

You need to create a new project. To do this, create a new folder named aimlapi-multimodal-example in any convenient location and navigate into it.

mkdir aimlapi-multimodal-example
cd ./aimlapi-multimodal-example
Enter fullscreen mode Exit fullscreen mode

Here, create a new project using npm and install the required dependencies:

npm init -y
npm i express @deepgram/sdk openai
Enter fullscreen mode Exit fullscreen mode

Create a source file where all the necessary code will be and open the project in your preferred IDE. In my case, I will be using VSCode.

touch ./index.js
code .
Enter fullscreen mode Exit fullscreen mode

Importing Dependencies

To create the required functionality, you will need to use the Deepgram API and AI/ML API. As a web server, any framework or module can be used, but for simplicity, I suggest using express.

AI/ML API supports usage through the OpenAI SDK, so you can limit the import of all dependencies to the following:


const deepgram = require('@deepgram/sdk');
const express = require('express');
const { OpenAI } = require('openai');
Enter fullscreen mode Exit fullscreen mode

API Interfaces and Prompts

The next step is to create all the constants, an express application, and interfaces for accessing the APIs:

const PORT = 8080;
const app = express();

const deepgramModel = 'nova-2';
const openaiModel = 'mistralai/Mixtral-8x7B-Instruct-v0.1';
const deepgramApi = deepgram.createClient('<DEEPGRAM_TOKEN>');
const openaiApi = new OpenAI({ baseURL: 'https://api.aimlapi.com', apiKey: '<AIMLAPI_TOKEN>' });
Enter fullscreen mode Exit fullscreen mode

Text models operate with prompts. Therefore, you need to create prompts that will give instructions to the model in processing audio recordings. There will be two prompts:

  • summary prompt: a detailed textual description of the audio file
  • context prompt: validation and editing of the description

Declare them in this manner:


const getSummaryPrompt =
  () => `Please provide a detailed report of the text transcription. The transcript of which I provide below in triple quotes, including key summary outcomes.
KEEP THESE RULES STRICTLY:
STRICTLY SPLIT OUTPUT IN PARAGRAPHS: Topic and the matter of discourse, Key outcomes, Ideas and Conclusions.
OUTPUT MUST BE STRICTLY LIMITED TO 2000 CHARACTERS!
STRICTLY KEEP THE SENTENCES COMPACT WITH BULLET POINTS! THIS IS IMPORTANT!
ALL CONTEXT OF THE TRANSCRIPT MUST BE INCLUDED IN OUTPUT!
DO NOT INCLUDE MESSAGES ABOUT CHARACTERS COUNT IN THE OUTPUT!`;

const getContextPrompt = (
  type,
) => `Ensure integrity and quality of the given summary, it is the summary of a ${type}, edit it accordingly.
OUTPUT MUST BE STRICTLY LIMITED TO 2000 CHARACTERS!
      STRICTLY KEEP THE SENTENCES COMPACT WITH BULLET POINTS! THIS IS IMPORTANT!
      ALL CONTEXT OF THE TRANSCRIPT MUST BE INCLUDED IN OUTPUT!
      DO NOT INCLUDE MESSAGES ABOUT CHARACTERS COUNT IN THE OUTPUT!`;

Enter fullscreen mode Exit fullscreen mode

These will be template functions, returning the required string to us.

Express Endpoint

Our task will be handled by a GET HTTP endpoint at /summarize.

We declare it using express:

app.get('/summarize', async (req, res, next) => {})
Enter fullscreen mode Exit fullscreen mode

Two parameters will be sent in the request: type and url. We will extract them from the request and perform basic validation.


const { type, url } = req.query;
if (!type || !url) {
  return res.status(400).send({ error: "'type' and 'url' parameters required" });
}
Enter fullscreen mode Exit fullscreen mode

Next, we need to send a request to the Deepgram API and obtain a textual transcription of the audio file:

const {
  result: {
    results: {
      channels: [
        {
          alternatives: [{ transcript }],
        },
      ],
    },
  },
} = await deepgramApi.listen.prerecorded.transcribeUrl(
  {
    url: url,
  },
  {
    model: deepgramModel,
    smart_format: true,
  },
);
Enter fullscreen mode Exit fullscreen mode

We are interested only in the first result, so we ignore all other possible alternatives and extract the data using destructuring assignment.

Next, we need to process the transcription using the AI/ML API. For this, we will use the OpenAI SDK and the chat.completions methods:


const summaryCompletion = await openaiApi.chat.completions.create({
  model: openaiModel,
  messages: [
    { role: 'system', content: getSummaryPrompt() },
    { role: 'user', content: transcript },
  ],
});

const contextedCompletion = await openaiApi.chat.completions.create({
  model: openaiModel,
  messages: [
    { role: 'system', content: getContextPrompt(type) },
    { role: 'user', content: summaryCompletion.choices[0].message.content },
  ],
});
Enter fullscreen mode Exit fullscreen mode

This will allow us to run the result twice, improving its quality and eliminating some errors the model might have made.

Now we need to return the response, formatting it visually:

const response = `<pre style="font-family: sans-serif; white-space: pre-line;">${contextedCompletion.choices[0].message.content}</pre>`;
res.send(response);
Enter fullscreen mode Exit fullscreen mode

With this, the processing of the /summarize request is complete. All that remains is to launch the web server:

app.listen(PORT, () => {
  console.log(`listening on http://127.0.0.1:${PORT}`);
});
Enter fullscreen mode Exit fullscreen mode

Result

Launch the application using the command:

node ./index.js
Enter fullscreen mode Exit fullscreen mode

And you will see in the console a message about the running server and its address. You can check the result in the browser by going to the server's address and adding the API request path: https://127.0.0.1:8080/summarize.

You will immediately see an error:

{"error":"'type' and 'url' parameters required"}
Enter fullscreen mode Exit fullscreen mode

This indicates that basic parameter validation is working. Now specify the necessary parameters in the URL for the request to be processed correctly:

http://127.0.0.1:8080/summarize?url=https://audio-samples.github.io/samples/mp3/blizzard_unconditional/sample-0.mp3&type=voice

This will return a result of approximately the following kind:

Summary:

* Speaker admires Mr. Rochester's beauty and devotion.
* Mr. Rochester is described as subdued and open to external influences.
* Speaker's admiration suggests a positive relationship.
* Use of language hints at Mr. Rochester's strength and control.

The text appears to be a fragmented transcription about a person named Mr. Rochester. The speaker expresses admiration for Mr. Rochester's beauty and will, describing him as subdued and devoted. The speaker's admiration and use of language suggest a positive relationship and impression of Mr. Rochester. The phrase "bowed to let might in" is unclear but may indicate Mr. Rochester's openness to external influences. The text's limited and fragmented nature makes definitive conclusions difficult, but the speaker's admiration and use of language hint at Mr. Rochester's strength and control.
Enter fullscreen mode Exit fullscreen mode

Voila! We have created an application capable of making a transcription from an audio file and its brief description. Launched it on a web server, and now it can be used in completely different contexts. For example, instead of a browser, we can use the wget utility and see the result directly in the terminal:

wget -q -O - 'http://127.0.0.1:8080/summarize?url=https://audio-samples.github.io/samples/mp3/blizzard_unconditional/sample-0.mp3&type=voice'
Enter fullscreen mode Exit fullscreen mode

Conclusion

Using text models through a multimodal approach opens up the possibility of solving tasks that previously seemed impossible. For example, we can transcribe YouTube videos, explain complex diagrams in simple language, or conduct an entire study by explaining the instructions to the model in simple human language.

Top comments (0)