DEV Community: AI/ML API

Multimodal Experience with AI/ML API in NodeJS

AI/ML API — Tue, 30 Apr 2024 21:02:10 +0000

Introduction

Large Language Models excel at text-related tasks. But what if you need to make a model multimodal? How can you teach a text model to process an audio file, for example?

There is a solution: combine two different models. A model that can transcribe an audio recording and a model that can process it. The result of this processing would be a description of what is happening in the audio recording.

This can be easily implemented using the text models of AI/ML API and an audio transcription model, such as Deepgram.

Choosing a Text Model in AI/ML API

Since the text model needs to strictly follow instructions, the best candidate for this would be an Instruct-model.

By going to the models section, we find the right one for our purposes. One of the good candidates would be the Mixtral 8x22B Instruct model.

Obtaining a Token in Deepgram

You can get the key here.

Obtaining a Token in AI/ML API

You can get the key here.

Implementation

Make sure that NodeJS is installed on your machine. If necessary, you can find all the instructions for installing NodeJS here.

For a clear example of implementing multimodality, you can create a web server that will be able to accept the URL of an audio file and a brief "type" of this recording so that the models can understand the context of the speech.

Preparation

You need to create a new project. To do this, create a new folder named aimlapi-multimodal-example in any convenient location and navigate into it.

mkdir aimlapi-multimodal-example
cd ./aimlapi-multimodal-example

Here, create a new project using npm and install the required dependencies:

npm init -y
npm i express @deepgram/sdk openai

Create a source file where all the necessary code will be and open the project in your preferred IDE. In my case, I will be using VSCode.

touch ./index.js
code .

Importing Dependencies

To create the required functionality, you will need to use the Deepgram API and AI/ML API. As a web server, any framework or module can be used, but for simplicity, I suggest using express.

AI/ML API supports usage through the OpenAI SDK, so you can limit the import of all dependencies to the following:


const deepgram = require('@deepgram/sdk');
const express = require('express');
const { OpenAI } = require('openai');

API Interfaces and Prompts

The next step is to create all the constants, an express application, and interfaces for accessing the APIs:

const PORT = 8080;
const app = express();

const deepgramModel = 'nova-2';
const openaiModel = 'mistralai/Mixtral-8x7B-Instruct-v0.1';
const deepgramApi = deepgram.createClient('<DEEPGRAM_TOKEN>');
const openaiApi = new OpenAI({ baseURL: 'https://api.aimlapi.com', apiKey: '<AIMLAPI_TOKEN>' });

Text models operate with prompts. Therefore, you need to create prompts that will give instructions to the model in processing audio recordings. There will be two prompts:

summary prompt: a detailed textual description of the audio file
context prompt: validation and editing of the description

Declare them in this manner:


const getSummaryPrompt =
  () => `Please provide a detailed report of the text transcription. The transcript of which I provide below in triple quotes, including key summary outcomes.
KEEP THESE RULES STRICTLY:
STRICTLY SPLIT OUTPUT IN PARAGRAPHS: Topic and the matter of discourse, Key outcomes, Ideas and Conclusions.
OUTPUT MUST BE STRICTLY LIMITED TO 2000 CHARACTERS!
STRICTLY KEEP THE SENTENCES COMPACT WITH BULLET POINTS! THIS IS IMPORTANT!
ALL CONTEXT OF THE TRANSCRIPT MUST BE INCLUDED IN OUTPUT!
DO NOT INCLUDE MESSAGES ABOUT CHARACTERS COUNT IN THE OUTPUT!`;

const getContextPrompt = (
  type,
) => `Ensure integrity and quality of the given summary, it is the summary of a ${type}, edit it accordingly.
OUTPUT MUST BE STRICTLY LIMITED TO 2000 CHARACTERS!
      STRICTLY KEEP THE SENTENCES COMPACT WITH BULLET POINTS! THIS IS IMPORTANT!
      ALL CONTEXT OF THE TRANSCRIPT MUST BE INCLUDED IN OUTPUT!
      DO NOT INCLUDE MESSAGES ABOUT CHARACTERS COUNT IN THE OUTPUT!`;

These will be template functions, returning the required string to us.

Express Endpoint

Our task will be handled by a GET HTTP endpoint at /summarize.

We declare it using express:

app.get('/summarize', async (req, res, next) => {})

Two parameters will be sent in the request: type and url. We will extract them from the request and perform basic validation.


const { type, url } = req.query;
if (!type || !url) {
  return res.status(400).send({ error: "'type' and 'url' parameters required" });
}

Next, we need to send a request to the Deepgram API and obtain a textual transcription of the audio file:

const {
  result: {
    results: {
      channels: [
        {
          alternatives: [{ transcript }],
        },
      ],
    },
  },
} = await deepgramApi.listen.prerecorded.transcribeUrl(
  {
    url: url,
  },
  {
    model: deepgramModel,
    smart_format: true,
  },
);

We are interested only in the first result, so we ignore all other possible alternatives and extract the data using destructuring assignment.

Next, we need to process the transcription using the AI/ML API. For this, we will use the OpenAI SDK and the chat.completions methods:


const summaryCompletion = await openaiApi.chat.completions.create({
  model: openaiModel,
  messages: [
    { role: 'system', content: getSummaryPrompt() },
    { role: 'user', content: transcript },
  ],
});

const contextedCompletion = await openaiApi.chat.completions.create({
  model: openaiModel,
  messages: [
    { role: 'system', content: getContextPrompt(type) },
    { role: 'user', content: summaryCompletion.choices[0].message.content },
  ],
});

This will allow us to run the result twice, improving its quality and eliminating some errors the model might have made.

Now we need to return the response, formatting it visually:

const response = `<pre style="font-family: sans-serif; white-space: pre-line;">${contextedCompletion.choices[0].message.content}</pre>`;
res.send(response);

With this, the processing of the /summarize request is complete. All that remains is to launch the web server:

app.listen(PORT, () => {
  console.log(`listening on http://127.0.0.1:${PORT}`);
});

Result

Launch the application using the command:

node ./index.js

And you will see in the console a message about the running server and its address. You can check the result in the browser by going to the server's address and adding the API request path: https://127.0.0.1:8080/summarize.

You will immediately see an error:

{"error":"'type' and 'url' parameters required"}

This indicates that basic parameter validation is working. Now specify the necessary parameters in the URL for the request to be processed correctly:

http://127.0.0.1:8080/summarize?url=https://audio-samples.github.io/samples/mp3/blizzard_unconditional/sample-0.mp3&type=voice

This will return a result of approximately the following kind:

Summary:

* Speaker admires Mr. Rochester's beauty and devotion.
* Mr. Rochester is described as subdued and open to external influences.
* Speaker's admiration suggests a positive relationship.
* Use of language hints at Mr. Rochester's strength and control.

The text appears to be a fragmented transcription about a person named Mr. Rochester. The speaker expresses admiration for Mr. Rochester's beauty and will, describing him as subdued and devoted. The speaker's admiration and use of language suggest a positive relationship and impression of Mr. Rochester. The phrase "bowed to let might in" is unclear but may indicate Mr. Rochester's openness to external influences. The text's limited and fragmented nature makes definitive conclusions difficult, but the speaker's admiration and use of language hint at Mr. Rochester's strength and control.

Voila! We have created an application capable of making a transcription from an audio file and its brief description. Launched it on a web server, and now it can be used in completely different contexts. For example, instead of a browser, we can use the wget utility and see the result directly in the terminal:

wget -q -O - 'http://127.0.0.1:8080/summarize?url=https://audio-samples.github.io/samples/mp3/blizzard_unconditional/sample-0.mp3&type=voice'

Conclusion

Using text models through a multimodal approach opens up the possibility of solving tasks that previously seemed impossible. For example, we can transcribe YouTube videos, explain complex diagrams in simple language, or conduct an entire study by explaining the instructions to the model in simple human language.

DBRX, Grok, Mixtral: Mixture-of-Experts is a trending architecture for LLMs

AI/ML API — Thu, 11 Apr 2024 12:23:04 +0000

Mixture-of-Experts (MoE) architecture is a relatively new wave in the development of large language models (LLMs), offering a flexible solution that efficiently tackles computational challenges. Leveraging the MoE technique, models like DBRX demonstrate enhanced performance by activating only a relevant subset of ‘experts’ for each input. This not only reduces the computational cost but also scales model capacity without proportionately increasing resource demands.

The recent introduction of models such as Databricks’ DBRX, Grok-1 by xAI, and Mixtral 8x7B by Mistral AI marks a significant trend toward the adoption of MoE architecture in open-source LLM development, making it a focal point for researchers and practitioners alike.

The adoption of MoE models, including DBRX, is paving the way for advancements in efficient LLM training, addressing critical aspects like flop efficiency per parameter and decreased latency. Such models have become instrumental in applications requiring retrieval-augmented generation (RAG) and autonomous agents, thanks to their cost-effective training methods and improved generalization capabilities.

With a focus on scalable, high-performing, and efficient LLMs, this article will explore the intricacies of MoE architecture, highlighting how pioneering open implementations by Databricks and others are setting new benchmarks in the field.

The Rise of Mixture-of-Experts in LLMs

The inception of Mixture-of-Experts (MoE) can be traced back to the early 1990s, marking a pivotal moment in neural network design. This innovative architecture, initially introduced by Jacobs et al.[1], revolutionized the way large language models (LLMs) are developed by integrating multiple “expert” networks. Each of these networks specializes in processing distinct subsets of input data, with a gating mechanism efficiently directing each input to the most relevant expert(s). This approach not only enhances model performance but also significantly reduces computational costs.

Key Features of MoE Models:

Scalability: MoE models uniquely maintain a relatively constant computational cost during inference, allowing for the scaling up of model size. This is achieved without the proportional increase in resource demand typically seen in dense models.
Efficiency: These models are celebrated for their flop efficiency per weight, making them ideal for scenarios with fixed computational budgets. This efficiency enables the processing of more tokens within the same time or compute constraints.

Challenges and Solutions:

Training Stability and Overfitting: MoE models are more susceptible to training instabilities and tend to overfit, especially with smaller datasets. Strategies like careful regularization and dataset augmentation are vital.
Load Balancing and Communication Overhead: Ensuring even distribution of workload among experts and managing communication overhead in distributed setups are critical for optimal performance.

MoE’s application in LLMs, such as DBRX and Mixtral 8x7B, demonstrates its capability to handle complex and diverse datasets with high efficiency. By dynamically allocating tasks to specialized experts, MoE models achieve nuanced understanding and high-performance standards, setting a new benchmark in the field of AI and opening avenues for further exploration in various domains.

Inside the Architecture: Understanding MoE

Applying the Mixture-of-Experts (MoE) architecture to transformers involves a significant architectural shift, particularly in how dense feedforward neural network (FFN) layers are reimagined. Here’s a closer look at this transformative process:

Replacement of Dense FFN Layers:

Traditional Architecture: Dense FFN layers where each layer is fully connected and participates in the computation for every input.
MoE Architecture: Sparse MoE layers replace dense FFNs. Each MoE layer houses multiple expert FFNs and a gating mechanism, fundamentally altering the network’s computation strategy.

Operational Dynamics:

Gating Mechanism: Acts as a traffic director, guiding each input sequence to the most relevant subset of experts.
Selective Activation: Only a specific group of experts is activated for a given input, optimizing computational resources and efficiency.

Scalability and Efficiency:

MoE models maintain a constant computational cost during inference, a stark contrast to traditional models where costs escalate with size. This trait is particularly valuable in resource-constrained deployment scenarios, ensuring larger models can be trained and deployed without proportional increases in computational demands.

The shift to MoE architecture, as seen in models like DBRX, Grok-1, and Mixtral 8x7B, represents a new trend in developing large, efficient LLMs. By partitioning tasks among specialized experts, MoE models offer a refined approach to handling complex, high-dimensional tasks, setting the stage for more sophisticated and capable AI systems.

‍

The Real Example of MoE Performance

You can explore the capabilities of the MoE architecture by yourself. Below is an example of a text generation task accomplished by an awesome MoE model Mixtral 8x7b Instruct through the AI/ML API:

import time
import openai

client = openai.OpenAI(
api_key=”***”,
base_url=”https://api.aimlapi.com",
)

def get_code_completion(messages, max_tokens=2500, model=”mistralai/Mixtral-8x7B-Instruct-v0.1"):
chat_completion = client.chat.completions.create(
messages=messages,
model=model,
max_tokens=max_tokens,
top_p=1,
n=10,
temperature=0.7,
)
return chat_completion

if __name__ == ‘__main__’:
messages = [
{“role”: “system”, “content”: “Assist in writing an article on a given topic. Write a detailed text with examples and reasoning.”},
{“role”: “user”, “content”: “I need an article about the impact of AI on the World Wide Web.”},
]
start = time.perf_counter()
chat_completion = get_code_completion(messages)
print(chat_completion.choices[0].message.content)
print(f’Elapsed time (sec): {time.perf_counter() — start}’)

You can replace the model id mistralai/Mixtral-8x7B-Instruct-v0.1 with some other supported model — let’s say, meta-llama/Llama-2–70b-chat-hf — and play with the prompt to assess various aspects of the MoE performance compared to other models. Some of the obvious you will notice — fast inference and accurate instruction-following skills of Mixtral, which are the benefits of the computationally effective MoE architecture and smart selection of experts for a given prompt.‍

DBRX: A New Benchmark in LLM Efficiency

DBRX, developed by Databricks, is emerging as a new benchmark in the landscape of large language models (LLMs), pushing the frontiers of efficiency and performance. This open LLM distinguishes itself through several key features:

Performance Benchmarks:

Outperforms GPT-3.5 and rivals Gemini 1.0 Pro in standard benchmarks.
Demonstrates superior capabilities in coding tasks, surpassing CodeLLaMA-70B.

Efficiency and Size:

Achieves up to double the inference speed of LLaMA2–70B.
Maintains a compact size, with both total and active parameter counts being about 40% smaller than Grok-1.

Generative Speed and Training Data:

When integrated with Mosaic AI Model Serving, it achieves a generation speed of up to 150 tokens per second per user.
Pre-trained on a massive corpus of 12T tokens of text and code data, supporting a maximum context length of 32k tokens.

DBRX’s standing on the Open LLM leaderboard is noteworthy, outperforming models like Mistral Instruct and Grok-1 in the majority of benchmarks. Its licensing model is uniquely designed to encourage wide usage while imposing restrictions on very large user bases (more than 700 million monthly active users). Positioned as twice as compute-efficient compared to leading LLMs, DBRX not only sets a new standard for open-source models but also paves the way for customizable, transparent generative AI across various enterprises. Its availability across major cloud platforms and its expected integration into NVIDIA’s ecosystem further underscore its accessibility and potential for widespread adoption.

Grok: The first open MoE model of 300B+ size.

Grok-1 by xAI stands as a pioneering implementation of the Mixture-of-Experts (MoE) architecture in the realm of large-scale LLMs. This transformer-based model features a staggering 314 billion parameters. However, its efficiency is highlighted by the fact that only about 86 billion parameters (approximately 25%) are active for any given token at a time. This selective activation significantly reduces computational demands while maintaining high-performance levels.

Key Attributes of Grok-1:

Architecture: Mixture-of-8-Experts, with each token processed by two experts during inference.
Training: Developed from scratch using a custom stack based on JAX and Rust, without fine-tuning for specific applications.
Accessibility: Available under the Apache 2.0 license for broad usage, including commercial applications.

Grok-1’s technical specifications are impressive, with 64 transformer layers, 6,144-dimensional embeddings, and the ability to process sequences up to 8,192 tokens long. Despite its large size and the substantial computational resources required (e.g., 8x A100 GPUs), Grok-1’s design facilitates efficient computation, employing bfloat16 precision. Another notable technical detail is the use of rotary positional embeddings to further enhance the model’s capability to manage extensive data sequences efficiently. This model exemplifies the new trend in open-source LLM development, emphasizing the importance of MoE architecture for achieving both scale and efficiency in AI models.

Mixtral: Fine-Grained MoE for Enhanced Performance

Mixtral 8x7B, developed by Mistral AI, represents a significant advancement in the mixture-of-experts (MoE) architecture, showcasing the power of fine-grained MoE for enhanced performance in large language models (LLMs).

Configuration:

Consists of eight experts, each with 7 billion parameters.
During inference, only two experts are activated per token, reducing computational costs effectively.

Performance:

Surpasses the 70 billion parameter Llama model in performance metrics.
Offers six times faster inference times, making it a leader in efficiency.

Multilingual Support and Context Handling:

Supports multiple languages including English, French, Italian, German, and Spanish.
Can process up to 32,000 tokens, approximately 50 pages of text, showcasing its robustness in handling extensive data sequences.

An easy way to try out the capabilities of the model is to sign up for access to the AI/ML API.

Mixtral 8x7B not only excels in general benchmarks, outperforming Llama 2 70B in areas like commonsense reasoning, world knowledge, and code but also demonstrates remarkable proficiency in multilingual benchmarks. This proficiency is particularly notable in French, German, Spanish, and Italian, where it significantly outperforms Llama 2 70B. Additionally, Mixtral’s approach to bias and sentiment, as evidenced in the BBQ and BOLD benchmarks, shows less bias and more positive sentiment compared to its counterparts. This combination of efficiency, performance, and ethical considerations positions Mixtral 8x7B as a model of choice for developers and researchers seeking scalable, high-performance, and ethically conscious LLM solutions.

Future Trends and Directions in MoE LLMs

Exploring the horizon of large language models (LLMs) reveals a compelling shift towards a more nuanced architecture, the Mixture of Tokens (MoT), promising to address the challenges faced by the Mixture of Experts (MoE). The MoT technique, by blending different token representations, paves the way for a richer data understanding in NLP tasks. Its potential lies in:

Enhanced Scalability and Efficiency: MoTs tackle MoE’s limitations like training instability and load imbalance head-on, offering a scalable solution without the computational heft.
Performance and Training Efficiency: By mixing tokens from various examples before presenting them to experts, MoTs not only boost model performance but also streamline the training process.
Parameter Reduction: A notable achievement is the drastic cut in parameters, showcasing MoT’s capability to deliver high-performing models with fewer resources.

Models like GLaM by Google and initiatives by Cohere AI underscore the industry’s move towards adopting MoT and refining MoE architectures. These advancements hint at an exciting future where LLMs achieve unprecedented efficiency and specialization, making them more accessible and effective across a wider range of applications. The journey from MoE to MoT represents a significant leap towards overcoming existing barriers, heralding a new era of AI that is more adaptable, efficient, and powerful.

‍

[1] https://www.cs.toronto.edu/~hinton/absps/jjnh91.pdf