<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: AI/ML API</title>
    <description>The latest articles on DEV Community by AI/ML API (@nikolayaimlapi).</description>
    <link>https://dev.to/nikolayaimlapi</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1287258%2F2bea1339-c955-4e34-973a-ca8b45b9015e.png</url>
      <title>DEV Community: AI/ML API</title>
      <link>https://dev.to/nikolayaimlapi</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/nikolayaimlapi"/>
    <language>en</language>
    <item>
      <title>Multimodal Experience with AI/ML API in NodeJS</title>
      <dc:creator>AI/ML API</dc:creator>
      <pubDate>Tue, 30 Apr 2024 21:02:10 +0000</pubDate>
      <link>https://dev.to/nikolayaimlapi/multimodal-experience-with-aiml-api-in-nodejs-jb0</link>
      <guid>https://dev.to/nikolayaimlapi/multimodal-experience-with-aiml-api-in-nodejs-jb0</guid>
      <description>&lt;h2&gt;
  
  
  &lt;strong&gt;Introduction&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Large Language Models excel at text-related tasks. But what if you need to make a model multimodal? How can you teach a text model to process an audio file, for example?&lt;/p&gt;

&lt;p&gt;There is a solution: combine two different models. A model that can transcribe an audio recording and a model that can process it. The result of this processing would be a description of what is happening in the audio recording.&lt;/p&gt;

&lt;p&gt;This can be easily implemented using the text models of AI/ML API and an audio transcription model, such as Deepgram.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Choosing a Text Model in AI/ML API&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Since the text model needs to strictly follow instructions, the best candidate for this would be an Instruct-model.&lt;/p&gt;

&lt;p&gt;By going to the &lt;a href="https://aimlapi.com/models" rel="noopener noreferrer"&gt;models section&lt;/a&gt;, we find the right one for our purposes. One of the good candidates would be the &lt;strong&gt;Mixtral 8x22B Instruct&lt;/strong&gt; model.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Obtaining a Token in Deepgram&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;You can get the key &lt;a href="https://deepgram.com/" rel="noopener noreferrer"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Obtaining a Token in AI/ML API&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;You can get the key &lt;a href="https://aimlapi.com/app/keys" rel="noopener noreferrer"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Implementation&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Make sure that NodeJS is installed on your machine. If necessary, you can find all the instructions for installing NodeJS &lt;a href="https://nodejs.org/en" rel="noopener noreferrer"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;For a clear example of implementing multimodality, you can create a web server that will be able to accept the URL of an audio file and a brief "type" of this recording so that the models can understand the context of the speech.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Preparation&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;You need to create a new project. To do this, create a new folder named &lt;strong&gt;&lt;code&gt;aimlapi-multimodal-example&lt;/code&gt;&lt;/strong&gt; in any convenient location and navigate into it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;mkdir &lt;/span&gt;aimlapi-multimodal-example
&lt;span class="nb"&gt;cd&lt;/span&gt; ./aimlapi-multimodal-example
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here, create a new project using &lt;strong&gt;&lt;code&gt;npm&lt;/code&gt;&lt;/strong&gt; and install the required dependencies:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm init &lt;span class="nt"&gt;-y&lt;/span&gt;
npm i express @deepgram/sdk openai
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Create a source file where all the necessary code will be and open the project in your preferred IDE. In my case, I will be using VSCode.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;touch&lt;/span&gt; ./index.js
code &lt;span class="nb"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;Importing Dependencies&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;To create the required functionality, you will need to use the &lt;strong&gt;&lt;code&gt;Deepgram API&lt;/code&gt;&lt;/strong&gt; and &lt;strong&gt;&lt;code&gt;AI/ML API&lt;/code&gt;&lt;/strong&gt;. As a web server, any framework or module can be used, but for simplicity, I suggest using &lt;strong&gt;&lt;code&gt;express&lt;/code&gt;&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;AI/ML API supports usage through the OpenAI SDK, so you can limit the import of all dependencies to the following:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight jsx"&gt;&lt;code&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;deepgram&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;require&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@deepgram/sdk&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;express&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;require&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;express&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;OpenAI&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;require&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;openai&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;API Interfaces and Prompts&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The next step is to create all the constants, an &lt;strong&gt;&lt;code&gt;express&lt;/code&gt;&lt;/strong&gt; application, and interfaces for accessing the APIs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight jsx"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;PORT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;8080&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;app&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;express&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;deepgramModel&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;nova-2&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;openaiModel&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;mistralai/Mixtral-8x7B-Instruct-v0.1&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;deepgramApi&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;deepgram&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;createClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;&amp;lt;DEEPGRAM_TOKEN&amp;gt;&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;openaiApi&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;baseURL&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;https://api.aimlapi.com&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;apiKey&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;&amp;lt;AIMLAPI_TOKEN&amp;gt;&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Text models operate with prompts. Therefore, you need to create prompts that will give instructions to the model in processing audio recordings. There will be two prompts:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;summary prompt: a detailed textual description of the audio file&lt;/li&gt;
&lt;li&gt;context prompt: validation and editing of the description&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Declare them in this manner:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight jsx"&gt;&lt;code&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;getSummaryPrompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt;
  &lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="s2"&gt;`Please provide a detailed report of the text transcription. The transcript of which I provide below in triple quotes, including key summary outcomes.
KEEP THESE RULES STRICTLY:
STRICTLY SPLIT OUTPUT IN PARAGRAPHS: Topic and the matter of discourse, Key outcomes, Ideas and Conclusions.
OUTPUT MUST BE STRICTLY LIMITED TO 2000 CHARACTERS!
STRICTLY KEEP THE SENTENCES COMPACT WITH BULLET POINTS! THIS IS IMPORTANT!
ALL CONTEXT OF THE TRANSCRIPT MUST BE INCLUDED IN OUTPUT!
DO NOT INCLUDE MESSAGES ABOUT CHARACTERS COUNT IN THE OUTPUT!`&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;getContextPrompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="nx"&gt;type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="s2"&gt;`Ensure integrity and quality of the given summary, it is the summary of a &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;type&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;, edit it accordingly.
OUTPUT MUST BE STRICTLY LIMITED TO 2000 CHARACTERS!
      STRICTLY KEEP THE SENTENCES COMPACT WITH BULLET POINTS! THIS IS IMPORTANT!
      ALL CONTEXT OF THE TRANSCRIPT MUST BE INCLUDED IN OUTPUT!
      DO NOT INCLUDE MESSAGES ABOUT CHARACTERS COUNT IN THE OUTPUT!`&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These will be template functions, returning the required string to us.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Express Endpoint&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Our task will be handled by a GET HTTP endpoint at &lt;strong&gt;&lt;code&gt;/summarize&lt;/code&gt;&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;We declare it using &lt;strong&gt;&lt;code&gt;express&lt;/code&gt;&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight jsx"&gt;&lt;code&gt;&lt;span class="nx"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/summarize&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;next&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two parameters will be sent in the request: &lt;strong&gt;&lt;code&gt;type&lt;/code&gt;&lt;/strong&gt; and &lt;strong&gt;&lt;code&gt;url&lt;/code&gt;&lt;/strong&gt;. We will extract them from the request and perform basic validation.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight jsx"&gt;&lt;code&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;url&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;type&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;status&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;400&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;send&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;error&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;'type' and 'url' parameters required&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Next, we need to send a request to the Deepgram API and obtain a textual transcription of the audio file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight jsx"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;result&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;results&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;channels&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
          &lt;span class="na"&gt;alternatives&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt; &lt;span class="nx"&gt;transcript&lt;/span&gt; &lt;span class="p"&gt;}],&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
      &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;deepgramApi&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;listen&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;prerecorded&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;transcribeUrl&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;deepgramModel&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;smart_format&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We are interested only in the first result, so we ignore all other possible alternatives and extract the data using &lt;a href="https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Operators/Destructuring_assignment#binding_and_assignment" rel="noopener noreferrer"&gt;destructuring assignment&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Next, we need to process the transcription using the AI/ML API. For this, we will use the OpenAI SDK and the &lt;strong&gt;&lt;code&gt;chat.completions&lt;/code&gt;&lt;/strong&gt; methods:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight jsx"&gt;&lt;code&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;summaryCompletion&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;openaiApi&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;openaiModel&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;role&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;system&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;getSummaryPrompt&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;role&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;user&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;transcript&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;contextedCompletion&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;openaiApi&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;openaiModel&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;role&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;system&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;getContextPrompt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;type&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;role&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;user&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;summaryCompletion&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nx"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;content&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This will allow us to run the result twice, improving its quality and eliminating some errors the model might have made.&lt;/p&gt;

&lt;p&gt;Now we need to return the response, formatting it visually:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight jsx"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;`&amp;lt;pre style="font-family: sans-serif; white-space: pre-line;"&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;contextedCompletion&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nx"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;content&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;&amp;lt;/pre&amp;gt;`&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With this, the processing of the &lt;strong&gt;&lt;code&gt;/summarize&lt;/code&gt;&lt;/strong&gt; request is complete. All that remains is to launch the web server:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight jsx"&gt;&lt;code&gt;&lt;span class="nx"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;listen&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;PORT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`listening on http://127.0.0.1:&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;PORT&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  &lt;strong&gt;Result&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Launch the application using the command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;node ./index.js
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And you will see in the console a message about the running server and its address. You can check the result in the browser by going to the server's address and adding the API request path: &lt;a href="https://127.0.0.1:8080/summarize" rel="noopener noreferrer"&gt;https://127.0.0.1:8080/summarize&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;You will immediately see an error:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"error"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"'type' and 'url' parameters required"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This indicates that basic parameter validation is working. Now specify the necessary parameters in the URL for the request to be processed correctly:&lt;/p&gt;

&lt;p&gt;&lt;a href="http://127.0.0.1:8080/summarize?url=https://audio-samples.github.io/samples/mp3/blizzard_unconditional/sample-0.mp3&amp;amp;type=voice" rel="noopener noreferrer"&gt;http://127.0.0.1:8080/summarize?url=https://audio-samples.github.io/samples/mp3/blizzard_unconditional/sample-0.mp3&amp;amp;type=voice&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This will return a result of approximately the following kind:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Summary:

* Speaker admires Mr. Rochester's beauty and devotion.
* Mr. Rochester is described as subdued and open to external influences.
* Speaker's admiration suggests a positive relationship.
* Use of language hints at Mr. Rochester's strength and control.

The text appears to be a fragmented transcription about a person named Mr. Rochester. The speaker expresses admiration for Mr. Rochester's beauty and will, describing him as subdued and devoted. The speaker's admiration and use of language suggest a positive relationship and impression of Mr. Rochester. The phrase "bowed to let might in" is unclear but may indicate Mr. Rochester's openness to external influences. The text's limited and fragmented nature makes definitive conclusions difficult, but the speaker's admiration and use of language hint at Mr. Rochester's strength and control.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Voila! We have created an application capable of making a transcription from an audio file and its brief description. Launched it on a web server, and now it can be used in completely different contexts. For example, instead of a browser, we can use the wget utility and see the result directly in the terminal:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;wget &lt;span class="nt"&gt;-q&lt;/span&gt; &lt;span class="nt"&gt;-O&lt;/span&gt; - &lt;span class="s1"&gt;'http://127.0.0.1:8080/summarize?url=https://audio-samples.github.io/samples/mp3/blizzard_unconditional/sample-0.mp3&amp;amp;type=voice'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  &lt;strong&gt;Conclusion&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Using text models through a multimodal approach opens up the possibility of solving tasks that previously seemed impossible. For example, we can transcribe YouTube videos, explain complex diagrams in simple language, or conduct an entire study by explaining the instructions to the model in simple human language.&lt;/p&gt;

</description>
      <category>javascript</category>
      <category>tutorial</category>
      <category>ai</category>
      <category>api</category>
    </item>
    <item>
      <title>DBRX, Grok, Mixtral: Mixture-of-Experts is a trending architecture for LLMs</title>
      <dc:creator>AI/ML API</dc:creator>
      <pubDate>Thu, 11 Apr 2024 12:23:04 +0000</pubDate>
      <link>https://dev.to/nikolayaimlapi/dbrx-grok-mixtral-mixture-of-experts-is-a-trending-architecture-for-llms-1nll</link>
      <guid>https://dev.to/nikolayaimlapi/dbrx-grok-mixtral-mixture-of-experts-is-a-trending-architecture-for-llms-1nll</guid>
      <description>&lt;p&gt;Mixture-of-Experts (MoE) architecture is a relatively new wave in the development of large language models (LLMs), offering a flexible solution that efficiently tackles computational challenges. Leveraging the MoE technique, models like DBRX demonstrate enhanced performance by activating only a relevant subset of ‘experts’ for each input. This not only reduces the computational cost but also scales model capacity without proportionately increasing resource demands.&lt;/p&gt;

&lt;p&gt;The recent introduction of models such as Databricks’ DBRX, Grok-1 by xAI, and Mixtral 8x7B by Mistral AI marks a significant trend toward the adoption of MoE architecture in open-source LLM development, making it a focal point for researchers and practitioners alike.&lt;/p&gt;

&lt;p&gt;The adoption of MoE models, including DBRX, is paving the way for advancements in efficient LLM training, addressing critical aspects like flop efficiency per parameter and decreased latency. Such models have become instrumental in applications requiring retrieval-augmented generation (RAG) and autonomous agents, thanks to their cost-effective training methods and improved generalization capabilities.&lt;/p&gt;

&lt;p&gt;With a focus on scalable, high-performing, and efficient LLMs, this article will explore the intricacies of MoE architecture, highlighting how pioneering open implementations by Databricks and others are setting new benchmarks in the field.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Rise of Mixture-of-Experts in LLMs
&lt;/h2&gt;

&lt;p&gt;The inception of Mixture-of-Experts (MoE) can be traced back to the early 1990s, marking a pivotal moment in neural network design. This innovative architecture, initially introduced by Jacobs et al.[1], revolutionized the way large language models (LLMs) are developed by integrating multiple “expert” networks. Each of these networks specializes in processing distinct subsets of input data, with a gating mechanism efficiently directing each input to the most relevant expert(s). This approach not only enhances model performance but also significantly reduces computational costs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Features of MoE Models:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Scalability: MoE models uniquely maintain a relatively constant computational cost during inference, allowing for the scaling up of model size. This is achieved without the proportional increase in resource demand typically seen in dense models.&lt;/li&gt;
&lt;li&gt;Efficiency: These models are celebrated for their flop efficiency per weight, making them ideal for scenarios with fixed computational budgets. This efficiency enables the processing of more tokens within the same time or compute constraints.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Challenges and Solutions:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Training Stability and Overfitting: MoE models are more susceptible to training instabilities and tend to overfit, especially with smaller datasets. Strategies like careful regularization and dataset augmentation are vital.&lt;/li&gt;
&lt;li&gt;Load Balancing and Communication Overhead: Ensuring even distribution of workload among experts and managing communication overhead in distributed setups are critical for optimal performance.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;MoE’s application in LLMs, such as DBRX and Mixtral 8x7B, demonstrates its capability to handle complex and diverse datasets with high efficiency. By dynamically allocating tasks to specialized experts, MoE models achieve nuanced understanding and high-performance standards, setting a new benchmark in the field of AI and opening avenues for further exploration in various domains.&lt;/p&gt;

&lt;h2&gt;
  
  
  Inside the Architecture: Understanding MoE
&lt;/h2&gt;

&lt;p&gt;Applying the Mixture-of-Experts (MoE) architecture to transformers involves a significant architectural shift, particularly in how dense feedforward neural network (FFN) layers are reimagined. Here’s a closer look at this transformative process:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Replacement of Dense FFN Layers:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Traditional Architecture: Dense FFN layers where each layer is fully connected and participates in the computation for every input.&lt;/li&gt;
&lt;li&gt;MoE Architecture: Sparse MoE layers replace dense FFNs. Each MoE layer houses multiple expert FFNs and a gating mechanism, fundamentally altering the network’s computation strategy.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Operational Dynamics:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Gating Mechanism: Acts as a traffic director, guiding each input sequence to the most relevant subset of experts.&lt;/li&gt;
&lt;li&gt;Selective Activation: Only a specific group of experts is activated for a given input, optimizing computational resources and efficiency.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Scalability and Efficiency:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;MoE models maintain a constant computational cost during inference, a stark contrast to traditional models where costs escalate with size. This trait is particularly valuable in resource-constrained deployment scenarios, ensuring larger models can be trained and deployed without proportional increases in computational demands.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The shift to MoE architecture, as seen in models like DBRX, Grok-1, and Mixtral 8x7B, represents a new trend in developing large, efficient LLMs. By partitioning tasks among specialized experts, MoE models offer a refined approach to handling complex, high-dimensional tasks, setting the stage for more sophisticated and capable AI systems.&lt;/p&gt;

&lt;p&gt;‍&lt;/p&gt;

&lt;h2&gt;
  
  
  The Real Example of MoE Performance
&lt;/h2&gt;

&lt;p&gt;You can explore the capabilities of the MoE architecture by yourself. Below is an example of a text generation task accomplished by an awesome MoE model Mixtral 8x7b Instruct through the &lt;a href="https://aimlapi.com"&gt;AI/ML API&lt;/a&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import time
import openai

client = openai.OpenAI(
api_key=”***”,
base_url=”https://api.aimlapi.com",
)

def get_code_completion(messages, max_tokens=2500, model=”mistralai/Mixtral-8x7B-Instruct-v0.1"):
chat_completion = client.chat.completions.create(
messages=messages,
model=model,
max_tokens=max_tokens,
top_p=1,
n=10,
temperature=0.7,
)
return chat_completion

if __name__ == ‘__main__’:
messages = [
{“role”: “system”, “content”: “Assist in writing an article on a given topic. Write a detailed text with examples and reasoning.”},
{“role”: “user”, “content”: “I need an article about the impact of AI on the World Wide Web.”},
]
start = time.perf_counter()
chat_completion = get_code_completion(messages)
print(chat_completion.choices[0].message.content)
print(f’Elapsed time (sec): {time.perf_counter() — start}’)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can replace the model id mistralai/Mixtral-8x7B-Instruct-v0.1 with some other supported model — let’s say, meta-llama/Llama-2–70b-chat-hf — and play with the prompt to assess various aspects of the MoE performance compared to other models. Some of the obvious you will notice — fast inference and accurate instruction-following skills of Mixtral, which are the benefits of the computationally effective MoE architecture and smart selection of experts for a given prompt.‍&lt;/p&gt;

&lt;h2&gt;
  
  
  DBRX: A New Benchmark in LLM Efficiency
&lt;/h2&gt;

&lt;p&gt;DBRX, developed by Databricks, is emerging as a new benchmark in the landscape of large language models (LLMs), pushing the frontiers of efficiency and performance. This open LLM distinguishes itself through several key features:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Performance Benchmarks:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Outperforms GPT-3.5 and rivals Gemini 1.0 Pro in standard benchmarks.&lt;/li&gt;
&lt;li&gt;Demonstrates superior capabilities in coding tasks, surpassing CodeLLaMA-70B.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Efficiency and Size:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Achieves up to double the inference speed of LLaMA2–70B.&lt;/li&gt;
&lt;li&gt;Maintains a compact size, with both total and active parameter counts being about 40% smaller than Grok-1.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Generative Speed and Training Data:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;When integrated with Mosaic AI Model Serving, it achieves a generation speed of up to 150 tokens per second per user.&lt;/li&gt;
&lt;li&gt;Pre-trained on a massive corpus of 12T tokens of text and code data, supporting a maximum context length of 32k tokens.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;DBRX’s standing on the Open LLM leaderboard is noteworthy, outperforming models like Mistral Instruct and Grok-1 in the majority of benchmarks. Its licensing model is uniquely designed to encourage wide usage while imposing restrictions on very large user bases (more than 700 million monthly active users). Positioned as twice as compute-efficient compared to leading LLMs, DBRX not only sets a new standard for open-source models but also paves the way for customizable, transparent generative AI across various enterprises. Its availability across major cloud platforms and its expected integration into NVIDIA’s ecosystem further underscore its accessibility and potential for widespread adoption.&lt;/p&gt;

&lt;h2&gt;
  
  
  Grok: The first open MoE model of 300B+ size.
&lt;/h2&gt;

&lt;p&gt;Grok-1 by xAI stands as a pioneering implementation of the Mixture-of-Experts (MoE) architecture in the realm of large-scale LLMs. This transformer-based model features a staggering 314 billion parameters. However, its efficiency is highlighted by the fact that only about 86 billion parameters (approximately 25%) are active for any given token at a time. This selective activation significantly reduces computational demands while maintaining high-performance levels.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Attributes of Grok-1:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Architecture: Mixture-of-8-Experts, with each token processed by two experts during inference.&lt;/li&gt;
&lt;li&gt;Training: Developed from scratch using a custom stack based on JAX and Rust, without fine-tuning for specific applications.&lt;/li&gt;
&lt;li&gt;Accessibility: Available under the Apache 2.0 license for broad usage, including commercial applications.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Grok-1’s technical specifications are impressive, with 64 transformer layers, 6,144-dimensional embeddings, and the ability to process sequences up to 8,192 tokens long. Despite its large size and the substantial computational resources required (e.g., 8x A100 GPUs), Grok-1’s design facilitates efficient computation, employing bfloat16 precision. Another notable technical detail is the use of rotary positional embeddings to further enhance the model’s capability to manage extensive data sequences efficiently. This model exemplifies the new trend in open-source LLM development, emphasizing the importance of MoE architecture for achieving both scale and efficiency in AI models.&lt;/p&gt;

&lt;h2&gt;
  
  
  Mixtral: Fine-Grained MoE for Enhanced Performance
&lt;/h2&gt;

&lt;p&gt;Mixtral 8x7B, developed by Mistral AI, represents a significant advancement in the mixture-of-experts (MoE) architecture, showcasing the power of fine-grained MoE for enhanced performance in large language models (LLMs).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Configuration:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Consists of eight experts, each with 7 billion parameters.&lt;/li&gt;
&lt;li&gt;During inference, only two experts are activated per token, reducing computational costs effectively.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Performance:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Surpasses the 70 billion parameter Llama model in performance metrics.&lt;/li&gt;
&lt;li&gt;Offers six times faster inference times, making it a leader in efficiency.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Multilingual Support and Context Handling:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Supports multiple languages including English, French, Italian, German, and Spanish.&lt;/li&gt;
&lt;li&gt;Can process up to 32,000 tokens, approximately 50 pages of text, showcasing its robustness in handling extensive data sequences.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;An easy way to try out the capabilities of the model is to sign up for access to the AI/ML API.&lt;/p&gt;

&lt;p&gt;Mixtral 8x7B not only excels in general benchmarks, outperforming Llama 2 70B in areas like commonsense reasoning, world knowledge, and code but also demonstrates remarkable proficiency in multilingual benchmarks. This proficiency is particularly notable in French, German, Spanish, and Italian, where it significantly outperforms Llama 2 70B. Additionally, Mixtral’s approach to bias and sentiment, as evidenced in the BBQ and BOLD benchmarks, shows less bias and more positive sentiment compared to its counterparts. This combination of efficiency, performance, and ethical considerations positions Mixtral 8x7B as a model of choice for developers and researchers seeking scalable, high-performance, and ethically conscious LLM solutions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Future Trends and Directions in MoE LLMs
&lt;/h2&gt;

&lt;p&gt;Exploring the horizon of large language models (LLMs) reveals a compelling shift towards a more nuanced architecture, the Mixture of Tokens (MoT), promising to address the challenges faced by the Mixture of Experts (MoE). The MoT technique, by blending different token representations, paves the way for a richer data understanding in NLP tasks. Its potential lies in:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Enhanced Scalability and Efficiency: MoTs tackle MoE’s limitations like training instability and load imbalance head-on, offering a scalable solution without the computational heft.&lt;/li&gt;
&lt;li&gt;Performance and Training Efficiency: By mixing tokens from various examples before presenting them to experts, MoTs not only boost model performance but also streamline the training process.&lt;/li&gt;
&lt;li&gt;Parameter Reduction: A notable achievement is the drastic cut in parameters, showcasing MoT’s capability to deliver high-performing models with fewer resources.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Models like GLaM by Google and initiatives by Cohere AI underscore the industry’s move towards adopting MoT and refining MoE architectures. These advancements hint at an exciting future where LLMs achieve unprecedented efficiency and specialization, making them more accessible and effective across a wider range of applications. The journey from MoE to MoT represents a significant leap towards overcoming existing barriers, heralding a new era of AI that is more adaptable, efficient, and powerful.&lt;/p&gt;

&lt;p&gt;‍&lt;/p&gt;

&lt;p&gt;[1] &lt;a href="https://www.cs.toronto.edu/~hinton/absps/jjnh91.pdf"&gt;https://www.cs.toronto.edu/~hinton/absps/jjnh91.pdf&lt;/a&gt;&lt;/p&gt;

</description>
      <category>llm</category>
      <category>moe</category>
      <category>ai</category>
    </item>
  </channel>
</rss>
