Open AI's API

In the rapidly evolving realm of artificial intelligence and machine learning, OpenAI is making a wide range of useful API services that can augment and enhance applications. The api uses OpenAI’s chat GPT 3.5, 4 and 4 Turbo for chat completions, a text to speech engine called TTS to convert text to natural sounding speech, Whisper to convert speech to text, an embeddings engine that converts text into a numerical form, a fine tuning model that trains the AI, a moderation model that can detect sensitive text, and DALL-E for image generation.

The audio endpoint supports text to speech, speech to text, and creating translations. The text to speech endpoint request accepts a model, an input, a voice, a response format, and a speed and returns audio in the specified format. The model determines what the information is processed with, currently there is tts-1 and tts-1-hd. The input is a string of text with a maximum length of 4096 characters. The voice determines what voice the response is in, currently there are six options. The response format determines what kind of audio file is sent back, currently mp3, opus, aac, and flac are supported. The speed can return the audio at anywhere from .25 to 4x natural spoken speed.

The speech to text endpoint request takes an audio file, a model, a language, a prompt, a response_format, and a temperature. The only currently supported model is whisper-1. The language parameter is optional and specifies the language of the input, which can increase accuracy. The prompt parameter is optional and can include the correct spellings of unusual words like brand names that appear in the audio and may be mistranslated. The transcription will use the capitalization and spelling provided in the prompt. The response can be formatted into json, text, srt, verbose_json, or vtt. The temperature is a number between 0 and 1 where 0 is the most accurate possible and 1 will create more random responses. It returns the transcribed text. The create translation also takes an audio file, a model, a prompt, a response_format, and a temperature. It returns the audio translated into English and transcribed to text. It also uses the whisper-1 model.

An example of a text to speech request:

app.post('/text-to-speech-openai', async(req, res) =>{
  const text = req.body.text
  try{
   const response = await openai.audio.speech.create({
      model: 'tts-1', 
      input: text, 
      voice: "fable"
    })
    const buffer = Buffer.from(await response.arrayBuffer())
    res.status(200).send(buffer)
  }catch(error){
    console.error('error in text to speech: ', error)
    res.status(500).send('Error synthesizing speech from open AI')
  }
})

The chat completion request takes a messages array, a model, frequency_penalty, logit_bias, max tokens, n, the presence_penalty, the response_format, the seed, the stop, the stream, temperature, top_p, tools, tool_choice, user, function_call, and functions. It returns a chat completion object processed by the selected model. The messages array contains objects which establish the role of the AI and the past user messages and AI responses. The create image model takes a prompt, a model, a number, the image quality, the desired response format, the style the image should be rendered in and the user.

An example of a chat completion request:

  app.post('/openAIGetResponse', async (req, res) => {
  try {
    const { messages } = req.body;
    const request = {
      model: "gpt-3.5-turbo",
      messages: messages
    };
    const headers = {
      'Content-Type': 'application/json',
      'Authorization': `Bearer ${OPENAI_API_KEY}`
    };
    const airesp = await axios.post('https://api.openai.com/v1/chat/completions', 
request, 
{ headers: headers });
    const responseText = airesp.data.choices[0].message.content;
    res.status(200).send({ response: responseText });
  } catch (error) {
    console.error(error);
    res.status(500).send('Internal Server Error');
  }
});

Embeddings is used to by the AI to convert text to numerical values that the AI uses to process language more quickly. The fine-tuning model allows the user to train the AI on specific bodies of texts to inform its responses. The create moderation model detects speech that goes against a speech policy.

These API services provide many tools that will be useful to enterprises and smaller developers alike. The capabilities offered by AI are expanding rapidly and will redefine what is possible in our digital world.