Vikram Vaswani

Posted on May 17, 2022 • Originally published at docs.rev.ai

Transcribe Audio with Automatic Language Identification

#speechrecognition #node #tutorial

By Vikram Vaswani, Developer Advocate

This tutorial was originally published at https://docs.rev.ai/resources/tutorials/transcribe-audio-automatic-language-identification/ on May 16, 2022.

Introduction

Rev AI's Asynchronous Speech-to-Text API is able to transcribe spoken audio even if it's not in English – simply specify the language of the audio file in your transcription job request. However, this assumes that your application is able to identify the language before requesting transcription...and this may not always be the case.

That's where Rev AI's Language Identification API comes in. This API is able to automatically identify the most probable language used in an audio file. It accepts and analyzes an input audio file and returns a list of possible languages, ranked by confidence.

A unique feature of the Language Identification API is that it performs language identification without requiring a list of possible language codes upfront. This feature eliminates the need to first acquire and validate information on language possibilities, reducing work (and code dependencies) for developers.

This tutorial explains how to integrate the Language Identification API with the Asynchronous Speech-to-Text API. It uses a webhook to create a seamless, asynchronous language identification and transcription process for use in ASR applications.

Assumptions

This tutorial assumes that:

You have a Rev AI account and access token. If not, sign up for a free account and generate an access token.
You have a properly-configured Node.js development environment with Node.js v16.x or v17.x. If not, download and install Node.js for your operating system.
You have some familiarity with webhooks. If not, learn the basics of using Rev AI API webhooks and then read about using webhooks to send email notifications on job completion.
You have some familiarity with the Express framework. If not, familiarize yourself with the basics using this example application.
Your webhook will be available at a public URL. If not, or if you prefer to develop and test locally, download and install ngrok to generate a temporary public URL for your webhook.
You have an audio file to transcribe. If not, use this example audio file from Rev AI.

NOTE: The Language Identification API is under active development. Always refer to the API documentation for the most up-to-date information.

Technical approach

There are two stages in performing transcription with automatic language identification.

Stage 1: Language identification

To perform language identification on an audio file, you must submit an HTTP POST request with various parameters (including either the audio file or its URL) to the API endpoint at https://api.rev.ai/language_identification/v1/jobs. Here is an example request:

curl -X POST "https://api.rev.ai/languageid/v1/jobs" \
     -H "Authorization: Bearer <REVAI_ACCESS_TOKEN>" \
     -H "Content-Type: application/json" \
     -d '{"media_url":"https://www.rev.ai/FTC_Sample_1.mp3","callback_url":"https://example.com/callback"}'

When a webhook URL is included with the job parameters, as in the example above, then, on job completion, the Language Identification API will send an HTTP POST request containing the job status to the specified webhook URL.

The webhook URL handler will receive and parse this status and, if successful, it will make a GET request to the API endpoint at https://api.rev.ai/language_identification/v1/jobs/<ID>/result to obtain the list of identified languages. The most probable language for the submitted audio file is specified in the top_language property of the final response. Here is an example response:

{
  "top_language": "en",
  "language_confidences": [
    {
      "language": "en",
      "confidence": 0.907
    },
    {
      "language": "nl",
      "confidence": 0.023
    }
  ]
}

Stage 2: Transcription

With the language identification complete, the webhook URL handler will then trigger a new transcription request to the Asynchronous Speech-to-Text API endpoint at https://api.rev.ai/speechtotext/v1/jobs, passing along the language identification data with the request. Here is an example request:

curl -X POST "https://api.rev.ai/speechtotext/v1/jobs" \
     -H "Authorization: Bearer <REVAI_ACCESS_TOKEN>" \
     -H "Content-Type: application/json" \
     -d '{"media_url":"https://www.rev.ai/FTC_Sample_1.mp3","language":"en","callback_url":"https://example.com/callback"}'

Here too, since a webhook URL is included with the job parameters, the Asynchronous Speech-to-Text API will send an HTTP POST request containing the job status to the specified webhook URL once the job is complete.

The webhook URL handler will check this status and, if it is successful, it will make a GET request to the API endpoint at https://api.rev.ai/speechtotext/v1/jobs/<ID>/transcript to obtain the final transcript. Here is an example transcript response from the API:

{
  "monologues": [
    {
      "speaker": 1,
      "elements": [
        {
          "type": "text",
          "value": "Hi",
          "ts": 0.27,
          "end_ts": 0.32,
          "confidence": 1
        },
        {
          "type": "punct",
          "value": ","
        },
        {
          "type": "punct",
          "value": " "
        },
        {
          "type": "text",
          "value": "my",
          "ts": 0.35,
          "end_ts": 0.46,
          "confidence": 1
        },
        {
          "type": "punct",
          "value": " "
        },
        {
          "type": "text",
          "value": "name's",
          "ts": 0.47,
          "end_ts": 0.59,
          "confidence": 1
        },
        {
          ...
        }
      ]
    },
    {
      ...
    }
  ]
}

NOTE: Learn more about submitting an asynchronous transcription job and obtaining a transcript.

Sequence diagram

The following diagram explains the communication between the client and the two APIs visually:

As an alternative to custom-crafting HTTP GET and POST requests to various API endpoints and evaluating the resulting responses, this tutorial uses the Rev AI Node SDK, which provides ready-made, tested and documented methods to communicate with the different Rev AI APIs.

Step 1: Install required packages

This tutorial will use:

The Rev AI Node SDK, to submit language identification and transcription requests to the Rev AI APIs;
The Express Web framework and body-parser middleware, to receive and parse webhook requests.

Begin by installing the required packages:

npm i revai-node-sdk express body-parser

Step 2: Create a webhook handler

The next step is to define a webhook handler within the application that receives job notifications from the APIs.

The following example demonstrates a webhook handler that receives both language identification and transcription job results from the respective APIs. If the results are successful, it performs the following additional processing:

For language identification jobs, it obtains the list of identified languages and the most probable language, and then initiates an asynchronous transcription request that includes this language information.
For asynchronous transcription jobs, it obtains the final transcript and prints it to the console.

To use this example, replace the <REVAI_ACCESS_TOKEN> placeholder with your Rev AI account's access token.

const { RevAiApiClient } = require('revai-node-sdk');
const bodyParser = require('body-parser');
const express = require('express');
const axios = require('axios');

const token = '<REVAI_ACCESS_TOKEN>';

// create Axios client
const http = axios.create({
  baseURL: 'https://api.rev.ai/',
  headers: {
    'Authorization': `Bearer ${token}`,
    'Content-Type': 'application/json'
  }
});

// create Rev AI API client
const revAiClient = new RevAiApiClient(token);

const getLanguageIdentificationJobResult = async (jobId) => {
  return await http.get(`languageid/v1beta/jobs/${jobId}/result`,
    { headers: { 'Accept': 'application/vnd.rev.languageid.v1.0+json' } })
    .then(response => response.data)
    .catch(console.error);
};

// create Express application
const app = express();
app.use(bodyParser.json());

// define webhook handler
app.post('/hook', async req => {
  // get job, media URL, callback URL
  const job = req.body.job;
  const fileUrl = job.media_url;
  const callbackUrl = job.callback_url;
  console.log(`Received status for job id ${job.id}: ${job.status}`);

  try {
    switch (job.type) {
      // language job result handler
      case 'language_id':
        if (job.status === 'completed') {
          const languageJobResult = await getLanguageIdentificationJobResult(job.id);
          // retrieve most probable language
          // use as input to transcription request
          const languageId = languageJobResult.top_language;
          console.log(`Received result for job id ${job.id}: language '${languageId}'`);
          const transcriptJobSubmission = await revAiClient.submitJobUrl(fileUrl, {
            language: languageId,
            callback_url: callbackUrl
          });
          console.log(`Submitted for transcription with job id ${transcriptJobSubmission.id}`);
        }
        break;
      // transcription job result handler
      case 'async':
        if (job.status === 'transcribed') {
          // retrieve transcript
          const transcriptJobResult = await revAiClient.getTranscriptObject(job.id);
          console.log(`Received transcript for job id ${job.id}`);
          // do something with transcript
          // for example: print to console
          console.log(transcriptJobResult);
        }
        break;
    }
  } catch (e) {
    console.error(e);
  }
});


//  start application on port 3000
app.listen(3000, () => {
  console.log('Webhook listening');
})

Save this code listing as index.js and take a closer look at it:

This code listing begins by importing the required packages and credentials and creating a Rev AI API client RevAiApiClient for the Asynchronous Speech-to-Text API. It also creates an Axios HTTP client http for the Language Identification API.
It starts an Express application on port 3000 and waits for incoming POST requests to the /hook URL route.
When the application receives a POST request at /hook, it parses the incoming JSON message body, extracts the file and callback URLs and checks the job type.
For language identification jobs (type: language_id):
- It checks the job status and if completed, it requests the list of identified languages via the getLanguageIdentificationJobResult() function. The returned object contains a top_language property with the language code for the most probable language.
- It submits the audio file for transcription using the Rev AI API client's submitJobUrl() method. The second argument to this method is an object containing job parameters. Here, the parameters are the webhook URL (callback_url), which is set to the current webhook URL, and the language (language), which is set to the top_language value.
For asynchronous transcription jobs (type: async):
- It checks the job status and if transcribed, it uses the client's getTranscriptObject() method to retrieve the complete transcript as a JSON document. This transcript can then be processed further depending on the requirements of the application. In this illustrative example, it is simply sent to the console but for more complex scenarios, it could be saved to a database, presented to the user for review, or acted upon in a different way.
Errors, if any, in the above process are sent to the console.

Step 3: Test the webhook

To see the webhook in action, first ensure that you have replaced the placeholders as described in the previous step and then start the application using the command below.

node index.js

Next, submit an audio file for language identification to Rev AI and include the callback_url parameter in your request. This parameter specifies the webhook URL that the Rev AI API should invoke on job completion.

Here is an example of submitting an audio file with a webhook using curl.

curl -X POST "https://api.rev.ai/languageid/v1/jobs" \
     -H "Authorization: Bearer <REVAI_ACCESS_TOKEN>" \
     -H "Content-Type: application/json" \
     -d '{"media_url":"<URL>","callback_url":"http://<WEBHOOK-HOST>/hook"}'

Replace the <REVAI_ACCESS_TOKEN> placeholder with your Rev AI access token and the <URL> placeholder with the direct URL to your audio file. Additionally, replace the <WEBHOOK-HOST> placeholder as follows:

If you are developing and testing in the public cloud, your Express application will typically be available at a public domain or IP address. In this case, replace the <WEBHOOK-HOST> placeholder with the correct domain name or IP address, including the port number 3000 if required.
If you are developing and testing locally, your Express application will not be available publicly and you must therefore configure a public forwarding URL using a tool like ngrok. Obtain this URL using the command ngrok http 3000 and replace the <WEBHOOK-HOST> placeholder with the temporary forwarding URL generated by ngrok.

Once the job is processed, the Rev AI Language Identification API will send a POST request to the webhook URL. This will trigger the process described above and shortly after, the transcript will be printed to the console. The transcript can also be viewed through the Rev AI dashboard.

If the webhook doesn't work as expected, you can test and inspect the webhook data.