DEV Community

Cover image for Build a Telegram voice chatbot using ChatGPT API and Whisper
Viet Hoang
Viet Hoang

Posted on

Build a Telegram voice chatbot using ChatGPT API and Whisper

In this article, I will provide you with a step-by-step guide on how to create your own voice chatbot in Telegram. With this, you will be able to engage in conversations with your chatbot in a way that is natural and intuitive.

You can chat with a Telegram bot or send it a voice file, and it will send a response back along with a voice file as a reply. The conversation can continue until you choose to reset it.

Image description

You can check the source code here.

https://github.com/ngviethoang/telegram-voice-chatbot

This guide requires you to deploy on your own server with a public IP so you can set up webhooks for your Telegram bot.

Also, this is my personal bot built for both Messenger and Telegram, you can check it here.

https://github.com/ngviethoang/ai-chatbot

It also has Dall-E 2 integrated, and other models in Replicate. If you are curious about this, I will write another blog to talk about it.

Set up project

We will use Bottender - a framework for writing Telegram bot faster. It also supports Session for us to store past conversation messages and other data, so it’s more convenient to build a chatbot with conversational memory.

To create a project, run this command:

npx create-bottender-app telegram-bot
Enter fullscreen mode Exit fullscreen mode

In this step, select Telegram platform.

Image description
You can check their documentation here for more details. https://bottender.js.org/docs/

We will need to add Express server for serving static files also. And add Typescript so it’s easier to develop and maintain later.

npm install body-parser express
npm install typescript ts-node nodemon --save-dev
// or yarn
yarn add body-parser express
yarn add typescript ts-node nodemon --dev
Enter fullscreen mode Exit fullscreen mode

Update scripts in package.json file

{
    "scripts": {
        "build": "tsc",
    "dev": "nodemon --exec ts-node src/server.ts",
    "lint": "eslint . --ext=js,ts",
    "start": "tsc && node dist/server.js",
    "test": "jest"
  },
}
Enter fullscreen mode Exit fullscreen mode

Add tsconfig.json file

{
  "include": ["src/**/*"],
  "exclude": ["**/__tests__", "**/*.{spec,test}.ts"],
  "compilerOptions": {
    "target": "es2016",
    "lib": ["es2017", "es2018", "es2019", "es2020", "esnext.asynciterable"],
    "module": "commonjs",
    "skipLibCheck": true,
    "moduleResolution": "node",
    "esModuleInterop": true,
    "strict": true,
    "forceConsistentCasingInFileNames": true,
    "resolveJsonModule": true,
    "isolatedModules": true,
    "rootDir": "./src",
    "outDir": "./dist",
    "types": ["node", "jest"]
  }
}
Enter fullscreen mode Exit fullscreen mode

Install NPM packages

In this project, we will use OpenAI APIs like model gpt-3.5-turbo for chat completion and Whisper to transcribe text from audio.

With generating voice from text, I will use Azure service, so we will install package microsoft-cognitiveservices-speech-sdk. You guys can use any other services like Google, Amazon,… to serve this purpose.

Also install other packages for helper function: axios util uuid

npm install openai gpt-3-encoder microsoft-cognitiveservices-speech-sdk axios util uuid
npm install @types/uuid --save-dev
// Or use yarn
yarn add openai gpt-3-encoder microsoft-cognitiveservices-speech-sdk axios util uuid
yarn add @types/uuid --dev
Enter fullscreen mode Exit fullscreen mode

Telegram setup

Edit file bottender.config.js with Telegram channel enabled

module.exports = {
  channels: {
    telegram: {
      enabled: true,
      path: '/webhooks/telegram',
      accessToken: process.env.TELEGRAM_ACCESS_TOKEN,
    },
  },
};
Enter fullscreen mode Exit fullscreen mode

Make sure to set the channels.telegram.enabled field to true.

Create a bot and generate access token

You can get a Telegram bot account and a bot token by sending the /newbot command to @BotFather on Telegram.

After you get your Telegram Bot Token, paste the value into the TELEGRAM_ACCESS_TOKEN field in your .env file:

TELEGRAM_ACCESS_TOKEN=<Your Telegram Bot Token>
Enter fullscreen mode Exit fullscreen mode

Set up commands for bot

Run /setcommands in Botfather to create commands for our bot.

new - Clear old conversation and create a new one
voice - Set up voice for bot to speak
language - Set up whisper language
Enter fullscreen mode Exit fullscreen mode

Set up Express server

Change the code in file index.js in the root directory with this code.

/* eslint-disable import/no-unresolved */
module.exports = require('./dist').default;
Enter fullscreen mode Exit fullscreen mode

Create a server.ts file in src directory and copy this code into it.

import bodyParser from 'body-parser';
import express from 'express';
import { bottender } from 'bottender';

const app = bottender({
  dev: process.env.NODE_ENV !== 'production',
});

const port = Number(process.env.PORT) || 5000;

const handle = app.getRequestHandler();

app.prepare().then(() => {
  const server = express();

  server.use(
    bodyParser.json({
      verify: (req, _, buf) => {
        (req as any).rawBody = buf.toString();
      },
    })
  );

    server.use('/static', express.static('static'));

  server.get('/api', (req, res) => {
    res.json({ ok: true });
  });

  server.all('*', (req, res) => {
    return handle(req, res);
  });

  server.listen(port, () => {
    console.log(`> Ready on http://localhost:${port}`);
  });
});
Enter fullscreen mode Exit fullscreen mode

As you can see, I have a static directory to store all static files. We will use this directory to store the voice files that we generate in the next steps.

Let’s create static directory in the root directory and voices directory inside it.

mkdir static
mkdir static/voices
Enter fullscreen mode Exit fullscreen mode

Handling Telegram events

There are 3 events that we need to handle from Telegram

  • Commands
  • Text messages
  • Voice messages

We will go over each of these events in detail.

Delete old files index.js and index.test.js. Create a index.ts file inside src directory.

Let’s use router to route different events to each handler.

import { Action, TelegramContext } from 'bottender';
import { router, text } from 'bottender/router';

export default async function App(
  context: TelegramContext
): Promise<Action<any> | void> {
  if (context.event.voice) {
    return HandleVoice;
  }
  return router([
    text(/^[/.](?<command>\w+)(?:\s(?<content>.+))?/i, HandleCommand),
    text('*', HandleText),
  ])
};
Enter fullscreen mode Exit fullscreen mode

Handling text messages

First, we will handle the user’s messages by sending them to ChatGPT API, send the response to the user then save these messages so we can continue this conversation.

async function HandleText(context: TelegramContext) {
  await context.sendChatAction(ChatAction.Typing);
  let { text, replyToMessage } = context.event;
    // Add reply message to text content
  const { text: replyText } = replyToMessage || {}
  if (replyText) {
    text += `\n${replyText}`
  }

  await handleChat(context, text)
}
Enter fullscreen mode Exit fullscreen mode

Next, let’s write a function to handle chat completion with ChatGPT API.

const configuration = new Configuration({
  apiKey: process.env.OPENAI_API_KEY,
});
const openai = new OpenAIApi(configuration);

export const createCompletion = async (messages: ChatCompletionRequestMessage[], max_tokens?: number, temperature?: number) => {
  const response = await openai.createChatCompletion({
    model: "gpt-3.5-turbo",
    messages,
    max_tokens,
    temperature,
  });
  return response.data.choices;
};

export const createCompletionFromConversation = async (
  context: TelegramContext,
  messages: ChatCompletionRequestMessage[]) => {
  try {
    // limit response to avoid message length limit, you can change this if you want
    const response_max_tokens = 500
    const GPT3_MAX_TOKENS = 4096
    const max_tokens = Math.min(getTokens(messages) + response_max_tokens, GPT3_MAX_TOKENS)

    const response = await createCompletion(messages, max_tokens);
    return response[0].message?.content;
  } catch (e) {
    return null;
  }
};
Enter fullscreen mode Exit fullscreen mode

We need to add OPENAI_API_KEY variable to our .env file. You can get your API key from here.

https://platform.openai.com/account/api-keys

OPENAI_API_KEY=<Your OpenAI API key>
Enter fullscreen mode Exit fullscreen mode

You can also set the system role message with your own prompt. In this way, your bot will have its own character and can serve your own purposes, such as a personal trainer, advisor, or any character from movies, novels...

Now, we will send a response back to user as a message in Markdown format.

We also want this conversation to keep going when user sends a new message. So we will save these messages in database. Bottender supports this through Session state. You can check about it here.

https://bottender.js.org/docs/the-basics-session

export const handleChat = async (context: TelegramContext, text: string) => {
  const response = await createCompletionFromConversation(context, [
    ...context.state.context as any,
    { role: 'user', content: text },
  ]);
  if (!response) {
    await context.sendText(
      'Sorry! Please try again`'
    );
    return;
  }
  let content = response.trim()

  await context.sendMessage(content, { parseMode: ParseMode.Markdown });
  await handleTextToSpeech(context, content, getAzureVoiceName(context))

    // save current conversation in session
  context.setState({
    ...context.state,
    context: [
      ...context.state.context as any,
      { role: 'user', content: text },
      { role: 'assistant', content },
    ],
  });
}
Enter fullscreen mode Exit fullscreen mode

You can also set up different session driver by edit the session.driver in the bottender.config.js file.

// bottender.config.js

module.exports = {
  session: {
    driver: 'memory',
    stores: {
      memory: {
        maxSize: 500,
      },
      file: {
        dirname: '.sessions',
      },
      redis: {
        port: 6379,
        host: '127.0.0.1',
        password: 'auth',
        db: 0,
      },
      mongo: {
        url: 'mongodb://localhost:27017',
        collectionName: 'sessions',
      },
    },
  },
};
Enter fullscreen mode Exit fullscreen mode

Send bot’s response as voice

Next, we want to convert this response to voice file in Telegram and send to the user like the bot is talking to them. To do this, I will use Azure Speech Service.

You can set up Azure service by following the documentation here.

Text-to-speech quickstart - Speech service - Azure Cognitive Services | Microsoft Learn

After creating Speech resource, let’s set the environment variables in .env file.

AZURE_SPEECH_KEY=
AZURE_SPEECH_REGION=
Enter fullscreen mode Exit fullscreen mode

It will convert the message from the bot to an audio file in ogg format.

import { SpeechConfig, AudioConfig, SpeechSynthesizer, ResultReason } from 'microsoft-cognitiveservices-speech-sdk'

export const textToSpeech = async (text: string, outputFile: string, voiceName?: string) => {
  return new Promise((resolve, reject) => {
    // This example requires environment variables named "SPEECH_KEY" and "SPEECH_REGION"
    const speechConfig = SpeechConfig.fromSubscription(process.env.AZURE_SPEECH_KEY || '', process.env.AZURE_SPEECH_REGION || '');
    const audioConfig = AudioConfig.fromAudioFileOutput(outputFile);

    // The language of the voice that speaks.
    speechConfig.speechSynthesisVoiceName = voiceName || "en-US-JennyNeural";

    // Create the speech synthesizer.
    const synthesizer = new SpeechSynthesizer(speechConfig, audioConfig);

    synthesizer?.speakTextAsync(text,
      function (result) {
        if (result.reason === ResultReason.SynthesizingAudioCompleted) {
          // console.log("synthesis finished.");
        } else {
          console.error("Speech synthesis canceled, " + result.errorDetails +
            "\nDid you set the speech resource key and region values?");
        }
        synthesizer?.close();
        resolve(result);
      },
      function (err) {
        console.trace("err - " + err);
        synthesizer?.close();
        reject(err);
      });
  });
}

export const handleTextToSpeech = async (context: TelegramContext, message: string, voiceName?: string) => {
  try {
    await context.sendChatAction(ChatAction.Typing);

    // set random filename
    const fileId = uuidv4().replaceAll('-', '')
    const outputDir = `static/voices`
    const outputFile = `${outputDir}/voice_${fileId}.ogg`
    const encodedOutputFile = `${outputDir}/voice_${fileId}_encoded.ogg`

    const result = await textToSpeech(
      message || '',
      outputFile,
      voiceName || getAzureVoiceName(context)
    )
    await encodeOggWithOpus(outputFile, encodedOutputFile)

    const voiceUrl = `${process.env.PROD_API_URL}/${encodedOutputFile}`

    await context.sendVoice(voiceUrl)
  } catch (err) {
    console.trace("err - " + err);
  }
}
Enter fullscreen mode Exit fullscreen mode

In order to send this audio file as a voice in Telegram, we must do a small step to encode this ogg file with opus. The detail in here. I figured one way to do this is by ffmpeg package.

Let’s install this package on our machine first. You can check how to install it here.

In Windows, run this command

choco install ffmpeg
Enter fullscreen mode Exit fullscreen mode

In Linux, run this command

sudo apt install ffmpeg
Enter fullscreen mode Exit fullscreen mode

Next, we will run this command in our JS code to convert ogg file to an encoded file.

import { exec } from 'child_process';
import { promisify } from 'util';

const asyncExec = promisify(exec);

export const encodeOggWithOpus = async (inputFile: string, outputFile: string) => {
  try {
    const { stdout, stderr } = await asyncExec(`ffmpeg -loglevel error -i ${inputFile} -c:a libopus -b:a 96K ${outputFile}`);
    // console.log(stdout);

    if (stderr) {
      console.error(stderr);
    }
  } catch (err) {
    console.error(err);
  }
}
Enter fullscreen mode Exit fullscreen mode

Great, after converting this new file, we will send it to the user.

One thing to notice is that I send this file as an URL, so we will store these files in static directory we created earlier. You will need to set the full URL, so remember to insert your domain you use to run this bot.

const voiceUrl = `${process.env.PROD_API_URL}/${encodedOutputFile}`

await context.sendVoice(voiceUrl)
Enter fullscreen mode Exit fullscreen mode

Set environment varible PROD_API_URL in .env file with your domain like: https://example.com

PROD_API_URL=<your api url>
Enter fullscreen mode Exit fullscreen mode

Handling commands

Handling all commands by this code below.

async function HandleCommand(
  context: TelegramContext,
  {
    match: {
      groups: { command, content },
    },
  }: any
) {
  switch (command.toLowerCase()) {
    case 'new':
      await clearServiceData(context);
      break;
    case 'voice':
      await setAzureVoiceName(context, content)
      break;
    case 'language':
      await setWhisperLang(context, content)
      break;
    default:
      await context.sendText('Sorry! Command not found.');
      break;
  }
}
Enter fullscreen mode Exit fullscreen mode

With /new command, we will simply clear conversation’s data from state.

export const clearServiceData = async (context: TelegramContext) => {
  context.setState({
    ...context.state,
    context: [],
  });
  await context.sendText('New conversation.');
};
Enter fullscreen mode Exit fullscreen mode

With /voice command, we will save this option to settings state.

const getSettings = (context: TelegramContext): any => {
  return context.state.settings || {}
}

export const setSettings = async (context: TelegramContext, key: string, value: string) => {
  let newValue: any = value
  if (value === 'true') {
    newValue = true
  } else if (value === 'false') {
    newValue = false
  }
  context.setState({
    ...context.state,
    settings: {
      ...getSettings(context),
      [key]: newValue,
    },
  })
}

export const setAzureVoiceName = async (context: TelegramContext, voiceName: string) => {
  await setSettings(context, 'azureVoiceName', voiceName)
}
Enter fullscreen mode Exit fullscreen mode

You can check available voices support here:

Language support - Speech service - Azure Cognitive Services | Microsoft Learn

With /language command, we will do as same as /voice command. We will use this to set the Whisper API’s language parameter.

export const setWhisperLang = async (context: TelegramContext, language: string) => {
  await setSettings(context, 'whisperLang', language)
}
Enter fullscreen mode Exit fullscreen mode

Handling voice messages

When user sends voice file to our bot, we need to transcribe this file to text. In order to do this, we will use Whisper API to transcribe.

Handling voice event by this code below.

async function HandleVoice(context: TelegramContext) {
  await handleAudioForChat(context)
}

export const handleAudioForChat = async (context: TelegramContext) => {
  let transcription: any
  const fileUrl = await getFileUrl(context.event.voice.fileId)
  if (fileUrl) {
    transcription = await getTranscription(context, fileUrl)
  }
  if (!transcription) {
    await context.sendText(`Error getting transcription!`);
    return
  }

  await context.sendMessage(`_${transcription}_`, { parseMode: ParseMode.Markdown });

  await context.sendChatAction(ChatAction.Typing);
  await handleChat(context, transcription)
}
Enter fullscreen mode Exit fullscreen mode

When we receive voice events from webhooks, we only receive the file id. So, in the next step, we will use Telegram API to get the full path of this voice file.

import axios from "axios"

export const getFileUrl = async (file_id: string) => {
  try {
    const response = await axios({
      method: 'GET',
      url: `https://api.telegram.org/bot${process.env.TELEGRAM_ACCESS_TOKEN}/getFile`,
      params: {
        file_id
      }
    })
    if (response.status !== 200) {
      console.error(response.data);
      return null;
    }

    const filePath = response.data.result.file_path;
    return `https://api.telegram.org/file/bot${process.env.TELEGRAM_ACCESS_TOKEN}/${filePath}`
  } catch (e) {
    console.error(e);
    return null;
  }
}
Enter fullscreen mode Exit fullscreen mode

After receiving the file path, we will download it to static directory with oga format. However, in order to use whisper API, the audio file must be in mp3 format. So we need to convert it to mp3 by this command, also using ffmpeg.

const asyncExec = promisify(exec);

export const convertOggToMp3 = async (inputFile: string, outputFile: string) => {
  try {
    const { stdout, stderr } = await asyncExec(`ffmpeg -loglevel error -i ${inputFile} -c:a libmp3lame -q:a 2 ${outputFile}`);
    // console.log(stdout);

    if (stderr) {
      console.error(stderr);
    }
  } catch (err) {
    console.error(err);
  }
}
Enter fullscreen mode Exit fullscreen mode

Now, let’s send this mp3 file to Whisper API to get the transcription.

const downloadsPath = './static/voices';

export const getTranscription = async (context: TelegramContext, url: string, language?: string) => {
  try {
    let filePath = await downloadFile(url, downloadsPath);
    if (filePath.endsWith('.oga')) {
      const newFilePath = filePath.replace('.oga', '.mp3')
      await convertOggToMp3(filePath, newFilePath)
      filePath = newFilePath
    }
    const response = await openai.createTranscription(
      fs.createReadStream(filePath) as any,
      'whisper-1',
      undefined, undefined, undefined,
      language,
    );
    return response.data.text
  } catch (e) {
    return null;
  }
}
Enter fullscreen mode Exit fullscreen mode

After we get the transcription, our job is similar to the previous step, we will use the function handleChat like before to handle user’s message and send ChatGPT’s reply back to user.

export const handleAudioForChat = async (context: TelegramContext) => {
  let transcription: any
  const fileUrl = await getFileUrl(context.event.voice.fileId)
  if (fileUrl) {
    transcription = await getTranscription(context, fileUrl)
  }
  if (!transcription) {
    await context.sendText(`Error getting transcription!`);
    return
  }

  await context.sendMessage(`_${transcription}_`, { parseMode: ParseMode.Markdown });

  await context.sendChatAction(ChatAction.Typing);
  await handleChat(context, transcription)
}
Enter fullscreen mode Exit fullscreen mode

You can see our bot also converts its reply to speech and responds back to the user.

You can also have a settings option here to disable text message responses and only send back voice to the user.

Deployment

Run Bottender on your server by the following command:

npm start
# or use yarn
yarn start
Enter fullscreen mode Exit fullscreen mode

Set Up Webhook for Production

Run this command to set up Webhook for Telegram bot, supposed your URL to be https://example.com/webhooks/telegram

npx bottender telegram webhook set -w https://example.com/webhooks/telegram
Enter fullscreen mode Exit fullscreen mode

Now you are ready to talk to your bot.

Send voice file and wait for the bot to respond.

Image description

Conclusion

I hope with this guide, you can build your own talking chatbot and have fun talking with it.

Hope you like it!

Top comments (1)

Collapse
 
cocoandrew profile image
Cocoandrew

Really helpful, thank you.