Skip to Content
Technical Articles
Author's profile photo Dmitry Buslov

Voice bot powered by SAP Conversational AI

Currently it is quite easy to make voice recognition bot based on SAP and Open Source technology. And the synergy is also very clear. If voice recognition is wrong, if there are typos – CAI (SAP Conversational AI) could help and recognise correct intent. This is first part and focus will be on docker and CAI settings. In the second part we will go through publishing process to Kyma.


From architecture point of view – we are going to connect CAI with docker container where all code for Automatic Speech Recognition (ASR) will be running. Also, concrete telegram bot and ID(group or personal) will be there.

So, the picture will look like this:

The file structure will be :

  • – interaction with CAI
  • – ASR and main logic
  • Dockerfile – instruction for container build

I think you already guessed that the code will be in python;)

Automatic Speech Recognition

There are a lot of different engines for ASR now. We will use transformers library from Huggingface. The full list of available models and your language – you can find here:

Also, it is quite easy to replace this model with Nvidia NEMO.

You can find relevant tutorials here.


All code here – is not production ready. Just examples!!! 

So, to make this idea available – let’s create folder with files:,, Dockerfile 

from oauthlib.oauth2 import BackendApplicationClient
from requests_oauthlib import OAuth2Session
import uuid
import requests
import json
import os

class CAI:
    oAuthClientID = os.environ['oAuthClientID']
    oAuthClientSecret = os.environ['oAuthClientSecret']
    CAIreqToken = os.environ['CAIreqToken']
    def __init__(self):
        self.oAuthURL = ''
        self.dialogURL = ''
        self.token = self._get_bearer()
    def _get_bearer(self):
        client = BackendApplicationClient(client_id=self.oAuthClientID)
        oauth = OAuth2Session(client=client)
        token = oauth.fetch_token(token_url=self.oAuthURL, client_id=self.oAuthClientID,
        return token['access_token']
    def get_response(self,text):
        dialogPayload = {"message":{"type":"text","content":text},"conversation_id":str(uuid.uuid1())}
        dialogHeaders = {
                "Authorization": "Bearer " + self.token,
                "X-Token" : "Token " + self.CAIreqToken,
                "Content-Type" : "application/json"
        dialogResponse =, data=json.dumps(dialogPayload), headers=dialogHeaders)
            return dialogResponse.json()['results']['messages']

import telegram
from telegram.ext import Updater,MessageHandler,Filters,CommandHandler
import torchaudio
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import torch
import logging
from cai import CAI
import os


config = {

LANG_ID = "en"#"ru"# 
if LANG_ID=='ru':
    MODEL_ID = "jonatasgrosman/wav2vec2-large-xlsr-53-russian"
    MODEL_ID = "facebook/wav2vec2-base-960h"
processor = Wav2Vec2Processor.from_pretrained(MODEL_ID)
model = Wav2Vec2ForCTC.from_pretrained(MODEL_ID)

def get_preds(OUTFILE):
    resampler = torchaudio.transforms.Resample(48_000, 16_000)

    def speech_file_to_array_fn(batch):
        speech_array, sampling_rate = torchaudio.load(batch)
        batch = resampler(speech_array).squeeze().numpy()
        return batch

    test_dataset = speech_file_to_array_fn(OUTFILE)

    inputs = processor(test_dataset, sampling_rate=16_000, return_tensors="pt", padding=True)

    with torch.no_grad():
        if LANG_ID=='ru':
            logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
            logits = model(inputs.input_values).logits
    predicted_ids = torch.argmax(logits, dim=-1)
    return processor.batch_decode(predicted_ids)

c = CAI()

def voice_handler(update, context):
    file_handler =
    file ='./voice.ogg')
        text = get_preds(file)[0]'The text - {text}')
        cai_resp = c.get_response(text)
        for i in cai_resp:
            if i['type']=='text':

def text_handler(update, context):
    cai_resp = c.get_response(update.message.text)
    for i in cai_resp:
        if i['type']=='text':

def help_command(update, context):

def main() -> None:
    """Run the bot."""'Ready!')
    # Create the Updater and pass it your bot's token.
    updater = Updater(config['API_KEY'])

    # Get the dispatcher to register handlers
    dispatcher = updater.dispatcher
    dispatcher.add_handler(MessageHandler(Filters.voice, voice_handler))
    dispatcher.add_handler(MessageHandler(Filters.text, text_handler))

    dispatcher.add_handler(CommandHandler("help", help_command))



if __name__ == '__main__':



FROM pytorch/pytorch:latest
RUN pip3 install torchaudio python-telegram-bot transformers oauthlib requests-oauthlib
CMD [ "python3", ""]

Instructions to start

First of all after all files preparation we have to build docker image.

We can do it with

> docker build -t cai .

After that we need some keys.

From CAI – we need ClientID, CLientSecret and Token – you can find all relevant info in this nice blogpost.

Also, we need Telegram token and group or person ID. I hope you can find it yourself. If not – don’t hesitate to ask.

So, we can run our bot locally with this command (just replace values with yours)

> docker run -d –name cairun -e oAuthClientID=’YOUR CAI CLIENT ID’ -e oAuthClientSecret=’YOUR CAI CLIENT SECET’ -e CAIreqToken=’YOUR CAI TOKEN ‘ -e API_KEY=’TELEGRAM BOT KEY’ -e id=’YOUR TELEGRAM ID’ cai

After that – you can try.

My native language is Russian – so, my bot talk russian. This one has to talk english with help of wav2vec model from Facebook.

Happy voice-botting

As next step – we will push this container to Kyma runtime to make it available as service.


Assigned tags

      1 Comment
      You must be Logged on to comment or reply to a post.
      Author's profile photo Taseeb Saeed
      Taseeb Saeed

      Thanks for sharing.



      Taseeb Saeed