Architecture

former_member671115 · ‎09-01-2021

Currently it is quite easy to make voice recognition bot based on SAP and Open Source technology. And the synergy is also very clear. If voice recognition is wrong, if there are typos - CAI (SAP Conversational AI) could help and recognise correct intent. This is first part and focus will be on docker and CAI settings. In the second part we will go through publishing process to Kyma.

Architecture

From architecture point of view - we are going to connect CAI with docker container where all code for Automatic Speech Recognition (ASR) will be running. Also, concrete telegram bot and ID(group or personal) will be there.

So, the picture will look like this:

The file structure will be :

cai.py - interaction with CAI

voicebot.py - ASR and main logic

Dockerfile - instruction for container build

I think you already guessed that the code will be in python;)

Automatic Speech Recognition

There are a lot of different engines for ASR now. We will use transformers library from Huggingface. The full list of available models and your language - you can find here:

Also, it is quite easy to replace this model with Nvidia NEMO.

You can find relevant tutorials here.

Code

All code here - is not production ready. Just examples!!!

So, to make this idea available - let's create folder with files:

cai.py, voicebot.py, Dockerfile

cai.py

from oauthlib.oauth2 import BackendApplicationClient

from requests_oauthlib import OAuth2Session

import uuid

import requests

import json

import os



class CAI:

    oAuthClientID = os.environ['oAuthClientID']

    oAuthClientSecret = os.environ['oAuthClientSecret']

    CAIreqToken = os.environ['CAIreqToken']

    def __init__(self):

        self.oAuthURL = 'https://sapcai-community.authentication.eu10.hana.ondemand.com/oauth/token'

        self.dialogURL = 'https://api.cai.tools.sap/build/v1/dialog'

        self.token = self._get_bearer()

    def _get_bearer(self):

        client = BackendApplicationClient(client_id=self.oAuthClientID)

        oauth = OAuth2Session(client=client)

        token = oauth.fetch_token(token_url=self.oAuthURL, client_id=self.oAuthClientID,

                client_secret=self.oAuthClientSecret)

        return token['access_token']

    def get_response(self,text):

        dialogPayload = {"message":{"type":"text","content":text},"conversation_id":str(uuid.uuid1())}

        dialogHeaders = {

                "Authorization": "Bearer " + self.token,

                "X-Token" : "Token " + self.CAIreqToken,

                "Content-Type" : "application/json"

            }

        dialogResponse = requests.post(self.dialogURL, data=json.dumps(dialogPayload), headers=dialogHeaders)

        if dialogResponse.status_code==requests.codes.ok:

            return dialogResponse.json()['results']['messages']

voicebot.py

import telegram

from telegram.ext import Updater,MessageHandler,Filters,CommandHandler

import torchaudio

from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

import torch

import logging

from cai import CAI

import os



logging.basicConfig(level=logging.INFO)



config = {

    'API_KEY':os.environ['API_KEY'],

    'id':[int(os.environ['id'])],

}



LANG_ID = "en"#"ru"# 

if LANG_ID=='ru':

    MODEL_ID = "jonatasgrosman/wav2vec2-large-xlsr-53-russian"

else:

    MODEL_ID = "facebook/wav2vec2-base-960h"

processor = Wav2Vec2Processor.from_pretrained(MODEL_ID)

model = Wav2Vec2ForCTC.from_pretrained(MODEL_ID)



def get_preds(OUTFILE):

    resampler = torchaudio.transforms.Resample(48_000, 16_000)



    def speech_file_to_array_fn(batch):

        speech_array, sampling_rate = torchaudio.load(batch)

        batch = resampler(speech_array).squeeze().numpy()

        return batch



    test_dataset = speech_file_to_array_fn(OUTFILE)



    inputs = processor(test_dataset, sampling_rate=16_000, return_tensors="pt", padding=True)



    with torch.no_grad():

        if LANG_ID=='ru':

            logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

        else:

            logits = model(inputs.input_values).logits

    

    predicted_ids = torch.argmax(logits, dim=-1)

    return processor.batch_decode(predicted_ids)



c = CAI()



def voice_handler(update, context):

    file_handler = context.bot.getFile(update.message.voice.file_id)

    file = file_handler.download('./voice.ogg')

    try:

        text = get_preds(file)[0]

        logging.info(f'The text - {text}')

        cai_resp = c.get_response(text)

        for i in cai_resp:

            if i['type']=='text':

                update.message.reply_text(i['content'])

    except:

        update.message.reply_text('Sorry!')



def text_handler(update, context):

    cai_resp = c.get_response(update.message.text)

    for i in cai_resp:

        if i['type']=='text':

            update.message.reply_text(i['content'])



def help_command(update, context):

    update.message.reply_text('Help!')



def main() -> None:

    """Run the bot."""

    logging.info('Ready!')

    # Create the Updater and pass it your bot's token.

    updater = Updater(config['API_KEY'])



    # Get the dispatcher to register handlers

    dispatcher = updater.dispatcher

    

    dispatcher.add_handler(MessageHandler(Filters.voice, voice_handler))

    dispatcher.add_handler(MessageHandler(Filters.text, text_handler))



    dispatcher.add_handler(CommandHandler("help", help_command))



    updater.start_polling()



    updater.idle()





if __name__ == '__main__':

    main()

Dockerfile

FROM pytorch/pytorch:latest

COPY ./cai.py cai.py

COPY ./voicebot.py voicebot.py

RUN pip3 install torchaudio python-telegram-bot transformers oauthlib requests-oauthlib

CMD [ "python3", "voicebot.py"]

Instructions to start

First of all after all files preparation we have to build docker image.

We can do it with

> docker build -t cai .

After that we need some keys.

From CAI - we need ClientID, CLientSecret and Token - you can find all relevant info in this nice blogpost.

Also, we need Telegram token and group or person ID. I hope you can find it yourself. If not - don't hesitate to ask.

So, we can run our bot locally with this command (just replace values with yours)

> docker run -d --name cairun -e oAuthClientID='YOUR CAI CLIENT ID' -e oAuthClientSecret='YOUR CAI CLIENT SECET' -e CAIreqToken='YOUR CAI TOKEN ' -e API_KEY='TELEGRAM BOT KEY' -e id='YOUR TELEGRAM ID' cai

After that - you can try.

My native language is Russian - so, my bot talk russian. This one has to talk english with help of wav2vec model from Facebook.

Happy voice-botting

As next step - we will push this container to Kyma runtime to make it available as service.

Voice bot powered by SAP Conversational AI

Architecture

Automatic Speech Recognition

Code

Instructions to start

Happy voice-botting

Get Your SAP HANA Idea Incubator Badge Today!

SCN Mission - SAP HANA Quiz Challenge is now retired

Share your #HANAStory and Win