Skip to Content

It’s Subhan here with another exciting episode of developing in python with XSA! In this blog, I am going to talk about how to build conversational interfaces in XSA applications by integrating speech-to-text and text-to-speech APIs. I am also going to briefly cover how web sockets work in python for XSA applications. For those that are not familiar, web sockets are used for live bi-directional communication between clients and a server. Numerous clients can be connected to the same server through which they receive live updates as events. They do not need to make explicit HTTP calls to check for status updates. In addition, the Server can also initiate communication with the client and send updates as opposed to simple HTTP where a client must always initiate communication often leading to polling interfaces.

With rapid growth in speech analysis and audio processing technology, it has become much easier for developers to incorporate conversational interfaces in their applications. This helps diversify the user interface allowing users greater flexibility for input options. In addition, voice commands and audio playback options come in very handy when saying is much easier than typing, or worse, selecting from long menus. Not to mention, conversational interfaces have become very common in commercial applications, it is about time they make their way into enterprise applications as well!

Getting Started

I am going to assume you know how to set up a basic python XSA application. If that is not the case, please refer to my first few blogs to be up-to-speed with everything!

For making voice-enabled web applications, the development workflow in general involves configuring a microphone on the front-end browser to capture audio stream from the user, encoding the stream to a valid audio format for the speech recognition API and transferring it to the back-end module. The back-end module then sends the audio stream to the speech-to-text API which responds with the transcript. This transcript is analyzed using Natural Language Processing tools, yielding commands that can be executed. Once the required actions are completed, the back-end can synthesize an audio response using a text-to-speech API and send that back as voice feedback for the user. The diagram below depicts this workflow.

There are various API providers you can choose from for speech-to-text and text-to-speech processing including Amazon Web Services (AWS), Microsoft Azure, IBM Watson, and Google Cloud Platform (GCP). I have personally worked with Microsoft, IBM, and GCP, and would recommend GCP as it comes with a python client which makes programming much simpler. The client also handles security for all HTTP and web socket communication itself. You do not need to deal with authorization tokens or API keys or anything. It just requires you to have all your credentials accessible in a JSON file. The documentation for GCP APIs is also quite easy to navigate and understand. You are welcome to check out the other API providers since some of them might suit your needs better. For my demo in this blog, I am going to use GCP!

Before you can start, you need to sign up for GCP if you don’t already have an account. The nice thing is that when you make a new account, you get $300 free credit to use their services! Once you have an account, you can follow the steps at this link to set up a new project, enable Cloud Speech API, connect it to a service account, and download the credentials file. Make sure you save the credentials file to somewhere in your XSA application so that you can use it later. You can also rename it to something nicer if you wish! Follow the same steps as before to enable the Cloud Text-to-Speech API as well. This time, you do not need to create a new service account, you can just connect to the one you created before. You also don’t need to download a new credentials file.

Now you need to download the GCP python clients for speech-to-text and text-to-speech. Execute the following commands from a terminal:

#for local testing
pip3 install google-cloud-speech google-cloud-texttospeech		

#for XSA application deployment
pip3 download google-cloud-speech google-cloud-texttospeech -d /path/to/vendor/folder/of/xsa/python/module		

Test locally before moving on to XSA to check that everything has been set up properly so far. To do this, first you need to set an environment variable so that the python client can access your application credentials. Execute the following command:

#windows
set GOOGLE_APPLICATION_CREDENTIALS=/path/to/credentials/json/file	

#linux	
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/credentials/json/file

Run the following python script. This should make a call to the text-to-speech API and save the received audio file to the desktop with name audio.wav. If everything is working fine, the audio file should say “Hello Google World” when played.

from google.cloud import texttospeech as tts
import wave

client = tts.TextToSpeechClient()

text = 'Hello Google World'
input_text = tts.types.SynthesisInput(text=text)

voice = tts.types.VoiceSelectionParams(
            language_code = 'en-US',
            ssml_gender = tts.enums.SsmlVoiceGender.FEMALE)

audio_config = tts.types.AudioConfig(
                   audio_encoding = tts.enums.AudioEncoding.LINEAR16)

response = client.synthesize_speech(input_text, voice, audio_config)
binAudio = response.audio_content

with wave.open('C:\Users\<user>\Desktop\audio.wav', 'wb') as f:
    f.write(binAudio)

XSA Application Setup

This section should be pretty straightforward! You need to specify the path to the application credentials JSON file that you downloaded earlier in an environment variable of your python module. This allows the python client from GCP to access these credentials when making calls to the APIs. Modify the manifest file in your XSA project as shown to specify the environment variable under your python module:

- name: core-py
  path: ./core-py/
  env:
    GOOGLE_APPLICATION_CREDENTIALS: "./google-credentials.json"
  ...

This should work if you have the credentials file saved in the main directory of your python module! Keep in mind “./” in the path under env section means relative to the python module directory, not relative to the main XSA project directory (where the manifest file is).

You also need to specify dependency on GCP python clients in the requirements.txt file for your python module. Add the following entries to the requirements file:

google_cloud_texttospeech
google_cloud_speech

That’s all the initial setup you need from XSA to start developing using speech-to-text and text-to-speech APIs. You are all ready to move on to the real exciting part now!

Recording Audio from User

You need to program the front-end module in your XSA application to stream microphone input from the user to your backend (python) application. There are a number of JavaScript libraries available to help you record microphone input including Pizzicato, MediaRecorder, and Recorder.js. Most of them, however, do not let you record audio in a format that the Google Speech API takes as input – at least I couldn’t find a way to make them do so, if you do, feel free to comment below! The Google Speech API requires mono audio with sample rate of 16 kHz and sample width of 2 bytes (16 bit). Mono audio means there is input from only one channel as opposed to stereo audio which uses two channels. The sample rate of 16 kHz is preferred but can be different. Mono audio input, however, is a must. Most JavaScript recording APIs/libraries let you only record stereo audio with sample rate of 48 kHz or 41.4 kHz. Although, these values do also depend on your browser defaults and the hardware you have available. Thus, they might be slightly different.

Recorder.js is one of the few libraries that lets you record mono audio. It is compatible with most mainstream browsers. There are also a wave audio encoder included in it which makes it easier for you to share recorded audio to the backend and eventually to the Google Speech API. To view the code for this library, you can visit this repository. You can use this library in your front-end module by including the following line in the html file.

<script src="https://cdn.rawgit.com/mattdiamond/Recorderjs/08e7abd9/dist/recorder.js"></script>

You can now add the following code to your client-side JavaScript file to start recording microphone input when the Start button is clicked. In this code, first you initialize an audio context which represents the main audio system and manages all audio input and output to the browser. Then you gain access to the user’s microphone stream with their consent and connect it to an AudioSourceNode in the audio context. Eventually, you initialize a recorder object which gets microphone input through the AudioSourceNode and starts recording. During initialization, you define number of channels to be one so that you can get mono audio recording in the end.

var context = new AudioContext();
startBtn.addEventListener(‘click’, () => {
navigator.mediaDevices.getUserMedia({
        audio: true,
        video: false
    }).then((micStream) => {
        var microphone = context.createMediaStreamSource(micStream);
        var rec = new Recorder(microphone, {
            numChannels: 1
        });
        rec.record();
        console.log('Started recording');
    });
});

Once the user is done recording, you need to turn off the microphone and share the recorded microphone audio data to your back-end module for processing. You can use the following code to do so. This snippet turns off the microphone, sends audio data off for wave format encoding, and eventually shares the encoded audio BLOB (binary large object) to the back-end once the Stop button is clicked. The shareAudio function uses a web socket to send the BLOB to back-end. I will explain how this works in the next section.

stopBtn.addEventListener('click', stopRecording);
function stopRecording(){
    console.log('Stopped recording.');
    rec.stop();	//stop recording
    micStream.getAudioTracks()[0].stop();	//turn off mic
    rec.exportWAV(shareAudio);		//wave encode and share
}

function shareAudio(blob){
    //send blob over to python using web sockets
}

If you are wondering why wave format encoding is needed, well it is not needed per se, but it does make it much easier to handle audio data. The encoder fills in the sample rate of the audio so you don’t need to specify it in your transcription call to Google Speech API. It also transforms raw audio data from Float32Array to Linear PCM 16 bit data which the Google Speech API easily takes as input.

Web Sockets

As mentioned earlier, web sockets are used for bi-directional communication between clients and a server, allowing both the client and the server to send and receive information. They entail asynchronous event-based processing, where each trigger initiates an event the results of which can be broadcasted to all clients connected to the server. As a result, all clients stay up-to-date with live information without having to make any HTTP calls at constant intervals to check for updates. You can read more about web sockets here.

The main reason I would recommend using web sockets over HTTP is because they are extremely simple and easy to program. Sending binary data, which you probably will in all speech-interfacing applications, over web sockets is much easier as well. In the code examples below, you’ll see how you can use web sockets with just few simple lines of code. In addition, based on your use case, real-time multi-client data sharing can also be very useful. For instance, if you are making an application for easier data querying during meetings or in other similar group setting, having one interface that can show live query results to all participants on their screen simultaneously can come in handy. Web sockets scale better in high parallel client requests compared to HTTP as well.

There are several ways to implement web sockets. I recommend using Socket.IO as there is a flask-socketio library which allows you to integrate web sockets into a python flask application. Since XSA applications are built on flask, this library comes in quite handy. Socket.IO has various JavaScript libraries as well which means you can easily implement web sockets between a JavaScript front-end module and a python back-end module. In our case, you will set up the back-end to be the server and the front-end to be the client. Socket.IO does fall back to AJAX long polling when web sockets are not available, if you do not want that to happen, you can use plain WebSocket API. To read more about Socket.IO vs. WebSocket API, follow this link.

Client Side

For the client side, you need to import the Socket.IO library in your html file.

<script src="https://cdnjs.cloudflare.com/ajax/libs/socket.io/2.0.3/socket.io.js"/>

You can start a web socket connection in your JavaScript file as follows:

var websocketPromise = new Promise((resolve, reject) => {
    var socket = io.connect('wss://' + pythonURL + namespace);
    socket.on('open', resolve(socket));
    socket.on('error', reject());
});

The pythonURL is the direct URL to your python module. I’ll talk about namespace in a bit once we get into server-side implementation.

Once the websocketPromise is resolved, meaning the socket is open to connect, you can define a callback function to specify all event handlers for your web socket. The general syntax for these looks like:

websocketPromise.then((socket) => {
    globalSocket = socket;	//if you need to use socket outside of this scope
    socket.on(<eventTitle>, (input) => {
        //use input as needed
    });
});

You can also send information back to the server using either of the commands below:

socket.send('this is a sample message');
socket.emit('customEvent', 'this is a sample message');

The send command shares the data to the server under “message” event, whereas the emit command lets you define custom events. You can now go back to the shareAudio function from the last section and add in the following command to share the audio BLOB.

globalSocket.emit('streamForTranscription', blob);

Server Side

On the server side, you need to use flask-socketio and eventlet which are both available on PyPI. Eventlet is platform specific, so be sure to download the linux compatible version to your vendor folder for XSA deployment. Make sure you also include both of these libraries in the requirements file for your python module.

The programming for implementing flask-socketio is quite simple. Basically, you need to initialize a SocketIO object and define event handlers for the web socket connection. You can use either the decorator syntax (similar to the default flask route handlers) or class-based namespace syntax. Namespace in this context is almost like a URL which the client connects to. You can define different namespaces to serve different endpoints. Finally, you have to replace the app.run() line that you normally use to start all flask applications, with socketio.run(). All of this is explained in the code snippet below.

from flask_socketio import SocketIO
from flask_socketio import send, emit, Namespace

app = Flask(__name__)
socketio = SocketIO(app)    #initialize SocketIO
app_port = int(os.environ.get('PORT', 3000))

###### decorator syntax ######
@socketio.on('myEvent', namespace='/test')
def doSomething(message):
    # event code goes here
    # reply back to client using send('text')
    # or emit('event', 'text')

###### class-based namespace syntax ######
class customNamespace(Namespace):
    def on_myEvent(self, message):
        # event code goes here
        # reply back using send or emit
socketio.on_namespace(customNamespace('/test'))	#initialize '/test' namespace

if __name__ == '__main__':
    socketio.run(app, port=app_port)    #not app.run(port)

Using Recorded Audio

Now that you know how to record and share audio data, let talk about how you can use this data to do meaningful tasks. In my demo application, which is attached at the end of this blog, the main event that processes audio data is as follows:

def on_streamForTranscription(self, blob):
    command = transcribe(blob)
    emit('transcribeSuccess', command, broadcast=True)

    response, read = executeCommand(command)

    if (read):
        resAudio = audioSynthesis(response)
        emit('speechResponse', 
             {
                 "audio": resAudio, 
                 "text": response
             }, broadcast=True)
    else:
        emit('textResponse', response, broadcast=True)

In this function, I start with transcribing the audio input using Google Speech API. The transcribe function is as follows:

def transcribe(blob):
    from google.cloud import speech
    from google.cloud.speech import enums, types

    client = speech.SpeechClient()
    
    audio = types.RecognitionAudio(content = blob)
    config = types.RecognitionConfig(
        language_code = 'en-US'
    )

    response = client.recognize(config, audio)

    if not response:
        return "Error in speech transcription"
    else:
        return response.results[0].alternatives[0].transcript

This function sends audio data to Google Speech API using the client and returns the first result received. The config object usually takes sample rate and encoding type as input parameter as well, but those are not needed in our case since our audio is wave encoded and both of those values are included in the encoded object.

Once the audio is transcribed, I send the transcription back to the client. Here I set broadcast to be True, this ensures that the transcription is sent to all connected clients as opposed to just the one that initiated the streamForTranscription event. The executeCommand function takes in the transcript, analyzes it to extract the user command, and calls appropriate functions to execute that command. This is where you can make use of Natural Language Processing (NLP), which involves manipulating block of text to extract meaning. There are multiple tools you can use for NLP including the Cloud NLP API provided by Google and the Natural Language Toolkit for python. Also keep in mind that this is an ideal place to make use of some HANA capabilities. You can store the transcripts to a table for persistence and further processing. You can perform sentiment analysis, full text indexing, and text analytics – or you can just use the string to perform a query. For more details on HANA Text Analytics, follow this link!

Finally, there is the part where I respond back to the client after the commands have been executed. For sending back an audio response which can be played back to the user, I call the audioSynthesis function which in turn calls the Google text-to-speech API. The code for it is as follows:

def audioSynthesis(text):
    from google.cloud import texttospeech as tts

    client = tts.TextToSpeechClient()

    input_text = tts.types.SynthesisInput(text=text)

    voice = tts.types.VoiceSelectionParams(
            language_code = 'en-US',
            ssml_gender = tts.enums.SsmlVoiceGender.FEMALE)

    audio_config = tts.types.AudioConfig(
        audio_encoding = tts.enums.AudioEncoding.LINEAR16)

    response = client.synthesize_speech(input_text, voice, audio_config)
    return response.audio_content

This snippet sends the input text to the text-to-speech API which responds back with a wave encoded audio file. The voice type can be modified by changing parameters in the tts.types.VoiceSelectionParams(). The audio response type can also be changed from wave encoded Linear 16 bit to other options if need be. To see what other options are available you can refer to the API documentation here.

Audio Playback in Browser

So far, we have covered how to record audio, send it for processing, execute commands based on the audio, and generate a response which includes an audio file to be played back to the user. To play this audio, you can add the following code to your client-side JavaScript file.

context.decodeAudioData(audioResponse, function(buffer) {
    var source = context.createBufferSource();
    source.buffer = buffer;
    source.connect(context.destination);
    source.start(0);
});

Here, audioResponse is the audio input from the back-end module. The decodeAudioData decodes the audio input into a PCM AudioBuffer that is ready to be played. You just need to create an audio source node that takes the buffer as the source and connects that node to context.destination which represent the audio output (speakers). Lastly, you can start playing the audio using source.start().

Ta da! Now you are for real done. I hope all of that information is actually helpful and can get you started with integrating conversational interfaces in your XSA applications. Before you go, I have a quick demo video of a voice command application I built on XSA using python, followed by some next steps you can take if you are interested in the speech interface topic. The code for the demo is available at this repository.

Demo

This application allows user to query data from a HANA database through speech commands. Users can have the application running in multiple windows (or on multiple machines) and all windows should see results from each query live.

Next Steps

You can do so much more within your XSA applications to extend the speech interface I have introduced. There are many services provided by SAP, Google, Microsoft, etc. in addition to just speech-to-text and text-to-speech. One such example is the Speaker Recognition API by Microsoft that can distinguish between different speakers. You can use this API to build applications that behave differently for different people (i.e. show different data results, etc.).

You can also use SAP Translation Hub to accommodate more languages in your application. The Translation Hub also lets you translate complete pages and applications in case you want to develop your front-end in English but want to show it to your user in their preferred language. Another service you can combine speech interface with is the SAP Concur API which allows you to automate expense reporting and easily manage spending for business trips. You can greatly enhance user experience by allowing voice commands when entering an expense. For example, saying “I spent twenty dollars for lunch at Pizza Hut and ten dollars for the ride there” is much easier than having to enter all this information manually to different forms.

Lastly, for my demo application, I have designed it so that the users record the whole command before it is shared with the back-end and transcribed. You don’t have to make it that way. You can also do real-time streaming and start transcription as the user speaks. This does require some knowledge about generators and multi-threading in python. If you are interested, you can find more details and sample code here.

To report this post you need to login first.

6 Comments

You must be Logged on to comment or reply to a post.

Leave a Reply