Speech recognition using Google Dialogflow

Following the recent release of Google Home in Swedish, I have been exploring how speech recognition in Swedish can be leveraged in various applications.

In September 2016, Google acquired a service then known as API.AI, and integrated it with their other services in GCP. It was later renamed to Dialogflow, and is currently – as far as I know – the only major service providing speech recognition in Swedish, using the underlying Google Cloud Speech API to convert voice into text.

While Google Home and Assistant are the featured platforms, it also comes with built-in integrations for a bunch of 3rd party services, including Facebook, Slack, Twitter, Skype, Alexa, Cortana and several others. You can, of course also build your own integration using the REST interface.

In this post, I’d like to demonstrate two different applications using the Dialogflow service.

 

KPerson for Google Home & Assistant

The first one uses Dialogflow as a chatbot in Google Home or Assistant – parsing questions into intents, generating a response in the form of voice and text,

I chose to build a simple version of KPerson (the internal Jayway employee directory) that can be interacted with by voice or text.

For example, you can ask it questions such as:

“Hur många arbetar på kontoret i Stockholm?”
“How many people work in the Stockholm office”

“Vad är könsfördelningen i Sverige?”
“What is the gender distribution in Sweden?”

“Vilka arbetar på kontoret i Malmö?”
“Who works in the Malmö office?”

“Vem är Adam Tibbing?”
“Who is Adam Tibbing?”

 

But you can also ask it more complex questions, where multiple parameters are used:

“Hur många procent kvinnor jobbar i Palo Alto?”
“How many percent women work in Palo Alto?”

“Vad har Adam Tibbing för telefonnummer?”
“What is Adam Tibbing’s phone number?”

“Visa alla som arbetar med .NET i Danmark
“Show me everyone that works with .NET in Denmark

 

 

To achieve this, you have to set up intents for your agent in the Dialogflow Console. Each intent is trained with one or more phrases, and you can select which parts of the training phrases that are entities that you want to be able to extract. The entities (marked in color in the examples above) can be either built-in types like cities, numbers, names etc., but you can also define custom ones.

For each intent, you can either respond with a random, predefined string. Or, you can redirect the request to your fulfillment API using a webhook. The fulfillment request contains information such as detected intent, parameters, and a context that can be used for follow-up questions and more complex dialog.

 

Google provides SDK Libraries for Node.js, Python, Java, Go, Ruby, C# and PHP.

I chose to create a basic .NET Core API in App Engine to handle fulfillments. Beware though, the .NET SDK is still poorly documented and it took some testing in order to figure out how it’s supposed to be used.

KPerson for Google Home infrastructure layout

 

My API uses a json dump of the KPerson directory, but could quite easily be integrated directly with KPerson instead, using a service account with the necessary permissions. One piece of data that was not available in KPerson was the genders, which are instead guessed based on the first names in KPerson using genderize.io

Another challenge was determining which name in KPerson the user means – even though the speech recognition is fairly accurate, names can be spelled differently and there are double names etc. I ended up using a best-guess approach, fetching similar names in KPerson and ranking the matches by similarity.

You can view the complete source code in the GitHub repo (restricted to Jayway employees). If you’d like to experiment on your own, the Dialogflow Agent file is included as well as deployment instructions. Or, just ping me if you’d like to be invited as a tester and you can try it on your phone or Google Home device.

 

Doorman fork

The other application I’d like to talk about is a fork of the Doorman Project by Gustaf Nilklint. If you haven’t checked it out already it’s a great example of how IOT can be used in everyday situations. It uses Amazon Rekognition to detect whether a person standing outside a door is a registered employee of Jayway, and if so, opens the door.

I decided to extend this project by handling non-Jayway employees visiting the office. If the person outside can not be recognized, it will ask who you are there to meet. If you reply with a name of a Jayway employee, an SMS (or email) will be sent to that employee containing a link to open the door remotely.

 

Since Amazon’s own voice recognition service, Amazon Transcribe, does not yet support Swedish, the choice fell upon Dialogflow here as well. The recorded audio is sent as a base64 encoded string to a Lambda function, that forwards it to Dialogflow which returns the detected name.

If the name is found within KPerson, a message is sent to the person using Amazon SNS with a link. The link leads to a Lambda function, that notifies the IOT Hub which both the RPI and the Door lock subscribes to. The RPI listens for messages that it should say, such as “Door unlocked”, and the door lock listens for the signal to unlock.

The initial idea was to be able to reply to the SMS, instead of clicking a link. But that would require setting up a shortnumber that can receive incoming messages – and while it can be done in AWS, it’s costly and requires a manual application to be approved by the AWS support team.

There are many different parts and services involved, but here is a general overview how everything fits together.

Doorman fork infrastructure layout

Infrastructure layout

 

You can view the complete source code on GitHub (restricted to Jayway employees)

 

A couple of notes regarding this fork:

  • The lambda checking uploaded photos with Rekognition has been rewritten in .NET Core, and two more lambda functions have been added: One for parsing uploaded audio data and sending SMS messages, and another handling clicked links that should open the door.
  • Some logic present in the original project has been stripped, such as adding and blacklisting faces.
  • There are a couple of security aspects that are not covered, such as the link being sent can be used multiple times (this is mostly a POC and is not intended to be used in production)
  • The SMS feature is strictly limited by the sandbox account in AWS, only $1 (about 5 messages) can be sent per day. For testing purposes, I have mostly used email notifications, but that requires the receiving email address to be approved manually in advance.
  • A USB microphone has to be connected to the RPI, since it does not support analog microphones.

 

Leave a Reply