Face recognition with AWS and Android Things

This blog post is part 2 in our series about Wallace, our four wheeled, Android Things on an RPi3 powered company house robot. You can find the first part of the series here.

In our quest to extend the capabilities of Wallace, we turn our focus to face recognition. One of the goals we had with Wallace was that he should be able to recognize a person’s face and greet by saying the name of that person. As it turned out, a couple of our colleagues, David Tran and Jonathan Böcker had already built a smart mirror, which was Electron based but utilized Amazon’s object and face recognition service called Amazon Rekognition. The smart mirror was trained on a set of pictures of all employees, and we decided to use that as our data source.

A brief history

When we started working on adding vision capabilities to Wallace, the latest Android Things version was Dev Preview 5.1. We experienced issues with the Pi Camera Module V2 on that version, where the platform failed to open a stream from the camera.  However, we discovered that a USB camera worked perfectly, even though USB cameras were not officially supported.

On Dev Preview 6 and beyond, the roles were reversed and we switched to using the Pi Camera. This is the setup we would prefer, as we’re using a USB sound card with both a microphone and a speaker. We also discovered that the USB camera would be detected as an audio source/destination by the AudioManager, which would cause the USB sound card to not be detected.

Hardware

Camera

  • Microsoft LifeCam USB camera (Android Things Dev Preview 5.1)
  • Raspberry Pi Camera Module V2 (Android Things Dev Preview 6 and beyond)

Audio

  • USB sound card plugged into the RPi3
  • Panda head speaker connected via a 3.5mm jack on the USB sound card
  • Microphone connected via a 3.5mm on the USB sound card
An image of Wallace, the robot, with a speaker, that looks like a panda, and a camera mounted on top.

The speaker (panda) with the USB camera mounted on top. The top of Wallace is getting a bit crowded with the microphone in the back almost falling off.

 

Implementation

The rough plan was to:

  • Capture an image with the camera
  • Pass the image to Google’s Mobile Vision library
  • If a face is detected, pass the image to AWS Rekognition
  • If the face is recognized, speak the name of the person

Google Mobile Vision

The Mobile Vision library has a face detector that we’ll use as a face recognition trigger. The detector also provides certain features of the detected face(s). One of the features is a confidence value of whether the person is smiling. To save bandwidth, we trigger face recognition if the “isSmiling” confidence value is above 95%.

The Android Things image can be preloaded with Play Services. Once preloaded, it will not be updated. To avoid issues with version mismatches, we need to make sure we use the same version of the library as the one in the image. Mobile Vision relies on a native library to do its face detection. This library is loaded from network and stored to disc. As we discovered, the download could take a while, and until it’s loaded, the APIs will just return empty detection results.

AWS

Amazon’s Android SDK is quite modular and we’re only picking libraries for the services we are going to interface with – Amazon Rekognition, DynamoDB and DynamoDB Mapper.

Kotlin coroutines

While not essential, we found that Kotlin’s coroutines made the code for the asynchronous calls to AWS more readable than if we would use a mechanism like AsyncTask.

Camera

Android’s Camera2 API provides a lot of flexibility in terms of how pictures should be captured and rendered, and thankfully there are a couple of great examples out there that show how to set up and use the camera, such as https://github.com/androidthings/doorbell and https://github.com/googlesamples/android-Camera2Video. The Raspberry Pi is currently limited to render to a single surface, which meant that we couldn’t render to a SurfaceTextureView and an ImageReader’s surface at the same time. This is fine when Wallace is roaming around, but during development we found it useful to be able to render the camera output to a connected display by drawing to an ImageView.

Image conversions

The camera produces frames in the YUV 420 888 format. Mobile Vision expects the image format to be NV16, NV21, or YV12. Amazon Rekognition in turn expects images to use an RGB format (we’re using ARGB888), which is also the format the debug ImageView expects. The conversion between these formats also needed to be somewhat performant, since we’re going to try and process images at the pace the camera delivers preview frames. While the Android ecosystem is blessed with a smørgåsbord of image loading libraries, we could not find any that converts images between all the formats we needed, so we developed utility methods for this, which we are planning to open-source.

Recognizing a face

Once we have the image converted to RGB, we compress it to a JPEG ByteBuffer to save network bandwidth. The AWS SDK provides an AmazonRekognitionClient class, and we’re going to call searchFacesByImage by creating a SearchFacesByImageRequest and providing the collection ID, a String that all the face images are assigned to, and the image ByteBuffer. If there are face matches, the results will contain a face ID for each recognized face. The ID is just a generated hash, so to have a meaningful greeting we create a mapping between face ID and employee name in a DynamoDB table. To load the face and name mapping we first declare it using Kotlin’s neat data class declaration:

We then pass the FaceMapping class and the face ID to DynamoDBMapper’s load method and get an object containing the name back, which we’ll pass on to the last step.

Speaking the name

Android’s text synthesis, TextToSpeech, is very easy to set up and use. Just create an instance, set a supported language and call the speak method with a phrase, which in this case is just “Hi {name}”:

We just need to remember to release the native resources of the TTS engine when the app shuts down, but apart from that this is all we need to do to let Wallace speak the name of the person he just recognized.

Summary

In this post we’ve given an overview of how an Android Things device can be used to detect and recognize faces.

Google’s Mobile Vision and Amazon’s Rekognition libraries are very convenient to use and get up to speed with. Where we spent most of our efforts was implementing a stable camera interface, and as the camera is quite a resource heavy component, we had to tune the image capture to make sure it let other components of the app run smoothly.

The face recognition feature required us to pull in quite a few libraries, and as we reached the DEX limit we tried multidex and quickly discovered performance degrades which led us to skip multidex and always enable ProGuard.

Stay tuned – in the next part of the Wallace blog post series, we’ll show how to add support for Google Assistant using a custom keyword as the trigger.

Leave a Reply