Lars Jacobsson for AWS Community Builders

Posted on Jan 9, 2022 • Edited on Jan 14, 2022

The serverless architecture of a talking doorbell

#aws #serverless #iot

In this post I'll describe how I used some of AWS' serverless offering to build a doorbell that leverages AI and speech synthesis to describe the person or people outside the door.

Here's a video demo. Keep on reading to learn how it's built.

Note: age approximation needs to be worked on

Motivation

My family has gone through quite a few cheap wireless doorbells over recent years. They've all had issues such as breaking randomly and not warning when the battery is low.

We've considered buying a Google Nest Doorbell, but they are quite expensive, so I decided to build my own doorbell that takes the concept to the next level.

The speech synthesis isn't just a fun feature, it could also be useful for visually impaired people.

Tech stack

Hardware

Raspberry Pi 3B+
Raspberry Pi camera module
2 meter Raspbery Pi camera flex cable
Small speaker with 3.5mm AUX interface
Shelly Button 1 (could really be any MQTT enabled button)

AWS services

API Gateway
S3 - for image and speech storage
EventBridge - to act on items created in S3
Lambda - compute
Simple Email Service - to send email alerts
Rekognition - picture analysis
Polly - speech synthesis
Iot - MQTT broker to send messages to the Raspberry Pi
StepFunctions - to orchestrate the above

High level architecture

Putting the hardware together

The first challenge was to get the camera installed ouside the door. My door has a roof over it and right next to it a window. I 3D printed a small camera mount and squeezed the ribbon cable between the window and the window frame. Since I put it out it's been both humid and cold (-15C) and the camera is still alive.

I also printed a wall mount for the button and modded it slightly to add my family's surname.

Next I configured a Mosquitto MQTT broker on the Raspberry Pi and configured the Shelly Button to send its events to it.

Raspberry Pi services

The Raspberry Pi is running three services:

camera.py - listens to MQTT messages from the doorbell button, takes a picture and uploads to S3 via a presigned URL it fetches from a Lambda function hosted behind API Gateway*
speech.py - listens to MQTT messages from the AWS IoT endpoint which contain a pre-signed S3 URL of an audio file describing the guest
battery.py - Reports battery status to an API Gateway endpoint*
I'm using IP restriction on the API Gateway endpoints to only allow traffic from my home network. IPs can be spoofed, so an enhancement here could be to use a Lambda authorizer with a client secret.

My Python skills include a lot of googling and copy/pasting, so I will not go into details on these scripts.

Cloud infra

When a new image lands in the S3 bucket a number of independent things need to happen;

The image needs to be analysed for faces, emotion and other labels. This is done in two independent Amazon Rekognition jobs
If there are faces found, generate a human friendly string based on the features and labels in the image
Send this information to a number of destination - in this case email and MQTT back to the RaspberryPi
In the case of MQTT also generate a speech synthesis

Whilst all this could be achieved in a single Lambda functions, I wanted to acheive a clean orchestration with independently retryable tasks, for which StepFunctions is the obvious choice in AWS.

I'm using the native S3 to EventBridge integration to trigger the state machine with the following event pattern:

S3Event:
  Type: EventBridgeRule
  Properties:
    EventBusName: default
    InputPath: $.detail.object
    Pattern:
      source:
        - aws.s3
      detail-type:
        - Object Created
      detail:
        bucket:
          name:
            - !Ref S3Bucket
        object:
          key:
            - prefix: image/

This rule triggers the state machine when a new image with a prefix of image/ is uploaded to the bucket.

I'm not normally a fan of drag-and-drop, but the StepFunctions Workflow Studio is actually really nice to get that initial ASL generated.

The state machine looks like this and below I'll go through and describe each task.

State machine tasks

1. Parallel image analysis

I want to describe the person at the door by both facial features such as age, gender, beard, emotion, etc and by other labels, like clothing or other things they might be carrying.
Amazon Rekognition offers two separate endpoints for this;

Both jobs are run on the same image independently from eachother, hence I can use a parallel state.

Until recently this would have to be run in a Lambda Function, but since the release of StepFunctions AWS-SDK integration we can now integrate directly with most services without glue code in Lambda.

The output of the parallel state is an array of both tasks outputs.

2. Has faces?

This is a choice state that checks if the face detection gave any results. There are three branches of this:

No faces were found - use generic message.
Exactly one face was found - generate description and enrich with results from label detection.
More than one faces were found. Since there's no way to link the label detection to the faces I skip the label enrichment and just describe the people by their faces.

The choice state gives each branch exactly one thing to do without if statements. In the case of #1, there's no need to even invoke a Lambda function.

For each face / Build face description string

This is a map state that iterates through each face from the face detection and invokes a function Build face description string. Each function returns a string like "a happy man aged about 39". The output of the map state is an array such as ["a surprised man aged about 25", "a happy woman aged about 32"].
Source

Combine descriptions

This is a simple function that joins the strings from the map state into one string and modifies it to make it gramatically ok.
Source

Build single face description

When exactly one face is detected we can enrich it with the output form the object label detection. The output of this function can be something like "a surprised man aged about 25 with a baseball cap and a jacket"
Source

Generic message

This is a pass state that outputs the generic message "There is someone at the door"

Send message to notification channels

In the first version I'm happy with receiving an email when someone is at the door as well as the audio description. This can easily be extended with Slack notifications, etc.

In the 'send email'-branch we need a pre-signed S3 URL for the image before firing off the email which looks like this:
Source

If there are more than one person in the picture, then each person will be described, but without the label enrichment:

Synthesise speech and send to MQTT

This step calls a Lambda function that does the following;

Uses the Amazon Polly SDK to synthesize speech
Uploads audio to S3
Generates a pre-signed URL to the audio
Publish URL to an MQTT topic [Source]https://github.com/ljacobsson/ai-doorbell/blob/main/src/SynthesizeSpeech.js

The first approach was to use the StepFunction's SDK integration and call StartSpeechSynthesis without a Lambda function and then act on the audio that it uploaded to S3. However, I saw delays of ~15 seconds before the audio was available, so I decided to pack it all into one function for performance reasons.

REST APIs

There are two API Gateway endpoints:

GET /url - called by the raspberry Pi to get a pre-signed URL with which it uploads the captured image
PUT /battery - The Shelly button reports on its battery level via a MQTT topic on every button press. It PUTs the current level and it it goes below 20% it'll send an email to the user so they can take it of fthe wall for a couple of hours to charge.

Areas of improvement

Sometimes there's quite a long delay between the ding-dong and the speech. This brings a risk that the description of the person might be read out after the door has been opened which could lead to an uncomfortable situation depending on how the AI interpreted the person. So far, whilst testing, it hasn't said anything rude or offensive.
Train a model to recognise people I know. My fellow AWS Community Builder Pubudu Jayawardana has also built a doorbell that does just that which he describes here
Convert the state machine to an express workflow. This will be faster and cheaper. However, I keep it as standard for now to get the visual debugging option in the console. We don't get enough people visiting to make the pricing benefit of express noticable.
It currently connects to AWS IoT without [AWS Greengrass}(https://aws.amazon.com/greengrass/). If and when this doorbell goes on sale I'll need a better fleet management, so I should run the python scripts under Greengrass.

Summary

This doorbell has been deployed at my home for a week now and is the first of my IoT hobby projects that we've actually have found useful.

The first working version of it took two hours to build and didn't use Polly for speech synthesis. Instead it used pyttsx3 which gave the user the voice response a bit faster faster, but with a very unpleasant tone:

The mechanical bell is something I might bring back in a future version.

You can find the project on GitHub:

ljacobsson / ai-doorbell

AI doorbell that uses speech synthesis to describe the person at the door

Top comments (1)

cuarmarc • Jun 2 '23

That's true, installing something like that seems to be quite complicated. However, I'm impressed with such a solution, the speech synthesis is a great feature indeed. I'm planning to install something similar as well, but ready-made, and I'll definitely contact Ring Door bell installation services for that, since I can't be sure that the connection can be secure in that case. Anyway, it's great that you did it yourself, that's an incredible project.