DEV Community: Matt Grofsky

Google AI Vision & Text to Speech on a Raspberry Pi

Matt Grofsky — Wed, 11 Nov 2020 16:05:31 +0000

As the CTO of Ytel, Inc., I work a lot with communications technology and machine learning. MMS is typically the defacto standard when it comes to sending photos back and forth on a mobile device outside of downloaded OTT applications.

RCS is now starting to appear on mobile devices, and media sharing is expected to accelerate. I thought it would be interesting to see how hard it would be to build out an IoT type device outside the Google Cloud Platform proper that can interact with some of Google’s prebuilt AI models and interpret this media.

I will provide some tools and code that will allow you to build a demo that will take a photo of a scene, analyze it, and then speak back the results.

To fully build out the proof of concept you will need:

A Raspberry Pi
A Raspberry Pi Camera
A Google Cloud Platform account

The first step is to make sure you have Python 3.7.x or higher installed on the Pi and add your requirements.txt.

google-cloud-vision==1.0.0
google-cloud-texttospeech==2.2.0
picamera==1.13

Next, let’s build out the application.

In your main.py, declare your imports, provide your GCP credentials, and instantiate your Google SDK clients. Your credentials will reference a JSON file, and it should have permissions to the Cloud Vision and Cloud Text-to-Speech APIs.

import picamera
import time
import os
from google.cloud import vision
from google.cloud import texttospeech

# Needs permission for Cloud Vision API and Cloud Text-to-Speech API

os.environ["GOOGLE_APPLICATION_CREDENTIALS"]="YourServiceAccount.json"
client_vision = vision.ImageAnnotatorClient()
client_tts = texttospeech.TextToSpeechClient()

To analyze a photo, you first have to take a picture. The beautiful thing about a Raspberry Pi camera is that this is a simple task.

Once your camera is plugged in and enabled in your Configuration, utilize the PiCamera library to take a photo. Below is a simple function for taking a picture using PiCamera.

def takephoto():
    camera = picamera.PiCamera()
    camera.resolution = (1024, 768)

    # Show me a quick preview before snapping the photo (If you have a monitor)

    camera.start_preview()
    time.sleep(1)

    # Take the photo
    camera.capture('image.jpg')

The primary function will execute the takephoto() function and start the process where an image.jpg file populates to the local drive. The file will be read into memory and processed by the Cloud Vision SDK, then analyzed by Google’s Cloud Vision AI service.

In this instance, I chose to use the label_detection feature to help identify objects in the photo. The service also has separate functions to recognize the existence of faces, famous logos, and more. For some detailed info on what it can do, visit the official Google Cloud Vision AI docs page.

The Text-to-Speech utilizes SSML and Google’s premium Wavenet voices. I don’t fully use SSML in the below example, but if you would like to see documentation highlighting some of the deeper SSML capabilities, you can do so here.

As for the voices, I highly recommend Google Wavenet voices for all TTS applications that demand near human quality synthesis.

The Speech is streamed back and stored as an MP3 file on the local drive. Once saved, mpg123 is used to play the MP3 over any speaker hooked up to the Raspberry Pi. If you have not done so already, install mpg123 via the apt install mpg123 command.

def main():
    takephoto()

    with open('image.jpg', 'rb') as image_file:
            content = image_file.read()

    image = vision.types.Image(content=content)

    response = client_vision.label_detection(image=image)

    response = client_vision.label(image=image)
        labels = response.label_annotations
        print('Labels:')

        synthesis_input = ''

        # Make a simple comma delimited string type sentence.
        for label in labels:
                print(label.description)
                synthesis_input = label.description + ', ' + synthesis_input

        synthesis_in = texttospeech.SynthesisInput(text=synthesis_input)

    # Let's make this a premium Wavenet voice in SSML
        voice = texttospeech.VoiceSelectionParams(
            language_code="en-US",
            name="en-US-Wavenet-A",
            ssml_gender=texttospeech.SsmlVoiceGender.MALE
        )

    # Select the type of audio file you want returned
        audio_config = texttospeech.AudioConfig(
            audio_encoding=texttospeech.AudioEncoding.MP3
        )

    # Perform the text-to-speech request on the text input with the     selected
        # voice parameters and audio file type
        response = client_tts.synthesize_speech(
        input=synthesis_in, voice=voice, audio_config=audio_config
        )

    # The response's audio_content is binary.
        with open("output.mp3", "wb") as out:
            # Write the response to the output file.
            out.write(response.audio_content)

    print('Audio content written to file "output.mp3"')

    file = "output.mp3"
        # apt install mpg123
        # Save the audio file to the dir
        os.system("mpg123 " + file)

    if __name__ == '__main__':
        main()

All code for this tutorial is on Github. Feel free to take it and modify it into something better…stronger…faster. 💪

mgrofsky / GoogleAI-Pi

Google AI Vision & Speech on a Raspberry Pi

GoogleAI-Pi

Google AI Vision & Speech on a Raspberry Pi

A python demo that will take a photo of a scene, analyze it, and then speak back the results.

This demo is for use on a Raspberry Pi with a Pi Camera attachment.

A full breakdown of requirements can be found at:

https://medium.com/@mgrofsky/google-ai-vision-text-to-speech-on-a-raspberry-pi-875dc13b3d73

View on GitHub

Building Scaleable .NET Apps Without Windows

Matt Grofsky — Mon, 09 Nov 2020 15:20:31 +0000

“I built a .NET Web App on macOS in Visual Studio and deployed a Linux Docker container to Google App Engine.”

That is a phrase you don’t hear too often, if at all. It wasn’t much more than a few years ago that .NET core was released. This open-source project allowed .NET developers to free themselves from the Windows platform, and allowed Microsoft to expand into what was once unfriendly territory. Fast forward to current technology trends, and you will find .NET developers expanding out to Linux and Mac Based machines. This once paradoxical development flow led me on a personal path to see how these technologies would fit into my Google Cloud, Mac-centric world. As it turns out, it feels quite natural.

Paradigms around which cloud provider is right for what purpose have persisted over the past few years. When engineers think about deploying a Windows web application in the cloud, they automatically think of Microsoft Azure. When Data Scientists want best-of-breed in Machine Learning and Analytics, they automatically think of Google Cloud. Amazon AWS is known for its global reach, maturity, and reputation. Over the past few years, these lines of distinction for building cloud-native applications have started to blur within not just cloud providers, but within the tools and operating systems used to create them. The following information contains a brief breakdown on how you or your team can start working with .NET outside of the Microsoft Windows ecosystem.

Deploying .NET in Google App Engine
For those that are not familiar with App Engine, it is a fully managed serverless application platform that takes advantage of Google’s years of building resilient and scalable architectures. It is an excellent solution if you want to start playing with .NET using Docker and don’t want to deal with Kubernetes.

Prerequisites:

Download Visual Studio for Mac
Sign up for a Google Cloud Platform account. You will get $300 in free usage.
Install the Google Cloud SDK.
Create a project in Google Cloud.
Open Terminal in Mac and run the following command to set your default project:

gcloud confid set project <PROJECT-NAME>

The following link contains sample code for those interested in trying out a pre-built Visual Studio Solution. Included is one example each for deploying a Docker container to Google App Engine as well as Kubernetes.

Visual Studio comes with the needed Dockerfile templates to help any developer jump into building a container ready application. When created, the Dockerfile will expose the necessary ports so that the Web App is reachable. One essential item here is that when designing and deploying a custom runtime, the App Engine front end will route incoming requests to the appropriate module on port 8080, and you must be sure that your application code is listening on that port.

The default EXPOSE when adding in Docker support to a .NET web application is:

    FROM mcr.microsoft.com/dotnet/core/aspnet:2.2-stretch-slim AS base
    WORKDIR /app
    EXPOSE 80
    EXPOSE 443

You will want to change this to:

    FROM mcr.microsoft.com/dotnet/core/aspnet:2.2-stretch-slim AS base
    WORKDIR /app
    EXPOSE 8080
    ENV ASPNETCORE_URLS=http://*:8080

Everything else will remain the same. You may then go ahead and create your .NET web application as usual. The final step before deploying to App Engine is to specify the custom runtime. Create a file called app.yaml and place it in your root directory.

    runtime: custom
    env: flex
    manual_scaling:
      instances: 1
    resources:
      cpu: 1
      memory_gb: 0.5
      disk_size_gb: 10

    service: service-test

The sample app.yaml above incurs costs to run on the App Engine flexible environment. The settings are to reduce costs during testing and are not appropriate for production use.

Remember that the flex environment does not scale down to 0 and could become costly if you supply the wrong resources and forget about it.

For more information, see: https://cloud.google.com/appengine/docs/flexible/python/configuring-your-app-with-app-yaml

In Terminal on your Mac, browse to the root directory containing your app.yaml file and run the command:

gcloud app deploy

You can preview your running application by browsing to App-Engine>Services in the Google Cloud Platform Console.

You just deployed a highly resilient and scalable .NET application without running a Windows Client machine or bringing up any Windows Servers.

Sample Visual Studio Solution

mgrofsky / NET_Docker_Google-_App_Engine

.NET Web App in Google App Engine Flexible w/ Docker Support

app.yaml

This is required to deploy to GAE Flexible. Runtime will be custom and env will be flex.

You can either add your service into the app.yaml or specify it in the gcloud app deploy command.

runtime: custom
env: flex

# This sample incurs costs to run on the App Engine flexible environment. 
# The settings below are to reduce costs during testing and are not appropriate
# for production use. For more information, see:
# https://cloud.google.com/appengine/docs/flexible/python/configuring-your-app-with-app-yaml
manual_scaling:
  instances: 1
resources:
  cpu: 1
  memory_gb: 0.5
  disk_size_gb: 10

service: matt-test

Remember that the flex environment does not scale down to 0 and could become costly if you supply the wrong resources and forget about it.

Dockerfile

One key item here is that when creating and deploying a custom runtime, the App Engine front end will route incoming requests to…

View on GitHub

Translated for Ukrainian audiences:

Створення масштабованих програм .NET без Windows | by Andrew Sheludenkov | Temy.co Ukraine | Medium

Andrew Sheludenkov ・ Oct 7, 2020 ・
Medium

Analyze Your Call Recordings With Google AI

Matt Grofsky — Mon, 09 Nov 2020 04:32:44 +0000

For most companies, the story usually goes like this.

A customer calls in to complain, praise, or ask for assistance.
The call is recorded for further training or evaluation.
The recording is typically picked at random, listened to by someone, and reviewed with the customer service representative.

This process can take anywhere from an hour to a week after a customer hangs up. During this time, a lot can go wrong. Compliance issues and poor service could leave you with some unhappy customers.

I’ll show you how to work smarter, not harder, and identify problems as soon as they occur. What most developers don’t realize is that the intricate pieces pre-built inside the Google Cloud Platform.

There are three essential items you will want to look for when evaluating a call.

Identity — Separate the individuals on the call distinctly.
Sentiment — Are these individuals generally positive or negative in the interaction.
Trigger Words — Were any words or phrases said that warrant further review.

Let’s complicate this a bit and evaluate single-channel audio phone calls. Complexity means we are not only dealing with call quality type audio, but also audio where each caller co-mingles in a single channel. Single channels make it much harder to distinguish who is talking and when.

A Google Cloud Function is the easiest way to trigger code execution at scale when a file is uploaded to Cloud Storage. Setting up a Cloud Function for this purpose is easy and straight forward.

Let’s first start with the requirements.txt file and imports.

Requirements.txt

google-cloud-speech==1.3.2
google-cloud-storage==1.27.0
pathlab

imports

In this example, I will be using diarization to distinguish and separate the audio between the two callers. Diarization is:

The process of partitioning an input audio stream into homogeneous segments according to the speaker identity

This process requires Cloud Speech beta module speech_v1p1beta1.

import os
import requests
import json
import sys
import time
import uuid
from google.cloud import speech_v1p1beta1
from google.cloud.speech_v1p1beta1 import enums
from google.cloud import storage

Identifying the created file

As the Cloud Function is triggered by a google.storage.object.finalize event inside GCS, a dictionary with data specific to this type of event is sent.

Grabbing the path of the file name is as easy as pulling out the object file[‘name’] from the [dictionary] (https://cloud.google.com/functions/docs/calling/storage). Knowing all this information, we can build out a gs:// URI that can be used for various Google AI services.

BucketName = 'gcs-bucket'

def transcribe_audio(event, context):
    file = event
    now = time.time()
    FileName = file['name']
    storage_uri = 'gs://' + BucketName + '/' + FileName

Transcribing the Audio

Before transcribing the audio, I first want to make sure it is an actual audio file. In this example, I am only going to deal with mp3 audio. There are a tremendous amount of options to choose from, and I will highlight a few. First, the hertz rate is essential, and more often than not, is 8000 for phone audio recordings. Second, because this is a phone call, it is different. Google has a different Machine Learning model for phone call audio that creates a better transcription overall. Finally, for proper configuration, make sure to enable diarization and set the appropriate amount of speakers on the call. If required, auto-adjust your utterance dictionary and pick out specific pronouns, business names, or phrases that can show up in conversation.

    # Let's process only mp3 files
    if storage_uri[-4:] ==".mp3":
        client = speech_v1p1beta1.SpeechClient()

    # Sample rate in Hertz of the audio data sent
        sample_rate_hertz = 8000

    # The language of the supplied audio
        language_code = "en-US"
        model = "phone_call"

    # Encoding of audio data sent. This sample sets this explicitly.
    # This field is optional for FLAC and WAV audio formats.
        encoding = enums.RecognitionConfig.AudioEncoding.MP3
        config = {
            "sample_rate_hertz": sample_rate_hertz,
            "language_code": language_code,
            "encoding": encoding,
            "model": model,
            "use_enhanced": True,
            "enable_automatic_punctuation": True,
            "enable_speaker_diarization": True,
            "diarization_speaker_count": 2,
            "speech_contexts": [{
                "phrases": ["Thank you for calling ABC", 
                "Thank you for contacting ABC",
                "Welcome to ABC",
                "ABC customer service",
                "Thank you for calling ABC customer support."]
                }]
        }
        audio = {"uri": storage_uri}

    operation = client.long_running_recognize(config, audio)

    #print(u"Waiting for operation to complete...")
        response = operation.result()
        transcript = ""
        transcriptw = ""
        sendtrans = False
        keyword = "Empty Audio"
        speaker = ""

    for result in response.results:
        words_info = result.alternatives[0].words
        for word_info in words_info:
            if str(word_info.speaker_tag) != "0":
                if str(word_info.speaker_tag) != str(speaker):
    #print(str(word_info.speaker_tag) + " is not " + str(speaker))
                    speaker = str(word_info.speaker_tag)
                    transcriptw = transcriptw + "\n-------\n*Speaker " + speaker + ":* " + word_info.word
                 else:
    #print(str(word_info.speaker_tag) + " is " + speaker)
                    transcriptw = transcriptw + " " + word_info.word
                    speaker = str(word_info.speaker_tag)

    sendtrans = False
    keyword = "Empty Audio"
    print(transcriptw)

    if transcriptw.strip() == "":
        transcriptw = "*No Sound*"
        sendtrans = True
    else:
        list = ["bitcoin","payment", "invoice", "bill", "utilities", "utility", "electricity", "credit card", "package", "testing","kits","financial", "supplies", "mask", "symptoms", "isolate","oxygen","ventilator","social security","government","internal revenue","covid", "world health", "national institute", "virus", "corona","quarantine","stimulus","relief","cdc","disease","pandemic","epidemic","sickness"] 
        # Using for loop 
        for i in list: 
            if i.lower() in transcriptw.lower():
                keyword = i.lower()
                sendtrans = True
                break

    if sendtrans == True:
            print(f"Sending to Slack: {file['name']}.")
            filename = file['name']
                send_slack(transcript.strip(),filename,keyword)

For Longer audio such as entire phone conversations, the best practice is to use the client.long_running_recognize(config, audio) method. This method performs asynchronous speech recognition.

After transcribing, I check the transcript for any keyword triggers and, if any match, send the transcription to slack for immediate notification.

Below is the slack function

def send_slack(transcript,filename,keyword):
    try:
        response = requests.post(url="https://hooks.slack.com/services/ABCDEFG/123456/ABC123",
            headers={
                "Content-Type": "application/json",
            },
            data=json.dumps({
            "text": "*Audio:* https://storage.cloud.google.com/" + BucketName + "/" + filename + "\n*Transcription:*\n" + transcript 
        })
        )
        print('Response HTTP Status Code: {status_code}'.format(
            status_code=response.status_code))
        print('Response HTTP Response Body: {content}'.format(
            content=response.content))
    except requests.exceptions.RequestException:
        print('HTTP Request failed')

An open-source and simplified example of the above code is in one of Ytel’s public Gitlab repositories.

Telecom companies quickly needed to identify and report certain types of scam oriented communications when the Covid-19 outbreak started.

Error Budgeting & Site Reliability Engineering

Matt Grofsky — Sun, 08 Nov 2020 17:54:38 +0000

When most companies search for an online SaaS solution, Service Level Agreements (SLA) play a crucial role in influencing the sale. Downtime surrounding a company’s SLA typically contain calculations around minutes of uptime for the service over some time. One sure-fire way you can provide adequate, and actionable SLA numbers are by implementing Site Reliability Engineering methodologies (SRE) and Error Budgeting.

As defined by Google:

SRE begins with the idea that a prerequisite to success is availability. A system that is unavailable cannot perform its function and will fail by default. Availability, in SRE terms, defines whether a system is able to fulfill its intended function at a point in time. In addition to being used as a reporting tool, the historical availability measurement can also describe the probability that your system will perform as expected in the future.

There are 3 tools at your disposal to help you identify and measure your SRE efforts.

SLA: The Service Level Agreement is a contract that the service provider promises customers on service availability, performance.
SLO: The Service Level Objective is a goal for a component that a service provider wants to reach. The SLO is not shared with the customer but is instead an internal goal.
SLI: The Service Level Indicator is a measurement the service provider uses for the SLO goal.(A Measurement that defines “Good Enough.” We should have enough “Good Enough” s to meet our SLO)

Utilizing an example service that provides a REST API to send out an SMS, we need to identify customers and a successful journey.

Example User and Service:

What is my service?

The service is an API that sends an SMS.

Who uses my service?

The users consist of businesses that like to communicate via SMS to their customers.

Example Journey:
A business user can make a REST request to our API and send an SMS to a mobile phone. The SMS should respond to, and the inbound SMS can be forwarded to a URL endpoint when received.

If I now rewrite this journey by using something measurable, I can create something that could be tracked and monitored.

Define your SLI
When a customer initiates a request to the SendSMS API endpoint, the amount of time to get a response back (response time), as measured by the time to send a response back to the customer.

Calculating SLI
When calculating reliability with your SLI, most take the approach of defining availability as:

Availability = (Number of minutes a system is working well / Total minutes) * 100

What you end up with is a fraction that defines your uptime percentage. This method of calculating reliability has some positives, but it also has some negatives. It’s straightforward for a human to understand the percentage and gauge reliability since the metric is binary. The service is up, or the service is down. The downside is that this process doesn’t work in distributed systems where multiple systems and servers are involved in contributing to this calculation.

A better approach to calculating SLI is to track events between your systems and not minutes.

Availability = (Number of good events / Total events) * 100

What happens here is that you gain some additional benefits by tracking events across servers versus tracking just time up and down. The number of servers in this scenario is irrelevant as you are measuring events that affect customers and their journey directly. This helps in situations where you use managed instance groups or preemptible machines in a cloud environment.

Define your SLO
An example SLO could be something as simple as, “99% of SMS requests return in under 300ms”. As you create the SLO, understand that this number is not static. Over time, your customers help define the true metric and decide if it needs to be adjusted upward. You should adjust your calculations to match the level of the outage. As an example, if your outage is considered degraded, multiply your error budget consumption for the incident by 0.25. If your outage is considered partial, multiply your error budget consumption for the incident by 0.5. Understanding the basics of making calculations across your infrastructure will help guide your decisions, but sometimes these calculations should be modified to meet specific goals. One example is to apply a force multiplier to specific customers that pay more for your services, or a force divider if a “bad” event occurs off-hours.

A 99% uptime allows for 14m 24s of downtime per day. This downtime is roughly 1h 40m, 7h 18m, and 3d 15h of downtime per week, month, and a year, respectively. These times calculated through your SLO target is your “Error Budget.” The error budget is your downtime, and the least allowable time you are willing to deal with lowered performance over 30 days and should be lower than your public SLA. Calculate your SLO through your various SLIs. Each time your SLI fails, and an event returns bad, you are consuming a portion of your allowed error budget.

It’s important to remember that technology services are complex, and those complex systems fail. Embracing failure is essential to growth as long as that failure in the system is understood, and effort is taken to fix the cause. Using this thought process of complexity in systems, we can say that 100% reliability does not exist. As long as you are within your budget, failure is OK. It’s important to monitor as much you can as this helps you gain insight into systems, and you can only improve upon what you measure.

If you find you are all quickly eating up your budget because a service is having problems, it’s time to re-prioritize your team to fix the budget eater. It provides guidance to stakeholders that reducing efforts on non-reliability features and refocusing on reliability features and infrastructure. One key benefit of utilizing error budgets is the reduction of paging alert fatigue for engineers. Error budgeting makes it easier to set paging alerts based on the amount of error budget consumed in X minutes versus the traditional method of paging someone every time a failure is noticed.

As you work to stay within your error budget, you will start to notice better reliability. Better reliability translates to increased uptime and an SLO that can be increased upward. Furthermore, as reliability increases, so does customer satisfaction, and this directly translates to increased revenue.

Do you have error budgeting and SRE all figured out? Try shutting down a random server or zone and see what happens.

This article was originally posted on Medium:

DEV Community: Matt Grofsky

Google AI Vision & Text to Speech on a Raspberry Pi

mgrofsky / GoogleAI-Pi

Google AI Vision & Speech on a Raspberry Pi

GoogleAI-Pi

Building Scaleable .NET Apps Without Windows

mgrofsky / NET_Docker_Google-_App_Engine

.NET Web App in Google App Engine Flexible w/ Docker Support

app.yaml

Dockerfile

Створення масштабованих програм .NET без Windows | by Andrew Sheludenkov | Temy.co Ukraine | Medium

Andrew Sheludenkov ・ Oct 7, 2020 ・
Medium

Analyze Your Call Recordings With Google AI

Error Budgeting & Site Reliability Engineering

Error Budgeting & Site Reliability Engineering | by Matt Grofsky | The Startup | Medium

Matt Grofsky ・ Nov 28, 2019 ・
Medium

DEV Community: Matt Grofsky

Google AI Vision & Text to Speech on a Raspberry Pi

mgrofsky / GoogleAI-Pi

Google AI Vision & Speech on a Raspberry Pi

GoogleAI-Pi

Building Scaleable .NET Apps Without Windows

mgrofsky / NET_Docker_Google-_App_Engine

.NET Web App in Google App Engine Flexible w/ Docker Support

app.yaml

Dockerfile

Створення масштабованих програм .NET без Windows | by Andrew Sheludenkov | Temy.co Ukraine | Medium

Andrew Sheludenkov ・ Oct 7, 2020 ・ Medium

Analyze Your Call Recordings With Google AI

Error Budgeting & Site Reliability Engineering

Error Budgeting & Site Reliability Engineering | by Matt Grofsky | The Startup | Medium

Matt Grofsky ・ Nov 28, 2019 ・ Medium

Andrew Sheludenkov ・ Oct 7, 2020 ・
Medium

Matt Grofsky ・ Nov 28, 2019 ・
Medium