Hermes Frangoudis

Posted on Jul 19, 2020

How To Build Hot Dog — Not Hot Dog

#augmentedreality #ai #swift #agoraio

Are you a big TV show binge-watcher? I sure am! As a dev, one thing I really enjoy is when a particular show highlights how technology interacts with the real world in believable ways, impacting us with unexpected—and often unintentionally hilarious—results.

For instance, back in 2017, HBO’s Silicon Valley aired an episode with the “Hot dog — Not hot dog” scene, where Jin Yang creates an app that recognizes hot dogs and everything else as “not hot dog.” The scene depicts a classic first step in training an AI for visual recognition.

For this tutorial, I will train a custom AI model using IBM Watson, and then use that model to detect “hot dog” or “not hot dog” within a live camera view. I’ll use augmented reality to display the result to the user. Since everything is more fun with friends, we’ll add a live streaming component to it!

Prerequisites

Basic understanding of Swift
Basic understanding of ARKit
has Basic understanding of CoreML and AI
Agora.io Developer Account
IBM Cloud Account with Watson Studio

Please Note: While no CoreML/AI knowledge is needed to follow along, certain basic concepts won't be explained along the way.

Device Requirements

In this project, we’ll be using ARKit, so we have some device requirements:

iPhone 6S or newer
iPhone SE
iPad (2017)
All iPad Pro models

Training the AI

Before we can build our iOS app, we first need to train the computer vision AI model. I chose IBM’s Watson Studio because they provide a very simple, drag-n-drop interface for training a computer vision model.

Create the Watson Project

Once you’ve created and logged into your Watson Studio account, click Create Project button. Give your project a name/description, add the storage, and then click to create the project.

Next, click Add Project and select Visual Recognition. Make sure to follow the prompts to add a Watson Visual Recognition service to the project. On the Custom Models screen, we’re going to Classify Images, so click the Create Model button. Now let’s name our model. I chose the name HotDog!

Source Training Images

Now that we've set up our Watson instance, we need images to train our model. Sourcing images for AI training may seem like a daunting undertaking. While it is quite a heavy lift, there are tools that help make this task easier.

I chose to use the Google Images Download python script. The script makes it easy to scrape images from Google while still respecting the original owner's copyrights.

Once you have set up the Google Images Download script, let's open up the command line and run it using:

googleimagesdownload --keywords "hotdog" --usage_rights labeled-for-reuse

We need to remove any photos like these so we are left with only pictures of real hot dogs (images can include toppings). Once all images are removed, you’ll notice that we’re left with about 50 photos. This isn’t very many considering that you’d usually want thousands of photos to train your model. While Watson could probably work with only 50 or so photos, let’s run the script a few more times with other keywords. These are the commands I used:

googleimagesdownload --keywords "hotdog" --usage_rights labeled-for-reuse
googleimagesdownload --keywords "plain hotdog" --usage_rights labeled-for-reuse
googleimagesdownload --keywords "real hotdog" --usage_rights labeled-for-reuse
googleimagesdownload --keywords "hotdog no toppings" --usage_rights labeled-for-reuse

After running the script a few times and removing any non-real hot dog images from each set of results, I was able to source 170 images for my model.

Let's put all our hot dog images into a single folder and name it hotdog. Now that we have our hot dog images, we need to find some not-hotdog images. Again, use the Google Images Download script but this time with a batch of keywords. I used:

googleimagesdownload --keywords "cake, pizza, hamburger, french fries, cup, plate, fork, glasses, computer, sandwich, table, dinner, meal, person, hand, keyboard"  --usage_rights labeled-for-reuse

IBM Watson's free tier imposes file size limits (250mb - per round) for training models, so once we've downloaded all of our non-hotdog images, we need to remove any images with large file sizes. Let's move all the images into a single folder and name it nothotdog. Next, zip each folder so you have hotdog.zip and nothotdog.zip.

Now, go back to the Watson Studio project, and upload the hotdog.zip file to our computer vision model. Once our zip finishes uploading, you'll notice that a new class hotdog has been created for us.

Next, upload your nothotdog.zip file. After it finishes uploading, you'll have two classes: hotdog and nothotdog. For this example, we only need one class, hotdog; the other class needs to be migrated into the existing Negative class. To do this, we need to open up the nothotdog class and select all the images. To do so, select the list view from the top, then scroll to the bottom and set the list length to 200, then scroll back to the top and click the select all button.

With all your images selected, click the Reclassify button, select the Negative class, and click Submit.

Once all the images have been reclassified, click back to the list of models to select and delete the nothotdog class. Now, we are ready to click the Train Model button to get Watson trained on our images.

Note: if you want to use a large data set you’ll have to break it up into pieces and repeat the training process (above) for each batch.

That's about it for collecting training images, all in all it wasn't too bad.

Building the iOS App

Now that Watson is training the visual recognition model, we are ready to build our iOS app.

In this example we’ll build an app that allows users to create or join a channel. Users that create a channel are then able to live stream themselves while they use IBM Watson custom model to infer hotdog or not hotdog.

Let’s start by creating a new single view app in Xcode.

Remove Scene Delegate

Since we are using the Storyboard interface, we can remove the SceneDelegate.swift and the Scene Manifest entry from the info.plist. Then we need to open the AppDelegate.swift and remove the Scene Delegate methods, and add the window property. Your AppDelegate.swift should look like this:

Since this project will implement ARKit and Agora, we’ll use the AgoraARKit library to simplify the implementation and UI for us.

Create a Podfile, open it and add the AgoraARKit pod.

platform :ios, '12.2'

target 'Agora Watson ARKit Demo' do
  use_frameworks!

  # Pods for Agora Watson ARKit Demo
  pod 'AgoraARKit'

  target 'Agora Watson ARKit DemoTests' do
    inherit! :search_paths
    # Pods for testing
  end

  target 'Agora Watson ARKit DemoUITests' do
    # Pods for testing
  end

end

Then run the install:

pod install

Permissions

Add NSCameraUsageDescription, NSMicrophoneUsageDescription, NSPhotoLibraryAddUsageDescription, and NSPhotoLibraryUsageDescription to the info.plist with a brief description for each. AgoraARKit uses the popular ARVideoKit framework and the last two permissions are required by ARVideoKit because of its ability to store photos/videos.

Note: I'm not implementing on-device recording so we don't need any of the library permissions; but if you plan to use this in production you will need to include them because they are requirements for ARVideoKit. For more information review Apple's guidelines on permissions.

Building the UI

We are ready to start building the UI. For this app we will have to build two views, the initial view and the AR view.

Within any live streaming and communication apps, you (as the developer) have two options for setting a channel name, do it for the user or allow users to input their own. The latter is more flexible, so we're going to extend our initial view to need to inherit from AgoraLobbyVC and allow users to input a channel name. Open your ViewController.swift, add import AgoraARKit just below the import UIKit line and set your ViewController class to inherit from AgoraLobbyVC.

Next, set your Agora App Id within the loadViewmethod and also set a custom image for the bannerImage property.

Next let's override the joinSession and createSession methods within our ViewController to set the images for the audience and broadcaster views.

import UIKit
import AgoraARKit

class ViewController: AgoraLobbyVC {

    override func loadView() {
        super.loadView()

        AgoraARKit.agoraAppId = ""

        // set the banner image within the initial view
        if let agoraLogo = UIImage(named: "watson_live_banner") {
            self.bannerImage = agoraLogo
        }
    }

    override func viewDidLoad() {
        super.viewDidLoad()
        // Do any additional setup after loading the view.
    }

    // MARK: Button Actions
    @IBAction override func joinSession() {
        if let channelName = self.userInput.text {
            if channelName != "" {
                let arAudienceVC = ARAudience()
                if let exitBtnImage = UIImage(named: "exit") {
                    arAudienceVC.backBtnImage = exitBtnImage
                }
                arAudienceVC.channelName = channelName
                arAudienceVC.modalPresentationStyle = .fullScreen
                self.present(arAudienceVC, animated: true, completion: nil)
            } else {
               // TODO: add visible msg to user
               print("unable to join a broadcast without a channel name")
            }
        }
    }

    @IBAction override func createSession() {
        if let channelName = self.userInput.text {
            if channelName != "" {
                let arBroadcastVC = ARBroadcaster()
                if let exitBtnImage = UIImage(named: "exit") {
                    arBroadcastVC.backBtnImage = exitBtnImage
                }
                if let micBtnImage = UIImage(named: "mic"),
                    let muteBtnImage = UIImage(named: "mute") {
                    arBroadcastVC.micBtnImage = micBtnImage
                    arBroadcastVC.muteBtnImage = muteBtnImage
                }

                arBroadcastVC.channelName = channelName
                arBroadcastVC.modalPresentationStyle = .fullScreen
                self.present(arBroadcastVC, animated: true, completion: nil)
            } else {
               // TODO: add visible msg to user
               print("unable to launch a broadcast without a channel name")
            }
        }
    }
}

Adding in the AI

Once Watson has finished training your model, you’ll need to download the CoreML file. Open Watson Studio and select the Hotdog Model. Within the model details, select the Implementation tab, then select the Core ML tab from the sub-menu on the left side of the screen. At the top of the Core ML section is the link to download the *CoreML*app model file.

Once you’ve downloaded the Hotdog.mlmodel file, drag the file into your Xcode project.

The computer vision will be running within our AR view, which is also the camera view being streamed into Agora, so we'll extend the ARBroadcaster class. The ARBoadcaster class is a bare-bones ARSCNView that is set up as a custom video source for Agora's SDK.

Create a new class called arHotDogBroadcaster which inherits from ARBroadcaster. Next we need to add properties for VNRequest and the DispatchQueue. Next extend the viewDidLoad and import the coreML model.

    let mlModel: MLModel = Hotdog().model
    var visionRequests = [VNRequest]()
    let dispatchQueueML = DispatchQueue(label: "io.agora.dispatchqueue.ml") // A Serial Queue

    override func viewDidLoad() {
        super.viewDidLoad()
        // Set up Vision Model
        guard let hotDogModel = try? VNCoreMLModel(for: mlModel) else {
            fatalError("Could not load model. Ensure Coreml model is in your XCode Project and part of a target (see: https://stackoverflow.com/questions/45884085/model-is-not-part-of-any-target-add-the-model-to-a-target-to-enable-generation ")
        }

        // Set up Vision-CoreML Request
        let classificationRequest = VNCoreMLRequest(model: hotDogModel, completionHandler: classificationCompleteHandler)
        classificationRequest.imageCropAndScaleOption = VNImageCropAndScaleOption.centerCrop // Crop from centre of images and scale to appropriate size.
        visionRequests = [classificationRequest]

    }

We'll use the currentFrame from the ARKit scene as our input for our computer vision. Use the currentFrame.capturedImage to create a CIImage that will be used as input for our VNImageRequestHandler.

func runCoreML() {
    // Get Camera Image as RGB
    guard let sceneView = self.sceneView else { return }
    guard let currentFrame = sceneView.session.currentFrame else { return }
    let pixbuff : CVPixelBuffer = currentFrame.capturedImage
    let ciImage = CIImage(cvPixelBuffer: pixbuff)

    // Prepare CoreML/Vision Request
    let imageRequestHandler = VNImageRequestHandler(ciImage: ciImage, options: [:])

    // Run Image Request
    do {
        try imageRequestHandler.perform(self.visionRequests)
    } catch {
        print(error)
    }

}

How do we know what the results are? If you look at the viewDidLoad snippet above, you'll notice we set classificationCompleteHandler as the completion block for any classification requests.

func classificationCompleteHandler(request: VNRequest, error: Error?) {
    // Catch Errors
    if error != nil {
        print("Error: " + (error?.localizedDescription)!)
        return
    }
    guard let observations = request.results else {
        print("No results")
        return
    }

    // Get Classifications
    let classification: VNClassificationObservation = observations.first as! VNClassificationObservation

    DispatchQueue.main.async {
        // Print Classifications
        print("--")

        // Display Debug Text on screen
        let debugText: String = "- \(classification.identifier) : \(classification.confidence)"
        print(debugText)

        // Display prediction
        var objectName: String = "Not Hotdog"
        if classification.confidence > 0.4 {
            objectName = "Hotdog"
        }

        // show the result
        self.showResult(objectName)
    }
}

Every time the CoreML engine returns a response we need to parse it and display to the user, “Hot Dog” or “Not Hot Dog.” You’ll notice in the snippet, that once we have a result, we parse it and then check the confidence level. I set the bar fairly low with a 40% confidence. That means that the AI model only has to be 40% confident that it sees a hotdog.

During testing, 40% confidence proved adequate for the intents of this project, but you may want to adjust that value depending on how sensitive you want your AI to be.

All that's left now is to display the result to the user using augmented reality. You'll notice in the classificationCompleteHandler, we call the function self.showResult and pass in a string with the value of either "Hot Dog" or "Not Hot Dog." Within showResult, we need to get the estimated position of the object and add an AR text label.

func showResult(_ result: String) {
    // HIT TEST : REAL WORLD
    // Get Screen Centre
    let screenCentre : CGPoint = CGPoint(x: self.sceneView.bounds.midX, y: self.sceneView.bounds.midY)

    let arHitTestResults : [ARHitTestResult] = sceneView.hitTest(screenCentre, types: [.featurePoint]) // Alternatively, we could use '.existingPlaneUsingExtent' for more grounded hit-test-points.

    if let closestResult = arHitTestResults.first {
        // Get Coordinates of HitTest
        let transform : matrix_float4x4 = closestResult.worldTransform
        let worldCoord : SCNVector3 = SCNVector3Make(transform.columns.3.x, transform.columns.3.y, transform.columns.3.z)

        // Create 3D Text
        let node : SCNNode = createNewResultsNode(result)
        resultsRootNode.addChildNode(node)
        node.position = worldCoord
    }
}

func createNewResultsNode(_ text : String) -> SCNNode {
    // Warning: Programmatically generating 3D Text is susceptible to crashing. To reduce chances of crashing; reduce number of polygons, letters, smoothness, etc.
    print("shwo result: \(text)")
    // Billboard contraint to force text to always face the user
    let billboardConstraint = SCNBillboardConstraint()
    billboardConstraint.freeAxes = SCNBillboardAxis.Y

    // SCN Text
    let scnText = SCNText(string: text, extrusionDepth: CGFloat(textDepth))
    var font = UIFont(name: "Helvetica", size: 0.15)
    font = font?.withTraits(traits: .traitBold)
    scnText.font = font
    scnText.alignmentMode = CATextLayerAlignmentMode.center.rawValue
    scnText.firstMaterial?.diffuse.contents = UIColor.orange
    scnText.firstMaterial?.specular.contents = UIColor.white
    scnText.firstMaterial?.isDoubleSided = true
    scnText.chamferRadius = CGFloat(textDepth)

    // Text Node
    let (minBound, maxBound) = scnText.boundingBox
    let textNode = SCNNode(geometry: scnText)
    // Centre Node - to Centre-Bottom point
    textNode.pivot = SCNMatrix4MakeTranslation( (maxBound.x - minBound.x)/2, minBound.y, textDepth/2)
    // Reduce default text size
    textNode.scale = SCNVector3Make(0.2, 0.2, 0.2)

    // Sphere Node
    let sphere = SCNSphere(radius: 0.005)
    sphere.firstMaterial?.diffuse.contents = UIColor.cyan
    let sphereNode = SCNNode(geometry: sphere)

    // Text Parent Node
    let textParentNode = SCNNode()
    textParentNode.addChildNode(textNode)
    textParentNode.addChildNode(sphereNode)
    textParentNode.constraints = [billboardConstraint]

    return textParentNode
}

Now that we have our model ready to run, we need to add a way for the user to invoke the computer vision model. Let's use the View's touchesBegan method to call the runCoreML method.

override func touchesBegan(_ touches: Set<UITouch>, with event: UIEvent?) {
  dispatchQueueML.async {
      self.runCoreML()
  }
}

Implement new broadcaster class

We're almost done. The last step (before we can start testing) is to set the ARBroadcaster in the ViewController.swift by updating line 43 to:

let arBroadcastVC = arHotDogBroadcaster()

This will set the ARbroadcaster to use our new arHotDogBroadcaster and we are ready to start testing!

That's It!

The core application is done, I'll leave it up to you to customize the UI. Thanks for following along. If you have any questions or feedback, please leave a comment.

I've uploaded my complete code with UI customizations, to GitHub so feel free to fork the repo and make PR's for new features.