DEV Community

Jordan Osterberg
Jordan Osterberg

Posted on • Updated on

ARKit + Vision: An intriguing combination

After multiple weeks of pondering what I should do first with Apple's newly announced ARKit, I decided that I wouldn't narrow my mindset to just that one API. I had viewed multiple tutorials on CoreML/Vision's object recognition features, and I decided to give it a shot myself.

TL;DR: ARKit and Vision is an awesome combination.

What are we doing?

We're going to create an ARKit app that displays what the iOS device believes the object displayed in the camera is, whenever the screen is tapped. (See bottom of article for example pictures)

Project Setup

We begin our journey in Xcode (9 or above), where we create a new Augmented Reality App...

select-ar-app

...give it a name... (in my case "arkit-testing-2") and set the Content Technology as SpriteKit...

content-tech

...select its location on our hard drive, and start plugging away.

ViewController.swift

We're going to focus on the important pieces of code in this class, as most of it is general boilerplate.

override func viewWillAppear(_ animated: Bool) {
    super.viewWillAppear(animated)

    // Create a session configuration
    let configuration = ARWorldTrackingSessionConfiguration()

    // Run the view's session
    sceneView.session.run(configuration)
}
Enter fullscreen mode Exit fullscreen mode

In viewWillAppear the ARWorldTrackingSessionConfiguration class is created, and then the view's session is run. You can modify the configuration if you wish, but for this tutorial we won't be playing with it.

func view(_ view: ARSKView, nodeFor anchor: ARAnchor) -> SKNode? {
    // Create and configure a node for the anchor added to the view's session.
    let labelNode = SKLabelNode(text: "👾")
    labelNode.horizontalAlignmentMode = .center
    labelNode.verticalAlignmentMode = .center
    return labelNode;
}
Enter fullscreen mode Exit fullscreen mode

Inside this function, an ARSKView object is provided, along with an ARAnchor object. The ARAnchor object will be important later. Inside the function an SKLabelNode is configured and returned. This will also be important later.

Before we jump into the other important file in this boilerplate project, let's modify our viewDidLoad method so we won't encounter a bug that I encountered when creating this project.

Replace...

// Load the SKScene from 'Scene.sks'
if let scene = SKScene(fileNamed: "Scene") {
    sceneView.presentScene(scene)
}
Enter fullscreen mode Exit fullscreen mode

with...

let scene = Scene(size: self.view.frame.size)
sceneView.presentScene(scene)
Enter fullscreen mode Exit fullscreen mode

I'm not sure what the bug is, or why this fixes it, but it does. You can play with the original code and find alternative fixes if need-be.

Scene.swift

To begin, comment out the following code inside of touchesBegan:

// Create a transform with a translation of 0.2 meters in front of the camera
var translation = matrix_identity_float4x4
translation.columns.3.z = -0.2
let transform = simd_mul(currentFrame.camera.transform, translation)

// Add a new anchor to the session
let anchor = ARAnchor(transform: transform)
sceneView.session.add(anchor: anchor)
Enter fullscreen mode Exit fullscreen mode

Yes, comment all of this out. Do not delete it, we'll come back to it later.

Vision!

Inside of the Scene.swift file, make sure you import the Vision framework before getting started:

import Vision
Enter fullscreen mode Exit fullscreen mode

Now go to the Apple Developer Website's machine learning page and download the InceptionV3 model. You can download any model you'd like, this is just the one I prefer and for what it does it's relatively small in file size.

Editor's Note: The InceptionV3 model is no longer on the site. Fortunately, you can download a different model and adapt the code accordingly.

All you have to do now is drag and drop the InceptionV3 MLModel file into your project, just like you would with any other file.

What Xcode does for you here is generate a Swift interface for the model. I would recommend watching the Vision and Introducing CoreML sessions from WWDC17 to learn more about it, located here and here, respectively.

Now we're finally ready to write some code inside touchesBegan.

Let's enter a background thread to not completely wreck our application's performance when we run one of these requests (I learned this the hard way):

DispatchQueue.global(qos: .background).async {

}
Enter fullscreen mode Exit fullscreen mode

Now let's create a do, catch and create a VNCoreMLModel object from our CoreML model we downloaded moments ago (depending on your internet speeds, of course)

do {            
    let model = try VNCoreMLModel(for: Inceptionv3().model)
} catch {}
Enter fullscreen mode Exit fullscreen mode

Inside of our do catch and just after our model initialization, let's create a VNCoreMLRequest with a completionHandler like so:

let request = VNCoreMLRequest(model: model, completionHandler: { (request, error) in

})
Enter fullscreen mode Exit fullscreen mode

Now, let's create a VNImageRequestHandler and perform our request (Write this code after VNCoreMLRequest's completionHandler):

let handler = VNImageRequestHandler(cvPixelBuffer: currentFrame.capturedImage, options: [:])
try handler.perform([request])
Enter fullscreen mode Exit fullscreen mode

Let me explain what this is code actually doing, because it can get a little strange.

We're creating an image request handler to handle our request, and passing it a...

CVPixelBuffer?!? What the heck is that? According to StackOverflow, CVPixelBuffer is a part of the CoreVideo framework. Fortunately for us, we can access one from ARKit by pulling it out of the currentFrame object, saving us from doing any heavy-lifting .

currentFrame.capturedImage
Enter fullscreen mode Exit fullscreen mode

Then we're performing our request with handler.perform([request]).

Now let's write the code inside of completionHandler:

// Jump onto the main thread
DispatchQueue.main.async {
    // Access the first result in the array after casting the array as a VNClassificationObservation array
    guard let results = request.results as? [VNClassificationObservation], let result = results.first else {
        print ("No results?")
        return
    }
}
Enter fullscreen mode Exit fullscreen mode

Awesome, we're almost done with our Scene class. Remember the code we commented earlier? Let's paste it in after we perform that guard statement.

We're also going to modify a property to make our text appear further away from the device when we instantiate our ARKit object:

// Create a transform with a translation of 0.2 meters in front of the camera
translation.columns.3.z = -0.4 // Originally this was -0.2
Enter fullscreen mode Exit fullscreen mode

If you'd like, you can update the comment to read 0.4 meters, because that comment was for the previous value of the property.

One last thing and we're done with our Scene class. Create a new swift file called ARBridge and paste the following code:

import UIKit
import ARKit

class ARBridge {

    static let shared = ARBridge()

    var anchorsToIdentifiers = [ARAnchor : String]()

}
Enter fullscreen mode Exit fullscreen mode

The anchorsToIdentifiers property will allow us to associate an ARAnchor with its corresponding machine-learning value.

Let's add a value to this dictionary, and restructure our code so that it executes properly:

// Create a new ARAnchor
let anchor = ARAnchor(transform: transform)

// Set the identifier
ARBridge.shared.anchorsToIdentifiers[anchor] = result.identifier

// Add a new anchor to the session
sceneView.session.add(anchor: anchor)
Enter fullscreen mode Exit fullscreen mode

Side note: If we save our identifier after we add the anchor to our scene, it won't appear properly. Make sure your code is in the order shown above.

We're all set! This is all of the code we just wrote inside of our touchesBegan function:

DispatchQueue.global(qos: .background).async {
                do {
                    let model = try VNCoreMLModel(for: Inceptionv3().model)
                    let request = VNCoreMLRequest(model: model, completionHandler: { (request, error) in
                        // Jump onto the main thread
                        DispatchQueue.main.async {
                            // Access the first result in the array after casting the array as a VNClassificationObservation array
                            guard let results = request.results as? [VNClassificationObservation], let result = results.first else {
                                print ("No results?")
                                return
                            }

                            // Create a transform with a translation of 0.4 meters in front of the camera
                            var translation = matrix_identity_float4x4
                            translation.columns.3.z = -0.4
                            let transform = simd_mul(currentFrame.camera.transform, translation)

                            // Add a new anchor to the session
                            let anchor = ARAnchor(transform: transform)

                            // Set the identifier
                            ARBridge.shared.anchorsToIdentifiers[anchor] = result.identifier

                            sceneView.session.add(anchor: anchor)
                        }
                    })

                    let handler = VNImageRequestHandler(cvPixelBuffer: currentFrame.capturedImage, options: [:])
                    try handler.perform([request])
                } catch {}
            }
Enter fullscreen mode Exit fullscreen mode

(Finally) Back to ViewController.swift

The only thing we need to do now is modify our view method to retrieve the text associated with our ARAnchor, which was generated by our machine learning model.

func view(_ view: ARSKView, nodeFor anchor: ARAnchor) -> SKNode? {
    // Create and configure a node for the anchor added to the view's session.
    guard let identifier = ARBridge.shared.anchorsToIdentifiers[anchor] else {
        return nil
    }

    let labelNode = SKLabelNode(text: identifier)
    labelNode.horizontalAlignmentMode = .center
    labelNode.verticalAlignmentMode = .center
    labelNode.fontName = UIFont.boldSystemFont(ofSize: 16).fontName
    return labelNode
}
Enter fullscreen mode Exit fullscreen mode

If there is no text associated with the ARAnchor, no SKNode is returned. If text exists, we create an SKLabelNode, change the font, and return it!

Testing!!!

I ran around my room pointing my camera at random objects, and this was the result:

example-1
example-2
example-3

It believed the MacBook Air on my desk was a stethoscope (that could have been the headphones or the mic), the pen on my nightstand was a revolver, and my Apple Watch sport band was a hatchet.

Other than that, it was amazing at predicting what the objects were. It thought the code for this project was a web-site, which was slightly correct. It also detected the snake pattern on my mousepad from Razer, which was pretty amazing.

With different models, I'm sure there will be different results, so try multiple models out and see what happens. It's as simple as dragging and dropping them into the project and changing the line of code that accesses the model.

The final project can be found on GitHub here, if you just want to run it and see what happens!

Thank you so much for reading, hopefully you enjoyed my (pretty basic) endeavor into ARKit and Vision!

Top comments (9)

Collapse
 
bunnyhero profile image
bunnyhero🐰😱

i believe CVPixelBuffer stands for “Core Video Pixel Buffer” (from the Core Video framework)

Collapse
 
drckangelo profile image
Derick Angelo

It's not showing anything mine on the view. :(

Collapse
 
giucom profile image
GiuCom

Nice !!!
Is it possible to recognize a specific image (marker) chosen by me ???
How can I teach a macheni learning to perform this recognition?
Thanks

Collapse
 
codegangpk profile image
Paul Kim

ARWorldTrackingSessionConfiguration is deprecated. replace with ARWorldTrackingConfiguration() and it works fine :)

Collapse
 
n3o999 profile image
n3o999

is it possible to implement a face detection in ArKit without machine learning? How to draw a square and obtain distance from the detected object? (possibly without lag :) ) thanks

Collapse
 
ben profile image
Ben Halpern

So cool!

Collapse
 
dbabbs profile image
Dylan Babbs

Nice tutorial. Doesn't work though. New release must have changed since your wrote this on the beta

Collapse
 
n3o999 profile image
n3o999

Is it possible to implement a face detection in ArKit without machine learning? How to draw a square and obtain distance from the detected object? (possibly without lag :) ). Thanks!

Collapse
 
sorianog profile image
Gerald S

Jordan, I'm having a hard time finding the InceptionV3 model. Did it change to something else?