Introducing devtool for running and benchmarking AI on-device

#ai #android #deeplearning #computervision

We’ve just created Embedl Hub, a developer platform where you can experiment with on-device AI and analyze how models perform on real hardware. It allows you to optimize, benchmark, and compare models by running them on devices hosted in the cloud, so you don’t need access to physical hardware yourself.

You can test performance across phones, dev boards, and SoCs directly from your Python environment or terminal. Everything is free to use.

Overview of the Platform

Optimize and deploy your model on any edge device with the Embedl Hub Python library:

Quantize your model for lower latency and memory usage.
Compile your model for execution on CPU, GPU, NPU or other AI accelerators on your target devices.
Benchmark your model's latency and memory usage on real edge devices in the cloud.

Embedl Hub logs your metrics, parameters, and benchmarks, allowing you to inspect and compare your results on the web and reproduce them later.

What We’ve Built

To showcase the platform we’ve built, we’ll demonstrate how it can be used to optimize and profile a model running on a Samsung Galaxy S24 mobile phone.

Compile the model

Let’s say you want to run a MobileNetV2 model trained in PyTorch.
First, export the model to ONNX and then compile it for the target runtime. In this case, we want to run it using LiteRT (TFLite).

To compile it with the embedl-hub CLI, you run the command:

embedl-hub compile \
    --model /path/to/mobilenet_v2_quantized.onnx \
    --size 1,3,224,224 \
    --device "Samsung Galaxy S24" \
    --runtime tflite

Quantize the model

Quantization is an optional but highly recommended step that can drastically reduce inference latency and memory usage. It is especially useful when deploying models to resource-constrained hardware such as mobile phones or embedded boards. It works by lowering the numerical precision of weights and activations in the model.

While this can reduce the model’s accuracy, you can minimize the loss by calibrating with a small sample dataset, typically just a few hundred examples.

embedl-hub quantize \
    --model /path/to/mobilenet_v2.onnx \
    --data /path/to/dataset \
    --num-samples 100

Benchmark the model on remote hardware

Now that the model is compiled (and quantized), you can run it on real hardware directly through one of Embedl Hub's integrated device clouds.

embedl-hub benchmark \
    --model /path/to/mobilenet_v2_quantized.tflite \
    --device "Samsung Galaxy S24"

In this example, we run the model on Samsung Galaxy S24. There are a large number of devices to chose from on Embedl Hub, see supported devices here.

Analyze & compare performance in the Web UI

Benchmarking the model gives useful information such as the model’s latency on the hardware platform, which layers are slowest, the number of layers executed on each compute unit type, and more! We can use this information for advanced debugging and for iterating on the model’s design. We can answer questions like: