DEV Community: Elina Norling

Launch Celebration: Win a Jetson Orin Nano Super or Raspberry Pi 5!

Elina Norling — Sun, 14 Dec 2025 10:24:57 +0000

AI hardware giveaway!

We’ve just introduced the latest major update to our devtool Embedl Hub: Our own remote device cloud including 50 various mobile devices!🥳

To celebrate, we’re hosting a giveaway!🤩. The most valuable feedback after using our platform will win an NVIDIA Jetson Orin Nano Super. We are also giving away 4x Raspberry Pi 5 to everyone who places 2nd to 5th.

Read about it here.

Join the competition!

Get started with the quickstart guide. The winners are announced on January 20. Stay tuned!

Our favorite features of Embedl Hub for on-device AI development:

Compile your model for execution on CPU, GPU, NPU or other AI accelerators
Quantize your model for lower latency and memory usage
Compare performance metrics across different devices
Identify which device delivers the best results for your model

Some of our latest blog posts:

From PyTorch to Shipping local AI on Android (link)
Diagnosing layer sensitivity during post training quantization (link)

We're happy to help if you have any questions!

Launch Celebration: Win a Jetson Orin Nano Super or Raspberry Pi 5!

Elina Norling — Sun, 14 Dec 2025 09:54:55 +0000

We’re excited to introduce the latest major update to Embedl Hub: our own remote device cloud!

To mark the occasion, we’re launching a community competition and giveaway. The participant who provides the most valuable feedback after using our platform to run and benchmark AI models on any device in our device cloud will win an NVIDIA Jetson Orin Nano Super. We’re also giving a Raspberry Pi 5 to everyone who places 2nd to 5th.

How to participate

Join Embedl Hub and get started.
Run a benchmark on any device from our device cloud and (optionally) compile and quantize the model before. This is done with the Embedl Hub Python library.
Give us your feedback about the platform in the feedback form available after login (see image below). Once you have submitted your feedback, you enter the competition.

You choose the level of feedback you give us, but to win the Jetson
Orin Nano Super, your feedback needs to be the most valuable for us in our platform development going forward. It could be a missing feature that would be useful, a friction point you discovered in the workflow after many runs, or something obvious we simply hadn’t thought of.

The winners are announced on January 20. Stay tuned!

NVIDIA Jetson Orin Nano Super at a glance
NVIDIA’s latest compact edge-AI “super developer kit,” delivering up to 67 TOPS of AI performance in a small, power-efficient form factor. Ideal for running advanced on-device AI workloads.

Raspberry Pi 5 at a glance
The newest generation of the iconic Raspberry Pi, featuring a 2.4 GHz quad-core ARM Cortex-A76 CPU, faster I/O, and significantly improved performance for modern hobby, IoT, and edge-AI projects.

Join the competition!

Get started with the quickstart guide. If you’d like help getting models running, or if you have questions or suggestions, just reach out to the team in our Slack channel.

Why we’ve built this platform

On-device AI offers many advantages; enabling low-latency interactions, offline functionality and total privacy to name a few. But running AI on an edge device is far harder than running it on a PC. One of the most common problems is that performance varies widely across devices and chipsets with different hardware limitations, such as memory constraints and battery budgets. Not every device has enough AI processing power to handle heavy workloads with real-time requirements, and this makes on-device performance inherently unpredictable; models that work perfectly on a few devices may be slow or broken on others.

For example, if you’re building a mobile app with on-device AI features, these issues often only show up after release when the app runs on user devices you never tested. This leads to frustrated users, negative reviews, and developers removing the feature entirely – losing the benefits of on-device AI. Read more about why on-device AI is important and the challenges that follow in this blogpost.

Because these issues are ones we’ve encountered ourselves, we set out to build this tool that removes this uncertainty. What you can do with the platform:

Use layer-wise peak signal-to-noise ratio (PSNR) for better quantization
View a history of all your submitted benchmarks
Compare performance metrics (such as latency and memory usage, and PSNR) across different devices
Identify which device delivers the best results for your model
Download detailed logs and reports for further analysis

Try it out at hub.embedl.com and let us know what you think!

From PyTorch to Shipping local AI on Android

Elina Norling — Sat, 13 Dec 2025 20:42:33 +0000

On-device AI offers many advantages for Android apps; enabling low-latency interactions, offline functionality and total privacy to name a few. But running AI on local devices is far harder than running it in a Jupyter notebook.

In this guide, we’ll break down why it is so hard and walk through how to optimize and run models on Android devices. We’ll also demonstrate how you can test it on different devices without needing physical access to a wide range of hardware.

Why run AI locally and why it’s hard on Android

Many modern Android apps rely on real-time intelligence to deliver a smooth and responsive user experience. Pose detection in fitness apps, AR filters in social apps, on-device audio processing, and live classification are all examples. These tasks benefit from running models locally rather than in the cloud, since the data can be processed directly where it is generated. Local inference gives you speed, privacy, and the ability to work offline – but it also comes with a challenge: getting these models to perform consistently across the enormous range of Android devices.

This became clear in a conversation I had recently with my friend Noah, an Android developer working on a lightweight pose-detection feature for a fitness app. He trained a MobileNet-based model in PyTorch, converted it to TFLite, and verified it on the three phones he had available, a Pixel 7, Galaxy S21, and a mid-range Motorola. Everything looked smooth. But after release, he started receiving reviews from users on other devices reporting sluggish performance, unstable frame rates, and in some cases even crashes before inference even began.

Noah’s experience isn’t unusual. In fact, it’s one of the most common issues Android developers run into when working with on-device AI. Apps that work perfectly on a few phones but feel slow or broken on others, frustrated users leaving negative reviews, and developers end up removing the on-device feature – losing many of the benefits of running AI on-device in the first place.

Digging deeper into the problem

To understand why situations like Noah’s happen, we need to look more closely at why the same model can show completely different latency, stability, and device-specific performance across devices, making on-device AI development so challenging.

1. Performance varies across devices and chipsets

Android hardware is highly diverse. Two devices released the same year can behave completely differently when running the exact same model. One may use an NPU and reach 60 FPS, another may fall back to GPU, and a third may run everything on the CPU and struggle to reach usable performance or even crash.

Rule of thumb:

CPUs run almost anything but rarely meet real-time needs.
GPUs are faster but depend heavily on runtime (TFLite GPU delegate, NNAPI, Vulkan) support
NPUs are fastest but only for models correctly adapted and compiled correctly for that chipset

And it’s not just about faster or slower processors. Android devices vary widely in how their accelerators and drivers support different operations and precisions, and runtime delegates often make different decisions about which compute units to use. As a result, two phones can execute the same model through completely different paths – resulting in noticeably different stability and performance on each device.

This makes broad testing essential. Yet most developers don’t have access to many devices, which is why issues often remain hidden until after launch.

2. Development complexity and setup effort discourage local AI

Even once a model works on your own device, getting it ready for actual deployment requires navigating a surprisingly complex toolchain. Exporting a model from PyTorch to ONNX and then to TFLite is only the beginning. Many hardware vendors expose their own delegates, runtimes, and SDKs, and each of them behaves slightly differently.

Developers I’ve spoken to say that even small on-device features, such as a simple classifier or filter, can take a huge amount of effort to get running well. Setting up TFLite GPU delegates, NNAPI, or vendor-specific runtimes on Qualcomm or Google Tensor devices requires time and experimentation. And when something doesn’t work, error messages are often vague, making it difficult to pinpoint whether the issue is an operator the hardware doesn’t support, relies on a precision (like FP32) that the accelerator can’t handle, or simply unsupported hardware acceleration.

3. Battery, speed, and hardware limitations are obstacles

Finally, even if you get the model running, real-world constraints remain. Phones have limited thermal budgets; running heavy models can overheat the device and throttle performance. Battery drain is a persistent concern – users will quickly uninstall an app that consumes too much power. Smaller or very old phones also have limited RAM and weaker accelerators, meaning some models simply will not run well no matter how they are optimized.

Several developers we have talked to point out that “not every device has enough AI processing power to handle heavy workloads with real-time requirements,” and this makes on-device performance inherently unpredictable. Bigger models are often too slow, too “hot”, or too power-hungry. Smaller models may lack accuracy. Getting the right balance requires careful optimization.

Solving Android devs’ on-device problems

The challenges described above are exactly what we built Embedl Hub to solve. Because these issues are ones we’ve encountered ourselves, we set out to create a tool that helps you identify which models perform well across devices, understand how they behave on different chipsets, optimize them for specific hardware targets, and verify the models on real Android devices in the cloud.

At a high level, the platform lets you

Compile your model for the correct runtime and accelerators on the target device, ensuring it can use the available hardware.
Optimize your model to reduce latency, memory usage, and energy consumption, and to enable NPU acceleration on many modern chipsets.
Benchmark your model on real edge hardware in the cloud to measure and compare device-specific latency, memory use, and execution paths. Embedl Hub logs your metrics, parameters, and benchmarks, and presents them in a web UI where you can inspect layer-level behavior, compare devices side by side, and reproduce every run. Our goal with this UI is to make it easy to confidently choose the best model–device combination before releasing your app.

To showcase the platform we’ve built, we’ll demonstrate how it can be used to optimize and profile a model running on a Samsung Galaxy S24 mobile phone.

Compile the model

Let’s say you want to run a MobileNetV2 model trained in PyTorch. First, export the model to ONNX and then compile it for the target runtime. In this case, we want to run it using LiteRT (TFLite).

To compile it with the embedl-hub CLI, you run the command:

embedl-hub compile \
    --model /path/to/mobilenet_v2.onnx \

This step gives you an early indication of whether the model is actually compatible with the device’s chipset and execution paths, so you can catch the kinds of issues that usually only appear after launch, before users start leaving reviews.

Optimize the model

Quantization is an optional but highly recommended step that can drastically reduce inference latency and memory usage. On mobile and embedded hardware, most optimizations come from quantization: By lowering the numerical precision of weights and activations to lower numerical precision (such as INT8), the model becomes faster and more power-efficient. It is especially useful when deploying models to resource-constrained hardware such as mobile phones or embedded boards and is often required for NPU acceleration on modern Android devices.

While this can reduce the model’s accuracy, you can minimize the loss by calibrating with a small sample dataset, typically just a few hundred examples.

embedl-hub quantize \
    --model /path/to/mobilenet_v2.tflite \
    --data /path/to/dataset \
    --num-samples 100

This feature directly addresses issues developers frequently encounter, such as inference failing due to unsupported ops. It also reduces memory use and battery consumption, and allows the model to run more efficiently on hardware that benefits from quantized execution, including NPUs and CPUs.

Benchmark the model on remote hardware

Now that the model is compiled (and quantized), you can run it on real hardware directly through one of Embedl Hub’s integrated device clouds.

embedl-hub benchmark \
    --model /path/to/mobilenet_v2.quantized.tflite \
    --device "Samsung Galaxy S24"

This is where many of the earlier problems are finally made totally visible: The benchmark results reveal how the model behaves on real devices. With these results, you can quickly see which devices run the model well, which don’t, and decide how to adapt or further develop before releasing your app.

In this example, we run the model on Samsung Galaxy S24. There are a large number of devices to choose from on Embedl Hub – Galaxy phones, Pixel phones, Snapdragon development boards – allowing you to test across the very diversity that makes Android deployment difficult. See supported devices.

Analyze & compare performance in the Web UI

Benchmarking the model gives useful information such as the model’s latency on the hardware platform, which layers are slowest, the number of layers executed on each compute unit type, and more! We can use this information for debugging and for iterating on the model’s design. We can answer questions like:

How does my model behave across different chipsets?
Can we optimize the slowest layer?
Why aren’t certain layers executed on the correct compute unit?

This interface lets you verify model performance across many devices without repeating setup work, with all your on-device efforts gathered in one place.

The visualizations in the dashboard make it easy to understand why a model behaves differently across chipsets, helping you systematically improve and optimize its performance for the hardware you target for. And by comparing multiple devices through our device cloud, you can confidently test and choose the best model–device combination before releasing your app.

Share your feedback

Embedl Hub is still in beta, and we’d love to hear your feedback and which features or devices you’d like to see next, so we can continue solving the problems Android devs face when building on-device AI.

Try it out at hub.embedl.com and let us know what you think!

New post! New AI devtool that let you optimize, run and benchmark models on remote edge devices, no need for your own hardware. Learn more about how the platform can solve your local AI challenges in the post. Link to platform: https://hub.embedl.com/

Elina Norling — Sun, 09 Nov 2025 18:55:50 +0000

Devtool for running and benchmarking local AI

Elina Norling ・ Nov 9

#embedded #machinelearning #deeplearning #computervision

hub.embedl.com

Devtool for running and benchmarking local AI

Elina Norling — Sun, 09 Nov 2025 18:51:02 +0000

We’ve just created Embedl Hub, a developer platform where you can experiment with on-device AI and analyze how models perform on real hardware. It allows you to optimize, benchmark, and compare models by running them on devices hosted in the cloud, so you don’t need access to physical hardware yourself.

You can test performance across phones, dev boards, and SoCs directly from your Python environment or terminal. Everything is free to use.

Overview of the Platform

Optimize and deploy your model on any edge device with the Embedl Hub Python library:

Quantize your model for lower latency and memory usage.
Compile your model for execution on CPU, GPU, NPU or other AI accelerators on your target devices.
Benchmark your model's latency and memory usage on real edge devices in the cloud.

Embedl Hub logs your metrics, parameters, and benchmarks, allowing you to inspect and compare your results on the web and reproduce them later.

What We’ve Built

To showcase the platform we’ve built, we’ll demonstrate how it can be used to optimize and profile a model running on a Samsung Galaxy S24 mobile phone.

Compile the model

Let’s say you want to run a MobileNetV2 model trained in PyTorch.
First, export the model to ONNX and then compile it for the target runtime. In this case, we want to run it using LiteRT (TFLite).

To compile it with the embedl-hub CLI, you run the command:

embedl-hub compile \
    --model /path/to/mobilenet_v2_quantized.onnx \
    --size 1,3,224,224 \
    --device "Samsung Galaxy S24" \
    --runtime tflite

Quantize the model

Quantization is an optional but highly recommended step that can drastically reduce inference latency and memory usage. It is especially useful when deploying models to resource-constrained hardware such as mobile phones or embedded boards. It works by lowering the numerical precision of weights and activations in the model.

While this can reduce the model’s accuracy, you can minimize the loss by calibrating with a small sample dataset, typically just a few hundred examples.

embedl-hub quantize \
    --model /path/to/mobilenet_v2.onnx \
    --data /path/to/dataset \
    --num-samples 100

Benchmark the model on remote hardware

Now that the model is compiled (and quantized), you can run it on real hardware directly through one of Embedl Hub's integrated device clouds.

embedl-hub benchmark \
    --model /path/to/mobilenet_v2_quantized.tflite \
    --device "Samsung Galaxy S24"

In this example, we run the model on Samsung Galaxy S24. There are a large number of devices to chose from on Embedl Hub, see supported devices here.

Analyze & compare performance in the Web UI

Benchmarking the model gives useful information such as the model’s latency on the hardware platform, which layers are slowest, the number of layers executed on each compute unit type, and more! We can use this information for advanced debugging and for iterating on the model’s design. We can answer questions like:

Can we optimize the slowest layer?
Why aren’t certain layers executed on the correct compute unit?

All data and artifacts are stored securely in your Embedl Hub account.

Share your feedback

Embedl Hub is still in beta, and we’d love to hear your feedback and what features or devices you’d like to see next.

Try it out at hub.embedl.com and let us know what you think!

Question: How do you ensure consistent AI model performance across Android devices?

Elina Norling — Fri, 31 Oct 2025 13:35:02 +0000

For those of you building apps that include AI models that run on-device (e.g. vision models), how do you handle the issue of models performing differently across different CPUs, GPUs, and NPUs? I’ve heard several cases where a model works perfectly on some devices but fails to meet real-time requirements or doesn’t work at all on others.

Do you usually deploy the same model across all devices? If so, how do you make it perform well on different accelerators and devices? Or do you switch models between devices to get better performance for each one? How do you decide which model works best for each type of device?

New blogpost about layer wise PSNR: https://dev.to/embedl-hub/diagnosing-layer-sensitivity-during-post-training-quantization-115g

Elina Norling — Thu, 30 Oct 2025 17:34:54 +0000

Diagnosing layer sensitivity during post training quantization

Elina Norling for Embedl Hub ・ Oct 30

#deeplearning #machinelearning #ai #embedded

Diagnosing layer sensitivity during post training quantization

Elina Norling — Thu, 30 Oct 2025 10:51:03 +0000

Quantization is an essential optimization technique to adapt a model to edge devices, realizing the hardware’s full potential. In practice, quantization refers to converting high-precision numerical types to lower-precision formats for both weights and activations. Most commonly, quantization converts 32-bit floating point (float32) to int8, often applied using post-training quantization (PTQ) where the model is quantized after training without requiring retraining. The result is a smaller and faster model on-device by cutting memory traffic up to 4× and enabling specialized int8 vector/NPU instructions with lower compute latency.

Quantization: Converting from float precision to int data format to increase computational efficiency.

However, quantization decreases the model’s expressivity and can introduce errors in the computations performed during inference. This performance degradation is expected and almost inevitable, but its impact varies by model architecture and task, ranging from unnoticeable to completely breaking the model. Therefore, this model degradation must be identified and measured as early as possible in the development and deployment pipeline.

In this blog post, we explain how to measure the degradation in practice and introduce a diagnostic toolset based on layer-wise peak signal-to-noise ratio (PSNR).

How to measure quality degradation for quantized models

Several methods exist in practice and academic literature to estimate accuracy degradation after quantization. Even without accessing the complete, annotated dataset and a task-specific evaluation pipeline, you can still strongly indicate the degradation level by comparing intermediate tensors and outputs from the float (float32) and quantized (int8) graphs. You can conclude by investigating the model data stream before and after quantization.

One metric that quantifies the difference between the original data stream and the quantized stream is “peak signal-to-noise ratio”, or PSNR for short. It captures how much a quantized tensor deviates from its original float counterpart, based on the mean squared error (MSE) between them. At Embedl Hub we calculate PSNR using the following formula:

where we define the peak signal and MSE according to

The MSE is calculated from the difference between the float (unquantized) and the quantized output tensors, whereas the peak signal is calculated from the float tensor T.

Model output

Measuring PSNR on the model outputs, such as logits, is common and lightweight. A higher PSNR indicates closer alignment between the quantized and unquantized outputs, and values above 30 dB are generally considered sufficient for many use cases, such as image classification. However, tasks like regression, audio, or safety-critical systems may require higher PSNR (≥40 dB) to preserve numerical accuracy.

However, output-level PSNR only captures the performance degradation of the model in its entirety after quantization. It doesn’t tell you where in the network the degradation occurs or why performance dropped more than expected after deployment. Instead, you need to go layer by layer to reach that level of detail. Understanding where degradation occurs makes it possible to debug quantization artifacts, tune calibration, or decide which operations to keep in float.

Diagnosing layers with Embedl hub

Embedl hub computes PSNR for each layer output in the entire neural network. This lets you visualize how quantization error accumulates through the model and spot sudden drops that point to sensitive layers.

You should expect a steady degradation as you go deeper into the network. This is because the quantization errors introduced in the first layers accumulate and propagate through the downstream layers. For this reason, the error can become significantly worse in particularly deep networks (e.g., Transformers, RNNs), which may require special measures such as mixed-precision.

However, if a sudden drop in PSNR occurs at a specific point in the network, it strongly indicates that a particular layer or operation is sensitive to quantization. This typically happens when the quantization error exceeds the operation’s numerical tolerance, causing a local breakdown in representation. Such layers require extra care to be quantized appropriately, for example, finer calibration, per-channel quantization, or being excluded from quantization altogether.

Here is an example from quantizing EfficientNet-B7, a rather deep CNN, where the PSNR degrades substantially inside the squeeze-and-excite blocks of the network.

An example of a sensitive layer is the softmax layer, which relies on precise value distributions due to its exponential and normalization behavior. Even minor rounding errors in softmax amplify and lead to disproportionate shifts in output and degraded model performance. It is recommended to keep softmax layers at a higher precision (float16/float32) during quantization if possible.

Get started today

Both output and layerwise PSNR are automatically calculated when you run a quantization job with the Embedl Hub Python library:

embedl-hub quantize \
   --model my_model.onnx

The PSNR values are computed by comparing the outputs of the original (unquantized) ONNX model to those of the quantized model, layer by layer as well as for the final output. This provides a clear, quantitative measure of how much information is lost at each stage of the network due to quantization.

Follow our getting started guides and start quantizing today!

Share your feedback

Embedl Hub is still in beta, and we’d love to hear your feedback and what features or devices you’d like to see next.

Try it out at hub.embedl.com and let us know what you think!

We’ve just launched Embedl Hub, a developer platform for experimenting with on-device AI. Run, benchmark, and compare models on real cloud-hosted hardware!

Elina Norling — Wed, 29 Oct 2025 08:24:17 +0000

Introducing devtool for running and benchmarking AI on-device

Elina Norling for Embedl Hub ・ Oct 23

#ai #android #deeplearning #computervision

9 reasons why Edge AI development is so hard

Elina Norling — Mon, 27 Oct 2025 10:14:50 +0000

Even though we’re a company specialized in Edge AI and most of our team spends their days building and deploying models to all kinds of devices – there’s no getting around the fact that all edge developers still run into recurring challenges on an almost daily basis. That’s why we decided to do a small in-house investigation to figure out what we can agree are the biggest challenges in or related to Edge AI development today. In this blog post, we present the results.

1. Real-time demands with limited compute

One of the biggest challenges with the edge is that systems often need to meet real-time latency requirements. For example, it is not certain that an autonomous vehicle has time to send data back and forth to the cloud – decisions must be made immediately. At the same time, the computers on edge devices are tiny compared to data centers, which means the models must be extremely efficient.

Running an inference can require billions of operations, which both takes time and consumes energy. This becomes a major challenge in embedded systems. Devices often lack of compute resources, operate under strict real-time requirements, and frequently run on battery. The combination makes the demands on optimization tougher than in almost any other environment. Even if you can clear the hurdle of getting a model to run, you might still be running into resource bottlenecks that make the model so slow that it’s useless.

2. No universal target device

The problem is also that hardware architectures vary greatly between devices, and there’s no one-size-fits-all for model and device pairing in this area. More specialized hardware may outclass a more general device for one type of model, yet utterly fail to compile and run another. ‘Faster’ is therefore not always faster. Furthermore, it is rarely a trivial task to determine why a model fails to run on a given device. This is where edge development differs significantly from cloud development. In the cloud, you can install virtually anything you want on your server. On a chip, the constraints are not only much stricter – it is often challenging to even get a model to run at all, because the hardware is so specific.

3. Device capabilities define which techniques work

Since hardware architecture varies from device to device, the techniques to improve model performance are also device-dependent. For example, some devices support unstructured sparsity, leading to significant speedups without any loss in accuracy, while others gain no performance benefit at all from sparse calculations and become less accurate. Different quantization capabilities across devices also compound the problem of optimizing a model for a given hardware target.

4. Investing blindly in hardware

Another frustrating part related to the fact that it isn’t obvious which model-hardware combo that will perform best – or at all – is that you might sometimes have to invest in hardware before knowing whether your model will actually run well on it. In theory, it’s possible to guess based on spec sheets or vendor claims – but in practice, the only reliable way to find out is to test. And testing is hard when you don’t have access to the device. This means many developers end up investing time – and sometimes money – on hardware that turns out to be a poor fit. Whatever the reason the model doesn’t run, you often don’t find out until late in the process. By then, you’ve already invested.

5. Full-stack knowledge is a prerequisite

To build an AI application with excellent performance on an edge device, you not only have to consider the model’s architecture, but, as a consequence of the challenges mentioned above, also the resources available on the hardware, its specific architecture, and which compression and deployment techniques are allowed. As a result, this requires more low-level knowledge compared to running in the cloud. An edge developer needs to understand the entire stack: both software and hardware, and the various layers in between that translate a trained model into the actual code that runs on the target device.

6. A multidisciplinary process by default

This requirement for a holistic understanding also makes edge deployment inherently a multidisciplinary process. Successfully deploying models to edge hardware end-to-end may require contributions from software engineers, hardware specialists, researchers, data engineers, and network experts. When teams with such diverse expertise need to collaborate, the process becomes not only technically challenging but also organizationally complex. A common issue that arises when this alignment is missing is that developers do not always take the hardware into account when selecting a model. As a result, a model that looked promising during training may prove unsuitable for the target hardware, and once this becomes apparent, it is often difficult for an edge developer to remedy the situation.

7. Hardware support lags behind evolving AI

The challenges in edge deployment are further compounded by the fact that we’re dealing with a very large and rapidly evolving problem space. AI development is advancing at incredible speed – new models are constantly being released, while hardware support is always playing catch up. Meanwhile, new hardware platforms are being introduced, but there is little consistency across these platforms in terms of support, compatibility, or efficiency. Porting a model from one hardware platform that runs it successfully to another hardware platform often feels like starting from zero all over again.

8. Poor documentation

Additionally, we want to highlight that the field is evolving rapidly while still being in an early stage. Documentation in edge development is therefore often poor. For example, it is not always clear which version it refers to, or which flags can actually be combined. Error messages are often vague, making it difficult to understand exactly what went wrong when a model fails to compile.

9. Fragmented tooling

Another consequence of the fact that edge AI is still not fully established compared to cloud AI is that tooling has long been fragmented, with no clear standards (as we have already touched upon). One major source of fragmentation is that there are multiple hardware vendors, and they all have their own inference tools and ecosystems. Furthermore, going from PyTorch to a working model on a phone or a development board often involves four to five different tools: PyTorch → ONNX → quantization tool → vendor-specific SDK → deployment code. These rarely work seamlessly together. You are expected to stitch everything together yourself, troubleshoot broken conversions, and still deliver something with low latency and high accuracy. It is not only time-consuming, it is also fragile. Every update of a tool risks breaking something further downstream.

Trying to make edge AI a little less painful

Since we in the Embedl Hub team also recognize ourselves in these challenges, we’ve tried to build a platform that we believe can solve quite a few of them. We believe that – even if we might be biased – our current BETA offers a solid solution for at least the issues related to fragmented tooling, poor documentation, and finding the most performant model-hardware combination. We offer a consistent workflow that combines model optimization, quantization, compilation, and benchmarking in one unified Python library that’s connected to a web UI where everything is saved and can be compared in one place. We’ve also built a remote device cloud where you can test models and techniques on real hardware – directly and without even needing access to the devices yourself – making it easier to quickly find the right hardware-model combination without you having to take chances on which hardware to invest in. In our benchmark suite, there are also ready-made benchmarks to browse and explore.

Even if it’s hard to cover everything in one single tool, our hope is that our Hub will make Edge AI development smoother, and even though the field is evolving fast, the Hub can help create better conditions to keep up.

Do you miss something else on our platform to make your Edge development easier? We are super-curious! Please get in touch with us here or write to us on slack.

Introducing devtool for running and benchmarking AI on-device

Elina Norling — Thu, 23 Oct 2025 14:38:54 +0000

You can test performance across phones, dev boards, and SoCs directly from your Python environment or terminal. Everything is free to use.

Overview of the Platform

Optimize and deploy your model on any edge device with the Embedl Hub Python library:

Quantize your model for lower latency and memory usage.
Compile your model for execution on CPU, GPU, NPU or other AI accelerators on your target devices.
Benchmark your model's latency and memory usage on real edge devices in the cloud.

Embedl Hub logs your metrics, parameters, and benchmarks, allowing you to inspect and compare your results on the web and reproduce them later.

What We’ve Built

To showcase the platform we’ve built, we’ll demonstrate how it can be used to optimize and profile a model running on a Samsung Galaxy S24 mobile phone.

Compile the model

Let’s say you want to run a MobileNetV2 model trained in PyTorch.
First, export the model to ONNX and then compile it for the target runtime. In this case, we want to run it using LiteRT (TFLite).

To compile it with the embedl-hub CLI, you run the command:

embedl-hub compile \
    --model /path/to/mobilenet_v2_quantized.onnx \
    --size 1,3,224,224 \
    --device "Samsung Galaxy S24" \
    --runtime tflite

Quantize the model

While this can reduce the model’s accuracy, you can minimize the loss by calibrating with a small sample dataset, typically just a few hundred examples.

embedl-hub quantize \
    --model /path/to/mobilenet_v2.onnx \
    --data /path/to/dataset \
    --num-samples 100

Benchmark the model on remote hardware

Now that the model is compiled (and quantized), you can run it on real hardware directly through one of Embedl Hub's integrated device clouds.

embedl-hub benchmark \
    --model /path/to/mobilenet_v2_quantized.tflite \
    --device "Samsung Galaxy S24"

In this example, we run the model on Samsung Galaxy S24. There are a large number of devices to chose from on Embedl Hub, see supported devices here.

Analyze & compare performance in the Web UI

Benchmarking the model gives useful information such as the model’s latency on the hardware platform, which layers are slowest, the number of layers executed on each compute unit type, and more! We can use this information for advanced debugging and for iterating on the model’s design. We can answer questions like:

Can we optimize the slowest layer?
Why aren’t certain layers executed on the correct compute unit?

All data and artifacts are stored securely in your Embedl Hub account.

Share your feedback

Embedl Hub is still in beta, and we’d love to hear your feedback and what features or devices you’d like to see next.

Try it out at hub.embedl.com and let us know what you think!