DEV Community

Hedy
Hedy

Posted on

How do I take a neural network in Python and run it on a microcontroller?

To take a neural network you trained in Python and run it on a microcontroller, the winning pattern is:

Train in Python → shrink (INT8 quantize) → convert/export → compile into firmware → run inference with a tiny runtime (TFLite Micro / vendor runtime) → validate speed + RAM/Flash.

Below are the most common routes and a concrete end-to-end example.

Route choices (pick one)
A) TensorFlow → TensorFlow Lite (TFLite) → TFLite Micro

Most common “generic MCU” workflow:

  • Convert with the TFLite Converter.
  • Quantize (usually INT8) using post-training quantization or QAT.
  • Run on-device with TFLite Micro (a small C++ runtime).

B) STM32 specifically → STM32Cube.AI / X-CUBE-AI

If your target is STM32, this is often the fastest “it just works” path:

Import model from popular frameworks, optimize, and generate STM32 project code via CubeMX/Cube.AI.

C) Hand-optimized kernels on Cortex-M → CMSIS-NN

For maximum speed/small footprint (but more manual work):

Use CMSIS-NN’s optimized NN kernels for Cortex-M.

D) Compile with TVM → microTVM

If you want compiler-driven optimization and AOT builds:

TVM’s microTVM can compile TFLite models to embedded targets.

The “standard” pipeline (works on most MCUs)
1) Design for MCU constraints (before training)

  • Prefer small architectures: DS-CNN / tiny CNN / small MLP
  • Avoid heavy ops not supported on micro runtimes
  • Know your budget:
    • Flash holds model + code
    • RAM holds tensors (“arena”), stacks, buffers

2) Quantize to INT8 (huge enabler)

INT8 typically cuts model size ~4× and speeds up inference on MCUs (especially Cortex-M with optimized kernels). Quantization options are documented by Google.

3) Convert to .tflite

Use the TFLite Converter (recommended starting point in official docs).

4) Turn .tflite into a C array and compile into firmware

Common approach: embed the model as const unsigned char model[] = {...};

5) Run inference on MCU (TFLite Micro idea)

  • Create an interpreter
  • Provide a tensor arena (static RAM buffer)
  • Feed input tensor → Invoke() → read output tensor
  • Measure latency and tune arena size

Minimal end-to-end example (TensorFlow → TFLite INT8 → MCU)
Python: convert + INT8 quantize (skeleton)

import tensorflow as tf

# 1) Load your SavedModel (or Keras model)
converter = tf.lite.TFLiteConverter.from_saved_model("saved_model_dir")

# 2) INT8 post-training quantization (representative dataset required for full int8)
def rep_data_gen():
    for _ in range(200):
        # yield a batch shaped like your model input
        yield [tf.random.uniform([1, 128], minval=-1, maxval=1, dtype=tf.float32)]

converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = rep_data_gen
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8

tflite_model = converter.convert()

with open("model_int8.tflite", "wb") as f:
    f.write(tflite_model)

Enter fullscreen mode Exit fullscreen mode

(Converter + post-training quantization are covered in Google’s docs.

C/C++ (MCU): TFLite Micro inference flow (concept)

// Pseudocode-level (exact headers depend on your TFLM integration)
#include "model_data.h"   // const unsigned char g_model[]
#include "tflite_micro_runtime.h"

static uint8_t tensor_arena[30 * 1024]; // tune this

int main(void) {
  // init clocks, UART, etc.

  MicroInterpreter interp(g_model, tensor_arena, sizeof(tensor_arena));
  interp.AllocateTensors();

  int8_t* in = interp.input_int8(0);
  // Fill input (already int8 quantized domain)
  // in[i] = ...

  interp.Invoke();

  int8_t* out = interp.output_int8(0);
  // Use out...
}
Enter fullscreen mode Exit fullscreen mode

Practical tips that save days

  • Start with a known tiny example (keyword spotting, gesture, simple classifier) to validate your toolchain, then swap in your model.
  • Arena sizing: if AllocateTensors() fails, increase tensor_arena; if it’s huge, simplify the model.
  • Operator support: if conversion succeeds but inference fails, it’s often an unsupported op (or requires a different quantization strategy).
  • On STM32, seriously consider STM32Cube.AI for a smoother workflow + codegen.
  • On Cortex-M, enabling CMSIS-NN kernels can give big speedups.

Top comments (0)