By @hejhdiss
Sample Repo:https://github.com/hejhdiss/embml
The embedded world has always been about doing more with less. Less RAM, less flash, less clock speed — and yet the demand for intelligence at the edge is growing faster than ever. We squeeze RTOS kernels into 64KB, hand-tune ISRs for microsecond response times, and we've gotten very good at writing C that doesn't waste a single cycle. So why are embedded developers still expected to port Python-first ML frameworks — designed for server racks — just to run a simple regression on a microcontroller?
They shouldn't be. And that's exactly the argument for a dedicated ML library built for embedded systems, from scratch, on our terms.
The Problem with "TinyML" as It Stands
Tools like TensorFlow Lite for Microcontrollers and Edge Impulse have done useful work. But they're fundamentally top-down: design in Python, train on a server, quantize, convert, deploy a frozen model blob to the device. The microcontroller is just a runtime. It has no agency. It cannot learn.
That's acceptable for a narrow class of applications, but it closes the door on anything that needs on-device adaptation — predictive maintenance that improves over time, sensor fusion that adjusts to component drift, control loops that tune themselves in the field. For those, we need an embedded-native ML library: designed around hardware constraints, not retrofitted onto them.
Scope It Right: Mid-Range MCUs Are the Target
Let's be precise about the target hardware. This isn't about squeezing transformers into an ATtiny85. The realistic and immediately useful scope is mid-range microcontrollers — devices like the ESP32, STM32F4/F7 series, RP2040, and similar parts that offer 128KB–512KB SRAM, hardware floating-point, and clock speeds in the 80–240 MHz range.
It's not impossible to go lower — but on sub-32KB SRAM devices, memory pressure becomes the real bottleneck, not compute. You can optimize arithmetic all day, but if your covariance matrix doesn't fit in SRAM, the algorithm simply doesn't run. Mid-range parts sidestep that wall cleanly. They have enough headroom for meaningful models while still being the kind of hardware that ends up in real products: industrial sensors, motor controllers, wearables, edge gateways.
Arduino Uno-class hardware (2KB SRAM) is a different conversation entirely — not excluded, but scoped separately, with stripped-down variants that make explicit trade-offs.
What Can Actually Be Built
Most classical ML algorithms are not inherently heavy. Their Python implementations are heavy because Python is heavy. Strip that away and what you have is math — and math runs fine on an ESP32.
Linear and Logistic Regression are a weight vector and a dot product. With online SGD and a fixed learning rate, you can train a linear model in real time with negligible memory overhead. Logistic regression adds a sigmoid activation — a lookup table handles it efficiently in fixed-point.
Small Feedforward Neural Networks with compact topologies — 4 inputs, 8 hidden neurons, 1 output — fit entirely in SRAM on mid-range hardware. Inference is matrix multiplication and activation. Backpropagation is heavier, but gradient clipping and fixed-point arithmetic make it workable on hardware with an FPU.
Recurrent Neural Networks are viable in minimal form. A single GRU cell for time-series prediction — temperature trends, vibration signatures, current draw anomalies — requires only a few weight matrices and a hidden state vector. The operations are repetitive and friendly to loop unrolling.
Neural manifold and ODE-inspired methods are worth keeping on the roadmap. They're not day-one targets for constrained hardware, but as the library matures and targets higher-spec parts, these become tractable. The key principle stays the same: implement the version that fits your problem and your flash budget, not the full general case.
Training Without Backprop: The Algorithms That Actually Fit
Gradient descent is not the only path to a trained model. On embedded hardware it's often not even the best path. There's a family of numerically stable, low-memory update algorithms that are much better suited to MCU constraints — and they deserve to be first-class citizens in this library.
Recursive Least Squares (RLS) solves the linear regression problem incrementally, sample by sample, without storing the full dataset. It maintains a covariance matrix and updates it with each new observation, converging faster than SGD and with no learning rate to tune. On a mid-range MCU with 10–20 features, the covariance matrix is small enough to live comfortably in SRAM. RLS is the right default for any regression task where fast convergence and numerical stability matter more than raw throughput.
Incremental QR decomposition takes this further. Rather than maintaining and inverting a covariance matrix directly — which can become ill-conditioned — incremental QR updates a factored representation of the data matrix as new samples arrive. It's more numerically robust than plain RLS and still runs sample-by-sample. For embedded systems where you might be training on noisy sensor data over long periods, that stability is worth the slightly higher per-update cost.
LMS-style updates (Least Mean Squares) are at the other end of the complexity spectrum: a single weight update per sample, one multiply-accumulate per feature, no matrix state. Convergence is slower and noisier than RLS, but the memory footprint is essentially zero beyond the weight vector itself. LMS is the right tool for the most constrained targets, or for problems where you want continuous, lightweight adaptation running indefinitely in the background.
Together, RLS, incremental QR, and LMS cover a range from "fast and stable" to "minimal overhead, always on." A well-designed library exposes all three and lets the developer choose based on their hardware and application — not based on what was easiest to port from Python.
Pure C: The Only Reasonable Implementation Language
This library should be written in pure C — not C++, not Rust, not a thin wrapper around a Python-generated blob. C is the lingua franca of embedded systems. It compiles cleanly on every toolchain from GCC-ARM to SDCC to the Arduino AVR compiler. It gives the developer full control over memory layout, alignment, and register usage. And it interoperates with existing firmware without friction.
The dependency list should be as short as possible. Ideally: the C standard library (stdint.h, string.h, math.h) and nothing else for the core algorithms. Platform-specific acceleration — CMSIS-DSP on Cortex-M, the ESP-IDF DSP extensions on ESP32 — can be offered as optional back-ends behind a thin abstraction layer, but the pure-C fallback must always exist and always compile cleanly on any target. Use as little of even those as possible. Every external dependency is a maintenance burden and a porting tax.
No dynamic allocation in the core path. No malloc, no free. The caller provides buffers; the library uses them. This is non-negotiable for embedded code that needs to run reliably over months without heap fragmentation killing a device in the field.
// Caller provides all memory — no hidden allocation
RLSModel model;
float weights[N_FEATURES];
float cov[N_FEATURES * N_FEATURES];
rls_init(&model, N_FEATURES, FORGETTING_FACTOR,
weights, cov);
// Per-sample update — can run in interrupt context
rls_update(&model, feature_vector, label);
// Inference
float prediction = rls_predict(&model, new_features);
The API should be flat, explicit, and boring. Boring embedded code is correct embedded code.
Why This Matters
The gap between embedded developers and ML practitioners is largely a tools gap. Embedded engineers understand their hardware deeply but may not know how to implement incremental QR. ML engineers can train excellent models but may not know how to write interrupt-safe code or work within a 256KB flash budget. A well-designed embedded ML library bridges that gap — meeting embedded developers where they already are: in C, on the hardware, thinking in terms of registers and cycles.
The hardware is already capable. The algorithms exist and are well-understood. What's missing is a library that takes both the hardware constraints and the algorithmic options seriously, without requiring a server in the loop.
That library should exist. It's time to build it.
@hejhdiss writes about embedded systems, signal processing, and the intersection of hardware and machine intelligence.
Top comments (0)