Muhammed Shafin P

Posted on Mar 1

Running Machine Learning on Microcontrollers — A Sample Usage of embml

#embedded #ai #programming #productivity

Most embedded developers have heard the pitch for "TinyML" by now. Train a model in Python, quantize it, convert it, flash a frozen blob to your device. The microcontroller runs inference. It never learns. It never adapts. It just executes.

That's fine for a class of problems — but it leaves a lot on the table. What if your sensor drifts after six months in the field? What if you want the device to tune itself to the specific motor it's attached to, not a generic one from a training dataset? What if there's simply no server in the loop?

embml is a sample repository exploring what it looks like to do machine learning on the device itself — in pure C, with no dynamic allocation, no external dependencies beyond the standard library, and no Python runtime anywhere in the chain.

📦 Sample Repo: https://github.com/hejhdiss/embml

It is not a production framework. It is a well-structured, readable starting point — a reference that embedded developers can clone, read, understand, and adapt. Every algorithm is implemented from scratch in C99, with the caller owning every buffer.

What's in the Repo

The library covers eight modules, all in src/:

Module	Algorithm
`embml_linear`	Online linear regression via SGD
`embml_logistic`	Binary logistic regression via SGD
`embml_lms`	LMS and Normalised LMS adaptive filter
`embml_rls`	Recursive Least Squares with forgetting factor
`embml_iqr`	Incremental QR via Givens rotations
`embml_nn`	Feedforward MLP — backprop, Xavier init, gradient clipping
`embml_gru`	Minimal GRU cell for time-series inference
`embml_esn`	Echo State Network — fixed reservoir, RLS-trained readout

Each module is a .c and .h pair. Drop them directly into your firmware project.

Sample Usage

The examples below show what real usage looks like. These aren't pseudocode — they compile and run on ESP32, STM32F4, RP2040, and Arduino Mega class hardware.

Linear Regression — On-Device Temperature Compensation

A sensor reading drifts linearly with board temperature. Train a correction model live, sample by sample, with no server in the loop.

#include "embml.h"

#define N_FEAT 2   /* [raw_reading, board_temp] → corrected_value */

float weights[N_FEAT];
LinearModel model;

void setup(void) {
    linear_init(&model, N_FEAT, 0.01f, weights);
}

void loop(void) {
    float x[N_FEAT] = { read_sensor(), read_board_temp() };
    float y_true    = read_reference();   /* calibration reference */

    /* learn from each sample — no batch needed */
    linear_update(&model, x, y_true);

    float corrected = linear_predict(&model, x);
    log_value(corrected);
}

After a few hundred samples the model converges to the compensation curve. No laptop. No Python. The device taught itself.

Logistic Regression — Fault Detection

Classify whether a motor is healthy (0) or showing early fault signs (1) from two vibration features.

#include "embml.h"

#define N_FEAT 3   /* [rms_vibration, peak_freq, temp] */

float weights[N_FEAT];
LogisticModel model;

void setup(void) {
    logistic_init(&model, N_FEAT, 0.005f, weights);
}

void loop(void) {
    float x[N_FEAT] = { rms(), peak_freq(), motor_temp() };

    /* During a known-good commissioning window, label = 0 */
    logistic_update(&model, x, 0.0f);

    /* In operation: */
    uint8_t fault = logistic_classify(&model, x);
    float   prob  = logistic_predict(&model, x);

    if (prob > 0.75f)
        trigger_alert();
}

LMS — Background Noise Cancellation

The Least Mean Squares filter adapts to reject a periodic noise source from a signal, updating every sample with a single multiply-accumulate per weight — the lightest possible online learner.

#include "embml.h"

#define FILTER_LEN 16

float weights[FILTER_LEN];
LMSModel model;

void setup(void) {
    /* Normalised LMS: stable without tuning step size manually */
    lms_init_nlms(&model, FILTER_LEN, 0.5f, 1e-6f, weights);
}

void loop(void) {
    float noisy_signal[FILTER_LEN] = { /* circular buffer of ADC samples */ };
    float desired = read_reference_mic();

    lms_update(&model, noisy_signal, desired);
    float clean = lms_predict(&model, noisy_signal);

    output_audio(clean);
}

RLS — Fast Converging System Identification

RLS converges far faster than SGD with no learning rate to tune. Here it identifies the coefficients of an unknown plant (e.g. a motor transfer function) in real time.

#include "embml.h"

#define N 5

float weights[N], P[N * N], k_scratch[N];
RLSModel model;

void setup(void) {
    /* lambda=0.98: moderate forgetting for a slowly drifting system */
    /* delta=1000: weak prior — trust the data quickly               */
    rls_init(&model, N, 0.98f, 1000.0f, weights, P);
}

void loop(void) {
    float x[N]  = { u_delayed(1), u_delayed(2),
                     y_delayed(1), y_delayed(2), 1.0f };
    float y_now = read_plant_output();

    rls_update(&model, x, y_now, k_scratch);

    /* weights[] now approximate the ARX model coefficients */
    float y_pred = rls_predict(&model, x);
    float residual = y_now - y_pred;
}

Incremental QR — Numerically Robust Least Squares

When the input data is poorly conditioned (e.g. highly correlated features), RLS can lose numerical stability. Incremental QR via Givens rotations avoids this by never forming the covariance matrix directly.

#include "embml.h"

#define N 6

float R[N * N], f[N];
float w[N], scratch[2 * N];
IQRModel model;

void setup(void) {
    /* ridge=1e-4: small regularisation until enough samples arrive */
    iqr_init(&model, N, 0.99f, 1e-4f, R, f);
}

void loop(void) {
    float x[N] = { feature_1(), feature_2(), feature_3(),
                    feature_4(), feature_5(), feature_6() };
    float y = read_target();

    iqr_update(&model, x, y, scratch);

    /* re-solve periodically — O(n^2) back-substitution */
    if (sample_count % 50 == 0) {
        iqr_solve(&model, w, scratch);
    }
    float yhat = iqr_predict(w, x, N);
}

Feedforward MLP — Small Neural Net, On-Device Training

A 3-layer net with 4 inputs, 8 hidden neurons, and 1 output. Xavier-initialised. Trains with backpropagation + gradient clipping — all on the MCU.

#include "embml.h"

#define L0 4
#define L1 8
#define L2 1

float W0[L1*L0], b0[L1], a1[L1], d1[L1];
float W1[L2*L1], b1_[L2], a2[L2], d2[L2];
float input_buf[L0];

NNLayer layers[2] = {
    { W0, b0,  a1, d1, L0, L1, EMBML_ACT_RELU    },
    { W1, b1_, a2, d2, L1, L2, EMBML_ACT_SIGMOID },
};
NNModel net;

void setup(void) {
    nn_init(&net, layers, 2, input_buf, L0, 0.01f, 1.0f);
}

void loop(void) {
    float x[L0]      = { s1(), s2(), s3(), s4() };
    float target[L2] = { ground_truth() };

    nn_train_sample(&net, x, target);

    /* Or just inference: */
    const embml_float_t *out = nn_forward(&net, x);
    float prediction = out[0];
}

GRU — Time-Series Inference

A Gated Recurrent Unit cell processes sequential sensor data step by step. Weights are loaded from flash (trained offline on a host), and the hidden state persists across time steps.

#include "embml.h"

#define X_SZ 4
#define H_SZ 8

/* Weights trained offline, stored as const arrays in flash */
#include "gru_weights.h"   /* defines Wz, Wr, Wn, Uz, Ur, Un, bz, br, bn */

float h_state[H_SZ];
float scratch[3 * H_SZ];
GRUCell cell;

void setup(void) {
    gru_init(&cell, X_SZ, H_SZ,
             Wz, Wr, Wn, Uz, Ur, Un,
             bz, br, bn, h_state, scratch);
}

void loop(void) {
    float x_t[X_SZ] = { accel_x(), accel_y(), accel_z(), gyro_z() };

    gru_step(&cell, x_t);

    /* Hidden state in cell.h[] — pass to a classifier or threshold */
    float anomaly_score = cell.h[0];
    if (anomaly_score > 0.8f)
        flag_anomaly();
}

Echo State Network — On-Device Training, No Backprop

The reservoir (random weights) is fixed and stored in flash. Only the linear readout layer is trained — via RLS, one sample at a time. This is the best balance of adaptability and compute cost for embedded time-series learning.

#include "embml.h"
#include "esn_reservoir.h"  /* const W_in[H*X], const W_res[H*H] in flash */

#define X_SZ  4
#define H_SZ 32
#define Y_SZ  1

float W_out[Y_SZ * H_SZ];
float state[H_SZ], scratch[H_SZ];
float P[H_SZ * H_SZ], k[H_SZ];

ESNModel esn;
RLSModel rls;

void setup(void) {
    esn_init(&esn, X_SZ, H_SZ, Y_SZ,
             W_in, W_res, 0.9f,
             state, scratch, W_out);
    esn_rls_init(&esn, &rls, 0.98f, 1000.0f, P, k);
}

void loop(void) {
    float x[X_SZ] = { s1(), s2(), s3(), s4() };
    float y[Y_SZ] = { read_target() };

    /* Training mode */
    esn_update_state(&esn, x);
    esn_rls_update(&esn, y);

    /* Inference mode */
    float y_out[Y_SZ];
    esn_update_state(&esn, x);
    esn_predict(&esn, y_out);
}

Why This Repo Exists

This is a sample — a proof of concept that these algorithms fit cleanly in embedded C, that the APIs are usable by firmware engineers without an ML background, and that on-device learning is not science fiction for mid-range MCUs.

If you're building something with it, adapting it, or just reading the source to understand how RLS or Givens rotations actually work in flat C arrays — that's exactly what it's here for.

📦 Sample Repo: https://github.com/hejhdiss/embml

MIT License · Author: @hejhdiss · Generated with Claude Sonnet 4.5

DEV Community