DEV Community: Raghu Bharadwaj

Coherent vs Streaming DMA: A Deep Dive into the Linux DMA Mapping API

Raghu Bharadwaj — Thu, 02 Jul 2026 06:02:09 +0000

The Linux DMA mapping API exists to solve two problems for driver authors: translating a CPU buffer into a bus address the device can use, and inserting the cache maintenance operations needed on non-coherent architectures. Coherent mappings (dma_alloc_coherent) are for small, long-lived control structures accessed by both CPU and device without explicit syncing, while streaming mappings (dma_map_single, dma_map_page, dma_map_sg) are for bulk data mapped just before a transfer and unmapped after, with explicit sync calls if the CPU touches the buffer in between. In the Linux 7.x source, every mapping with no custom DMA ops and no IOMMU takes the dma-direct fast path through dma_map_phys(), and CONFIG_DMA_API_DEBUG can validate that a driver's map and unmap calls are used correctly.

If you write drivers for embedded Linux, the DMA mapping API is one of the interfaces you cannot avoid for long. The moment a device moves data into or out of memory on its own, without the CPU copying each byte, your driver has to tell the kernel how that memory should be prepared. Get it wrong and the symptoms are some of the hardest to debug in kernel work: data that is correct on a desktop x86 board but corrupted on an ARM target, or a buffer that reads back stale values only under load. This Deep Dive comes in two parts. First it covers the concepts and rules every driver author needs: coherent versus streaming mappings, the DMA mask, directions, and syncing. Then it goes one layer down and traces the kernel source that implements them, from the dispatch in kernel/dma/mapping.c to the arm64 cache hooks and the CONFIG_DMA_API_DEBUG facility. The source shown is from Linux 7.1, whose series reworked the DMA core to be physical-address based.

Three kinds of addresses

The first source of confusion is that DMA involves three different address spaces, and they are not interchangeable. The kernel works with virtual addresses, the kind returned by kmalloc() and stored in a void *. The memory management unit translates those into CPU physical addresses, the values you see in /proc/iomem. A device, however, sees a third kind of address called a bus address. On simple systems the bus address equals the physical address, but when an IOMMU or a host bridge sits between the device and memory, the two diverge.

This matters because a device performing DMA uses bus addresses, and it has no access to the CPU's virtual memory system. You cannot hand a device a pointer from kmalloc() and expect it to work. The job of the DMA mapping API is to take a buffer the CPU can see and return a dma_addr_t value the device can use, setting up any IOMMU translation along the way. Every driver that touches DMA must include the header that defines this type:

#include <linux/dma-mapping.h>

Why the DMA mapping API exists

Beyond address translation, the API solves a second problem: cache coherency. Many embedded SoCs have CPU caches that are not kept coherent with DMA traffic. If the CPU writes a buffer, the data may still be sitting in the cache when the device reads main memory, so the device sees old contents. In the other direction, the device writes main memory while the CPU still holds a cached copy, so the CPU reads stale data. The DMA mapping API is the single place where the kernel inserts the cache maintenance operations needed to handle this, in an architecture-independent way. On a fully coherent platform those operations compile down to almost nothing; on a non-coherent ARM board they become real cache flushes and invalidations. Your driver code stays the same either way.

Tell the kernel your addressing limits

Before mapping anything, a driver must declare how many address bits the device can drive. By default the kernel assumes 32-bit DMA addressing. You change that with a single call that covers both the streaming and coherent interfaces:

if (dma_set_mask_and_coherent(dev, DMA_BIT_MASK(64))) {
        dev_warn(dev, "No suitable 64-bit DMA availablen");
        /* fall back or refuse to probe */
}

The kernel saves this mask and uses it later when it allocates DMA addresses, so it never hands the device an address it cannot reach. Note that dma_set_mask_and_coherent() will not fail for masks of 32 bits or larger, so the common pattern is to set 64 bits when the device supports it and 32 bits otherwise, rather than trying a 64-bit call and falling back to 32. If the device has different limits for descriptors and for data, you can set the streaming and coherent masks separately with dma_set_mask() and dma_set_coherent_mask().

Coherent mappings: allocate once, keep for the device's lifetime

A coherent mapping is memory for which a write by either the CPU or the device is immediately visible to the other, with no explicit flushing in your driver. Think of it as synchronous. You allocate it once, usually at probe time, and free it at removal. The classic uses are control structures the device polls continuously: network card ring descriptors, command mailboxes, or firmware microcode run out of main memory.

dma_addr_t dma_handle;
void *cpu_addr;

cpu_addr = dma_alloc_coherent(dev, size, &dma_handle, GFP_KERNEL);
if (!cpu_addr)
        return -ENOMEM;

The call returns two things: a CPU virtual address you use to read and write the buffer, and a dma_handle of type dma_addr_t that you program into the device. When you are done, release both with the matching free call:

dma_free_coherent(dev, size, cpu_addr, dma_handle);

One subtlety that surprises people: coherent does not mean the CPU stops reordering writes. If the device must see word zero of a descriptor updated before word one, you still need a write memory barrier between the two stores:

desc->word0 = address;
wmb();
desc->word1 = DESC_VALID;

For many small allocations, carving them out of a single page is wasteful. The dma_pool interface acts like a kmem_cache built on top of dma_alloc_coherent(), and it understands alignment and boundary constraints that hardware queues often require.

Streaming mappings: map for one transfer, then unmap

A streaming mapping is for memory the CPU already owns, which you want to hand to the device for a single transfer and then take back. Think of it as asynchronous, outside the coherency domain. Network packets being transmitted or received, and filesystem buffers, are the standard examples. You map a buffer just before the transfer and unmap it as soon as the device signals completion:

dma_addr_t dma_handle;

dma_handle = dma_map_single(dev, addr, size, DMA_TO_DEVICE);
if (dma_mapping_error(dev, dma_handle))
        goto map_error;

/* program dma_handle into the device, start the transfer */

dma_unmap_single(dev, dma_handle, size, DMA_TO_DEVICE);

The direction argument is not optional decoration. DMA_TO_DEVICE means memory to device, DMA_FROM_DEVICE means device to memory, and DMA_BIDIRECTIONAL covers both at a possible performance cost. The kernel uses the direction to decide which cache operations to perform, so specify it as precisely as you can. DMA_NONE exists only as a debugging placeholder.

Two rules are easy to miss. First, always check dma_mapping_error() on the returned address; mapping can fail when DMA address space is exhausted or an IOMMU mapping cannot be created, and using an unchecked address can lead to silent corruption. Second, never use the CPU buffer while it is mapped for the device. The buffer belongs to the device between map and unmap. The same applies to dma_map_page(), which takes a page and offset instead of a CPU pointer so it can map HIGHMEM memory, and to dma_map_sg() for scatter-gather lists.

Synchronising a buffer you reuse

Sometimes you need the CPU to look at a streaming buffer between transfers without fully unmapping it. That is what the sync calls are for. Before the CPU reads a buffer the device just wrote, give ownership back to the CPU; before handing it to the device again, return ownership to the device:

dma_sync_single_for_cpu(dev, dma_handle, size, DMA_FROM_DEVICE);
/* CPU may now safely read the buffer */

dma_sync_single_for_device(dev, dma_handle, size, DMA_FROM_DEVICE);
/* device may now use the buffer again */

If you never touch the data between dma_map_*() and dma_unmap_*(), you do not need the sync calls at all. They exist precisely for the reuse case, and skipping them on a non-coherent platform is a frequent cause of intermittent corruption.

Alignment and cache lines

One rule deserves special attention on embedded targets. You may DMA to memory from kmalloc() or the page allocator, but not from vmalloc() memory, kernel stack, or static (data, text, bss) addresses. On a CPU with DMA-incoherent caches, a DMA buffer must also not share a cache line with other data, or a CPU write to one word and a DMA write to a neighbouring word in the same line can overwrite each other. Architectures set ARCH_DMA_MINALIGN so that kmalloc() buffers are aligned safely, but if you embed a DMA buffer inside a larger structure next to fields the CPU writes, you are responsible for keeping them on separate cache lines.

Inside the DMA mapping API: three back ends

Everything above is the contract your driver works to, and it is stable across kernel versions. The implementation underneath is not. The 7.x series reworked the DMA core to be physical-address based: the internal entry point that performs the dispatch is now dma_map_phys() in kernel/dma/mapping.c, the dma_map_ops operation map_page was renamed map_phys, and dma_direct_map_page() became dma_direct_map_phys(). The dma_map_single() and dma_map_page() you call are unchanged; they convert your buffer to a physical address and feed dma_map_phys() underneath. Trimmed to the decision that matters, the dispatch looks like this:

dma_addr_t dma_map_phys(struct device *dev, phys_addr_t phys, size_t size,
                enum dma_data_direction dir, unsigned long attrs)
{
        const struct dma_map_ops *ops = get_dma_ops(dev);
        dma_addr_t addr = DMA_MAPPING_ERROR;

        if (dma_map_direct(dev, ops))
                addr = dma_direct_map_phys(dev, phys, size, dir, attrs, true);
        else if (use_dma_iommu(dev))
                addr = iommu_dma_map_phys(dev, phys, size, dir, attrs);
        else if (ops->map_phys)
                addr = ops->map_phys(dev, phys, size, dir, attrs);

        debug_dma_map_phys(dev, phys, size, dir, addr, attrs);
        return addr;
}

There are still exactly three paths. The dma-direct path (dma_direct_map_phys) is the common one on most modern arm64 and x86 systems. The IOMMU path (iommu_dma_map_phys) runs when an IOMMU is managing the device. The legacy ops path (ops->map_phys) is for buses that install their own struct dma_map_ops. The selector is dma_map_direct(), which calls a small helper:

static bool dma_go_direct(struct device *dev, dma_addr_t mask,
                const struct dma_map_ops *ops)
{
        if (use_dma_iommu(dev))
                return false;
        if (likely(!ops))
                return true;
        /* CONFIG_DMA_OPS_BYPASS mask check omitted */
        return false;
}

The key line is if (likely(!ops)) return true;. When a device has no custom DMA ops, the kernel takes the direct path. And whether a device has ops is decided by get_dma_ops():

static inline const struct dma_map_ops *get_dma_ops(struct device *dev)
{
        if (dev->dma_ops)
                return dev->dma_ops;
        return get_arch_dma_ops();
}

On architectures built without CONFIG_ARCH_HAS_DMA_OPS (which includes today's arm64 and x86), this returns NULL. A NULL ops pointer is precisely what makes dma_go_direct() return true. So on a typical embedded arm64 board with no IOMMU in the path, every mapping you make goes straight through the dma-direct layer. That is the code worth understanding well.

The dma-direct fast path

The dma-direct implementation lives in kernel/dma/direct.h and kernel/dma/direct.c. The single-buffer map is a static inline in the header, and in 7.x it takes a physical address directly rather than a page and offset. Trimmed to the normal-memory path (the source also handles MMIO and confidential-computing buffers), it is short and revealing:

static inline dma_addr_t dma_direct_map_phys(struct device *dev,
                phys_addr_t phys, size_t size, enum dma_data_direction dir,
                unsigned long attrs, bool flush)
{
        dma_addr_t dma_addr = phys_to_dma(dev, phys);

        if (unlikely(!dma_capable(dev, dma_addr, size, true))) {
                if (is_swiotlb_active(dev))
                        return swiotlb_map(dev, phys, size, dir, attrs);
                return DMA_MAPPING_ERROR;
        }

        if (!dev_is_dma_coherent(dev) && !(attrs & DMA_ATTR_SKIP_CPU_SYNC)) {
                arch_sync_dma_for_device(phys, size, dir);
                if (flush)
                        arch_sync_dma_flush();
        }
        return dma_addr;
}

Read it from the top. The page-to-physical conversion that older kernels did here is gone; the caller already passes a phys_addr_t. The function turns it into a bus address with phys_to_dma(), then checks dma_capable(): can the device, given its DMA mask, reach this address? If not, and a software IOMMU is available, it bounces the transfer through swiotlb_map(); otherwise it returns DMA_MAPPING_ERROR, the value dma_mapping_error() tests for. This closes the loop on the mask you set in part one: the mask is the input to dma_capable(), and an honest mask is what triggers bouncing instead of silent corruption when a 32-bit device is handed a high buffer.

The last lines are the cache story. If the device is not cache-coherent and the caller did not set DMA_ATTR_SKIP_CPU_SYNC, the code calls arch_sync_dma_for_device(), then, when flush is set, arch_sync_dma_flush(). That second call is new in the 7.x series: the cache maintenance and its memory barrier were split apart, so a batch of mappings can issue one barrier at the end instead of one per buffer. On a coherent platform dev_is_dma_coherent(dev) is true and nothing happens. That single branch is the difference between a desktop x86 board where DMA "just works" and an embedded arm64 target where forgetting a sync corrupts data.

Where cache coherency actually happens

arch_sync_dma_for_device() and its counterpart arch_sync_dma_for_cpu() are per-architecture hooks, enabled by CONFIG_ARCH_HAS_SYNC_DMA_FOR_DEVICE. On arm64 they live in arch/arm64/mm/dma-mapping.c and are remarkably direct:

void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
                enum dma_data_direction dir)
{
        unsigned long start = (unsigned long)phys_to_virt(paddr);

        dcache_clean_poc_nosync(start, start + size);
}

void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
                enum dma_data_direction dir)
{
        unsigned long start = (unsigned long)phys_to_virt(paddr);

        if (dir == DMA_TO_DEVICE)
                return;
        dcache_inval_poc_nosync(start, start + size);
}

This is the concrete meaning of the DMA direction argument from part one. Before the device reads memory the CPU wrote (the for-device direction), arm64 cleans the data cache to the Point of Coherency with dcache_clean_poc_nosync(), pushing any dirty cache lines out to where the device will read them. After the device writes memory the CPU will read (the for-cpu direction), arm64 invalidates the cache with dcache_inval_poc_nosync() so stale cached copies are dropped and the CPU re-reads from RAM. The _nosync suffix is the 7.x change: the barrier that used to follow each cache operation is deferred to the single arch_sync_dma_flush() call we saw a moment ago, which the generic layer issues once per batch. The if (dir == DMA_TO_DEVICE) return; in the for-cpu path is an optimisation: if data only moved toward the device, there is nothing for the CPU to re-read, so no invalidation is needed. The direction selects which cache operation runs.

Why coherent memory is a different memory type

Part one said a coherent buffer needs no syncing. The source explains why. The allocator dma_alloc_attrs() dispatches the same three ways, and on the direct path calls dma_direct_alloc() in direct.c. On a non-coherent architecture it allocates pages with __dma_direct_alloc_pages(), prepares them with arch_dma_prep_coherent() (on arm64 a cache clean over the region), and then, when CONFIG_DMA_DIRECT_REMAP is set, remaps them uncached with dma_common_contiguous_remap(); on architectures with CONFIG_ARCH_HAS_DMA_SET_UNCACHED it instead calls arch_dma_set_uncached().

So on a non-coherent SoC, the buffer returned by dma_alloc_coherent() is mapped uncached. That is how coherency is achieved without per-access cache maintenance, and it is also why coherent memory is the wrong choice for large data buffers: every CPU access bypasses the cache and is slow. This is the implementation reason behind the rule from part one to keep coherent memory for small control structures and use streaming maps for bulk data.

When the device cannot reach the buffer: swiotlb

Return to the dma_capable() check. When a device with a narrow DMA mask is handed a buffer above its reach, the direct path calls swiotlb_map(). The software I/O TLB keeps a low, device-addressable memory pool reserved at boot. swiotlb_map() copies (bounces) the buffer into that pool and returns a bus address the device can use. For a transfer toward the device, the bounce copy happens at map time; for a transfer from the device, it happens when you sync or unmap, which is one more reason the unmap and sync calls are mandatory rather than advisory. Bouncing is transparent to your driver, but it costs a copy, so a correct DMA mask is what lets the kernel skip it whenever the hardware can address the buffer directly.

Streaming sync, in the source

The sync calls from part one map onto dma_direct_sync_single_for_cpu() and dma_direct_sync_single_for_device(). The for-cpu side shows both halves of the work in one place:

static inline void dma_direct_sync_single_for_cpu(struct device *dev,
                dma_addr_t addr, size_t size, enum dma_data_direction dir,
                bool flush)
{
        phys_addr_t paddr = dma_to_phys(dev, addr);

        if (!dev_is_dma_coherent(dev)) {
                arch_sync_dma_for_cpu(paddr, size, dir);
                if (flush)
                        arch_sync_dma_flush();
                arch_sync_dma_for_cpu_all();
        }
        swiotlb_sync_single_for_cpu(dev, paddr, size, dir);
}

It performs the architecture cache invalidate (only when the device is non-coherent), issues the deferred barrier, then asks swiotlb to copy any bounced data back. So the same call covers both the cache problem and the bounce-buffer problem, which is why a single dma_sync_single_for_cpu() in your driver is enough regardless of platform.

A debugging session with CONFIG_DMA_API_DEBUG

The kernel can validate every DMA call you make. Build with CONFIG_DMA_API_DEBUG=y and the code in kernel/dma/debug.c shadows each mapping in a hash table, checking that unmaps match maps, that the CPU does not touch memory currently owned by a device, and that drivers do not free memory with the wrong function. First confirm it is enabled and look at the debugfs directory it creates:

raghu@techveda.org:~$ zcat /proc/config.gz | grep DMA_API_DEBUG
CONFIG_DMA_API_DEBUG=y
raghu@techveda.org:~$ ls /sys/kernel/debug/dma-api/
all_errors        driver_filter     error_count       min_free_entries
disabled          dump              num_errors        nr_total_entries
                                                      num_free_entries

The most useful files are error_count (how many problems have been detected), dump (a listing of every mapping the kernel is currently tracking, which lets you spot leaks), and num_errors (how many warnings it will still print before going quiet, which you can raise). The driver_filter file restricts reporting to a single driver, so you can isolate your own:

raghu@techveda.org:~$ cat /sys/kernel/debug/dma-api/error_count
0
raghu@techveda.org:~$ echo mydev > /sys/kernel/debug/dma-api/driver_filter

When you break the rules from part one, the report names the driver and is specific about the fault. Unmapping with a different call than you mapped with is flagged as freeing DMA memory with the wrong function. Touching a buffer the device still owns is caught by the active-cacheline tracker, which in the 7.x series warns that you have exceeded the allowed number of overlapping mappings of a cacheline (the older "cpu touching an active dma mapped cacheline" wording was replaced). If heavy traffic exhausts the shadow entries, which you can watch by reading min_free_entries as it falls toward zero, raise the preallocated count from its default of 65536 at boot with the kernel parameter dma_debug_entries=. You can disable the facility entirely with dma_debug=off; note that it cannot be re-enabled at runtime. The tracking has a real performance cost, so this is a development-kernel tool, not something to ship.

Learning both the contract and the implementation behind it, against real hardware and a real source tree, is the core of our Linux device drivers training, where DMA, interrupts, and the driver model are taught by tracing the kernel rather than memorising signatures.

Key takeaways

DMA involves virtual, physical, and bus addresses; the DMA mapping API converts a CPU buffer into a dma_addr_t the device can use, handling IOMMU and cache maintenance.
Declare the device's addressing limits with dma_set_mask_and_coherent() before mapping; the mask feeds dma_capable() and decides whether the kernel must bounce through swiotlb.
Use coherent mappings (dma_alloc_coherent()) for small long-lived control structures; they are uncached on non-coherent SoCs, so use streaming maps (dma_map_single(), dma_map_page(), dma_map_sg()) with the correct direction and a dma_mapping_error() check for bulk data.
Call dma_sync_*() when the CPU touches a streaming buffer between transfers; on arm64 the direction selects the cache op, dcache_clean_poc_nosync() before the device reads and dcache_inval_poc_nosync() after it writes.
The 7.x DMA core is physical-address based: with no custom ops and no IOMMU every map takes the dma-direct path (dma_map_phys() then dma_direct_map_phys()); build with CONFIG_DMA_API_DEBUG and read /sys/kernel/debug/dma-api/ to catch misuse.

Frequently asked questions

What is the difference between a coherent and a streaming DMA mapping?
A coherent mapping, made with dma_alloc_coherent(), is allocated once and kept for the device's lifetime with no explicit flushing needed; it suits small control structures like ring descriptors. A streaming mapping, made with dma_map_single() or similar, is set up just before one transfer and torn down after, and is the right choice for bulk data such as network packets.

Why does a DMA buffer that works on x86 get corrupted on an ARM target?
Many embedded SoCs have CPU caches that are not automatically kept coherent with DMA traffic, so a CPU write can sit in cache while the device reads stale main memory, or vice versa. On a coherent platform the DMA API's cache operations compile down to almost nothing, but on a non-coherent ARM board they become real cache flushes and invalidations, so skipping the required sync calls causes intermittent corruption there but not on coherent x86 hardware.

Do I need to call the dma_sync_* functions for every streaming mapping?
No. The sync calls are only needed if the CPU touches the buffer data between dma_map_() and dma_unmap_() while reusing the same mapping. If you never touch the data in that window, the map and unmap calls alone are sufficient.

What does CONFIG_DMA_API_DEBUG check for?
It shadows every DMA mapping your driver makes in a hash table and checks that unmaps match the corresponding maps, that the CPU does not touch memory currently owned by a device, and that drivers do not free memory with the wrong function. Enable it and read the files under /sys/kernel/debug/dma-api/, such as error_count and dump, to catch misuse during development.

How to Get Your First Linux Kernel Patch Accepted

Raghu Bharadwaj — Thu, 02 Jul 2026 06:01:04 +0000

Getting a first Linux kernel patch accepted means choosing one small, real, single-purpose fix, basing it on the correct tree from the MAINTAINERS file, and following the kernel's exact submission process end to end. That process is: commit with git commit -s so your Signed-off-by certifies the work, run scripts/checkpatch.pl --strict and confirm a clean build, find recipients with scripts/get_maintainer.pl, and send the patch inline (never as an attachment) with git send-email. Most first patches are not merged as sent, so responding well to reviewer feedback, with patience over weeks, is what actually decides whether a patch gets accepted.

For an embedded or kernel engineer, getting your first Linux kernel patch accepted is one of the clearest signals you can put on a CV. It proves you can read unfamiliar code, follow a strict process, and work with maintainers who hold a very high bar. This guide walks through the exact steps to prepare, send, and defend a first patch, using the current kernel.org process. The change itself can be small. What matters is that you complete the full loop correctly.

Why your first Linux kernel patch is worth the effort

A merged patch is public and permanent. Your name and sign-off stay in the Git history of a project that runs on billions of devices. For hiring managers in embedded Linux, device drivers, and BSP work, that is stronger evidence than any certificate, because the kernel community rejects work that is sloppy or untested. The skills you practice here, namely reading code paths, writing a clear commit message, and responding to review, are the same skills that senior engineering roles are built on.

You do not need to invent a feature. Most first contributions are small fixes: a real bug, a correctness issue found by a static analysis tool, or a documented behaviour that the code does not match. The goal of a first patch is to learn the workflow end to end, not to make a large change.

Find a change worth submitting

Pick a change that is small, real, and easy for a reviewer to verify. Each patch must solve exactly one problem and be justifiable on its own. Good places to look for a first change include:

Warnings from scripts/checkpatch.pl or sparse on a driver you already work with, where the fix is a genuine correctness improvement and not a cosmetic edit made only to silence the tool.
A real bug you hit on your own hardware, where you can describe the user-visible impact (a crash, a lockup, a wrong value in dmesg).
The drivers/staging tree, which historically accepts cleanup and fix work from new contributors.
A mismatch between documented behaviour and actual code that you can demonstrate.

Avoid changes that only reformat code or rename things for taste. Maintainers see many such patches and usually reject them. If you found the bug with git bisect, record the commit that introduced it; you will reference it later with a Fixes: tag.

Your move → Before writing any code, open the MAINTAINERS file and find the subsystem that owns the file you want to change. Note the T: line; it tells you which Git tree to base your work on.

Set up the source tree and your identity

Start from the correct tree. Many changes go to a subsystem maintainer's tree rather than mainline, so check the MAINTAINERS entry first. To begin from mainline:

raghu@techveda.org:~$ git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
raghu@techveda.org:~$ cd linux

Set your real name and email. The kernel does not accept anonymous contributions, and the name you configure here becomes the author and sign-off in the permanent changelog.

raghu@techveda.org:~$ git config user.name "Raghu Bharadwaj"
raghu@techveda.org:~$ git config user.email "you@example.org"

Create a topic branch so you can later generate the base-commit information that reviewers and automated CI need:

raghu@techveda.org:~$ git checkout -t -b my-first-fix master
Branch 'my-first-fix' set up to track local branch 'master'.
Switched to a new branch 'my-first-fix'

Make the change and write the commit message

Make one logical change, then commit with sign-off. The -s flag adds your Signed-off-by: line, which certifies that you have the right to submit the work under the kernel's licence (the Developer's Certificate of Origin). The -v flag shows the diff in the editor so you can review it while writing the message:

raghu@techveda.org:~$ git commit -s -v

The commit message is judged as carefully as the code. Write the description in the imperative mood, for example "fix" and "remove" rather than "fixed" or "this patch removes". State the problem first, then the user-visible impact, then what your change does. If you are fixing a specific earlier commit, add a Fixes: tag with at least the first 12 characters of its SHA-1 and the one-line summary. A complete message looks like this:

net: foo: avoid use-after-free on probe failure

When foo_probe() fails after registering the netdev, the cleanup
path frees the private data before unregistering, so a queued
work item dereferences freed memory and triggers a KASAN
use-after-free splat on disconnect.

Reorder teardown to unregister the device before freeing its
private data.

Fixes: 54a4f0239f2e ("net: foo: add initial probe support")
Signed-off-by: Raghu Bharadwaj <you@example.org>

If you used any AI coding assistant while preparing the change, the current process asks you to record that with an Assisted-by: tag; failing to disclose it can hurt acceptance. Keep one logical change per commit. If the message starts to describe two unrelated things, split it into two patches.

Check the patch before you send it

Never send a patch you have not style-checked and built. Run the style checker on your commit; --strict turns on the stricter CHECK-level advice that maintainers often expect from new code:

raghu@techveda.org:~$ ./scripts/checkpatch.pl --strict --git HEAD
total: 0 errors, 0 warnings, 0 checks, 24 lines checked

checkpatch reports at three levels: ERROR for things very likely wrong, WARNING for things needing review, and CHECK for things needing thought. You should be able to justify every violation that remains. The checker is a guide, not a replacement for judgement, so do not make code worse just to silence it. Then confirm the tree still builds cleanly with your change applied; a patch that breaks the build will be rejected without a real review.

Generate the patch as a file. Using --base=auto records the base commit so reviewers and CI know exactly what your work applies to:

raghu@techveda.org:~$ git format-patch -1 --base=auto -o outgoing/
outgoing/0001-net-foo-avoid-use-after-free-on-probe-failure.patch

Your move → Read your own patch one more time as plain text in the outgoing/ file. The subject line becomes a permanent, searchable identifier for your change, so keep it under about 70 characters and make it describe what and why.

Send the patch to the right people

Find the correct recipients with the maintainer script. Pass it your patch file, not a source file:

raghu@techveda.org:~$ ./scripts/get_maintainer.pl outgoing/0001-net-foo-avoid-use-after-free-on-probe-failure.patch

Send the patch inline as plain text. Patches must not be sent as attachments, because reviewers need to quote your code line by line in their replies. The recommended tool is git send-email, which formats the subject prefix, sign-off, and separators correctly and is far less error-prone than a normal mail client. You can let the maintainer script fill the Cc list automatically:

raghu@techveda.org:~$ git send-email 
    --cc-cmd='./scripts/get_maintainer.pl --norolestats outgoing/0001-net-foo-avoid-use-after-free-on-probe-failure.patch' 
    outgoing/0001-net-foo-avoid-use-after-free-on-probe-failure.patch

Always Cc the subsystem list and the maintainers; linux-kernel@vger.kernel.org is the default catch-all list, but do not spam unrelated lists or people. If you would rather not manage these mechanics by hand, the b4 tool automates much of the format, check, and send flow.

Respond to review, which is what decides acceptance

Most first patches are not merged as sent. You will get review comments, and responding to them well is the part that actually gets your patch accepted. Reply to the comments, thank the reviewer, and use trimmed, interleaved replies rather than top-posting. When you send a new version, bump the version in the subject (the tooling produces [PATCH v2] for you) and add a short changelog below the --- separator explaining what changed since the previous version, so it does not become part of the permanent commit log.

Be patient. Reviewers are busy and you may wait two to three weeks for a response. Wait at least a week before pinging or resending, and longer during a merge window. If a reviewer offers a Reviewed-by: or Tested-by: tag, carry it forward on the next version unless the patch changed substantially. Persistence and politeness matter more than cleverness when getting a first patch merged.

Key takeaways

Choose a small, real, single-purpose change and base it on the correct tree from the MAINTAINERS file.
Commit with git commit -s so your Signed-off-by: certifies the work; write the message in imperative mood and add a Fixes: tag when you fix a known commit.
Run scripts/checkpatch.pl --strict and confirm a clean build before you send anything.
Route the patch with scripts/get_maintainer.pl and send it inline with git send-email; never attach it.
Disclose AI assistance with an Assisted-by: tag, and treat review feedback as the main work, not an afterthought.

Frequently asked questions

Do I need to find or invent a big feature for my first kernel patch?
No. Most first contributions are small fixes, such as a real bug, a correctness issue flagged by checkpatch.pl or sparse, or a documented behaviour that does not match the code. The goal is to learn the submission workflow end to end, not to make a large change.

How do I find the right Git tree and maintainers for my change?
Open the MAINTAINERS file and find the subsystem that owns the file you want to change, noting its T: line for the correct Git tree to base your work on. Before sending, run scripts/get_maintainer.pl against your patch file to get the correct recipient list.

Can I send a kernel patch as an email attachment?
No, patches must not be sent as attachments, because reviewers need to quote the code line by line in their replies. The recommended method is to send the patch inline as plain text using git send-email.

Why was my first kernel patch not merged even though I followed the process?
Most first patches are not merged as sent, and responding well to review comments is what actually gets a patch accepted. Reply to feedback, use trimmed interleaved replies, bump the patch version in the subject line for revisions, and be patient, since reviewers may take two to three weeks to respond.

Real-Time Linux vs RTOS: Zephyr, FreeRTOS, PREEMPT_RT

Raghu Bharadwaj — Thu, 02 Jul 2026 05:59:53 +0000

The real-time Linux vs RTOS choice depends on how tight and guaranteed your timing must be versus how many features and how much hardware you can spend. FreeRTOS is a minimal kernel giving single-digit-microsecond, construction-bounded latency on Cortex-M microcontrollers. Zephyr is a larger, connectivity-rich RTOS with a Linux-style Kconfig/devicetree workflow and real userspace isolation. PREEMPT_RT, mainline in Linux since version 6.12 (November 2024) and extended to 32-bit ARM in Linux 7.1 (April 2026), keeps the full Linux operating system but makes its worst-case latency bounded (typically tens of microseconds, validated empirically with cyclictest) rather than tiny. On heterogeneous SoCs, teams increasingly run Linux and an RTOS together on separate cores, connected through the OpenAMP framework.

The question of which real-time system to use keeps getting sharper. For years the choice was simple. If you needed tight, predictable timing you reached for a small real-time operating system (RTOS). If you needed Linux, you accepted that Linux was not real-time. That line has moved. As of Linux 6.12, released in November 2024, the PREEMPT_RT real-time work is fully merged into the mainline kernel, and through 2025 and 2026 it has spread to more architectures. At the same time, AI, connectivity, and hard real-time control increasingly land on the same multicore system-on-chip (SoC). This post looks in detail at the three options engineers actually weigh — FreeRTOS, Zephyr, and PREEMPT_RT — works through the real-time Linux vs RTOS trade-off, gives concrete use cases, and closes with where each is likely to head.

FreeRTOS: the minimal kernel

FreeRTOS is a small real-time kernel released under the MIT license, with stewardship passed to Amazon Web Services in 2017. Its design goal is to be minimal and portable: the core is a handful of C files, it runs on more than 40 architectures with 15-plus toolchains, and it is used most widely on Arm Cortex-M microcontrollers. A minimal build occupies roughly 6 to 12 KB of ROM and about 1 KB of RAM.

The scheduler is fixed-priority preemptive. The highest-priority ready task always runs; tasks of equal priority can be time-sliced in round-robin fashion; an idle task runs at the lowest priority; and a tickless idle mode saves power by suppressing the periodic tick when nothing is scheduled. On top of the scheduler, FreeRTOS provides tasks, queues, binary and counting semaphores, mutexes with priority inheritance, recursive mutexes, event groups, direct-to-task notifications, stream and message buffers, and software timers.

Memory handling is deliberate. A system can be built entirely with static allocation, which fixes memory use at build time and removes allocation failures and allocation timing from runtime behaviour. Where dynamic allocation is used, FreeRTOS ships several heap schemes: heap_1 allocates but never frees, heap_3 wraps the standard library malloc and free, heap_4 uses a first-fit algorithm with block coalescence and is the common choice, and heap_5 extends heap_4 across several non-contiguous memory regions.

For larger designs there is FreeRTOS-SMP, which schedules one kernel across several identical cores, and there is memory protection through FreeRTOS-MPU, which runs restricted tasks in unprivileged mode behind an MPU. On ARMv8-M parts such as the Cortex-M23 and Cortex-M33, TrustZone divides execution into secure and non-secure sides. The connectivity story is add-on rather than built in: FreeRTOS-Plus-TCP provides the TCP/IP stack, and the modular AWS IoT libraries — coreMQTT, coreHTTP, coreJSON, plus over-the-air update support — sit on top. Long-term-support (LTS) releases give multi-year maintenance and are also packaged as CMSIS-Packs.

FreeRTOS itself is not safety-certified. Where certification is required, SafeRTOS — a derivative built by WITTENSTEIN high integrity systems from the FreeRTOS functional model — is pre-certified to IEC 61508 SIL 3 and ISO 26262 ASIL D by TÜV SÜD, and supports TrustZone. That gives teams a path from prototype to a regulated product without changing programming model.

Zephyr: the connected RTOS

Zephyr is a larger, full-featured RTOS under the Apache 2.0 license, hosted by the Linux Foundation since 2016 and governed by a broad group of silicon vendors including Nordic, NXP, Intel, STMicroelectronics, and others. Where FreeRTOS is a kernel you build around, Zephyr is closer to a small operating system with batteries included.

The kernel is unified: threads carry priorities, with cooperative threads at negative priorities and preemptible threads at non-negative ones, plus meta-IRQ threads for work that must run ahead of normal threads. It offers mutexes with priority inheritance, semaphores, message queues, pipes, and workqueues, and it can be built with different scheduler implementations depending on the trade-off between speed and memory.

Two things make Zephyr feel familiar to kernel engineers. First, its configuration model is taken from Linux: Kconfig for build-time configuration and devicetree for describing hardware, driven by the West meta-tool and a module system, with toolchains supplied by the Zephyr SDK. It supports Cortex-M, Cortex-R, Cortex-A, RISC-V, Xtensa, ARC, and x86 across more than 600 boards, and includes symmetric multiprocessing. Second, Zephyr has a real user-mode model: on MCUs with an MPU (on Arm, ARC, and x86) it runs threads in user mode with memory domains and partitions, isolates thread stacks, and enforces a system-call boundary, with equivalent support on RISC-V through PMP. Privilege separation of this kind is unusual in an RTOS.

The subsystem list is what sets Zephyr apart for connected products. It ships a Bluetooth Low Energy controller and host, IEEE 802.15.4, Thread, Matter, Wi-Fi, LoRaWAN, CAN, USB device and host, an IPv4/IPv6 network stack with TLS, and filesystems such as littlefs and FAT, alongside sensor, power-management, logging, shell, and settings subsystems. Around the kernel sit the MCUboot secure bootloader, PSA Crypto, a security-response process, and an active functional-safety certification effort, with testing handled by the Twister runner and the native_sim simulator. Zephyr issues rapid feature releases — the 4.x line in 2026 — alongside periodic long-term-support releases.

Real-time Linux: PREEMPT_RT

PREEMPT_RT is not a separate operating system. It is a configuration of the Linux kernel that makes almost all kernel code preemptible, so a high-priority thread can meet a deadline even while the kernel is busy. It reaches bounded latency through a few changes: hardware interrupt handlers run as kernel threads with configurable priorities; most kernel spinlocks (the spinlock_t type) become sleeping locks built on rt_mutex, which support priority inheritance so a low-priority lock holder cannot cause an unbounded delay; and high-resolution timers give fine-grained wakeups. It needs an application-class core with a memory management unit (MMU).

Real-time work on Linux uses the POSIX scheduling policies SCHED_FIFO and SCHED_RR, with priorities from 1 to 99 above all normal tasks. For deadline-driven work there is SCHED_DEADLINE, available since Linux 3.14, which implements earliest-deadline-first scheduling with a constant-bandwidth server. A task declares a runtime, a period, and a deadline through sched_setattr(), and the kernel runs an admission test so the set of deadline tasks stays schedulable. That gives Linux a formal reservation model, not just relative priorities.

The standard way to measure the result is cyclictest from the rt-tests package, which measures the gap between when a high-priority task should wake and when it actually runs:

raghu@techveda.org:~$ sudo cyclictest --mlockall --smp --priority=80 --interval=200 --distance=0
# /dev/cpu_dma_latency set to 0us
T: 0 ( 1520) P:80 I:200 C: 100000 Min:      2 Act:    3 Avg:    4 Max:      58
T: 1 ( 1521) P:80 I:200 C:  99998 Min:      2 Act:    4 Avg:    4 Max:      61

The Max column is what matters: the worst observed latency in microseconds, typically in the tens of microseconds on tuned hardware. Two limits are worth stating plainly. Some kernel paths still use non-preemptible raw_spinlock_t locks and a few interrupts stay non-threaded, so the worst case is larger and less certain than on a small RTOS. And an RTOS bounds latency by construction, while PREEMPT_RT is validated empirically with a tool like cyclictest rather than proven.

Where mainline real-time stands in 2026

The mainline merge was the start of a steady expansion. Linux 6.12 (November 2024) made PREEMPT_RT selectable on x86-64, arm64, and RISC-V. Linux 6.13 (early 2025) added real-time support for LoongArch and introduced a lazy preemption model, which delays the preemption of normal tasks to the next scheduler tick to recover throughput while still preempting real-time tasks immediately. Linux 7.1 (April 2026) added mainline real-time support for 32-bit ARM, the older architecture on many existing embedded SoCs, removing the last widely used out-of-tree patches for it. A small out-of-tree queue remains, including PowerPC, but for the architectures most embedded teams use the real-time kernel is now a configuration option rather than a patch set to maintain.

Real-time Linux vs RTOS: which one?

The real-time Linux vs RTOS decision comes down to how tight and how guaranteed your timing must be, set against how many features you need and how much hardware you can spend. An RTOS on a Cortex-M core has a short, well-understood path from interrupt to task, so its latencies are typically single-digit microseconds and easy to bound, on a device with no MMU and a few kilobytes of RAM. Linux with PREEMPT_RT keeps the full operating system — process isolation, an MMU, drivers, networking, a filesystem — and makes its worst case bounded rather than tiny, on an application-class core. The table below sets the three side by side.

Property	FreeRTOS	Zephyr	PREEMPT_RT (Linux)
Type	Minimal RTOS kernel	Full-featured RTOS	Real-time config of Linux
License / steward	MIT / AWS	Apache 2.0 / Linux Foundation	GPL-2.0 / kernel community
Typical cores	Cortex-M, RISC-V MCUs	Cortex-M/R/A, RISC-V, Xtensa	Cortex-A, x86-64, arm64, RISC-V
Footprint	~6–12 KB ROM	Tens of KB to MB	Megabytes (full Linux)
Scheduling	Fixed-priority preemptive	Cooperative + preemptive	SCHED_FIFO/RR, SCHED_DEADLINE
Worst-case latency	Single-digit µs	Single-digit µs	Tens of µs (tuned)
Memory protection	Optional MPU (FreeRTOS-MPU)	User mode + memory domains	Full MMU, process isolation
Connectivity	Add-on (FreeRTOS-Plus, AWS)	Built in (BLE, Thread, Matter)	Full Linux stack
Safety-cert path	Via SafeRTOS (SIL 3 / ASIL D)	Active certification effort	Via commercial RT distributions
Config & build	C config headers	Kconfig + devicetree + West	Kconfig (kernel)

Read the table by column, not by row. FreeRTOS wins where size, cost, and a tiny certified core matter. Zephyr wins where you want one maintained codebase with connectivity and security already present. PREEMPT_RT wins where you already need Linux and want bounded latency on the same chip.

Use cases: matching the tool to the job

FreeRTOS suits cost- and power-sensitive microcontroller products with hard deadlines and modest connectivity: motor and power controllers, battery-powered sensors, wearables, and appliance control boards. It is also the common choice for the RTOS running on the real-time core of a heterogeneous SoC, and, through SafeRTOS, for regulated products in industrial and automotive systems.
Zephyr suits connected products that need a maintained protocol stack: Bluetooth, Thread, Matter, and Wi-Fi devices, multiprotocol gateways, and sensor hubs. It suits teams that want to standardise one codebase across many microcontrollers, and products that need userspace isolation and a secure-boot and update story without assembling it from third-party parts.
PREEMPT_RT suits systems already built on application-class Linux that need bounded latency: industrial controllers and PLCs, robotics running ROS 2, CNC and machine control, professional audio, test and measurement, and telecom data planes.
A mixed design suits heterogeneous SoCs that carry both core types: Linux, optionally with PREEMPT_RT, on the application cores for interface, networking, and AI, and an RTOS on the real-time cores for the deterministic loop.

Mixed-criticality on one SoC

Modern SoCs increasingly remove the need to choose only one. A single chip carries application cores and real-time cores side by side: NXP i.MX 8M pairs Cortex-A53 cores with a Cortex-M4 or M7; TI Sitara parts such as AM64x combine Cortex-A53 with Cortex-R5F cores; ST STM32MP1 pairs Cortex-A7 with a Cortex-M4; and AMD/Xilinx Zynq UltraScale+ combines Cortex-A53 with Cortex-R5F. This is asymmetric multiprocessing (AMP): Linux runs on the application cores for connectivity, interface, and AI workloads, while an RTOS or bare-metal firmware runs on the real-time cores for the control loop that must never miss a deadline.

The open framework for this pattern is OpenAMP, with three parts: remoteproc for life-cycle management, where Linux loads firmware onto the remote core and starts or stops it; rpmsg for inter-processor messaging; and virtio as the transport underneath. The remoteproc and rpmsg infrastructure has been in the mainline Linux kernel since version 3.4, originally contributed by Texas Instruments. Controlling the second core from Linux is done through sysfs:

raghu@techveda.org:~$ echo rtos_firmware.elf > /sys/class/remoteproc/remoteproc0/firmware
raghu@techveda.org:~$ echo start > /sys/class/remoteproc/remoteproc0/state
raghu@techveda.org:~$ cat /sys/class/remoteproc/remoteproc0/state
running

Mixing tasks of different criticality on shared hardware is also a research field in its own right, dating to Vestal's 2007 work on scheduling systems with different levels of timing assurance and surveyed since by Burns and Davis. The practical warning from that work is that separate cores do not give separate timing. Even when the control loop has its own core, it still shares the last-level cache and the DRAM controller with Linux, and heavy memory traffic on the application cores can lengthen the worst case on the real-time core. Techniques that address this — cache partitioning and colouring, memory-bandwidth reservation such as MemGuard, and static partitioning hypervisors such as Jailhouse — are worth knowing when the deadlines are strict. Designing this division of work cleanly, including the shared-resource behaviour, is a core skill in our Linux Systems Engineering training.

Where each is heading

FreeRTOS is likely to stay what it is: a minimal, near-ubiquitous kernel. Its growth is not in the scheduler but around it — the AWS IoT library set, long-term-support maintenance, and the certified SafeRTOS path for regulated markets. Symmetric multiprocessing will mature, but the value remains a small and predictable core. The most probable long-term role is the RTOS on the real-time core of heterogeneous SoCs, where its size and determinism are exactly what is wanted.

Zephyr has the clearest momentum among the open RTOSes, and the reason is its backers: most major silicon vendors now support it directly, which means new parts arrive with Zephyr support rather than a proprietary stack. Its direction is broader connectivity, a stronger security and update story, real userspace isolation, and a functional-safety certification path. The likely outcome is that Zephyr keeps displacing in-house and proprietary RTOSes in the mid-to-high-end microcontroller space. The cost of that capability is a larger footprint and a steeper initial setup than FreeRTOS, so the two are more likely to divide the space than for one to remove the other.

PREEMPT_RT has passed its hardest milestone. With the mainline merge complete and architecture coverage still widening, the maintenance burden falls and adoption becomes easier to justify. The work now moves to tooling — latency tracing and validation — to wider architecture support, and to tighter use of SCHED_DEADLINE and real-time-aware frameworks such as ROS 2. It will not replace a hard RTOS where microsecond determinism or formal certification is required. What it will keep doing is absorbing control tasks that once needed a separate processor, because a single Linux core is now often good enough.

The common direction is convergence on one chip. Mixed-criticality on a single SoC is becoming the default system shape rather than a special case, which changes where the hard engineering sits. The difficult questions are moving away from which operating system to use and towards the boundary between them: how the cores are isolated, how they exchange messages, and how shared caches and memory affect timing. The durable advantage for an engineer is the ability to work competently on both sides of that boundary — the Linux side and the RTOS side — rather than only one.

Key takeaways

FreeRTOS is the minimal, tiny-footprint kernel; Zephyr is the larger, connectivity-rich RTOS with a Linux-style workflow and real userspace isolation; PREEMPT_RT is mainline Linux made preemptible for bounded latency.
An RTOS bounds latency by construction in single-digit microseconds; PREEMPT_RT gives bounded latency inside a full operating system, typically tens of microseconds, validated with cyclictest.
Since Linux 6.12 (November 2024), real-time Linux is mainline, and through 2025 and 2026 it reached LoongArch (6.13) and 32-bit ARM (7.1).
On heterogeneous SoCs the practical answer is both: Linux on the application cores and an RTOS on the real-time cores, joined by OpenAMP, with attention to shared-cache and memory interference.

Frequently asked questions

What is the difference between FreeRTOS and Zephyr?
FreeRTOS is a small, minimal kernel (roughly 6-12 KB ROM, about 1 KB RAM) built around tasks, queues, and semaphores, with connectivity added on through FreeRTOS-Plus-TCP and the AWS IoT libraries. Zephyr is a larger, full-featured RTOS with a Linux-style Kconfig and devicetree workflow, built-in Bluetooth/Thread/Matter/Wi-Fi stacks, and real userspace isolation with memory domains on MCUs that have an MPU.

Is PREEMPT_RT part of mainline Linux now?
Yes. PREEMPT_RT was fully merged into the mainline kernel as of Linux 6.12 (November 2024), initially selectable on x86-64, arm64, and RISC-V. Linux 6.13 added LoongArch, and Linux 7.1 (April 2026) added mainline real-time support for 32-bit ARM.

How is PREEMPT_RT latency measured, and how does it compare to an RTOS?
PREEMPT_RT latency is measured empirically with cyclictest from the rt-tests package, which reports the worst observed gap (the Max column) between when a high-priority task should wake and when it actually runs, typically tens of microseconds on tuned hardware. An RTOS such as FreeRTOS or Zephyr instead bounds latency by construction, typically to single-digit microseconds, rather than validating it with a measurement tool.

How do Linux and an RTOS work together on the same chip?
On heterogeneous SoCs such as the NXP i.MX 8M or TI AM64x, Linux runs on the application cores while an RTOS or bare-metal firmware runs on separate real-time cores for the deterministic control loop. The open framework connecting them is OpenAMP, using remoteproc for life-cycle management and rpmsg for inter-processor messaging, controllable from Linux through sysfs under /sys/class/remoteproc/.

Edge AI in the Next 10 Years: The Silicon Shift

Raghu Bharadwaj — Thu, 02 Jul 2026 05:41:42 +0000

Edge AI is moving inference off the cloud and onto the device itself, driven by latency, bandwidth and cost, reliability, and privacy regulation. Training stays in the cloud; inference moves to the edge — which makes the AI accelerator (the NPU) the component that shapes the whole embedded design. Dedicated NPUs now span sub-1 TOPS microcontrollers up to 275+ TOPS Jetson modules, so the hardware is rarely the limit. The harder work has moved up the stack: quantizing models to INT8/INT4, and building the embedded Linux and Yocto platform that turns the silicon into a shipping product.

For most of the last decade, "AI" and "the cloud" were almost the same thing. You collected data on a device, sent it to a data centre, ran inference on a rack of GPUs, and returned an answer. That model worked until latency, bandwidth, battery, cost, and privacy regulation all began to favour edge AI: running the model on the device itself.

If you build embedded Linux systems — writing BSPs, bringing up boards, maintaining Yocto layers, working on device trees and kernel drivers — this shift affects your work directly. The next ten years of edge AI are not a data-science topic that happens elsewhere. They are a systems-engineering topic, and embedded Linux is the layer underneath almost all of it.

This is a practical field guide to that shift. It is not marketing and not a vendor pitch. It is a grounded look at what is shipping, what is standardising, and where an embedded engineer should invest their skills now. We start with the current state and the silicon that makes edge AI possible.

The state of edge AI in 2026

Market signals, and what the forecasts mean

If you look for the size of the "edge AI market," you will find figures that differ from each other by a factor of three. That is not because the analysts are wrong. It is because they measure different things. Before quoting any figure in a roadmap, understand its scope.

STL Partners models edge AI addressable revenue reaching roughly USD 157 billion by 2030, growing at about 19% a year, and projects that computer vision alone will account for around half of that market by 2030.
BCC Research, measuring more narrowly, sees the market rising from USD 11.8 billion in 2025 to USD 56.8 billion by 2030, a 36.9% compound annual growth rate.
On hardware specifically, MarketsandMarkets forecasts edge AI hardware growing from USD 26.14 billion in 2025 to USD 58.90 billion by 2030 at 17.6% CAGR, while the narrower accelerator segment is tracked by Mordor Intelligence at USD 7.45 billion in 2025 rising to USD 35.75 billion by 2030 at 31% CAGR.

The lesson for engineers is to quote ranges, not single points, and to state whether a figure covers software, hardware, or accelerator silicon alone. The direction, however, is not in doubt: every credible forecast rises steeply.

The clearest structural driver sits under all of these numbers. More than 12 billion IoT-connected endpoints were capable of running basic ML inference in 2025, a figure projected to pass 38 billion by 2034 as vendors add dedicated ML accelerators to every tier of silicon (Research Intelo). Inference capability is being built into nearly every new device by default.

From cloud training to edge inference

The defining pattern of this decade is a split, not a full migration. Training largely stays in the cloud, where it is compute-intensive, batch-oriented, and benefits from centralised data and large-scale parallelism. Inference increasingly moves to the edge, where the decisions must be made.

Four forces drive that split, and each is familiar to embedded engineers:

Latency. A control loop, a safety interlock, or a real-time vision pipeline cannot afford a round trip to a data centre. Local inference is measured in milliseconds, not network hops.
Bandwidth and cost. Streaming raw sensor or video data to the cloud continuously is expensive and often infeasible at fleet scale. Running the model locally and sending only results reduces that cost sharply.
Reliability. A device that depends on connectivity to think stops thinking when the link drops. On-device inference degrades gracefully.
Privacy and regulation. Regulation increasingly requires sensitive data to be processed where it is generated. This has become one of the strongest drivers of all: on-device processing is becoming a compliance strategy, not only a performance one.

For the embedded Linux engineer, this split changes the job. The board is no longer a data-collection endpoint that forwards work upstream. It is where the work happens. That places more importance on the component that makes local inference possible: the accelerator.

AI accelerators: the silicon arms race

The hardware you choose determines much of what follows — the models you can run, the power budget you live within, and the software stack you will spend months integrating.

NPUs, TOPS, and the new performance tiers

The largest change of the past two years is that dedicated neural processing units (NPUs) have improved by roughly an order of magnitude and are now standard across the whole compute spectrum. A rough map of the landscape, from microcontroller NPUs to dev kits and across several vendors:

Class	Representative silicon	AI performance
Microcontroller / TinyML	STMicroelectronics STM32N6 (Neural-ART)	~0.6 TOPS INT8 (3 TOPS/W)
Integrated SoC NPU	NXP i.MX 8M Plus	~2.3 TOPS
Integrated SoC NPU	Rockchip RK3588	~6 TOPS INT8
Integrated SoC NPU	TI TDA4VM (Jacinto)	~8 TOPS
Discrete accelerator	Google Coral Edge TPU	4 TOPS INT8 (~2 W)
Discrete accelerator	Hailo-8 (M.2 module)	26 TOPS INT8
Laptop / phone NPU	Qualcomm Snapdragon X Elite	~45 TOPS
Laptop / phone NPU	Intel Lunar Lake (Core Ultra 200V, NPU4)	up to 48 TOPS
Laptop / phone NPU	AMD Ryzen AI 300	up to 50 TOPS
Robotics / vision module	NVIDIA Jetson Orin Nano Super	up to 67 TOPS
Robotics / vision module	NVIDIA Jetson Orin NX	up to 157 TOPS
Robotics / vision module	NVIDIA Jetson AGX Orin	up to 275 TOPS
High-end edge dev kit	NVIDIA Jetson AGX Thor	~2070 FP4 TFLOPS (40–130 W)

A caution about TOPS. Trillions of operations per second is a peak-throughput headline number, and it tells you little about real-world performance on your workload. The parts above also span different device classes — a microcontroller NPU, an integrated SoC NPU, a discrete accelerator module, and a full compute module — which are not measured the same way. The figure says nothing about the numeric precision assumed (INT8, INT4, or FP4), memory bandwidth, on-chip SRAM, thermal sustainability, or how well your model maps onto the accelerator's execution units. A 275-TOPS module that is memory-starved on a vision-language model can perform worse than a lower-TOPS part with a better-balanced architecture. Use TOPS as a first-pass filter, then benchmark your actual model.

How edge accelerators achieve their efficiency

Understanding how NPUs achieve speed and efficiency is what lets you write software that uses them well. Three architectural techniques do most of the work:

Systolic arrays for matrix multiplication — the dense linear algebra at the core of neural networks — arranged so data flows through a grid of multiply-accumulate units with little control overhead.
Dedicated memory hierarchies designed to minimise data movement, because on modern silicon, moving data costs far more energy than the arithmetic. Keeping weights and activations close to the compute units is the main objective.
Reduced-precision arithmetic — INT8 and increasingly INT4 — that keeps acceptable accuracy while cutting the compute and memory footprint.

The third point is where your work as an engineer meets the silicon. Quantization is no longer optional; it is a requirement for edge inference. A model trained in FP32 in the cloud must be quantized to INT8 or INT4 to run efficiently on an edge NPU, and doing that well — post-training quantization versus quantization-aware training, per-channel versus per-tensor scaling, handling outlier activations — is becoming a core embedded-AI skill. Heterogeneous SoC integration is delivering roughly 3 to 5 times better energy efficiency per inference operation with each generation (Research Intelo), and quantization is how you capture that efficiency.

The power spectrum: from sub-50 mW MCUs to 15 W NPUs

A useful way to think about the coming decade is that edge AI is not one thing. It is a spectrum that has widened at both ends. Vendors now embed ML accelerators across every processor tier — from Cortex-M class microcontrollers drawing under 50 mW for TinyML workloads, up to high-performance edge NPUs consuming as much as 15 W, and beyond that to the 130 W dev-kit class such as Jetson Thor.

Choosing the right point on that spectrum is the central design decision, and it is a familiar embedded trade-off in a new form:

A battery-powered sensor node doing keyword spotting or anomaly detection wants the microcontroller end: milliwatts, TinyML, INT8 models measured in kilobytes.
A smart camera or industrial gateway doing continuous computer vision wants a mid-tier NPU or a discrete accelerator in the single-digit-watt range.
An autonomous robot or multi-camera vision-language system justifies a Jetson-class module and its power and thermal budget.

If this choice is wrong, software optimisation cannot recover it: an over-provisioned part drains the battery, and an under-provisioned one drops frames. The range of available parts is now wide enough that there is an accelerator tier for essentially every power envelope.

Where this leaves us

The foundation is in place. In 2026, edge AI has moved decisively from "collect data, send to cloud" toward "run the model where the data is." The market forecasts, however they are scoped, point up. And the silicon — NPUs spanning sub-50 mW microcontrollers to 2000-plus-TFLOP dev kits — has matured to the point where the hardware is rarely the limiting factor.

That means the harder work moves up the stack. The open questions now are on the software side: how do you build, harden, and maintain the embedded Linux platform that turns this silicon into a deployable product, and how do you keep improving models across a fleet of devices without moving everyone's private data back to a central server? Building that platform is the substance of our Embedded Linux and Yocto training.

Key takeaways

Edge AI has changed the default from cloud inference to on-device inference, driven by latency, bandwidth and cost, reliability, and privacy regulation.
Training stays largely in the cloud; inference increasingly runs on the device, which makes the accelerator the component that shapes the rest of the design.
Accelerators now span from sub-1 TOPS microcontroller NPUs (STM32N6) through integrated SoC NPUs (i.MX 8M Plus, RK3588) and discrete modules (Coral, Hailo-8) to Jetson-class modules and dev kits. Choose by power envelope and workload, and treat TOPS as a first-pass filter only.
Quantization to INT8 or INT4 is a required skill for edge inference, not an optional optimisation.

Frequently asked questions

What is edge AI?
Edge AI runs the machine-learning model directly on the device where the data is generated, instead of sending that data to a cloud data centre and returning an answer. Inference happens locally, typically in milliseconds.

Why does training stay in the cloud while inference moves to the edge?
Training is compute-intensive, batch-oriented, and benefits from centralised data and large-scale parallelism, so it remains in the cloud. Inference moves to the device because the decision must be made where the data is — driven by latency, bandwidth and cost, reliability when connectivity drops, and privacy regulation.

What does TOPS actually tell you about an edge accelerator?
TOPS (trillions of operations per second) is a peak-throughput headline. It says nothing about numeric precision (INT8, INT4, or FP4), memory bandwidth, on-chip SRAM, thermal sustainability, or how well your model maps onto the accelerator. Use TOPS as a first-pass filter, then benchmark your actual model.

Why is quantization a required skill for edge AI?
A model trained in FP32 in the cloud must be quantized to INT8 or INT4 to run within the power and memory budget of an edge NPU. Doing it well — post-training quantization versus quantization-aware training, per-channel versus per-tensor scaling, handling outlier activations — is now a core embedded-AI skill.

DEV Community: Raghu Bharadwaj

Coherent vs Streaming DMA: A Deep Dive into the Linux DMA Mapping API

Three kinds of addresses

Why the DMA mapping API exists

Tell the kernel your addressing limits

Coherent mappings: allocate once, keep for the device's lifetime

Streaming mappings: map for one transfer, then unmap

Synchronising a buffer you reuse

Alignment and cache lines

Inside the DMA mapping API: three back ends

The dma-direct fast path

Where cache coherency actually happens

Why coherent memory is a different memory type

When the device cannot reach the buffer: swiotlb

Streaming sync, in the source

A debugging session with CONFIG_DMA_API_DEBUG

Key takeaways

Frequently asked questions

Further reading

How to Get Your First Linux Kernel Patch Accepted

Why your first Linux kernel patch is worth the effort

Find a change worth submitting

Set up the source tree and your identity

Make the change and write the commit message

Check the patch before you send it

Send the patch to the right people

Respond to review, which is what decides acceptance

Key takeaways

Frequently asked questions

Further reading

Real-Time Linux vs RTOS: Zephyr, FreeRTOS, PREEMPT_RT

FreeRTOS: the minimal kernel

Zephyr: the connected RTOS

Real-time Linux: PREEMPT_RT

Where mainline real-time stands in 2026

Real-time Linux vs RTOS: which one?

Use cases: matching the tool to the job

Mixed-criticality on one SoC

Where each is heading

Key takeaways

Frequently asked questions

Further reading

Edge AI in the Next 10 Years: The Silicon Shift

The state of edge AI in 2026

Market signals, and what the forecasts mean

From cloud training to edge inference

AI accelerators: the silicon arms race

NPUs, TOPS, and the new performance tiers

How edge accelerators achieve their efficiency

The power spectrum: from sub-50 mW MCUs to 15 W NPUs

Where this leaves us

Key takeaways

Frequently asked questions

Further reading