DEV Community

WonderLab
WonderLab

Posted on

Open Source Project of the Day (Part 8): NexaSDK - Cross-Platform On-Device AI Runtime for Running Frontier Models Locally

Introduction

"What if the latest AI models could run on your phone, on IoT devices, even on edge devices — without needing to rely on the cloud?"

This is Part 8 of the "Open Source Project of the Day" series. Today we explore NexaSDK (GitHub).

Imagine running the Qwen3-VL multimodal model on an Android phone, using Apple Neural Engine for speech recognition on an iOS device, and running the Granite-4 model on a Linux IoT device — all without connecting to the cloud. That's the revolutionary experience NexaSDK delivers — bringing frontier AI models truly "down to earth" on all kinds of devices.

Why this project?

  • 🚀 NPU-first: The industry's first NPU-first on-device AI runtime
  • 📱 Full platform support: PC, Android, iOS, Linux/IoT all covered
  • 🎯 Day-0 model support: Supports newly released models (GGUF, MLX, NEXA formats)
  • 🔌 Multimodal capabilities: LLM, VLM, ASR, OCR, Rerank, image generation, and more
  • 🌟 Community recognized: 7.6k+ Stars, collaborates with Qualcomm on on-device AI competitions

What You'll Learn

  • Core concepts and architecture design of NexaSDK
  • How to run on-device AI models on various platforms
  • Support and usage of NPU, GPU, and CPU compute backends
  • Integration and use of multimodal AI capabilities
  • Comparative analysis with other on-device AI frameworks
  • How to get started building on-device AI applications with NexaSDK

Prerequisites

  • Basic understanding of LLMs and AI models
  • Familiarity with at least one programming language (Python, Go, Kotlin, Swift)
  • Understanding of on-device AI basics (optional)
  • Basic knowledge of hardware acceleration like NPU, GPU (optional)

Project Background

Project Introduction

NexaSDK is a cross-platform on-device AI runtime supporting frontier LLM and VLM models on GPU, NPU, and CPU. It provides comprehensive runtime coverage for PC (Python/C++), mobile (Android & iOS), and Linux/IoT (Arm64 & x86 Docker) platforms.

Core problems the project solves:

  • On-device AI runtimes are fragmented, requiring different solutions for different platforms
  • Lack of native NPU support, unable to fully utilize hardware acceleration
  • After new model releases, on-device support lags (no day-0 support)
  • Multimodal AI capabilities are difficult to integrate on-device
  • Cross-platform development costs are high, requiring separate implementations per platform

Target user groups:

  • Developers building on-device AI applications
  • Mobile app developers wanting to leverage NPU acceleration
  • Developers needing to run AI models on IoT devices
  • Researchers interested in on-device AI

Author/Team Introduction

Team: NexaAI

  • Background: Team focused on on-device AI solutions
  • Partners: Collaborates with Qualcomm to host on-device AI competitions
  • Contributors: 45 contributors including @RemiliaForever, @zhiyuan8, @mengshengwu, and others
  • Philosophy: Enable frontier AI models to run efficiently on all kinds of devices

Project creation date: 2024 (based on GitHub commit history showing continuous activity)

Project Stats

  • GitHub Stars: 7.6k+ (rapidly and continuously growing)
  • 🍴 Forks: 944+
  • 📦 Version: v0.2.71 (latest version, released January 22, 2026)
  • 📄 License: Apache-2.0 (CPU/GPU components); NPU components require a license
  • 🌐 Website: docs.nexa.ai
  • 📚 Documentation: Complete documentation
  • 💬 Community: Active Discord and Slack communities
  • 🏆 Competition: Nexa × Qualcomm On-Device AI Competition ($6,500 prize)

Project development history:

  • 2024: Project launched, initial version released
  • 2024-2025: Rapid development, multi-platform support added
  • 2025: NPU support refined, collaboration with Qualcomm
  • 2026: Continuous iteration, more models and feature support added

Supported models:

  • OpenAI GPT-OSS
  • IBM Granite-4
  • Qwen-3-VL
  • Gemma-3n
  • Ministral-3
  • And many more frontier models

Main Features

Core Purpose

NexaSDK's core purpose is to provide a unified cross-platform on-device AI runtime, enabling developers to:

  1. Run AI models on multiple devices: PC, phones, and IoT devices all covered
  2. Fully utilize hardware acceleration: Automatic selection from NPU, GPU, or CPU backends
  3. Quickly integrate new models: Day-0 support, use new models as soon as they're released
  4. Multimodal AI capabilities: Comprehensive support for text, images, audio, video, and more
  5. Simplify development: Unified API, one codebase for all platforms

Use Cases

  1. Mobile AI applications

    • Smart assistants on phones
    • Offline speech recognition and translation
    • Image recognition and processing
    • Local LLM chat applications
  2. IoT and edge computing

    • AI capabilities for smart home devices
    • Intelligent analysis for industrial IoT
    • AI inference on edge servers
    • Perception capabilities for autonomous vehicles
  3. Desktop application integration

    • Local AI assistant
    • Intelligent document processing
    • Code generation tools
    • Creative content generation
  4. Enterprise applications

    • Data privacy protection (local processing)
    • Offline AI capabilities
    • Reduce cloud costs
    • Real-time response requirements
  5. Research and development

    • Model performance testing
    • Hardware acceleration research
    • New model validation
    • Algorithm optimization experiments

Quick Start

CLI Method (Simplest)

# Install Nexa CLI
# Windows (x64 with Intel/AMD NPU)
# Download: nexa-cli_windows_x86_64.exe

# macOS (x64)
# Download: nexa-cli_macos_x86_64.pkg

# Linux (ARM64)
curl -L https://github.com/NexaAI/nexa-sdk/releases/latest/download/nexa-cli_linux_arm64.sh | bash

# Run first model
nexa infer ggml-org/Qwen3-1.7B-GGUF

# Multimodal: drag and drop image to CLI
nexa infer NexaAI/Qwen3-VL-4B-Instruct-GGUF

# NPU support (Windows arm64 with Snapdragon X Elite)
nexa infer NexaAI/OmniNeural-4B
Enter fullscreen mode Exit fullscreen mode

Python SDK

# Install
pip install nexaai

# Usage example
from nexaai import LLM, GenerationConfig, ModelConfig, LlmChatMessage

# Create LLM instance
llm = LLM.from_(model="NexaAI/Qwen3-0.6B-GGUF", config=ModelConfig())

# Build conversation
conversation = [
    LlmChatMessage(role="user", content="Hello, tell me a joke")
]
prompt = llm.apply_chat_template(conversation)

# Streaming generation
for token in llm.generate_stream(prompt, GenerationConfig(max_tokens=100)):
    print(token, end="", flush=True)
Enter fullscreen mode Exit fullscreen mode

Android SDK

// Add to build.gradle.kts
dependencies {
    implementation("ai.nexa:core:0.0.19")
}

// Initialize SDK
NexaSdk.getInstance().init(this)

// Load and run model
VlmWrapper.builder()
    .vlmCreateInput(VlmCreateInput(
        model_name = "omni-neural",
        model_path = "/data/data/your.app/files/models/OmniNeural-4B/files-1-1.nexa",
        plugin_id = "npu",
        config = ModelConfig()
    ))
    .build()
    .onSuccess { vlm ->
        vlm.generateStreamFlow("Hello!", GenerationConfig()).collect { print(it) }
    }
Enter fullscreen mode Exit fullscreen mode

iOS SDK

import NexaSdk

// Example: Speech recognition
let asr = try Asr(plugin: .ane)
try await asr.load(from: modelURL)

let result = try await asr.transcribe(options: .init(audioPath: "audio.wav"))
print(result.asrResult.transcript)
Enter fullscreen mode Exit fullscreen mode

Linux Docker

# Pull image
docker pull nexa4ai/nexasdk:latest

# Run (requires NPU token)
export NEXA_TOKEN="your_token_here"
docker run --rm -it --privileged \
  -e NEXA_TOKEN \
  nexa4ai/nexasdk:latest infer NexaAI/Granite-4.0-h-350M-NPU
Enter fullscreen mode Exit fullscreen mode

Core Features

  1. NPU-first support

    • Industry's first NPU-first on-device AI runtime
    • Supports Qualcomm Hexagon NPU
    • Supports Apple Neural Engine (ANE)
    • Supports Intel/AMD NPU
    • Significantly improves performance and energy efficiency
  2. Full-platform runtime

    • PC: Python/C++ SDK
    • Android: Kotlin SDK, supports NPU/GPU/CPU
    • iOS: Swift SDK, supports ANE
    • Linux/IoT: Docker image, supports Arm64 & x86
  3. Day-0 model support

    • Supports newly released models
    • Multiple model formats: GGUF, MLX, NEXA
    • Quickly integrate new models on-device
  4. Multimodal AI capabilities

    • LLM: Large language models
    • VLM: Vision language models (multimodal)
    • ASR: Automatic speech recognition
    • OCR: Optical character recognition
    • Rerank: Reranking
    • Object Detection: Object detection
    • Image Generation: Image generation
    • Embedding: Vector embeddings
  5. Unified API interface

    • OpenAI-compatible API
    • Function calling support
    • Streaming generation support
    • Unified configuration interface
  6. Model format support

    • GGUF: Widely-used quantization format
    • MLX: Apple MLX framework format
    • NEXA: NexaSDK native format
  7. Hardware acceleration optimization

    • Automatically selects the best compute backend
    • Priority: NPU > GPU > CPU
    • Optimizations for different hardware
  8. Developer-friendly

    • Run models with one line of code
    • Detailed documentation and examples
    • Active community support
    • Rich cookbook

Project Advantages

Compared to other on-device AI frameworks, NexaSDK's advantages:

Comparison NexaSDK Ollama llama.cpp LM Studio
NPU support ⭐⭐⭐⭐⭐ NPU-first ❌ Not supported ❌ Not supported ❌ Not supported
Android/iOS SDK ⭐⭐⭐⭐⭐ Full support ⚠️ Partial support ⚠️ Partial support ❌ Not supported
Linux Docker ⭐⭐⭐⭐⭐ Supported ⭐⭐⭐⭐⭐ Supported ⭐⭐⭐⭐⭐ Supported ❌ Not supported
Day-0 model support ⭐⭐⭐⭐⭐ GGUF/MLX/NEXA ❌ Lags ⚠️ Partial support ❌ Lags
Multimodal support ⭐⭐⭐⭐⭐ Full support ⚠️ Partial support ⚠️ Partial support ⚠️ Partial support
Cross-platform ⭐⭐⭐⭐⭐ All platforms ⚠️ Some platforms ⚠️ Some platforms ⚠️ Some platforms
One-line execution ⭐⭐⭐⭐⭐ Supported ⭐⭐⭐⭐⭐ Supported ⚠️ Needs config ⭐⭐⭐⭐⭐ Supported
OpenAI API compat ⭐⭐⭐⭐⭐ Supported ⭐⭐⭐⭐⭐ Supported ⭐⭐⭐⭐⭐ Supported ⭐⭐⭐⭐⭐ Supported

Why choose NexaSDK?

  • NPU-first: Fully leverages hardware acceleration for best performance and energy efficiency
  • Full-platform support: One SDK covers all platforms, reduces development costs
  • Day-0 support: Use new models as soon as they're released, no waiting
  • Multimodal capabilities: Complete AI capability stack for all needs
  • Developer-friendly: Simple API, rich documentation and examples

Detailed Project Analysis

Architecture Design

NexaSDK adopts a layered architecture at whose core is a unified runtime abstraction layer:

┌─────────────────────────────────────┐
│   Application Layer                 │
│   - CLI / Python / Android / iOS    │
└──────────────┬──────────────────────┘
               │
┌──────────────▼──────────────────────┐
│   SDK Layer                         │
│   - Unified API interface           │
│   - Model loading and management    │
│   - Configuration and optimization  │
└──────────────┬──────────────────────┘
               │
┌──────────────▼──────────────────────┐
│   Runtime Layer                     │
│   - Compute backend abstraction     │
│   - Model format parsing            │
│   - Inference engine                │
└──────────────┬──────────────────────┘
               │
    ┌──────────┴──────────┐
    │                     │
┌───▼────┐         ┌─────▼─────┐
│  NPU   │         │   GPU     │
│ Plugin │         │  Plugin   │
└────────┘         └───────────┘
    │                     │
┌───▼─────────────────────▼─────┐
│   CPU Plugin (Fallback)        │
└────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Core Module Details

1. Compute Backend Abstraction Layer

Function: Unified management of different compute backends (NPU, GPU, CPU)

Design characteristics:

  • Plugin-based architecture, easy to extend
  • Automatically selects the best backend
  • Priority: NPU > GPU > CPU
  • Supports backend switching and fallback

Supported NPUs:

  • Qualcomm Hexagon NPU (Snapdragon)
  • Apple Neural Engine (iOS/macOS)
  • Intel/AMD NPU (Windows)

2. Model Format Support

GGUF format:

  • Widely-used quantization format
  • Supports multiple quantization levels
  • Compatible with the llama.cpp ecosystem

MLX format:

  • Apple MLX framework format
  • Optimized for Apple Silicon
  • Supports macOS and iOS

NEXA format:

  • NexaSDK native format
  • Optimized for NPU
  • Better performance and compatibility

3. Multimodal Capabilities

LLM (Large Language Models):

  • Text generation and conversation
  • Streaming output support
  • Function Calling support

VLM (Vision Language Models):

  • Image understanding and generation
  • Multimodal conversation
  • Visual question answering

ASR (Automatic Speech Recognition):

  • Speech to text
  • Supports multiple audio formats
  • Real-time recognition support

OCR (Optical Character Recognition):

  • Text recognition from images
  • Multi-language support
  • High-precision recognition

Other capabilities:

  • Rerank: Text reranking
  • Object Detection: Object detection
  • Image Generation: Image generation
  • Embedding: Vector embeddings

4. Platform-specific Implementations

PC platform (Python/C++):

  • Python SDK: Easy to use
  • C++ SDK: High performance
  • Supports Windows, macOS, Linux

Android platform:

  • Kotlin SDK
  • Supports NPU (Snapdragon 8 Gen 4+)
  • GPU and CPU fallback support
  • Minimum SDK 27

iOS platform:

  • Swift SDK
  • Supports Apple Neural Engine
  • iOS 17.0+ / macOS 15.0+
  • Swift 5.9+

Linux/IoT platform:

  • Docker image
  • Supports Arm64 and x86
  • Supports Qualcomm Dragonwing IQ9
  • Suitable for edge computing scenarios

Key Technical Implementation

1. NPU Acceleration Optimization

Challenge: Different vendors' NPU architectures vary significantly

Solution:

  • Unified NPU abstraction layer
  • Optimized implementations for different NPUs
  • Automatic NPU detection and selection
  • Performance monitoring and tuning

2. Model Format Conversion

Challenge: Supporting multiple model formats requires unified handling

Solution:

  • Format parser abstraction
  • Unified model representation
  • Format conversion tools
  • Caching mechanism optimization

3. Cross-platform Compatibility

Challenge: Different platforms have different APIs and constraints

Solution:

  • Platform abstraction layer
  • Conditional compilation
  • Unified configuration interface
  • Platform-specific optimizations

4. Memory Management

Challenge: On-device memory is limited, needs efficient management

Solution:

  • Smart memory allocation
  • Model quantization support
  • Memory pool management
  • Timely resource release

Extension Mechanisms

1. Adding New Compute Backends

Implement the Plugin interface:

type ComputeBackend interface {
    Name() string
    IsAvailable() bool
    LoadModel(config ModelConfig) error
    Generate(input Input, config GenerationConfig) (Output, error)
}
Enter fullscreen mode Exit fullscreen mode

2. Adding New Model Formats

Implement the FormatParser interface:

type FormatParser interface {
    CanParse(path string) bool
    Parse(path string) (*Model, error)
    Optimize(model *Model, target Backend) error
}
Enter fullscreen mode Exit fullscreen mode

3. Adding New AI Capabilities

Implement the Capability interface:

type Capability interface {
    Name() string
    SupportedModels() []string
    Process(input Input, config Config) (Output, error)
}
Enter fullscreen mode Exit fullscreen mode

Project Resources

Official Resources

Who Should Use This

Highly Recommended:

  • Developers building on-device AI applications
  • Mobile app developers wanting to leverage NPU acceleration
  • Developers needing to run AI models on IoT devices
  • Researchers interested in on-device AI performance optimization

Also Suitable For:

  • Students wanting to learn on-device AI implementation
  • Architects needing to evaluate different AI frameworks
  • Technical professionals interested in NPU acceleration

Welcome to visit my personal homepage for more useful knowledge and interesting products

Top comments (0)