WonderLab

Posted on Mar 7

Open Source Project of the Day (Part 8): NexaSDK - Cross-Platform On-Device AI Runtime for Running Frontier Models Locally

#opensource #llm #npu #python

Introduction

"What if the latest AI models could run on your phone, on IoT devices, even on edge devices — without needing to rely on the cloud?"

This is Part 8 of the "Open Source Project of the Day" series. Today we explore NexaSDK (GitHub).

Imagine running the Qwen3-VL multimodal model on an Android phone, using Apple Neural Engine for speech recognition on an iOS device, and running the Granite-4 model on a Linux IoT device — all without connecting to the cloud. That's the revolutionary experience NexaSDK delivers — bringing frontier AI models truly "down to earth" on all kinds of devices.

Why this project?

🚀 NPU-first: The industry's first NPU-first on-device AI runtime
📱 Full platform support: PC, Android, iOS, Linux/IoT all covered
🎯 Day-0 model support: Supports newly released models (GGUF, MLX, NEXA formats)
🔌 Multimodal capabilities: LLM, VLM, ASR, OCR, Rerank, image generation, and more
🌟 Community recognized: 7.6k+ Stars, collaborates with Qualcomm on on-device AI competitions

What You'll Learn

Core concepts and architecture design of NexaSDK
How to run on-device AI models on various platforms
Support and usage of NPU, GPU, and CPU compute backends
Integration and use of multimodal AI capabilities
Comparative analysis with other on-device AI frameworks
How to get started building on-device AI applications with NexaSDK

Prerequisites

Basic understanding of LLMs and AI models
Familiarity with at least one programming language (Python, Go, Kotlin, Swift)
Understanding of on-device AI basics (optional)
Basic knowledge of hardware acceleration like NPU, GPU (optional)

Project Background

Project Introduction

NexaSDK is a cross-platform on-device AI runtime supporting frontier LLM and VLM models on GPU, NPU, and CPU. It provides comprehensive runtime coverage for PC (Python/C++), mobile (Android & iOS), and Linux/IoT (Arm64 & x86 Docker) platforms.

Core problems the project solves:

On-device AI runtimes are fragmented, requiring different solutions for different platforms
Lack of native NPU support, unable to fully utilize hardware acceleration
After new model releases, on-device support lags (no day-0 support)
Multimodal AI capabilities are difficult to integrate on-device
Cross-platform development costs are high, requiring separate implementations per platform

Target user groups:

Developers building on-device AI applications
Mobile app developers wanting to leverage NPU acceleration
Developers needing to run AI models on IoT devices
Researchers interested in on-device AI

Author/Team Introduction

Team: NexaAI

Background: Team focused on on-device AI solutions
Partners: Collaborates with Qualcomm to host on-device AI competitions
Contributors: 45 contributors including @RemiliaForever, @zhiyuan8, @mengshengwu, and others
Philosophy: Enable frontier AI models to run efficiently on all kinds of devices

Project creation date: 2024 (based on GitHub commit history showing continuous activity)

Project Stats

⭐ GitHub Stars: 7.6k+ (rapidly and continuously growing)
🍴 Forks: 944+
📦 Version: v0.2.71 (latest version, released January 22, 2026)
📄 License: Apache-2.0 (CPU/GPU components); NPU components require a license
🌐 Website: docs.nexa.ai
📚 Documentation: Complete documentation
💬 Community: Active Discord and Slack communities
🏆 Competition: Nexa × Qualcomm On-Device AI Competition ($6,500 prize)

Project development history:

2024: Project launched, initial version released
2024-2025: Rapid development, multi-platform support added
2025: NPU support refined, collaboration with Qualcomm
2026: Continuous iteration, more models and feature support added

Supported models:

OpenAI GPT-OSS
IBM Granite-4
Qwen-3-VL
Gemma-3n
Ministral-3
And many more frontier models

Main Features

Core Purpose

NexaSDK's core purpose is to provide a unified cross-platform on-device AI runtime, enabling developers to:

Run AI models on multiple devices: PC, phones, and IoT devices all covered
Fully utilize hardware acceleration: Automatic selection from NPU, GPU, or CPU backends
Quickly integrate new models: Day-0 support, use new models as soon as they're released
Multimodal AI capabilities: Comprehensive support for text, images, audio, video, and more
Simplify development: Unified API, one codebase for all platforms

Use Cases

Mobile AI applications
- Smart assistants on phones
- Offline speech recognition and translation
- Image recognition and processing
- Local LLM chat applications
IoT and edge computing
- AI capabilities for smart home devices
- Intelligent analysis for industrial IoT
- AI inference on edge servers
- Perception capabilities for autonomous vehicles
Desktop application integration
- Local AI assistant
- Intelligent document processing
- Code generation tools
- Creative content generation
Enterprise applications
- Data privacy protection (local processing)
- Offline AI capabilities
- Reduce cloud costs
- Real-time response requirements
Research and development
- Model performance testing
- Hardware acceleration research
- New model validation
- Algorithm optimization experiments

Quick Start

CLI Method (Simplest)

# Install Nexa CLI
# Windows (x64 with Intel/AMD NPU)
# Download: nexa-cli_windows_x86_64.exe

# macOS (x64)
# Download: nexa-cli_macos_x86_64.pkg

# Linux (ARM64)
curl -L https://github.com/NexaAI/nexa-sdk/releases/latest/download/nexa-cli_linux_arm64.sh | bash

# Run first model
nexa infer ggml-org/Qwen3-1.7B-GGUF

# Multimodal: drag and drop image to CLI
nexa infer NexaAI/Qwen3-VL-4B-Instruct-GGUF

# NPU support (Windows arm64 with Snapdragon X Elite)
nexa infer NexaAI/OmniNeural-4B

Python SDK

# Install
pip install nexaai

# Usage example
from nexaai import LLM, GenerationConfig, ModelConfig, LlmChatMessage

# Create LLM instance
llm = LLM.from_(model="NexaAI/Qwen3-0.6B-GGUF", config=ModelConfig())

# Build conversation
conversation = [
    LlmChatMessage(role="user", content="Hello, tell me a joke")
]
prompt = llm.apply_chat_template(conversation)

# Streaming generation
for token in llm.generate_stream(prompt, GenerationConfig(max_tokens=100)):
    print(token, end="", flush=True)

Android SDK

// Add to build.gradle.kts
dependencies {
    implementation("ai.nexa:core:0.0.19")
}

// Initialize SDK
NexaSdk.getInstance().init(this)

// Load and run model
VlmWrapper.builder()
    .vlmCreateInput(VlmCreateInput(
        model_name = "omni-neural",
        model_path = "/data/data/your.app/files/models/OmniNeural-4B/files-1-1.nexa",
        plugin_id = "npu",
        config = ModelConfig()
    ))
    .build()
    .onSuccess { vlm ->
        vlm.generateStreamFlow("Hello!", GenerationConfig()).collect { print(it) }
    }

iOS SDK

import NexaSdk

// Example: Speech recognition
let asr = try Asr(plugin: .ane)
try await asr.load(from: modelURL)

let result = try await asr.transcribe(options: .init(audioPath: "audio.wav"))
print(result.asrResult.transcript)

Linux Docker

# Pull image
docker pull nexa4ai/nexasdk:latest

# Run (requires NPU token)
export NEXA_TOKEN="your_token_here"
docker run --rm -it --privileged \
  -e NEXA_TOKEN \
  nexa4ai/nexasdk:latest infer NexaAI/Granite-4.0-h-350M-NPU

Core Features

NPU-first support
- Industry's first NPU-first on-device AI runtime
- Supports Qualcomm Hexagon NPU
- Supports Apple Neural Engine (ANE)
- Supports Intel/AMD NPU
- Significantly improves performance and energy efficiency
Full-platform runtime
- PC: Python/C++ SDK
- Android: Kotlin SDK, supports NPU/GPU/CPU
- iOS: Swift SDK, supports ANE
- Linux/IoT: Docker image, supports Arm64 & x86
Day-0 model support
- Supports newly released models
- Multiple model formats: GGUF, MLX, NEXA
- Quickly integrate new models on-device
Multimodal AI capabilities
- LLM: Large language models
- VLM: Vision language models (multimodal)
- ASR: Automatic speech recognition
- OCR: Optical character recognition
- Rerank: Reranking
- Object Detection: Object detection
- Image Generation: Image generation
- Embedding: Vector embeddings
Unified API interface
- OpenAI-compatible API
- Function calling support
- Streaming generation support
- Unified configuration interface
Model format support
- GGUF: Widely-used quantization format
- MLX: Apple MLX framework format
- NEXA: NexaSDK native format
Hardware acceleration optimization
- Automatically selects the best compute backend
- Priority: NPU > GPU > CPU
- Optimizations for different hardware
Developer-friendly
- Run models with one line of code
- Detailed documentation and examples
- Active community support
- Rich cookbook

Project Advantages

Compared to other on-device AI frameworks, NexaSDK's advantages:

Comparison	NexaSDK	Ollama	llama.cpp	LM Studio
NPU support	⭐⭐⭐⭐⭐ NPU-first	❌ Not supported	❌ Not supported	❌ Not supported
Android/iOS SDK	⭐⭐⭐⭐⭐ Full support	⚠️ Partial support	⚠️ Partial support	❌ Not supported
Linux Docker	⭐⭐⭐⭐⭐ Supported	⭐⭐⭐⭐⭐ Supported	⭐⭐⭐⭐⭐ Supported	❌ Not supported
Day-0 model support	⭐⭐⭐⭐⭐ GGUF/MLX/NEXA	❌ Lags	⚠️ Partial support	❌ Lags
Multimodal support	⭐⭐⭐⭐⭐ Full support	⚠️ Partial support	⚠️ Partial support	⚠️ Partial support
Cross-platform	⭐⭐⭐⭐⭐ All platforms	⚠️ Some platforms	⚠️ Some platforms	⚠️ Some platforms
One-line execution	⭐⭐⭐⭐⭐ Supported	⭐⭐⭐⭐⭐ Supported	⚠️ Needs config	⭐⭐⭐⭐⭐ Supported
OpenAI API compat	⭐⭐⭐⭐⭐ Supported	⭐⭐⭐⭐⭐ Supported	⭐⭐⭐⭐⭐ Supported	⭐⭐⭐⭐⭐ Supported

Why choose NexaSDK?

✅ NPU-first: Fully leverages hardware acceleration for best performance and energy efficiency
✅ Full-platform support: One SDK covers all platforms, reduces development costs
✅ Day-0 support: Use new models as soon as they're released, no waiting
✅ Multimodal capabilities: Complete AI capability stack for all needs
✅ Developer-friendly: Simple API, rich documentation and examples

Detailed Project Analysis

Architecture Design

NexaSDK adopts a layered architecture at whose core is a unified runtime abstraction layer:

┌─────────────────────────────────────┐
│   Application Layer                 │
│   - CLI / Python / Android / iOS    │
└──────────────┬──────────────────────┘
               │
┌──────────────▼──────────────────────┐
│   SDK Layer                         │
│   - Unified API interface           │
│   - Model loading and management    │
│   - Configuration and optimization  │
└──────────────┬──────────────────────┘
               │
┌──────────────▼──────────────────────┐
│   Runtime Layer                     │
│   - Compute backend abstraction     │
│   - Model format parsing            │
│   - Inference engine                │
└──────────────┬──────────────────────┘
               │
    ┌──────────┴──────────┐
    │                     │
┌───▼────┐         ┌─────▼─────┐
│  NPU   │         │   GPU     │
│ Plugin │         │  Plugin   │
└────────┘         └───────────┘
    │                     │
┌───▼─────────────────────▼─────┐
│   CPU Plugin (Fallback)        │
└────────────────────────────────┘

Core Module Details

1. Compute Backend Abstraction Layer

Function: Unified management of different compute backends (NPU, GPU, CPU)

Design characteristics:

Plugin-based architecture, easy to extend
Automatically selects the best backend
Priority: NPU > GPU > CPU
Supports backend switching and fallback

Supported NPUs:

Qualcomm Hexagon NPU (Snapdragon)
Apple Neural Engine (iOS/macOS)
Intel/AMD NPU (Windows)

2. Model Format Support

GGUF format:

Widely-used quantization format
Supports multiple quantization levels
Compatible with the llama.cpp ecosystem

MLX format:

Apple MLX framework format
Optimized for Apple Silicon
Supports macOS and iOS

NEXA format:

NexaSDK native format
Optimized for NPU
Better performance and compatibility

3. Multimodal Capabilities

LLM (Large Language Models):

Text generation and conversation
Streaming output support
Function Calling support

VLM (Vision Language Models):

Image understanding and generation
Multimodal conversation
Visual question answering

ASR (Automatic Speech Recognition):

Speech to text
Supports multiple audio formats
Real-time recognition support

OCR (Optical Character Recognition):

Text recognition from images
Multi-language support
High-precision recognition

Other capabilities:

Rerank: Text reranking
Object Detection: Object detection
Image Generation: Image generation
Embedding: Vector embeddings

4. Platform-specific Implementations

PC platform (Python/C++):

Python SDK: Easy to use
C++ SDK: High performance
Supports Windows, macOS, Linux

Android platform:

Kotlin SDK
Supports NPU (Snapdragon 8 Gen 4+)
GPU and CPU fallback support
Minimum SDK 27

iOS platform:

Swift SDK
Supports Apple Neural Engine
iOS 17.0+ / macOS 15.0+
Swift 5.9+

Linux/IoT platform:

Docker image
Supports Arm64 and x86
Supports Qualcomm Dragonwing IQ9
Suitable for edge computing scenarios

Key Technical Implementation

1. NPU Acceleration Optimization

Challenge: Different vendors' NPU architectures vary significantly

Solution:

Unified NPU abstraction layer
Optimized implementations for different NPUs
Automatic NPU detection and selection
Performance monitoring and tuning

2. Model Format Conversion

Challenge: Supporting multiple model formats requires unified handling

Solution:

Format parser abstraction
Unified model representation
Format conversion tools
Caching mechanism optimization

3. Cross-platform Compatibility

Challenge: Different platforms have different APIs and constraints

Solution:

Platform abstraction layer
Conditional compilation
Unified configuration interface
Platform-specific optimizations

4. Memory Management

Challenge: On-device memory is limited, needs efficient management

Solution:

Smart memory allocation
Model quantization support
Memory pool management
Timely resource release

Extension Mechanisms

1. Adding New Compute Backends

Implement the Plugin interface:

type ComputeBackend interface {
    Name() string
    IsAvailable() bool
    LoadModel(config ModelConfig) error
    Generate(input Input, config GenerationConfig) (Output, error)
}

2. Adding New Model Formats

Implement the FormatParser interface:

type FormatParser interface {
    CanParse(path string) bool
    Parse(path string) (*Model, error)
    Optimize(model *Model, target Backend) error
}

3. Adding New AI Capabilities

Implement the Capability interface:

type Capability interface {
    Name() string
    SupportedModels() []string
    Process(input Input, config Config) (Output, error)
}

Project Resources

Official Resources

🌟 GitHub: https://github.com/NexaAI/nexa-sdk
🌐 Website: https://docs.nexa.ai
📚 Documentation: Complete documentation

Who Should Use This

Highly Recommended:

Developers building on-device AI applications
Mobile app developers wanting to leverage NPU acceleration
Developers needing to run AI models on IoT devices
Researchers interested in on-device AI performance optimization

Also Suitable For:

Students wanting to learn on-device AI implementation
Architects needing to evaluate different AI frameworks
Technical professionals interested in NPU acceleration

Welcome to visit my personal homepage for more useful knowledge and interesting products

Top comments (0)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.