Introduction
"What if the latest AI models could run on your phone, on IoT devices, even on edge devices — without needing to rely on the cloud?"
This is Part 8 of the "Open Source Project of the Day" series. Today we explore NexaSDK (GitHub).
Imagine running the Qwen3-VL multimodal model on an Android phone, using Apple Neural Engine for speech recognition on an iOS device, and running the Granite-4 model on a Linux IoT device — all without connecting to the cloud. That's the revolutionary experience NexaSDK delivers — bringing frontier AI models truly "down to earth" on all kinds of devices.
Why this project?
- 🚀 NPU-first: The industry's first NPU-first on-device AI runtime
- 📱 Full platform support: PC, Android, iOS, Linux/IoT all covered
- 🎯 Day-0 model support: Supports newly released models (GGUF, MLX, NEXA formats)
- 🔌 Multimodal capabilities: LLM, VLM, ASR, OCR, Rerank, image generation, and more
- 🌟 Community recognized: 7.6k+ Stars, collaborates with Qualcomm on on-device AI competitions
What You'll Learn
- Core concepts and architecture design of NexaSDK
- How to run on-device AI models on various platforms
- Support and usage of NPU, GPU, and CPU compute backends
- Integration and use of multimodal AI capabilities
- Comparative analysis with other on-device AI frameworks
- How to get started building on-device AI applications with NexaSDK
Prerequisites
- Basic understanding of LLMs and AI models
- Familiarity with at least one programming language (Python, Go, Kotlin, Swift)
- Understanding of on-device AI basics (optional)
- Basic knowledge of hardware acceleration like NPU, GPU (optional)
Project Background
Project Introduction
NexaSDK is a cross-platform on-device AI runtime supporting frontier LLM and VLM models on GPU, NPU, and CPU. It provides comprehensive runtime coverage for PC (Python/C++), mobile (Android & iOS), and Linux/IoT (Arm64 & x86 Docker) platforms.
Core problems the project solves:
- On-device AI runtimes are fragmented, requiring different solutions for different platforms
- Lack of native NPU support, unable to fully utilize hardware acceleration
- After new model releases, on-device support lags (no day-0 support)
- Multimodal AI capabilities are difficult to integrate on-device
- Cross-platform development costs are high, requiring separate implementations per platform
Target user groups:
- Developers building on-device AI applications
- Mobile app developers wanting to leverage NPU acceleration
- Developers needing to run AI models on IoT devices
- Researchers interested in on-device AI
Author/Team Introduction
Team: NexaAI
- Background: Team focused on on-device AI solutions
- Partners: Collaborates with Qualcomm to host on-device AI competitions
- Contributors: 45 contributors including @RemiliaForever, @zhiyuan8, @mengshengwu, and others
- Philosophy: Enable frontier AI models to run efficiently on all kinds of devices
Project creation date: 2024 (based on GitHub commit history showing continuous activity)
Project Stats
- ⭐ GitHub Stars: 7.6k+ (rapidly and continuously growing)
- 🍴 Forks: 944+
- 📦 Version: v0.2.71 (latest version, released January 22, 2026)
- 📄 License: Apache-2.0 (CPU/GPU components); NPU components require a license
- 🌐 Website: docs.nexa.ai
- 📚 Documentation: Complete documentation
- 💬 Community: Active Discord and Slack communities
- 🏆 Competition: Nexa × Qualcomm On-Device AI Competition ($6,500 prize)
Project development history:
- 2024: Project launched, initial version released
- 2024-2025: Rapid development, multi-platform support added
- 2025: NPU support refined, collaboration with Qualcomm
- 2026: Continuous iteration, more models and feature support added
Supported models:
- OpenAI GPT-OSS
- IBM Granite-4
- Qwen-3-VL
- Gemma-3n
- Ministral-3
- And many more frontier models
Main Features
Core Purpose
NexaSDK's core purpose is to provide a unified cross-platform on-device AI runtime, enabling developers to:
- Run AI models on multiple devices: PC, phones, and IoT devices all covered
- Fully utilize hardware acceleration: Automatic selection from NPU, GPU, or CPU backends
- Quickly integrate new models: Day-0 support, use new models as soon as they're released
- Multimodal AI capabilities: Comprehensive support for text, images, audio, video, and more
- Simplify development: Unified API, one codebase for all platforms
Use Cases
-
Mobile AI applications
- Smart assistants on phones
- Offline speech recognition and translation
- Image recognition and processing
- Local LLM chat applications
-
IoT and edge computing
- AI capabilities for smart home devices
- Intelligent analysis for industrial IoT
- AI inference on edge servers
- Perception capabilities for autonomous vehicles
-
Desktop application integration
- Local AI assistant
- Intelligent document processing
- Code generation tools
- Creative content generation
-
Enterprise applications
- Data privacy protection (local processing)
- Offline AI capabilities
- Reduce cloud costs
- Real-time response requirements
-
Research and development
- Model performance testing
- Hardware acceleration research
- New model validation
- Algorithm optimization experiments
Quick Start
CLI Method (Simplest)
# Install Nexa CLI
# Windows (x64 with Intel/AMD NPU)
# Download: nexa-cli_windows_x86_64.exe
# macOS (x64)
# Download: nexa-cli_macos_x86_64.pkg
# Linux (ARM64)
curl -L https://github.com/NexaAI/nexa-sdk/releases/latest/download/nexa-cli_linux_arm64.sh | bash
# Run first model
nexa infer ggml-org/Qwen3-1.7B-GGUF
# Multimodal: drag and drop image to CLI
nexa infer NexaAI/Qwen3-VL-4B-Instruct-GGUF
# NPU support (Windows arm64 with Snapdragon X Elite)
nexa infer NexaAI/OmniNeural-4B
Python SDK
# Install
pip install nexaai
# Usage example
from nexaai import LLM, GenerationConfig, ModelConfig, LlmChatMessage
# Create LLM instance
llm = LLM.from_(model="NexaAI/Qwen3-0.6B-GGUF", config=ModelConfig())
# Build conversation
conversation = [
LlmChatMessage(role="user", content="Hello, tell me a joke")
]
prompt = llm.apply_chat_template(conversation)
# Streaming generation
for token in llm.generate_stream(prompt, GenerationConfig(max_tokens=100)):
print(token, end="", flush=True)
Android SDK
// Add to build.gradle.kts
dependencies {
implementation("ai.nexa:core:0.0.19")
}
// Initialize SDK
NexaSdk.getInstance().init(this)
// Load and run model
VlmWrapper.builder()
.vlmCreateInput(VlmCreateInput(
model_name = "omni-neural",
model_path = "/data/data/your.app/files/models/OmniNeural-4B/files-1-1.nexa",
plugin_id = "npu",
config = ModelConfig()
))
.build()
.onSuccess { vlm ->
vlm.generateStreamFlow("Hello!", GenerationConfig()).collect { print(it) }
}
iOS SDK
import NexaSdk
// Example: Speech recognition
let asr = try Asr(plugin: .ane)
try await asr.load(from: modelURL)
let result = try await asr.transcribe(options: .init(audioPath: "audio.wav"))
print(result.asrResult.transcript)
Linux Docker
# Pull image
docker pull nexa4ai/nexasdk:latest
# Run (requires NPU token)
export NEXA_TOKEN="your_token_here"
docker run --rm -it --privileged \
-e NEXA_TOKEN \
nexa4ai/nexasdk:latest infer NexaAI/Granite-4.0-h-350M-NPU
Core Features
-
NPU-first support
- Industry's first NPU-first on-device AI runtime
- Supports Qualcomm Hexagon NPU
- Supports Apple Neural Engine (ANE)
- Supports Intel/AMD NPU
- Significantly improves performance and energy efficiency
-
Full-platform runtime
- PC: Python/C++ SDK
- Android: Kotlin SDK, supports NPU/GPU/CPU
- iOS: Swift SDK, supports ANE
- Linux/IoT: Docker image, supports Arm64 & x86
-
Day-0 model support
- Supports newly released models
- Multiple model formats: GGUF, MLX, NEXA
- Quickly integrate new models on-device
-
Multimodal AI capabilities
- LLM: Large language models
- VLM: Vision language models (multimodal)
- ASR: Automatic speech recognition
- OCR: Optical character recognition
- Rerank: Reranking
- Object Detection: Object detection
- Image Generation: Image generation
- Embedding: Vector embeddings
-
Unified API interface
- OpenAI-compatible API
- Function calling support
- Streaming generation support
- Unified configuration interface
-
Model format support
- GGUF: Widely-used quantization format
- MLX: Apple MLX framework format
- NEXA: NexaSDK native format
-
Hardware acceleration optimization
- Automatically selects the best compute backend
- Priority: NPU > GPU > CPU
- Optimizations for different hardware
-
Developer-friendly
- Run models with one line of code
- Detailed documentation and examples
- Active community support
- Rich cookbook
Project Advantages
Compared to other on-device AI frameworks, NexaSDK's advantages:
| Comparison | NexaSDK | Ollama | llama.cpp | LM Studio |
|---|---|---|---|---|
| NPU support | ⭐⭐⭐⭐⭐ NPU-first | ❌ Not supported | ❌ Not supported | ❌ Not supported |
| Android/iOS SDK | ⭐⭐⭐⭐⭐ Full support | ⚠️ Partial support | ⚠️ Partial support | ❌ Not supported |
| Linux Docker | ⭐⭐⭐⭐⭐ Supported | ⭐⭐⭐⭐⭐ Supported | ⭐⭐⭐⭐⭐ Supported | ❌ Not supported |
| Day-0 model support | ⭐⭐⭐⭐⭐ GGUF/MLX/NEXA | ❌ Lags | ⚠️ Partial support | ❌ Lags |
| Multimodal support | ⭐⭐⭐⭐⭐ Full support | ⚠️ Partial support | ⚠️ Partial support | ⚠️ Partial support |
| Cross-platform | ⭐⭐⭐⭐⭐ All platforms | ⚠️ Some platforms | ⚠️ Some platforms | ⚠️ Some platforms |
| One-line execution | ⭐⭐⭐⭐⭐ Supported | ⭐⭐⭐⭐⭐ Supported | ⚠️ Needs config | ⭐⭐⭐⭐⭐ Supported |
| OpenAI API compat | ⭐⭐⭐⭐⭐ Supported | ⭐⭐⭐⭐⭐ Supported | ⭐⭐⭐⭐⭐ Supported | ⭐⭐⭐⭐⭐ Supported |
Why choose NexaSDK?
- ✅ NPU-first: Fully leverages hardware acceleration for best performance and energy efficiency
- ✅ Full-platform support: One SDK covers all platforms, reduces development costs
- ✅ Day-0 support: Use new models as soon as they're released, no waiting
- ✅ Multimodal capabilities: Complete AI capability stack for all needs
- ✅ Developer-friendly: Simple API, rich documentation and examples
Detailed Project Analysis
Architecture Design
NexaSDK adopts a layered architecture at whose core is a unified runtime abstraction layer:
┌─────────────────────────────────────┐
│ Application Layer │
│ - CLI / Python / Android / iOS │
└──────────────┬──────────────────────┘
│
┌──────────────▼──────────────────────┐
│ SDK Layer │
│ - Unified API interface │
│ - Model loading and management │
│ - Configuration and optimization │
└──────────────┬──────────────────────┘
│
┌──────────────▼──────────────────────┐
│ Runtime Layer │
│ - Compute backend abstraction │
│ - Model format parsing │
│ - Inference engine │
└──────────────┬──────────────────────┘
│
┌──────────┴──────────┐
│ │
┌───▼────┐ ┌─────▼─────┐
│ NPU │ │ GPU │
│ Plugin │ │ Plugin │
└────────┘ └───────────┘
│ │
┌───▼─────────────────────▼─────┐
│ CPU Plugin (Fallback) │
└────────────────────────────────┘
Core Module Details
1. Compute Backend Abstraction Layer
Function: Unified management of different compute backends (NPU, GPU, CPU)
Design characteristics:
- Plugin-based architecture, easy to extend
- Automatically selects the best backend
- Priority: NPU > GPU > CPU
- Supports backend switching and fallback
Supported NPUs:
- Qualcomm Hexagon NPU (Snapdragon)
- Apple Neural Engine (iOS/macOS)
- Intel/AMD NPU (Windows)
2. Model Format Support
GGUF format:
- Widely-used quantization format
- Supports multiple quantization levels
- Compatible with the llama.cpp ecosystem
MLX format:
- Apple MLX framework format
- Optimized for Apple Silicon
- Supports macOS and iOS
NEXA format:
- NexaSDK native format
- Optimized for NPU
- Better performance and compatibility
3. Multimodal Capabilities
LLM (Large Language Models):
- Text generation and conversation
- Streaming output support
- Function Calling support
VLM (Vision Language Models):
- Image understanding and generation
- Multimodal conversation
- Visual question answering
ASR (Automatic Speech Recognition):
- Speech to text
- Supports multiple audio formats
- Real-time recognition support
OCR (Optical Character Recognition):
- Text recognition from images
- Multi-language support
- High-precision recognition
Other capabilities:
- Rerank: Text reranking
- Object Detection: Object detection
- Image Generation: Image generation
- Embedding: Vector embeddings
4. Platform-specific Implementations
PC platform (Python/C++):
- Python SDK: Easy to use
- C++ SDK: High performance
- Supports Windows, macOS, Linux
Android platform:
- Kotlin SDK
- Supports NPU (Snapdragon 8 Gen 4+)
- GPU and CPU fallback support
- Minimum SDK 27
iOS platform:
- Swift SDK
- Supports Apple Neural Engine
- iOS 17.0+ / macOS 15.0+
- Swift 5.9+
Linux/IoT platform:
- Docker image
- Supports Arm64 and x86
- Supports Qualcomm Dragonwing IQ9
- Suitable for edge computing scenarios
Key Technical Implementation
1. NPU Acceleration Optimization
Challenge: Different vendors' NPU architectures vary significantly
Solution:
- Unified NPU abstraction layer
- Optimized implementations for different NPUs
- Automatic NPU detection and selection
- Performance monitoring and tuning
2. Model Format Conversion
Challenge: Supporting multiple model formats requires unified handling
Solution:
- Format parser abstraction
- Unified model representation
- Format conversion tools
- Caching mechanism optimization
3. Cross-platform Compatibility
Challenge: Different platforms have different APIs and constraints
Solution:
- Platform abstraction layer
- Conditional compilation
- Unified configuration interface
- Platform-specific optimizations
4. Memory Management
Challenge: On-device memory is limited, needs efficient management
Solution:
- Smart memory allocation
- Model quantization support
- Memory pool management
- Timely resource release
Extension Mechanisms
1. Adding New Compute Backends
Implement the Plugin interface:
type ComputeBackend interface {
Name() string
IsAvailable() bool
LoadModel(config ModelConfig) error
Generate(input Input, config GenerationConfig) (Output, error)
}
2. Adding New Model Formats
Implement the FormatParser interface:
type FormatParser interface {
CanParse(path string) bool
Parse(path string) (*Model, error)
Optimize(model *Model, target Backend) error
}
3. Adding New AI Capabilities
Implement the Capability interface:
type Capability interface {
Name() string
SupportedModels() []string
Process(input Input, config Config) (Output, error)
}
Project Resources
Official Resources
- 🌟 GitHub: https://github.com/NexaAI/nexa-sdk
- 🌐 Website: https://docs.nexa.ai
- 📚 Documentation: Complete documentation
Who Should Use This
Highly Recommended:
- Developers building on-device AI applications
- Mobile app developers wanting to leverage NPU acceleration
- Developers needing to run AI models on IoT devices
- Researchers interested in on-device AI performance optimization
Also Suitable For:
- Students wanting to learn on-device AI implementation
- Architects needing to evaluate different AI frameworks
- Technical professionals interested in NPU acceleration
Welcome to visit my personal homepage for more useful knowledge and interesting products
Top comments (0)