Dual Launch! Mininglamp Technology Open-Sources Cider On-Device Inference Acceleration Framework and Mano-P On-Device Model

#ai #agents #programming #opensource

Mininglamp Technology has officially open-sourced its self-developed Cider inference acceleration SDK (Software Development Kit) and the on-device GUI agent model Mano-P. Following the earlier open-sourcing of the Mano-CUA skill, this release of the Mano-P model vividly demonstrates the immense potential of on-device models in real-world business workflows. Meanwhile, the Cider framework addresses computation operators and hardware invocation mechanisms at the foundational level, empowering on-device large models to run smoothly on macOS local compute with greater efficiency and lower memory footprint.

GitHub-Mano-P
Cider SDK

Mano-P: Validating the Deployment Potential of On-Device Agents

Mano-P is Mininglamp Technology's self-developed on-device GUI-VLA agent model. It understands and operates graphical interfaces through pure vision, without relying on traditional API integrations or being limited to browser scenarios. Instead, it can directly interact with desktop software, web-based systems, and more complex graphical workflows.

Complex graphical interface interactions inherently demand robust multimodal visual understanding capabilities from the model. The model must continuously process screenshots at high frequency, precisely locate minuscule UI elements, and execute subsequent actions based on visual feedback. Under traditional cloud-based large model architectures, the token cost incurred by such high-frequency visual interactions is extraordinarily high.

In contrast, the 4B-parameter Mano-P on-device model not only achieves accuracy comparable to cloud-based large models on CUA tasks but also completely eliminates the otherwise prohibitive cloud API call costs. In fully offline local mode, all application screenshots, interaction processes, and task data are strictly confined to the user's local device, making privacy protection a matter of "physical isolation" by design.

Cider: An On-Device Inference Acceleration Framework for Apple Silicon

The core metrics that truly determine the usability of on-device models are local inference speed, hardware utilization, memory footprint, integration cost, and long-term stability. If inference speed is too slow, the AI interaction experience suffers significantly; if memory usage is too high, the model becomes difficult to deploy widely on mainstream devices; if integration costs remain prohibitive, enterprises and developers struggle to rapidly incorporate on-device capabilities into their business pipelines.

Cider was born precisely to address these challenges. As a self-developed and open-sourced SDK from Mininglamp Technology, Cider is built on the Apple MLX ecosystem, purpose-built for macOS and Apple Silicon. It precisely fills the gaps in the native MLX framework regarding activation quantization and specific tensor computation capabilities, serving as a highly efficient on-device inference framework designed for the broad open-source model ecosystem.

Currently, the native Apple MLX architecture already supports weight quantization modes such as W4A16 and W8A16. Building upon this foundation, Cider further provides W8A8 and W4A8 inference paths. Through deep integration of online activation quantization, INT8 TensorOps computation, quantized matrix multiplication, and dequantization pipelines, Cider fully unleashes the underlying computational potential of Apple Silicon, enabling open-source models not merely to "run on Mac" but to operate smoothly with higher efficiency and lower memory consumption.

In benchmark testing, Cider's operator speed in W8A8 mode achieves approximately 1.4x to 1.9x improvement over native MLX mode, with specific performance varying by Batch Size. In W4A8 mode, Cider further reduces weight memory footprint by 50% compared to W8A8 mode while matching the computational speed of native MLX's full-precision W4A16 approach in high-concurrency scenarios.

For the Qwen3-VL series of mainstream vision-language models, Cider demonstrates highly significant acceleration in end-to-end prefill scenarios. Under varying prompt lengths, compared to native MLX W8A16 mode, Cider's W8A8 PC mode delivers approximately 17% to 22% prefill speed improvement for the Qwen3-VL-4B model; for the Qwen3-VL-2B model, this speedup leaps to approximately 57% to 61%.

Additionally, Cider has performed deep optimization and non-invasive fixes for technical challenges such as RoPE position handling in multi-image inference, substantially improving inference stability for complex visual tasks. Since visual interaction tasks typically require processing longer contexts, more complex screenshot information, and denser inference requests, this magnitude of performance improvement is particularly critical for on-device VLMs and GUI agents.

Furthermore, Cider actively explores heterogeneous collaboration between the Apple Neural Engine and GPU on the M4 chip. For a long time, on-device large model inference has primarily relied on GPUs, while the potential of the Neural Engine in Apple chips has remained largely untapped. By introducing an ANE+GPU heterogeneous tensor parallelism mechanism, Cider enables both types of compute units to work in concert, achieving an additional approximately 3% to 16% acceleration in certain test scenarios.

Minimal Integration, Enabling Local Acceleration for More Open-Source Models

Cider seamlessly supports any LLM model, covering Qwen, Llama, Mistral, as well as VLM models such as Qwen3-VL, with a built-in OpenAI-compatible VLM inference service. Enterprises and developers need not rewrite model architectures—with only minimal code adaptation, integration can be achieved effortlessly.

During the prefill phase, Cider supports enabling W8A8 INT8 TensorOps to dramatically boost computation speed; during the decode phase, the framework intelligently falls back to the original weight path, effectively avoiding unnecessary additional overhead.

Whether enterprises aim to deploy highly customized local large language models within their internal networks, or developers are committed to building vertical-domain private AI application ecosystems, Cider provides a robust, reliable, and highly extensible underlying inference infrastructure.

Toward Private AI: Building Local Intelligence Infrastructure

In the past, most large model applications relied on cloud computing. Cloud-based models offer stronger scalability, but in enterprise scenarios, data transmission costs, privacy security, API call expenses, and network dependency have become issues that cannot be ignored. Particularly in scenarios involving internal systems, core business processes, sensitive interface screenshots, and task data, on-device AI brings the model closer to where data originates, reducing transmission risks while improving response speed and autonomous controllability.

By enhancing local inference efficiency, Cider brings "data never leaves the device" closer to a truly viable engineering solution. When local models achieve better inference performance, enterprises gain the confidence to explore private AI deployment across more scenarios—such as local intelligent assistants, enterprise internal Agents, offline task execution, on-device multimodal analysis, and automated workflows with high confidentiality requirements.

Going forward, Mininglamp Technology will also open-source the complete Mano-Action training methodology and related tools, helping enterprises and developers train customized GUI agent models based on their own data, or develop new training techniques on top of Mano-Action, fully empowering enterprise customization and algorithmic innovation.

Mininglamp Technology is extending its deep expertise in intelligent agents, multimodal models, and enterprise-grade AI applications further down to the foundations of underlying inference frameworks and on-device model development. We are committed to providing developers and enterprise users with a complete, out-of-the-box private AI infrastructure, enabling AI to truly achieve private deployment, low-cost operation, and trustworthy real-world implementation.