DEV Community: Mininglamp

Three Open-Source Projects That Turn Your Mac Into a Private AI Workstation

Mininglamp — Tue, 19 May 2026 12:05:56 +0000

The idea of running AI agents entirely on your laptop used to be a joke. A fun thought experiment you'd entertain over coffee before switching back to your cloud API dashboard and watching the bills pile up.

In 2026, it's a real workflow.

Not a demo. Not a "technically possible if you squint" proof of concept. An actual, production-grade stack where a vision-language model sees your screen, operates your apps, accelerates inference on Apple Silicon, and builds entire applications from a product spec — all without a single byte leaving your machine.

At Mininglamp Technology, we've been building toward this with three open-source projects. Each solves a distinct piece of the on-device AI puzzle. Together, they form something we think is genuinely new: a complete private AI workstation stack that runs on a Mac.

Let's walk through them.

1. Mano-P: The Agent That Sees Your Screen

Repo: github.com/Mininglamp-AI/Mano-P

Most "AI agents" are glorified API wrappers. They read text, call tools, and hope the tool's interface hasn't changed since the prompt was written. Mano-P takes a fundamentally different approach: it's a GUI-VLA (Vision-Language-Action) model that perceives your screen the way a human does — by looking at it.

Mano-P comes in two sizes:

72B (cloud/server): The full model, currently ranked #1 on OSWorld with a score of 58.2% — a significant lead over the second-place opencua-72b at 45.0%.
4B (local): A distilled model designed to run entirely on-device. On an M5 Pro, it decodes at roughly ~80 tokens/second with a peak memory footprint of just 4.3GB. It runs on M4 chips with 32GB RAM.

What makes this interesting isn't just the benchmark numbers — it's the interaction model. Mano-P doesn't need custom integrations or tool definitions. It sees buttons, text fields, menus, and dialogs the same way you do. Tell it "open Safari and find the latest Hacker News post about Rust," and it navigates the GUI visually, clicking and typing as needed.

The 72B model also includes WebRetriever, a web navigation component that scores 41.7 on NavEval — ahead of Gemini 2.5 Pro (40.9) and Claude 4.5 (31.3). Web browsing as a first-class agent capability, not an afterthought.

Why This Matters

The traditional approach to computer-use agents is brittle. You build tool adapters, maintain API schemas, and pray that the next macOS update doesn't break your Accessibility API hooks. A vision-first agent sidesteps all of that. If a human can use the app, Mano-P can use the app.

2. Cider: Inference Acceleration for Apple Silicon

Repo: github.com/Mininglamp-AI/cider

Running a 4B model at 80 tok/s on a Mac doesn't happen by accident. It requires an inference engine that actually understands Apple Silicon's hardware characteristics. That's what Cider is.

Cider is an inference acceleration SDK built specifically for Apple's M-series chips. Its key contribution is activation quantization — specifically W8A8 and W4A8 schemes — which fills a gap that MLX currently doesn't cover. MLX supports weight-only quantization (W4A16, W8A16), but activations stay in full precision. Cider quantizes both weights and activations, which unlocks substantially better throughput.

The Numbers

On an M5 Pro, Cider delivers 1.4–2.2x faster inference compared to MLX W4A16, depending on the quantization granularity you choose:

Quantization	Granularity	Speedup vs MLX W4A16
W8A8 / W4A8	Per-channel	1.8x (fastest)
W8A8 / W4A8	Per-group (gs=128)	1.5x
W8A8 / W4A8	Per-group (gs=64)	1.3x

There's a tradeoff between speed and accuracy, as you'd expect. On the CUA Benchmark (M5, 16GB), W8A16 quantization maintains 58.0% accuracy while W8A8 comes in at 54.0%. Depending on your use case, that 4-point delta may or may not matter — for many agentic workflows, the speed gain is worth it.

Why Not Just Use MLX?

This isn't about replacing MLX. MLX is excellent at what it does. But weight-only quantization hits a wall when you need both low memory and high throughput for real-time agent interactions. Activation quantization is the next lever, and right now, Cider is the open-source option that pulls it on Apple Silicon.

Think of it this way: MLX gives you the foundation. Cider fills the gap in activation quantization that lets you push throughput further on the same hardware.

3. Mano-AFK: The Autonomous App Builder

Repo: github.com/Mininglamp-AI/mano-afk

This is where things get wild.

Mano-AFK takes a PRD (Product Requirements Document) and turns it into a working application. Not a skeleton. Not boilerplate. A deployed, tested application — with zero human intervention in the loop.

Here's the pipeline:

Read the PRD — Parse requirements, extract features, identify tech stack
Write the code — Generate the full application
Deploy it — Spin up a local or containerized environment
Test it visually — Using Mano-P's vision model to actually look at the running app
Find bugs — Compare what's on screen to what the PRD specified
Fix them — Modify code, redeploy, retest

The critical piece here is step 4. Most code-generation tools "test" by running unit tests they also generated — which is roughly as useful as grading your own homework. Mano-AFK uses Mano-P's vision capabilities to perform visual testing: it loads the app, looks at the screen, and verifies that the UI actually matches the spec. A button that's supposed to be blue but renders as white? Caught. A form that submits but shows no confirmation? Caught.

This closes the loop in a way that pure code generation can't. The vision model acts as an independent quality gate that evaluates the artifact, not just the source.

What It's Good For

Mano-AFK shines for internal tools, prototypes, and MVPs where the cost of human QA exceeds the cost of iteration cycles. It's not going to replace your engineering team on a complex distributed system. But for "I need a dashboard that shows these metrics with these filters by Thursday"? It's remarkably capable.

The Stack: Model → Accelerator → Builder

Here's where the three projects become more than the sum of their parts.

┌─────────────────────────────────────────────┐
│              Your Mac (M4+ / 32GB)          │
│                                             │
│  ┌──────────┐  ┌──────────┐  ┌───────────┐ │
│  │  Mano-P  │  │  Cider   │  │ Mano-AFK  │ │
│  │  (Agent) │──│  (Accel) │──│ (Builder) │ │
│  │  4B VLA  │  │  W8A8    │  │ PRD→App   │ │
│  └──────────┘  └──────────┘  └───────────┘ │
│                                             │
│  Data stays here. Always.                   │
└─────────────────────────────────────────────┘

Mano-P provides the vision-language-action intelligence — the ability to see, understand, and act on screen content. Cider accelerates inference so that intelligence runs at interactive speeds on consumer hardware. Mano-AFK orchestrates multi-step autonomous workflows, using Mano-P as both its brain and its eyes.

The result is a stack where:

Your AI agent perceives and operates your entire desktop
Inference is fast enough for real-time interaction (not "wait 30 seconds per action" fast — actually fast)
Autonomous workflows can build, deploy, and quality-test applications without human involvement
Nothing leaves your machine. No API calls to external servers. No telemetry. No data exfiltration vectors. Your code, your screen content, your documents — they stay on your Mac.

That last point matters more than people think. Enterprise teams working with proprietary code, healthcare organizations handling patient data, legal teams reviewing confidential documents — these groups can't use cloud AI agents, period. An on-device stack isn't a nice-to-have for them. It's the only option.

Hardware Requirements

Let's be clear about what you need: Apple M4 with 32GB of RAM is the minimum for running the 4B model at usable speeds. An M5 Pro will give you the best experience. This isn't a "runs on any Mac" situation — you need the unified memory bandwidth and Neural Engine capabilities of recent Apple Silicon.

The Bigger Picture

We're not claiming this replaces cloud AI. The 72B model exists for a reason — some workloads need that scale, and running it requires serious hardware. What we are saying is that the gap between "cloud-only" and "runs on your laptop" has narrowed dramatically, and for a growing category of workflows, the on-device option is not just viable but preferable.

The three forces driving this:

Model distillation has gotten remarkably good. The 4B Mano-P retains enough capability from its 72B parent to handle real-world GUI tasks.
Apple Silicon's unified memory architecture is uniquely suited to LLM inference. High memory bandwidth + large unified pool = exactly what transformer decoding needs.
Activation quantization (via Cider) closes the remaining throughput gap. Weight-only quantization was the easy win; activation quantization is the hard one that makes real-time interaction possible.

The open-source angle matters here too. These aren't black-box binaries. You can inspect the model weights, audit the inference engine, verify that nothing phones home. For privacy-sensitive deployments, "trust us" isn't good enough. "Read the code" is.

Get Started

All three projects are released under Apache 2.0 — use them commercially, fork them, contribute back, or just kick the tires.

Mano-P (GUI Vision-Language-Action Agent): github.com/Mininglamp-AI/Mano-P
Cider (Apple Silicon Inference Acceleration): github.com/Mininglamp-AI/cider
Mano-AFK (Autonomous App Builder): github.com/Mininglamp-AI/mano-afk

If you build something with them, we'd love to hear about it. File an issue, open a PR, or just star the repos if you think this direction is worth pursuing.

The future of AI workstations isn't in the cloud. It's on your desk.

Mininglamp Technology builds AI infrastructure for enterprises. Our open-source projects focus on on-device AI agents, inference optimization, and autonomous software development. Learn more at github.com/Mininglamp-AI.

Agent vs Skill vs MCP vs Tool: The 4-Layer Stack Every AI Developer Should Know

Mininglamp — Thu, 14 May 2026 11:10:04 +0000

The Terminology Problem

The AI agent ecosystem has a vocabulary collision. "Tool" means one thing in LangChain, another in AutoGPT, and something else entirely in Claude's function-calling docs. "Skill" and "agent" are similarly overloaded—an "agent" might be a simple prompt wrapper or a fully autonomous system that books flights and deploys code. "MCP" arrived in late 2024 and added yet another term to the mix.

This matters architecturally. When layers are conflated, testing becomes harder, reuse drops, and swapping a model means rewriting half the system. A function that orchestrates 15 steps gets called a "tool." A prompt that strings together API calls gets called an "agent." The result is codebases where nothing is composable.

A 4-layer mental model resolves most of the confusion—similar to how the OSI model gave networking a shared vocabulary, or how MVC clarified web application structure. It's not a rigid specification, but a framework for making architectural discussions more productive.

The 4-Layer Stack

From bottom to top:

Layer 1: Tools — The Atoms

A tool is a single, stateless function that performs one atomic operation. It clicks a button, reads a file, calls an API, or captures a screenshot. Tools have no memory, no planning capability, and no awareness of why they're being called.

Key properties:

Deterministic (or close to it)
Testable in isolation
Composable — designed to be called by higher layers
Environment-specific — a click() on macOS differs in implementation from click() on Android, even if the interface is identical

Examples:

screenshot() — captures the current screen
click(x, y) — clicks at coordinates
read_file(path) — returns file contents
http_get(url) — fetches a URL

Tools are the smallest composable unit. They accept input, perform one action, and return a result. No side quests. The web analogy: individual HTTP endpoints. A GET /users/:id doesn't know about business logic—it fetches a row from a database and returns it.

Layer 2: MCP (Model Context Protocol) — The Connectors

MCP is a standardized transport layer for tool discovery and invocation across process boundaries. Think of it as GraphQL or gRPC for AI systems—it defines how tools are discovered, described, and called, not what they do.

Before MCP, every agent framework had its own tool integration spec. Building a tool for LangChain meant rebuilding it for AutoGPT. Building it for CrewAI meant doing it again. MCP standardizes three things:

Discovery: "What tools are available on this server?"
Schema: "What parameters does this tool accept? What does it return?"
Transport: stdio, HTTP, or WebSocket—the calling code picks the transport

MCP is about interoperability, not intelligence. An MCP server exposes tools; it never decides when to use them. The calling agent makes all decisions. An MCP server is a waiter that presents the menu and takes orders—it doesn't choose the meal.

When MCP adds value: Tools living in different processes or machines. Multiple agents or frameworks sharing the same tool set. Tool authors who want to write once and have it work across LangChain, Claude, OpenAI Assistants, and others.

When MCP adds overhead without benefit: Everything runs in-process and only one agent consumes the tools. In that case, direct function calls are simpler.

Layer 3: Skills — The Playbooks

A skill is a reusable, multi-step procedure that combines tools to accomplish a meaningful task. The web analogy: a service-layer module. A PlaceOrderUseCase orchestrates inventory checks, payment processing, and notifications—it's not a single endpoint but a choreography of endpoints.

"Fill out a web form" is a skill: it involves locating fields, typing values, handling dropdowns, scrolling, and clicking submit. Each step invokes tools, but the sequence, branching logic, and error recovery are the skill's contribution.

Examples:

"Navigate to Settings > Privacy > Clear Cache" (UI navigation)
"Search for a flight, compare prices, select the cheapest" (multi-step research)
"Read an Excel file, extract key metrics, generate a summary" (data analysis)
"Log into a service, check account status, export a report" (multi-app workflow)

Skills are portable when the underlying tool layer provides the required primitives. A "fill web form" skill works on any OS as long as click, type, and screenshot tools are available underneath.

The skill is the natural unit of reuse. A 3-line function and a 300-line multi-step workflow serve fundamentally different purposes; separating them clarifies what's testable in isolation (tools) versus what requires integration testing (skills). Skills can also be shared across agents—one agent might use a "file analysis" skill in a data pipeline context, another in a customer support context.

Layer 4: Agent — The Decision-Maker

An agent is the autonomous reasoning entity that decides what to do, when, and why. It observes the environment (via tools), reasons about the next action (via its language model), selects the appropriate skill, monitors execution, and adapts when things fail.

An agent owns:

Goal decomposition — breaking "book me a flight to Tokyo" into subtasks
Skill selection — choosing which playbook fits the current subtask
Error recovery — detecting failures and trying alternatives
Memory — tracking what's been done across a session
Termination judgment — knowing when the goal is achieved

Agents are model-powered. Replace the model, and the agent's capability ceiling changes. But in well-layered architecture, skills and tools remain valid regardless of which model drives the agent. This is the key insight: the agent is the most volatile layer (models improve quarterly), while tools and skills are the most stable (click is still click).

How the Layers Compose

Agent (decides what to do)
  ↓ selects
Skill (knows how to do it)
  ↓ invokes via
MCP (discovers and routes)
  ↓ calls
Tool (executes one atomic action)

This separation enables:

Swappable models — upgrade the agent's LLM without touching skills or tools
Portable skills — move a skill from cloud to edge by swapping the tool layer
Testable tools — unit-test each tool independently, integration-test each skill
Interoperable infrastructure — MCP means tools work with any compliant agent

A Real-World Example: Mano-P

Mano-P is Mininglamp Technology's open-source on-device GUI agent for macOS. It illustrates how the Agent and Skill layers work together in a local-first, privacy-preserving architecture.

It is pure vision-driven—understanding screens via screenshots, with no dependency on DOM trees, accessibility APIs, or HTML scraping. A local 4B-parameter model runs the entire inference loop on-device.

At the Tool layer: Screen capture, mouse click, keyboard input, scroll—all native macOS operations. No cloud calls for any action primitive.

At the Skill layer: Multi-step workflows for desktop tasks—form filling, app navigation, data extraction—compose the native tools into reliable sequences. These are packaged as mano-skill, a format callable by external orchestrators like Claude Code or OpenClaw agents.

At the Agent layer: The vision-language model observes screenshots and decides the next action autonomously. On Apple M4 + 32GB RAM, it runs at 76 tok/s using the Cider SDK (MLX inference acceleration with W8A8 activation quantization). Data never leaves the device—no screenshots uploaded to cloud APIs, no keystrokes logged remotely.

On the OSWorld benchmark, Mano-P ranked #1 in the proprietary model category with 58.2% accuracy—demonstrating that smaller local models with well-separated architecture can compete with cloud-dependent systems on real desktop tasks.

Installation:

brew tap Mininglamp-AI/tap && brew install mano-cua

Apache 2.0 licensed. Hardware requirement: Apple M4 chip + 32GB RAM.

When to Use What

Not every project needs all four layers:

Tools alone — deterministic automation with fixed sequences (cron jobs, CI pipelines, simple scripts).

Tools + MCP — tools live in different processes or machines; multiple agents share the same tool set.

Tools + MCP + Skills — multi-step workflows with conditional logic and error recovery; reusable procedures across different agents.

Full stack (Agent + Skill + MCP + Tool) — goals are ambiguous or user-specified at runtime; the environment is dynamic; autonomous operation over extended sessions is needed.

Building from the bottom up tends to work well. Get tools right first. Add MCP when interop is needed. Compose skills when workflows emerge. Add an agent when autonomous reasoning becomes necessary.

Common Architecture Smells

Patterns worth recognizing early:

Monolithic prompts — tools, skills, and orchestration logic all in one system message. Hard to test or debug individual pieces. Hard to reuse across projects.
"Tools" that maintain state — a function doing 15 things with internal state is a skill in disguise. Recognizing this improves testability and makes the codebase legible.
MCP everywhere — wrapping every in-process function call in MCP transport adds complexity without interoperability gains. MCP shines at boundaries, not within a single process.
Platform logic in skills — skills containing OS-specific code instead of delegating to tools lose portability. The fix: push platform specifics down into the tool layer where they belong.
Agent without skills — putting all multi-step logic directly in the agent's prompt creates a brittle system that breaks when the model changes or the prompt grows too long.

Summary

The 4-layer model—Tool, MCP, Skill, Agent—provides a vocabulary for answering recurring design questions:

Where does this logic belong?
What's reusable vs. environment-specific?
What can be tested in isolation?
What changes when the model is swapped?
What survives a model upgrade without modification?

These are the same separation-of-concerns questions that web development answered with MVC, service layers, and API gateways. The AI agent stack is working through equivalent patterns now. The projects that age well will be the ones with clean boundaries between layers—where upgrading the LLM doesn't require rewriting the skill library, and swapping from macOS to Linux only means changing the tool implementations.

Mano-P is open-source at github.com/Mininglamp-AI/Mano-P. If you find this useful, a ⭐ on GitHub helps the project reach more developers.

Why One Giant Model Ruling Everything Is a Bad Idea

Mininglamp — Wed, 13 May 2026 09:56:36 +0000

The Narrative Everyone Accepted Without Questioning

There's a story the AI industry has been telling itself for the past few years, and it goes something like this: bigger is better, and the biggest wins. More parameters. More data. More compute. The leaderboard rewards scale, venture capital rewards scale, and so the entire field marches in one direction — upward.

But spend enough time in the trenches — dealing with real deployment constraints, real failure modes, and real questions about who controls what — and this narrative starts to look, at best, incomplete.

What if scaling up is only half the story? What if the other half — scaling out — is not just a fallback for teams who can't afford the big model, but a fundamentally different architecture that solves problems the monolithic approach structurally cannot?

The Internet Is Changing at the Infrastructure Level

Here's something that doesn't get discussed enough: the internet itself is undergoing a quiet paradigm shift.

The old internet was designed to connect human attention. Search engines, social feeds, recommendation algorithms — they all competed for the same scarce resource: the roughly 16 waking hours each person has per day. The entire ad-tech economy was built on this bottleneck.

The emerging internet connects agent compute. Software agents don't sleep. They don't get bored. They don't have a finite attention span that advertisers fight over. When AI agents become the primary consumers and producers of internet traffic — not just humans browsing pages — the architecture of the network itself needs to change.

This isn't a distant future. It's already happening. API calls between services are growing faster than human page views. Autonomous agents are booking meetings, writing code, filing reports, and negotiating with other agents. The internet is transitioning from a human attention marketplace to an agent cooperation network.

And this transition raises a profound question: should that cooperation network be controlled by a single model, or distributed across many?

Why Scaling Up Alone Is Structurally Risky

To be clear: large models are not inherently bad. They're remarkable achievements. Frontier systems demonstrate capabilities that seemed impossible five years ago. The research behind them is genuinely impressive.

But as an architectural strategy for the entire field, the "one model to rule them all" approach has structural risks that don't go away by throwing more compute at them:

Extreme centralization. Training frontier models costs hundreds of millions of dollars. Only a handful of organizations on Earth can play this game. That means the most powerful AI capabilities are concentrated in very few hands. Whatever your politics, this level of concentration should give you pause.

Black-box decision making. When a single 2-trillion-parameter model makes a decision, good luck auditing why. Interpretability research is making progress, but the field is nowhere near being able to trace a complex reasoning chain through a monolithic transformer with confidence. For high-stakes domains — medicine, law, finance — "trust me, the big model said so" isn't going to cut it.

Diminishing returns on investment. The scaling laws that powered the last generation of breakthroughs are showing signs of flattening in certain domains. Training costs are growing faster than capability gains. At some point, the next 10x in compute doesn't buy 10x in usefulness — it buys marginally better benchmark scores that don't translate to real-world value.

Single points of failure. When an entire AI strategy depends on one provider's API staying up, staying affordable, and staying aligned with the user's interests... that's one policy change away from a very bad week.

None of these are reasons to abandon large models. They're reasons to ask: is there a complementary approach?

Scaling Out: A Different Architectural Bet

An alternative gaining traction in the industry: instead of making one model infinitely large, connect many specialized models over the internet and let them cooperate on tasks.

Consider how the internet itself succeeded. It didn't win by building one giant supercomputer that everyone connects to. It won by creating a protocol that lets millions of different machines — each with their own capabilities, owners, and purposes — collaborate. The genius was in the connection, not the concentration.

Scaling Out applies the same principle to AI. Different agents, potentially running different models optimized for different tasks, coordinate over network protocols to accomplish complex goals. A planning agent delegates to a code-writing agent, which delegates to a testing agent, which reports back. Each agent is independently deployable, replaceable, and auditable.

The advantages mirror those of distributed systems in general:

Resilience. No single agent failure takes down the whole system.
Specialization. Each agent can be optimized for its specific task rather than being a jack-of-all-trades.
Auditability. The communication between agents is inspectable. The reasoning chain is explicit in the messages, not buried in hidden layers.
Accessibility. No billion-dollar GPU cluster required to participate. A well-tuned 7B model running on modest hardware can be a valuable node in an agent network.

MOA vs. MoE: The Difference That Matters

Anyone familiar with Mixture of Experts (MoE) might be thinking: "This is already solved. MoE architectures route different inputs to different expert sub-networks within a single model."

That's true, but there's a crucial distinction.

In MoE, the routing happens inside the model. It's an internal optimization — a way to make a single model more efficient. The experts share weights, share a training process, and share an operator. From the outside, it's still one black box. There's no way to inspect which expert handled a query, no way to audit the expert's reasoning independently, and no way to replace one expert without retraining the whole system.

Mixture of Agents (MOA) is architecturally different. Each agent is a separate system — potentially running a different model, operated by a different team, connected over the internet. The "routing" is explicit: an orchestrator delegates tasks to agents based on their declared capabilities, and the communication happens over observable channels.

This means:

White-box cooperation. Every message between agents is inspectable. It can be logged, audited, replayed. There's no hidden routing decision buried in a softmax layer.
Independent governance. Each agent can have its own safety constraints, access controls, and compliance requirements. A medical agent can enforce HIPAA. A financial agent can enforce SOX. These constraints don't need to be negotiated inside a single model's RLHF training.
Traceable accountability. When something goes wrong, it's possible to point to exactly which agent made which decision based on which inputs. Try doing that with a trillion-parameter monolith.
Evolvability. Swap out one agent for a better version without touching the rest of the system. Upgrade incrementally. No need for a six-month retraining cycle.

MoE is an optimization technique for building better monoliths. MOA is an architectural pattern for building systems of cooperation. They solve different problems at different levels of the stack.

The Bigger Picture: Democratized AI Research

There's one more angle that doesn't get enough attention: what happens when the barrier to contributing to AI systems is lowered?

Right now, a domain expert — a biologist, a materials scientist, a climate researcher — who wants to leverage AI for their field has limited options: (a) fine-tune someone else's foundation model if the budget allows, or (b) hope that the general-purpose model happens to know enough about the niche.

In a Scaling Out world, there's option (c): build a specialized agent for a specific domain and plug it into the network. That agent doesn't need to be a frontier model. It needs to be good at its specific thing — identifying protein structures, simulating material properties, parsing climate data — and able to communicate its results to other agents that handle the parts it can't.

This is how scientific collaboration works among humans. No single scientist knows everything. Progress happens when specialists communicate effectively. There's no reason AI-assisted research should be different.

Imagine an internet where thousands of domain-specific AI agents — each built by experts in their respective fields — cooperate on complex research problems. A genomics agent identifies candidate genes. A chemistry agent predicts binding affinities. A literature agent surfaces relevant prior work. An experiment-design agent proposes validation studies. Each one is modest in isolation. Together, they're formidable.

This isn't just a technical architecture. It's a statement about who gets to participate in the AI revolution. If the only path forward is "build a bigger model," then only the richest organizations get a seat at the table. If the path forward is "build a specialized agent and connect it," then every domain expert in the world is a potential contributor.

Where Does This Leave Us?

Scaling Out does not replace Scaling Up. Large foundation models will continue to be valuable — as general-purpose reasoning engines, as pre-training bases for fine-tuning, as components within larger agent systems. The question isn't "which one wins." It's "what's the right mix, and who decides?"

The more likely future looks less like one omniscient oracle and more like an internet of cooperating specialists. Not because distributed systems are trendy, but because the problems that actually need solving — scientific discovery, complex engineering, personalized medicine, climate adaptation — are too varied, too specialized, and too important to trust to any single system, no matter how large.

The monolithic model is a cathedral. The agent network is a bazaar. History suggests which one adapts faster.

What's your take? Is this overcomplicating things — will a sufficiently large model really handle everything? Or does the distributed approach resonate with how you think about building reliable systems? If you've experimented with multi-agent architectures, what worked and what didn't?

The HN Post That Got 1,700 Upvotes: Local AI Needs to Be the Norm.Why "Local AI" Just Became the Default for Developers

Mininglamp — Tue, 12 May 2026 09:45:13 +0000

The HN Post That Got 1,700 Upvotes: Local AI Needs to Be the Norm

In early 2025, a post titled "Local AI needs to be the norm" hit the front page of Hacker News and stayed there. It collected 1,763 upvotes and over 800 comments. No product launch, no benchmark claim, no drama — just a statement that resonated with a large number of developers simultaneously.

The comments weren't the usual HN contrarianism either. Most of them were agreements, expansions, and stories of people already running models locally for daily work. Reading through that thread felt less like a debate and more like a census.

Something shifted. This article is an attempt to understand what, why, and where it leads.

The Cloud Assumption Is Cracking

For the past two years, the default mental model for AI has been: send your data to a powerful server, get results back. OpenAI, Anthropic, Google — they all operate on this assumption. You pay per token, your data traverses the internet, and the model lives somewhere you'll never see.

This worked fine when models were enormous and consumer hardware was weak. GPT-4 at launch required infrastructure that no individual could replicate. The cloud wasn't just convenient — it was the only option.

But hardware caught up faster than most expected. Apple's M-series chips turned laptops into credible inference machines. The M4 Pro can run a 4-billion parameter quantized model at 476 tokens per second for prefill and 76 tokens per second for decode, using 4.3GB of peak memory. That's not a toy — that's production-grade speed for most interactive use cases.

Meanwhile, the model side moved just as fast. Quantization techniques (GGUF, AWQ, GPTQ) made it possible to shrink models dramatically without proportional quality loss. A well-quantized 7B model today outperforms the full-precision 13B models of 18 months ago on most practical tasks.

The gap between "what you can run locally" and "what you need from the cloud" is narrowing every quarter.

Why Developers Care About Local

The HN thread was revealing because it surfaced the actual motivations, not the marketing ones. Here's what kept coming up:

Privacy isn't paranoia. Developers working on proprietary codebases, medical data, legal documents, or internal communications can't send that to third-party APIs without violating policies, NDAs, or regulations. This isn't about tinfoil hats — it's about professional responsibility. A developer at a bank can't pipe customer data to OpenAI's API, no matter how good the model is.

Latency is UX. A local model responds in milliseconds. No network round-trip, no queue, no cold start. For code completion, text editing, or any interactive workflow, the difference between 50ms and 500ms is the difference between a tool that feels invisible and one that interrupts your flow.

Cost compounds. API pricing looks cheap per call, but it adds up. A team of 10 developers making moderate use of GPT-4 for coding assistance can easily spend $2,000-5,000/month. A local model on existing hardware costs nothing after setup. For startups and indie developers, this matters enormously.

Offline availability. Planes, trains, bad WiFi, rural areas, classified environments — there are many contexts where internet access is unreliable or prohibited. Local models work everywhere your hardware goes.

Control and reproducibility. When you run a model locally, you know exactly which version, which weights, which quantization you're using. Cloud APIs change without notice. Models get updated, deprecated, or have their behavior modified. Local inference gives you a frozen, reproducible environment.

None of these are theoretical. They're daily realities for working developers.

What's notable is that these motivations cut across experience levels and company sizes. A solo indie developer cares about cost. A staff engineer at a Fortune 500 cares about compliance. A researcher cares about reproducibility. A journalist in a hostile regime cares about privacy as a survival matter. Local AI serves all of them with the same architecture.

The Ecosystem That Made It Possible

Local AI didn't become practical because of one breakthrough. It happened because an entire ecosystem matured simultaneously:

llama.cpp made inference accessible. Georgi Gerganov's C++ implementation proved you could run large language models on consumer hardware without Python, without CUDA, without a GPU cluster. It was a proof of concept that became infrastructure.

Ollama made it approachable. Download a model, run it with one command, expose an API. Ollama did for local LLMs what Docker did for containers — it removed the setup friction that kept most developers from trying.

Apple's MLX framework brought first-party support. Apple clearly sees on-device AI as a strategic differentiator. MLX is optimized for Apple Silicon in ways that third-party frameworks can't match, and Apple Intelligence's architecture is explicitly local-first with cloud as fallback.

Hugging Face's ecosystem provided the models. The proliferation of open-weight models (Llama, Mistral, Phi, Qwen, Gemma) meant developers had real choices. Competition drove quality up and size down.

Quantization research made the math work. Papers like GPTQ, AWQ, and QuIP# showed that aggressive quantization (4-bit, even 2-bit) could preserve model quality for most practical tasks. This was the key that unlocked consumer hardware — you don't need 70B parameters if 7B quantized gets you 90% of the way there.

The result: in 2024-2025, running a competent local model went from "impressive hack" to "standard developer workflow." The HN post didn't create this trend — it named something that was already happening.

It's worth noting how fast this moved. In early 2023, running any useful model locally required a beefy NVIDIA GPU and considerable technical skill. By late 2024, a MacBook Air could run a 7B model with no configuration beyond installing Ollama. That's a two-year journey from "research project" to "commodity tool."

Apple's Bet Tells You the Direction

Apple's approach to AI is worth studying because Apple doesn't make speculative bets. They ship what they believe will be the default in 3-5 years.

Apple Intelligence is architecturally local-first. The on-device model handles most requests. Only when a task exceeds local capability does it route to Private Cloud Compute — and even then, Apple designed PCC so that data is processed in a stateless enclave that even Apple employees can't access.

This isn't just a privacy story. It's an architecture story. Apple is betting that the future of AI interaction is:

Most inference happens on-device
The cloud is a capability fallback, not the default
Users shouldn't have to think about where processing happens

The MLX framework, the Neural Engine improvements in each chip generation, the Core ML optimizations — these are multi-year, multi-billion-dollar investments. Apple doesn't spend that money on trends they think will reverse.

When the largest company in the world builds its AI strategy around local inference, that's a signal worth paying attention to.

From Local Models to Local Agents

Here's where the conversation gets interesting, and where the HN thread didn't fully go.

Running a model locally is valuable, but it's still fundamentally a chat interface. You ask, it answers. The model is a brain in a jar — it can think, but it can't act.

The next logical step is obvious: if you can run inference locally, why not run agents locally?

An agent doesn't just generate text — it perceives your screen, understands context, and takes actions. It clicks buttons, fills forms, navigates applications, moves files. The gap between "AI that tells you how to do something" and "AI that does it for you" is the gap between a language model and an agent.

Cloud-based agents have a fundamental problem: they need to see your screen. That means streaming your desktop to a remote server continuously. Every document you open, every email you read, every private message — all sent to someone else's infrastructure. Even if you trust the provider today, you're creating a surveillance surface that didn't need to exist.

Local agents solve this elegantly. The model runs on your machine. It perceives your screen locally. It acts locally. Your data never leaves your device because there's nowhere else for it to go.

This is where the "local AI as norm" argument becomes strongest. For chat and text generation, privacy concerns are manageable — you can be careful about what you paste into a prompt. But for agents that continuously observe your workflow? Local-only isn't a preference; it's a requirement for anyone who takes security seriously.

The Technical Puzzle of On-Device Agents

Building a local agent is harder than running a local chatbot. The challenges are specific:

Vision understanding. The agent needs to interpret screenshots — understand UI elements, read text, recognize buttons, comprehend layouts. This requires vision-language models that are both capable and small enough to run locally.

Action grounding. Seeing a button is different from knowing how to click it. The agent needs to map visual understanding to precise coordinates and actions. This is a harder problem than it sounds — UI elements are dynamic, vary across applications, and don't come with semantic labels accessible to the model.

Speed. An agent that takes 10 seconds to decide what to click is useless for interactive workflows. Inference needs to be fast enough that the agent feels responsive, not laggy.

Reliability. Unlike a chatbot where a bad response is just annoying, an agent that clicks the wrong button can cause real damage. Accuracy matters more when the model has agency.

These constraints push toward a specific architecture: small, fast, vision-capable models that are optimized for action prediction rather than general conversation. You don't need GPT-4-level reasoning for most UI interactions — you need precise, fast, visual understanding.

Why Vision-Only Matters

There are two approaches to building GUI agents:

Accessibility-tree based: Parse the application's DOM or accessibility API to get structured data about UI elements. Feed that structure to the model.
Vision-only: Give the model a screenshot. Let it figure out what's on screen the same way a human would — by looking.

The accessibility approach seems easier, but it's brittle. Not all applications expose clean accessibility trees. Electron apps, games, custom UI frameworks, remote desktops — they all have incomplete or missing accessibility data. You're building on an abstraction that the underlying applications don't reliably provide.

Vision-only is harder to build but more robust in deployment. If a human can see it and interact with it, a vision-based agent can too. No dependency on application internals, no platform-specific APIs, no breaking when an app updates its UI framework.

This mirrors how humans actually interact with computers. We don't read the DOM — we look at the screen and click what looks right. A vision-only agent generalizes the same way.

The Convergence

Put the pieces together:

Local inference is fast enough for interactive use
Vision-language models are small enough to run on consumer hardware
Developers want their data to stay local
Agents are the natural evolution beyond chatbots
Vision-only approaches generalize across applications

The convergence point is clear: on-device AI agents that see your screen, understand your intent, and act locally — with zero data leaving your machine.

This isn't a prediction about 2030. The hardware exists today. The models exist today. The demand — as that HN post demonstrated — has been here for a while.

Where We're Putting Our Work

At Mininglamp Technology, we've been building toward this convergence with Mano-P — an open-source, on-device GUI agent that runs locally on Mac.

Mano-P takes the vision-only approach: it perceives your screen through screenshots and executes actions directly, with no data leaving your device. On the OSWorld benchmark, it achieves 58.2% accuracy — currently ranked #1. The 4B quantized model runs on an M4 Pro at 476 tokens/s prefill and 76 tokens/s decode, with 4.3GB peak memory usage. It's licensed under Apache 2.0.

We built it because we believe the argument in that HN post is correct: local AI should be the norm. And local agents are where that norm leads.

If this direction resonates with how you think about AI tooling, the repo is open. Contributions and stars are always appreciated.

Full-Stack On-Device GUI Agent — Mano-P Model + Cider + AFK, All Open Source

Mininglamp — Wed, 06 May 2026 11:06:58 +0000

Full-Stack On-Device GUI Agent — Mano-P Model + Cider + AFK, All Open Source

Introduction

GUI automation (Computer Use Agent) is becoming a key capability in the AI agent ecosystem. However, most existing solutions rely on cloud-based inference — every screenshot captured during task execution must be uploaded to a remote server for visual understanding. This creates significant data privacy concerns, especially in enterprise and security-sensitive environments.

Today, we are officially open-sourcing the Mano-P 1.0-4B local model, the Cider inference acceleration SDK, and Mano-AFK (an end-to-end automated app builder) — bringing a complete on-device GUI agent stack to Apple Silicon.

All screenshots and task data stay on your device. No cloud APIs required.

What is Mano-P

Mano-P is an open-source GUI-VLA (Vision-Language-Action) agent designed for edge devices. "Mano" means "hand" in Spanish, and "P" stands for Private — we believe individuals and organizations should be able to create their own private AI.

Built on the full Mano technical framework (Mano Technical Report), Mano-P uses a three-stage progressive training pipeline (SFT → Offline RL → Online RL) with a think-act-verify reasoning loop to achieve high-precision GUI understanding and operation.

Benchmark results (Mano-P 1.0-72B):

OSWorld (Specialized GUI Agent Models): 58.2% success rate, ranked #1
WebRetriever Protocol I: 41.7 NavEval score

Mano-P 1.0-4B Local Model

The Mano-P 1.0-4B model runs directly on Apple Silicon devices with no internet connection required.

Hardware Requirements:

Apple M4 chip or above (Mac mini / MacBook)
32GB+ unified memory
Alternatively: Mano-P compute stick via USB 4.0

Performance (Apple M5 Pro, 64GB RAM):

W8A16: Prefill 2.839s, Decode ~80 tokens/s
W8A8 (with Cider): Prefill 2.519s, Decode ~79.5 tokens/s
~12.7% prefill speedup with Cider W8A8

Privacy: In local mode, all inference runs on-device via MLX. No screenshots or task descriptions are transmitted over the network.

Download:

🤗 HuggingFace
🪄 ModelScope

Cider — INT8 Activation Quantization SDK for MLX

Cider is an open-source inference acceleration SDK for macOS, built on Apple MLX.

Why Cider Exists

MLX's built-in quantization is weight-only: QuantizedLinear dequantizes weights to FP16 and runs FP16 GEMM. MLX does not provide a true W8A8 inference path where both weights and activations are quantized to INT8 for computation.

Cider fills this gap with custom Metal kernels that implement fused quantize-matmul-dequant primitives, exposed as MLX custom primitives with full lazy evaluation support.

Supported Modes

W8A8: INT8 symmetric weights + INT8 per-token activation quantization → TensorOps matmul2d
W4A8: INT4 packed weights + INT8 per-token activation quantization → Unpack → TensorOps

Performance (Apple M5 Pro)

End-to-end VLM acceleration: Cider W8A8 achieves 1.4x–2.2x prefill speedup vs MLX native W4A16, while maintaining comparable decode speed.

Compatibility

Cider works with any MLX model, not just Mano-P. It also provides non-invasive compatibility patches for mlx_vlm (verified on v0.4.3), fixing several issues with Qwen3-VL multi-image inference.

Conditional Compilation

INT8 TensorOps C++ extensions build only on Apple M5+. On M4 devices, Cider installs as a pure Python package with is_available() returning False. Use CIDER_FORCE_BUILD=1 to override.

Source: github.com/Mininglamp-AI/cider

Mano-AFK — End-to-End App Builder

Mano-AFK is an automated application construction pipeline powered by Mano-P. From a single natural language description, it autonomously handles:

Requirements clarification → Architecture design → Code generation → Deployment → E2E GUI testing → Bug fixing → Delivering a working application

The E2E testing phase uses Mano-P as the local visual model backend, driving real browsers for GUI automation testing. When tests fail, the system automatically locates defects, fixes code, and re-verifies — forming a complete build-test-fix loop entirely on-device.

CUA Benchmark

Test environment: Mano-P 4B on MacBook Pro M5 (16GB unified memory), 100 tasks across 5 auto-built web applications.

W8A16: 58.0% accuracy, avg 6.1 steps, ~1,253 tok/s prefill
W8A8 (Cider): 54.0% accuracy, avg 6.93 steps, ~1,453 tok/s prefill

Note: On 16GB devices, W8A8 requires storing both original and INT8 weights, nearly doubling weight memory. Memory pressure may offset prefill gains. We recommend 4GB+ free memory beyond model size for full W8A8 benefit.

Source: github.com/Mininglamp-AI/mano-afk

Getting Started

# Install CLI
brew tap Mininglamp-AI/tap
brew install mano-cua

# Set up local mode
mano-cua check
mano-cua install-sdk
mano-cua install-model

# Run locally
mano-cua run "Open Safari and search Python" --local

Open Source Roadmap

Mano-P follows a phased open-source strategy:

Phase 1 (Released): Mano-CUA Skills — for Agent enthusiasts using OpenClaw, Claude Code, etc.
Phase 2 (This Release): Local model + Cider SDK — for developers with high security requirements
Phase 3 (Coming Soon): Training methods, pruning, and quantization techniques — for developers with custom model training needs

Dual Launch! Mininglamp Technology Open-Sources Cider On-Device Inference Acceleration Framework and Mano-P On-Device Model

Mininglamp — Wed, 06 May 2026 10:05:15 +0000

Mininglamp Technology has officially open-sourced its self-developed Cider inference acceleration SDK (Software Development Kit) and the on-device GUI agent model Mano-P. Following the earlier open-sourcing of the Mano-CUA skill, this release of the Mano-P model vividly demonstrates the immense potential of on-device models in real-world business workflows. Meanwhile, the Cider framework addresses computation operators and hardware invocation mechanisms at the foundational level, empowering on-device large models to run smoothly on macOS local compute with greater efficiency and lower memory footprint.

GitHub-Mano-P
Cider SDK

Mano-P: Validating the Deployment Potential of On-Device Agents

Mano-P is Mininglamp Technology's self-developed on-device GUI-VLA agent model. It understands and operates graphical interfaces through pure vision, without relying on traditional API integrations or being limited to browser scenarios. Instead, it can directly interact with desktop software, web-based systems, and more complex graphical workflows.

Complex graphical interface interactions inherently demand robust multimodal visual understanding capabilities from the model. The model must continuously process screenshots at high frequency, precisely locate minuscule UI elements, and execute subsequent actions based on visual feedback. Under traditional cloud-based large model architectures, the token cost incurred by such high-frequency visual interactions is extraordinarily high.

In contrast, the 4B-parameter Mano-P on-device model not only achieves accuracy comparable to cloud-based large models on CUA tasks but also completely eliminates the otherwise prohibitive cloud API call costs. In fully offline local mode, all application screenshots, interaction processes, and task data are strictly confined to the user's local device, making privacy protection a matter of "physical isolation" by design.

Cider: An On-Device Inference Acceleration Framework for Apple Silicon

The core metrics that truly determine the usability of on-device models are local inference speed, hardware utilization, memory footprint, integration cost, and long-term stability. If inference speed is too slow, the AI interaction experience suffers significantly; if memory usage is too high, the model becomes difficult to deploy widely on mainstream devices; if integration costs remain prohibitive, enterprises and developers struggle to rapidly incorporate on-device capabilities into their business pipelines.

Cider was born precisely to address these challenges. As a self-developed and open-sourced SDK from Mininglamp Technology, Cider is built on the Apple MLX ecosystem, purpose-built for macOS and Apple Silicon. It precisely fills the gaps in the native MLX framework regarding activation quantization and specific tensor computation capabilities, serving as a highly efficient on-device inference framework designed for the broad open-source model ecosystem.

Currently, the native Apple MLX architecture already supports weight quantization modes such as W4A16 and W8A16. Building upon this foundation, Cider further provides W8A8 and W4A8 inference paths. Through deep integration of online activation quantization, INT8 TensorOps computation, quantized matrix multiplication, and dequantization pipelines, Cider fully unleashes the underlying computational potential of Apple Silicon, enabling open-source models not merely to "run on Mac" but to operate smoothly with higher efficiency and lower memory consumption.

In benchmark testing, Cider's operator speed in W8A8 mode achieves approximately 1.4x to 1.9x improvement over native MLX mode, with specific performance varying by Batch Size. In W4A8 mode, Cider further reduces weight memory footprint by 50% compared to W8A8 mode while matching the computational speed of native MLX's full-precision W4A16 approach in high-concurrency scenarios.

For the Qwen3-VL series of mainstream vision-language models, Cider demonstrates highly significant acceleration in end-to-end prefill scenarios. Under varying prompt lengths, compared to native MLX W8A16 mode, Cider's W8A8 PC mode delivers approximately 17% to 22% prefill speed improvement for the Qwen3-VL-4B model; for the Qwen3-VL-2B model, this speedup leaps to approximately 57% to 61%.

Additionally, Cider has performed deep optimization and non-invasive fixes for technical challenges such as RoPE position handling in multi-image inference, substantially improving inference stability for complex visual tasks. Since visual interaction tasks typically require processing longer contexts, more complex screenshot information, and denser inference requests, this magnitude of performance improvement is particularly critical for on-device VLMs and GUI agents.

Furthermore, Cider actively explores heterogeneous collaboration between the Apple Neural Engine and GPU on the M4 chip. For a long time, on-device large model inference has primarily relied on GPUs, while the potential of the Neural Engine in Apple chips has remained largely untapped. By introducing an ANE+GPU heterogeneous tensor parallelism mechanism, Cider enables both types of compute units to work in concert, achieving an additional approximately 3% to 16% acceleration in certain test scenarios.

Minimal Integration, Enabling Local Acceleration for More Open-Source Models

Cider seamlessly supports any LLM model, covering Qwen, Llama, Mistral, as well as VLM models such as Qwen3-VL, with a built-in OpenAI-compatible VLM inference service. Enterprises and developers need not rewrite model architectures—with only minimal code adaptation, integration can be achieved effortlessly.

During the prefill phase, Cider supports enabling W8A8 INT8 TensorOps to dramatically boost computation speed; during the decode phase, the framework intelligently falls back to the original weight path, effectively avoiding unnecessary additional overhead.

Whether enterprises aim to deploy highly customized local large language models within their internal networks, or developers are committed to building vertical-domain private AI application ecosystems, Cider provides a robust, reliable, and highly extensible underlying inference infrastructure.

Toward Private AI: Building Local Intelligence Infrastructure

In the past, most large model applications relied on cloud computing. Cloud-based models offer stronger scalability, but in enterprise scenarios, data transmission costs, privacy security, API call expenses, and network dependency have become issues that cannot be ignored. Particularly in scenarios involving internal systems, core business processes, sensitive interface screenshots, and task data, on-device AI brings the model closer to where data originates, reducing transmission risks while improving response speed and autonomous controllability.

By enhancing local inference efficiency, Cider brings "data never leaves the device" closer to a truly viable engineering solution. When local models achieve better inference performance, enterprises gain the confidence to explore private AI deployment across more scenarios—such as local intelligent assistants, enterprise internal Agents, offline task execution, on-device multimodal analysis, and automated workflows with high confidentiality requirements.

Going forward, Mininglamp Technology will also open-source the complete Mano-Action training methodology and related tools, helping enterprises and developers train customized GUI agent models based on their own data, or develop new training techniques on top of Mano-Action, fully empowering enterprise customization and algorithmic innovation.

Mininglamp Technology is extending its deep expertise in intelligent agents, multimodal models, and enterprise-grade AI applications further down to the foundations of underlying inference frameworks and on-device model development. We are committed to providing developers and enterprise users with a complete, out-of-the-box private AI infrastructure, enabling AI to truly achieve private deployment, low-cost operation, and trustworthy real-world implementation.

Complex UIs, Cross-App Workflows, Long Tasks: What GUI Agents Actually Unlock

Mininglamp — Wed, 29 Apr 2026 09:44:16 +0000

AI agents have gotten remarkably good at text-based tasks. Platforms like OpenClaw and Claude Code can write code, manage files, search the web, analyze data, and orchestrate multi-step workflows. If the task lives in a terminal, an editor, or an API — agents handle it well.

But ask an agent to fill out a form in your CRM, adjust parameters in a design tool, or navigate a multi-step workflow in an enterprise system — and you'll hit a wall.

The problem isn't intelligence. It's that agents can't see your screen.

The GUI Gap in Agent Capabilities

Most agent platforms interact with computers through three channels: command-line interfaces (CLI), browser developer protocols (CDP), and APIs. These work well for code execution, web scraping, and cloud service calls. But they share a fundamental limitation: they only work with software that exposes a programmatic interface.

In practice, a large portion of the software people use daily has no API:

Enterprise systems (ERP, CRM, internal tools) often lack external interfaces
Desktop applications (office suites, design tools, specialized software) rely on mouse and keyboard interaction
Many web applications involve complex dynamic UIs that resist simple scripting

This is a structural gap in the agent technology stack. Agents have the "brain" to plan and reason, but they lack the "eyes" to see the screen and the "hands" to operate the interface.

Why GUI Vision Is the Missing Piece

Humans interact with computers through a visual feedback loop: observe the screen → understand the interface → locate the target element → perform an action → check the result → proceed. This process doesn't depend on any underlying API. It works through seeing and doing.

Traditional RPA (Robotic Process Automation) attempted to automate GUI interactions, but relied on hardcoded coordinates, element paths, and pixel matching. When the UI changes — which happens constantly in modern software — scripts break and need manual updates.

A more robust approach is GUI-VLA (Vision-Language-Action) models: architectures that unify visual perception (seeing the screen), language understanding (interpreting instructions), and action execution (clicking, typing, navigating) into a single framework. Instead of depending on fixed UI structures, the agent understands the interface through visual comprehension and acts accordingly.

The implication: if a piece of software has a graphical interface, an agent can potentially operate it.

From Theory to Working System

Mano-P is an open-source GUI-VLA agent model built for edge devices, released by Mininglamp Technology under the Apache 2.0 license. Its core approach: pure vision-driven GUI interaction — no DOM parsing, no system APIs, just screen understanding and action execution from screenshots.

The technical design involves three key mechanisms:

Three-stage progressive training. The model goes through supervised fine-tuning (SFT), offline reinforcement learning, and online reinforcement learning. Each stage builds on the previous one, progressively improving action accuracy and environmental robustness.

Think-act-verify reasoning loop. Before each action, the agent plans its intent. After execution, it verifies whether the result matches expectations. If the outcome deviates, the system automatically corrects course. This significantly reduces error accumulation in multi-step tasks.

Edge-optimized deployment. Through mixed-precision quantization and visual token pruning (GS-Pruning), the model runs locally on Apple M4 devices with 32GB RAM. All screenshots and task data stay on-device — no cloud calls required.

Benchmark Results

OSWorld benchmark: Mano-P 1.0-72B achieves a 58.2% success rate, ranking #1 among specialized GUI agent models — 13.2 percentage points ahead of the second-place opencua-72b (45.0%)
WebRetriever Protocol I: Mano-P 1.0 scores 41.7 NavEval, surpassing Gemini 2.5 Pro Computer Use (40.9) and Claude 4.5 Computer Use (31.3)
On-device inference: The 4B quantized model (w4a16) achieves 476 tokens/s prefill and 76 tokens/s decode on Apple M4 Pro, with only 4.3GB peak memory

What GUI Agents Actually Unlock

Once agents gain the ability to see and operate graphical interfaces, several previously impossible workflows become practical. Here are four scenarios demonstrated in the Mano-P project:

1. Fully Automated Application Building

The agent receives natural language requirements and autonomously completes the entire pipeline: requirement clarification → architecture design → code generation → local deployment → multi-level testing (API tests, LLM-based visual page inspection, and end-to-end GUI automation testing driven by VLA models). When tests fail, the system automatically diagnoses root causes, fixes code, redeploys, and retests — iterating until all test cases pass. No human intervention required. The final deliverable is a running application with complete documentation.

2. Commercial Video Production Pipeline

Starting from a user command, the system handles video generation, uploading, analysis, editing, and secondary evaluation. The agent independently operates web interfaces and editing software, performing file management, subtitle modifications, and other fine-grained GUI operations. It then generates analysis reports with both subjective assessments and objective metrics. This kind of cross-application, multi-step workflow is exactly what GUI agents enable.

3. Local On-Device Task Execution

The model runs inference directly on Mac devices (M4 chip + 32GB RAM required), breaking through the bottleneck where agent workflows previously had to pause and wait for human GUI interaction. The agent handles the entire flow autonomously, including steps that require screen-based operations.

4. Beyond Work: General-Purpose Visual Understanding

GUI vision capabilities extend beyond productivity scenarios. Through pure visual understanding of a game interface, the agent can perform tile recognition, analysis, and decision-making in Mahjong. This demonstrates the generality of the GUI-VLA approach — the same model framework applies across structured business processes and unstructured interactive environments.

What This Means for Developers

The agent ecosystem has been expanding steadily — from chat to code generation, from file management to data analysis. But the jump from "text-based assistant" to "desktop-native operator" requires a fundamentally new capability: visual understanding of graphical interfaces.

With GUI vision in place, agents are no longer limited to software that provides APIs or CLI access. Any application with a screen becomes a potential workspace.

For developers building agent-powered automation, this opens up scenarios that were previously out of reach: enterprise systems without APIs, cross-application data workflows, long-running business processes that span multiple desktop tools, and tasks that previously required a human sitting in front of a screen.

The desktop was the last frontier agents couldn't reach. That's changing.

1.6 Trillion Parameters Just Went Open Source. What About the Other Direction?

Mininglamp — Tue, 28 Apr 2026 10:54:16 +0000

On April 27, DeepSeek released its V4 model family and open-sourced the weights. The flagship V4-Pro Base has 1.6 trillion parameters (862B active), while V4-Flash comes in at 158B (Base 292B). Both use a Mixture of Experts (MoE) architecture. Within 48 hours of landing on HuggingFace, V4-Pro had already racked up 3,000+ likes and 174K downloads.

It's an impressive milestone for open-source AI. But it also crystallizes a question that's been brewing for a while: Is "bigger" the only direction AI models can go?

The case for Scaling Up

Let's be clear — Scaling Up works, and DeepSeek V4 is the latest proof.

The logic behind bigger models traces back to the Scaling Laws paper (Kaplan et al., 2020): model performance scales predictably with parameter count, dataset size, and compute. From GPT-3 (175B) to DeepSeek V3 to V4 (1.6T), each generation has pushed the ceiling higher on general reasoning, code generation, and mathematical problem-solving.

The engineering has matured too. MoE architecture is key — V4-Pro's 1.6T total parameters don't all activate at once. A routing mechanism selects which expert networks fire for each input, keeping per-inference compute manageable while retaining the knowledge capacity of a massive model. Combined with distributed inference, mixed precision, and optimized serving stacks (V4-Pro is already available on Together, Novita, Fireworks, and others), trillion-parameter models are becoming practically accessible.

None of this is hype. The results are real. For general-purpose tasks — open-ended reasoning, multilingual generation, complex code synthesis — larger models consistently outperform smaller ones.

But not every problem needs a trillion parameters

Here's where the story gets more interesting.

Running V4-Pro requires a multi-GPU cluster. Even using it through an inference API costs money per call. For high-frequency use cases — real-time interaction, continuous agent workflows, batch processing — that cost adds up fast. And for individual developers or small teams, the economics don't always work.

There are also structural constraints:

Data privacy. Cloud inference means your input data leaves your machine. For AI agent scenarios where the model needs to see your entire screen — emails, chat messages, bank statements — that's a non-trivial compliance issue.
Latency. Network round trips add delay. For agent workflows involving dozens of sequential steps (screenshot → understand → act → repeat), every millisecond of latency compounds.
Availability. No internet, no AI. But real-world use cases on airplanes, in secure facilities, or on unstable connections require AI that works offline.

These aren't criticisms of Scaling Up. They're boundary conditions that define where a different approach makes more sense.

The other direction: Scaling Out

If Scaling Up means making one model as large as possible, Scaling Out means distributing multiple smaller, specialized AI models closer to where they're actually needed — and having them collaborate.

This isn't a theoretical alternative. Several converging technical trends make it practical:

Model compression is real

Techniques like mixed-precision quantization (e.g., w4a16), visual token pruning, and knowledge distillation can shrink billion-parameter models to run on consumer hardware. On an Apple M4 chip, a 4B-parameter quantized model achieves 476 tokens/s prefill and 76 tokens/s decode, with a peak memory footprint of just 4.3GB.

Specialized models can beat general ones — in their domain

A general-purpose trillion-parameter model spreads its capacity across every conceivable task. A specialized model focuses all its parameters on one domain. In GUI automation specifically, a 4B-parameter model trained for this task has achieved #1 scores on domain benchmarks, outperforming models hundreds of times its size on the same tests.

Data sovereignty matters

When the model runs on the user's device, the data never leaves. No cloud upload, no network transmission, no third-party processing. For enterprise compliance, personal privacy, and regulated industries, this is a structural advantage that cloud-only models can't match.

Multi-agent collaboration

Instead of one giant model doing everything, multiple specialized agents can divide work — each running on different devices or nodes, communicating through standardized protocols. This architecture naturally fits the Scaling Out paradigm.

A concrete example: GUI agents on the edge

Let's make this concrete with a specific domain: GUI automation.

The task is straightforward in concept but demanding in practice: an AI agent looks at a screen, understands the interface elements, and performs operations — clicking buttons, filling forms, navigating menus — just like a human user would.

This is a natural fit for Scaling Out because:

Screen captures contain sensitive personal data — better processed locally
GUI tasks involve many sequential steps — latency accumulates
The task requires precise visual grounding and action planning, not broad general knowledge

Mano-P is an open-source project (Apache 2.0) by Mininglamp Technology that takes this approach. It's a GUI-VLA (Vision-Language-Action) agent designed for edge devices — specifically, it runs entirely on a Mac, with all data staying on the local machine.

The architecture integrates visual understanding, language reasoning, and action generation in a single end-to-end model, trained through a three-stage pipeline (SFT → offline RL → online RL) with a think-act-verify inference loop and GS-Pruning for visual token efficiency.

Published benchmark results (with evaluation framework and model specification noted):

OSWorld (72B model): 58.2% accuracy — ranked #1 (2nd place: 45.0%, a 13.2 percentage point gap)
WebRetriever Protocol I (72B model): 41.7 NavEval — ranked #1 (Gemini 2.5 Pro: 40.9, Claude 4.5: 31.3)
Edge deployment (4B quantized, w4a16): 476 tokens/s prefill, 76 tokens/s decode, 4.3GB peak memory on Apple M4

Hardware requirement: Mac with Apple M4 chip + 32GB RAM (or Mano-P Compute Stick via USB 4.0+).

The takeaway: a 4B-parameter model running locally on a Mac can achieve state-of-the-art results in its domain. Not because small models are universally better, but because the right model for the right task, deployed in the right place, can outperform a general-purpose giant.

Two tracks, one ecosystem

DeepSeek V4 pushing to 1.6 trillion parameters and a 4B model hitting #1 on GUI benchmarks are not contradictory developments. They're two sides of the same evolution in AI:

Scaling Up provides the general intelligence foundation — broad reasoning, complex generation, cross-domain capabilities
Scaling Out provides the execution layer — privacy-preserving, low-latency, offline-capable, specialized for specific tasks

The two can work together: edge models handle local tasks, and when something exceeds their scope, they call out to cloud models. This layered architecture may be closer to how AI actually gets deployed in the real world than any single-model paradigm.

For developers choosing a direction: it's not about picking the model with the most parameters. It's about picking the model that fits your constraints — compute budget, latency requirements, data sensitivity, deployment environment.

The trillion-parameter era is here. And so is the era of AI that runs on your machine.

Resources:

Mano-P (Apache 2.0): github.com/Mininglamp-AI/Mano-P

Happy 45th Birthday, GUI. Meet Your New Power User.

Mininglamp — Mon, 27 Apr 2026 11:54:15 +0000

On April 27, 1981, Xerox introduced the Star 8010 Information System — the first commercial computer with a graphical user interface.

Bitmapped display, desktop metaphor, icons, windows, mouse, WYSIWYG. Everything we take for granted about modern computing started with a $16,595 workstation that most people never used.

Today marks the 45th anniversary of that moment.

Five Milestones in 45 Years

The GUI's history can be traced through a handful of defining moments:

1981 · Xerox Star: GUI is born. The desktop metaphor becomes the foundational paradigm for human-computer interaction.
1984 · Macintosh: Apple brings GUI to the consumer market. Computing becomes visual for everyone.
1995 · Windows 95: The Start menu and taskbar. GUI becomes the global default.
2007 · iPhone: Multi-touch replaces the mouse. GUI extends from desktops to pockets.
2025–2026 · GUI Agents: AI learns to "see" screens and operate them autonomously.

The first four milestones share one constant: the user is always a human. Interface design revolves around human visual cognition — icons should be intuitive, layouts should follow natural eye movement, interactions should provide instant feedback.

The fifth milestone introduces a fundamental shift: the "user" can be an AI.

When AI Becomes the GUI Operator

Over the past two years, GUI Agents have emerged as a distinct technical direction. The core idea: train AI models to operate computers the way humans do — by looking at the screen and performing mouse/keyboard actions.

This is fundamentally different from traditional automation:

Approach	Dependency	Coverage
API/CLI	Target system must expose an API	Only apps with APIs
DOM/CDP parsing	Requires browser internals or accessible widget trees	Primarily web apps
Pure vision	None — works with any GUI	Any application with a visual interface

The vision-based approach inherits the exact principle that Xerox Star's designers articulated 45 years ago: a GUI should be self-explanatory — you should be able to understand how to use it just by looking at it. Back then, that capability belonged to humans. Now AI is developing it too.

Mano-P: A Vision-Only GUI Agent for Edge Devices

Mininglamp Technology open-sourced Mano-P under the Apache 2.0 license, taking a vision-only approach to GUI automation. Mano-P uses a GUI-VLA (Vision-Language-Action) architecture that integrates visual understanding, language reasoning, and action generation in a single end-to-end model.

Benchmark Results

OSWorld (verified, specialized model): Mano-P 72B achieves 58.2% accuracy, ranking #1 (runner-up: 45.0%)
WebRetriever Protocol I: 41.7 NavEval (ranked #1), surpassing Gemini 2.5 Pro (40.9) and Claude 4.5 (31.3)

On-Device Performance

The 4B quantized model (w4a16) runs locally on Apple M4 Macs:

Prefill: 476 tokens/s
Decode: 76 tokens/s
Peak memory: 4.3 GB
Fully local execution — screen captures and task data never leave the device

Hardware requirements: Mac with Apple M4 chip + 32GB RAM, or any Mac with a Mano-P Compute Stick (USB 4.0).

Technical Approach

Bidirectional self-reinforcement learning (Text ↔ Action cyclic consistency)
Three-stage training: SFT → Offline RL → Online RL
Think-act-verify reasoning loop
GS-Pruning for visual token reduction, optimizing edge inference

Full Circle

Forty-five years ago, the Xerox Star taught humans to interact with computers through visual interfaces. Today, AI agents are learning to do the same thing — looking at pixels, understanding layouts, clicking buttons.

The Xerox Star was a commercial failure but a technical triumph. Its design DNA — bitmapped displays, the desktop metaphor, WYSIWYG — lives on in every Mac, PC, phone, and tablet. GUI Agents are the next chapter: the interface designed for human eyes turns out to work for AI eyes too.

The GUI hasn't changed. What changed is who's looking at the screen.

GitHub: github.com/Mininglamp-AI/Mano-P
Technical Report: arXiv:2509.17336

Mano-P is developed by Mininglamp Technology and released under the Apache 2.0 license.

AI Got Hands: Breaking the Human Bottleneck in Agent Workflows

Mininglamp — Fri, 24 Apr 2026 07:35:34 +0000

Most AI agent frameworks can browse the web. Open a URL, read some HTML, click a button, fill a form. This works because browsers expose their internals through well-defined protocols — Chrome DevTools Protocol (CDP), DOM APIs, JavaScript injection.

But here's the problem: the majority of professional work doesn't happen in a browser.

CAD engineers work in SolidWorks. Video editors work in DaVinci Resolve. Data analysts switch between Excel, custom BI dashboards, and terminal sessions. System administrators navigate native configuration panels. Designers use Figma's desktop app, Photoshop, Blender.

None of these expose a DOM. None of them speak CDP. And most of the "AI automation" ecosystem simply cannot reach them.

This article examines the three main technical approaches to GUI automation, explains why the vision-only approach matters for breaking the browser boundary, and looks at measured results on cross-application benchmarks.

Three Approaches to GUI Automation

Approach 1: CDP and HTML Parsing

The Chrome DevTools Protocol gives programmatic access to Chromium-based browsers. You can:

Query the DOM tree
Execute JavaScript in page context
Intercept network requests
Simulate clicks and keyboard input at the DOM element level

Frameworks like Playwright, Puppeteer, and most browser-based AI agents use this approach. It's precise, fast, and reliable — within its domain.

Strengths:

Pixel-perfect element targeting via CSS selectors
Access to hidden elements, shadow DOM, iframe contents
Can read and modify page state programmatically
Low latency (no screen capture needed)

Limitations:

Browser-only. CDP doesn't exist outside Chromium. Firefox has a partial equivalent; Safari's is limited. Native desktop apps, mobile apps, and OS-level UI are completely out of scope.
Site-specific fragility. CSS selectors break when websites update their markup. A class name change, a restructured component tree, or a switch from server-rendered to client-rendered content can silently break automation scripts.
SPA complexity. Modern single-page applications with dynamic rendering, lazy loading, and virtual scrolling create timing dependencies that are hard to handle reliably.
Anti-automation measures. Many sites actively detect and block CDP-based automation through bot detection, CAPTCHAs, and behavioral analysis.

For browser-based tasks, CDP is the right tool. But framing "AI automation" as "browser automation" leaves most of the desktop untouched.

Approach 2: Accessibility APIs

Operating systems provide accessibility APIs (UI Automation on Windows, Accessibility API on macOS, AT-SPI on Linux) that expose a tree of UI elements with their roles, labels, and states. Screen readers use these APIs. So can automation frameworks.

Strengths:

Works across native applications, not just browsers
Semantic information (button labels, text field values, checkbox states)
Standardized per-OS (once you handle the platform API, it works across apps)
Doesn't require visual rendering — works even on headless systems

Limitations:

Inconsistent implementation. Application developers implement accessibility support to varying degrees. A well-built macOS app might expose a complete accessibility tree. A cross-platform Electron app might expose a flat, unlabeled hierarchy. A legacy Qt application might expose nothing useful.
Custom controls are invisible. Rendered canvases (games, CAD viewports, video timelines, terminal emulators with custom rendering) don't have accessibility tree entries for their internal elements. A 3D modeling tool's viewport is a single opaque rectangle to the accessibility API.
Platform fragmentation. Each OS has its own API, data model, and quirks. Code written for macOS accessibility doesn't transfer to Windows or Linux.
Performance overhead. Querying the full accessibility tree of a complex application can be slow — hundreds of milliseconds for apps with deep hierarchies.

Accessibility APIs are genuinely useful and underappreciated in the automation space. But they have a fundamental coverage gap: they can only see what developers explicitly expose, and many interfaces — especially professional tools with custom rendering — aren't fully accessible.

Approach 3: Vision-Only Understanding

The third approach skips the application's internal representation entirely. Instead of querying DOM trees or accessibility APIs, the agent looks at what's on screen — raw pixels — and reasons about what it sees.

This is how humans interact with computers. We don't parse HTML to find the "Submit" button. We see a rectangle that looks like a button, read its label, and click it.

Strengths:

Universal coverage. If a human can see it on screen, the agent can see it. Native apps, web apps, terminals, games, remote desktops, virtual machines — all the same to a screenshot.
No application cooperation required. The agent doesn't need hooks, APIs, or special access. Screen capture is a standard OS capability.
Resilient to UI changes. A button that moves from the left sidebar to the top toolbar still looks like a button. Visual understanding is inherently more robust to layout changes than coordinate-based or selector-based targeting.
Cross-platform by default. Screenshots are screenshots, regardless of OS. The same model that automates macOS can automate Windows or Linux without platform-specific code.

Limitations:

Requires capable vision models. The agent needs to accurately parse dense UIs, read small text, distinguish between similar-looking elements, and understand spatial relationships. This is a hard computer vision problem.
Higher computational cost. Processing a full screenshot through a vision model is more expensive than querying a DOM tree. This is where model optimization and edge deployment become critical.
Occlusion and overlaps. Dropdown menus, tooltips, and modal dialogs can cover important UI elements. The agent needs to handle these states.
No hidden state access. The agent can't see what's behind a collapsed menu or in an unscrolled region. It has to navigate to make information visible, just like a human would.

The trade-off is clear: vision-only gives you universal reach at the cost of requiring a strong vision model. The question is whether today's models are good enough to make that trade worthwhile.

Breaking the Browser Boundary

Let's make this concrete. Consider a workflow that's common in any organization:

"Pull Q1 sales data from the CRM, cross-reference it with the finance spreadsheet on the shared drive, and create a summary slide deck for the Monday meeting."

A browser-based agent can maybe handle the CRM part (if it's a web app). But the finance spreadsheet might be in a native Excel window. The slide deck is in PowerPoint or Keynote. The shared drive might be mounted as a local folder or accessed through a native file manager.

This is one task that touches three or four applications. A CDP-based agent taps out after step one. An accessibility-based agent might handle two of the three but struggle with Excel's complex grid rendering. A vision-based agent can navigate all of them — it sees what you see, clicks where you'd click, types what you'd type.

The same principle applies to more specialized work:

DevOps: Switching between a terminal, a monitoring dashboard (Grafana), a cloud console (AWS), and a ticket system (Jira) — mixing web and native UIs.
Design: Moving assets between Figma, Photoshop, and a file manager, with each tool having its own UI paradigms.
Data science: Interacting with Jupyter notebooks, database GUIs, Excel, and custom visualization tools.
System administration: Navigating OS settings panels, network configuration tools, and hardware management interfaces that have no web equivalent.

These aren't edge cases. They're the normal workday for millions of professionals. The browser boundary isn't a minor limitation — it's a wall that separates "AI demo" from "AI tool."

Measured Results on Cross-Application Benchmarks

Mano-P (GUI-Aware Agent Model for Edge Devices, open-source under Apache 2.0) uses the vision-only approach. The name stands for "Mano" (Spanish for "hand") and "P" (Person & Party).

Here's the architecture:

The model takes screenshots as input and outputs action sequences — click coordinates, keystrokes, scroll directions, and multi-step plans. No DOM parsing. No accessibility tree queries. Just pixels in, actions out.

On OSWorld — a benchmark specifically designed to test agents on real desktop environments across different operating systems and applications — the results look like this:

Mano-P achieves a 58.2% success rate on OSWorld, compared to 45.0% for the second-place model. This benchmark includes tasks spanning file management, office applications, web browsing, system configuration, and multi-application workflows — exactly the kind of cross-boundary work where vision-only approaches should theoretically shine.

On web-specific benchmarks, the vision-only approach remains competitive. On WebRetriever Protocol I, Mano-P scores 41.7 NavEval, ahead of Gemini 2.5 Pro (40.9) and Claude 4.5 (31.3). This is notable because web benchmarks should favor approaches that can access the DOM directly — yet the vision-only model still leads.

Why Vision-Only Can Win on the Web Too

This counterintuitive result — a vision model beating DOM-aware models on web tasks — has a plausible explanation.

Modern web pages are designed for human eyes, not for programmatic parsing. A typical SaaS dashboard might have:

Dynamically loaded content with JavaScript-rendered elements
Canvas-based charts and visualizations
Complex CSS layouts where the visual hierarchy doesn't match the DOM hierarchy
Shadow DOM components that hide internal structure
Iframes embedding third-party content

A DOM parser sees the structural complexity. A vision model sees the rendered result — the same clean layout the designer intended for human users. In many cases, the rendered output is actually easier to reason about than the underlying markup.

This doesn't mean vision-only is universally better for web tasks. DOM access provides exact text content (no OCR errors), hidden metadata, and element state information. But for navigation and interaction tasks — "find the settings button and change this option" — visual understanding can be more robust than structural parsing.

Running on the Edge

A vision-based agent is computationally demanding. Processing high-resolution screenshots through a vision-language model requires significant inference capacity. This is where model design and hardware optimization become critical.

Mano-P uses a 4B parameter model with w4a16 quantization (4-bit weights, 16-bit activations). On an Apple M4 Pro with 32GB RAM:

Prefill: 476 tokens/s (ingesting the screenshot and context)
Decode: 76 tokens/s (generating the action sequence)
Peak memory: 4.3 GB

These numbers mean the full perception-reasoning-action loop completes in under a second for typical interactions. The 4.3 GB memory footprint leaves plenty of room for the applications being automated to run alongside the agent.

Running locally also eliminates the latency of uploading screenshots to a cloud API. A screenshot from a 4K display can be several megabytes — sending that to a remote server for every action step adds meaningful delay, especially on typical upload speeds.

The local execution model also means screenshots and task data never leave the device. For workflows involving sensitive information — financial data, medical records, proprietary designs — this is often a hard requirement, not a nice-to-have.

The Training Challenge: Teaching a Model to See and Act

Building a vision-only agent that works across diverse applications requires solving several interconnected problems:

Visual grounding: The model must map regions of a screenshot to semantic UI elements. "The blue button in the top-right corner that says 'Save'" needs to become a precise coordinate.

Action planning: Given a goal ("rename this file to quarterly-report-v2.pdf"), the model must generate a sequence of actions: right-click the file → click "Rename" → select all text → type the new name → press Enter.

Error recovery: UI automation in real environments is noisy. Menus take time to open. Dialog boxes appear unexpectedly. Actions sometimes fail. The model needs to verify outcomes and adapt.

Mano-P addresses these through a three-stage training pipeline:

Supervised Fine-Tuning (SFT) on curated GUI interaction datasets builds foundational visual understanding and action generation.
Offline Reinforcement Learning on collected trajectories teaches multi-step planning from both successful and failed interactions.
Online Reinforcement Learning with a think-act-verify loop develops robustness — the model learns to check its work and recover from failures in live environments.

A technique called GS-Pruning (Gradient-based Structured Pruning) then compresses the model, removing redundant capacity to hit the 4B parameter target without proportional capability loss.

Implications for Agent Architecture

The vision-only approach has second-order effects on how agent systems are designed:

Simpler integration. Adding a new application to the agent's capabilities doesn't require building an adapter, writing selectors, or mapping accessibility trees. If the app has a GUI, the agent can use it.

Cross-system workflows. Tasks that span multiple applications — copying data from a web CRM into a native spreadsheet, then attaching it to an email — don't require different automation strategies for each app. The agent uses the same perception-action loop throughout.

Long-task planning. Because the agent perceives the full screen state at each step, it can maintain context across complex, multi-step workflows. The think-act-verify training means it checks whether each step succeeded before proceeding.

Reduced maintenance burden. Selector-based automation scripts break when UIs update. Vision-based automation is inherently more resilient because it relies on visual patterns rather than structural identifiers.

Current Limitations and Honest Assessment

Vision-only GUI automation is not a solved problem. Current limitations include:

Small text and dense UIs. Spreadsheets with tiny fonts, code editors with many similar-looking lines, and dashboards with packed metrics are still challenging.
Speed-sensitive interactions. Drag-and-drop, real-time canvas manipulation, and rapid sequential inputs are harder than discrete click-and-type actions.
Verification ambiguity. Sometimes it's hard to tell from a screenshot alone whether an action succeeded (e.g., a background save operation with no visual confirmation).
Training data coverage. The model performs best on application types well-represented in training data. Niche or custom enterprise software may require fine-tuning.

These are active research areas, not fundamental barriers. As vision models improve in resolution handling, temporal reasoning, and few-shot adaptation, the coverage gap will narrow.

Getting Started

Mano-P is open-source under Apache 2.0 with a three-phase release plan:

Phase 1 (released): Skills — task-specific capability modules
Phase 2: Local models and SDK — the inference runtime and integration tools
Phase 3: Training methods — the full pipeline for community extension

The code and documentation are at github.com/Mininglamp-AI/Mano-P.

If you're building agent workflows that stop at the browser boundary, it might be time to give your AI hands that can reach the rest of the desktop.

AI for Personal: How Edge-Native Agents Bring Data Sovereignty Back to Your Device

Mininglamp — Fri, 24 Apr 2026 07:34:50 +0000

When you ask a cloud-based AI agent to "summarize my last 20 emails" or "fill out this expense report from my receipts," you're making an implicit trade: convenience for control. Your screenshots, your documents, your workflow patterns — all uploaded to someone else's infrastructure, processed on someone else's GPUs, stored under someone else's data retention policy.

For many developers and enterprise users, that trade is becoming harder to justify.

This article explores the technical architecture behind running AI agents entirely on local hardware — no cloud round-trips, no data exfiltration, no API keys required — and how a 4B-parameter model running on Apple Silicon can match or exceed cloud-hosted alternatives on GUI automation benchmarks.

The Cloud Dependency Problem

Most AI agent frameworks today follow a predictable pattern:

Capture screen state (screenshot, DOM, accessibility tree)
Send it to a cloud API (OpenAI, Anthropic, Google)
Receive action instructions
Execute locally
Repeat

This works. But it has structural problems that no amount of prompt engineering can fix:

Latency compounds. Each action in a multi-step workflow requires a round-trip. A 10-step task that takes 500ms per API call adds 5 seconds of pure network overhead — before you account for token generation time on the server side.

Data leaves the device by design. Screenshots contain everything visible on screen: open tabs, notification previews, partial passwords in terminal windows, private messages, financial data. The agent doesn't selectively capture — it sees what you see.

Cost scales with usage. Vision API calls with screenshot inputs are expensive. A power user running an agent for 8 hours might generate hundreds of screenshots, each consuming thousands of tokens.

Availability depends on infrastructure you don't control. API rate limits, outages, region restrictions, and policy changes can break your workflow without warning.

None of these are hypothetical. They're the everyday reality of cloud-dependent agent architectures.

What "Edge-Native" Actually Means

Edge-native AI isn't just "smaller model on a laptop." It's a fundamentally different architecture where the entire inference loop — perception, reasoning, and action — runs on the device where the work happens.

Mano-P (GUI-Aware Agent Model for Edge Devices, open-source under Apache 2.0) is built around this principle. The name comes from "Mano" (Spanish for "hand") and "P" (Person & Party) — an agent that works with its hands, for its person.

Here's the architecture:

The key design decision: Mano-P uses vision-only understanding. It looks at screenshots — raw pixels — rather than parsing HTML, querying accessibility APIs, or injecting JavaScript into the DOM. This matters for edge deployment because:

No application-specific adapters. The same model works on browsers, native apps, terminal windows, and 3D tools.
No privilege escalation required. Screen capture is a standard OS capability. DOM injection and accessibility API access often require elevated permissions.
Reduced attack surface. The agent reads pixels. It doesn't hook into application internals.

In local mode, screenshots and task data never leave the device. There's no telemetry endpoint, no "anonymous usage data" upload, no cloud fallback. The inference happens on your hardware, and the data stays on your hardware.

Running a 4B Model on Apple Silicon

The practical question is: can edge hardware actually run a capable agent model at interactive speeds?

Here are measured numbers on an Apple M4 Pro with 32GB unified memory:

Metric	Value
Model size	4B parameters (w4a16 quantization)
Prefill throughput	476 tokens/s
Decode throughput	76 tokens/s
Peak memory	4.3 GB

Let's break down why these numbers matter.

476 tokens/s prefill means the model can ingest a screenshot (encoded as visual tokens) and the task context in well under a second. This is the "reading" phase — where the model processes what it sees on screen.

76 tokens/s decode means action generation (the "writing" phase — outputting what to click, type, or scroll) takes roughly 100-300ms for a typical action sequence. This is fast enough for real-time interaction.

4.3 GB peak memory means the model fits comfortably alongside your normal workload. On a 32GB machine, you have ~28GB left for browsers, IDEs, design tools — whatever the agent is supposed to be automating.

The w4a16 quantization scheme (4-bit weights, 16-bit activations) is the key enabler here. It reduces the model's memory footprint by roughly 4x compared to fp16, while preserving activation precision where it matters most — in the attention and reasoning layers.

Apple Silicon's unified memory architecture is particularly well-suited for this workload. There's no PCIe bottleneck between CPU and GPU memory; the model weights, the screenshot tensor, and the action output all live in the same memory space. The Neural Engine and GPU cores can be dispatched to different parts of the inference pipeline without data copies.

For machines without sufficient local compute, Mano-P also supports offloading to a compute stick connected via USB 4.0 — effectively adding a dedicated inference accelerator without changing the data sovereignty model (the stick is still physically local).

Benchmark Performance: Does Local Mean Worse?

The assumption that smaller, local models must sacrifice capability is worth testing empirically.

On OSWorld — a benchmark that tests agents on real desktop environments across operating systems — Mano-P achieves a 58.2% success rate, compared to 45.0% for the second-place model. This isn't a narrow domain-specific benchmark; OSWorld tests general GUI automation across diverse applications and multi-step workflows.

On WebRetriever Protocol I, Mano-P scores 41.7 NavEval, ahead of Gemini 2.5 Pro (40.9) and Claude 4.5 (31.3).

These results suggest that the "edge tax" — the performance cost of running locally instead of in the cloud — can be zero or negative when the model architecture is specifically designed for the task. A 4B model trained and optimized for GUI understanding can outperform much larger general-purpose models that treat GUI automation as one capability among many.

The Training Pipeline: How a Small Model Gets Good

Model size alone doesn't explain the benchmark results. The training methodology matters more at this scale because every parameter has to earn its keep.

Mano-P's training follows a three-stage progression:

Stage 1: Supervised Fine-Tuning (SFT). The base model is trained on curated GUI interaction datasets — screenshots paired with correct action sequences. This gives the model foundational competence in visual grounding (mapping screen regions to semantic elements) and action generation.

Stage 2: Offline Reinforcement Learning. Using collected interaction trajectories, the model learns from both successful and failed attempts. This stage improves multi-step planning — the ability to reason about sequences of actions rather than reacting to each screenshot independently.

Stage 3: Online Reinforcement Learning. The model interacts with live environments and learns from real outcomes. A think-act-verify loop ensures the model checks whether its actions achieved the intended result before proceeding. This is where the model develops robustness — learning to recover from unexpected states, handle loading delays, and adapt to UI variations.

An additional technique called GS-Pruning (Gradient-based Structured Pruning) removes redundant model capacity after training, further reducing the model size without proportional capability loss. This is how you get a 4B model that punches above its weight class.

What This Enables

When an AI agent runs entirely on your device with no cloud dependency, certain use cases become possible that were previously impractical or unacceptable:

Sensitive workflow automation. Automating tasks that involve medical records, legal documents, financial data, or classified information — where uploading screenshots to a third-party API would violate compliance requirements.

Air-gapped environments. Research labs, government facilities, and financial trading floors often operate without internet access. A local agent works regardless of network state.

Consistent performance. No API rate limits, no cold starts, no "the service is experiencing high demand" degradation. The model runs at the same speed whether it's Monday morning or Friday night.

Cost predictability. The hardware is a one-time cost. There's no per-token billing, no surprise invoices, no pricing changes.

Beyond single-device automation, the core capabilities extend to cross-system data integration (working across multiple apps to consolidate information), long-task planning (breaking complex goals into executable sequences), and intelligent report generation (synthesizing information from multiple sources into structured output).

Open Source Roadmap

Mano-P is released under Apache 2.0 with a three-phase open-source plan:

Phase 1 (released): Skills — the agent's capability modules for specific task domains
Phase 2: Local models and SDK — the inference runtime and developer integration tools
Phase 3: Training methods — the full pipeline so others can train specialized models

The phased approach is deliberate. Phase 1 lets developers use and evaluate the agent immediately. Phase 2 gives them the tools to integrate it into their own products. Phase 3 enables the community to extend the model to new domains and hardware platforms.

The Bigger Picture

The shift from cloud-dependent to edge-native AI agents isn't primarily a technical argument. It's an architectural one.

Cloud APIs are shared infrastructure. They're powerful, convenient, and constantly improving. But they come with structural constraints — latency, cost, data exposure, availability — that are inherent to the architecture, not bugs to be fixed.

Edge-native agents trade cloud-scale compute for data sovereignty, predictable performance, and zero marginal cost. For many workflows — especially those involving sensitive data or requiring low-latency interaction — that's a trade worth making.

The benchmark results suggest it doesn't have to be a trade at all. A well-designed, well-trained 4B model running on consumer hardware can match or exceed cloud-hosted alternatives on practical GUI automation tasks.

The code is on GitHub: github.com/Mininglamp-AI/Mano-P

If your data matters enough to keep it on your device, your AI agent should be able to stay there too.

Apple Took 50 Years for 3 CEOs — GUI Agents Went from Paper to Production in One

Mininglamp — Wed, 22 Apr 2026 04:56:34 +0000

Yesterday, Apple announced a landmark succession: Tim Cook steps down as CEO to become Executive Chairman, with John Ternus taking over on September 1. In its 50-year history, Apple has had just three CEOs: Jobs, Cook, Ternus.

Three people. Fifty years. Each transition spaced over a decade apart.

Now consider the AI Agent space: one year ago, most people were still debating whether AI could operate a computer at all. Today, there are open-source projects delivering usable on-device solutions.

This article breaks down the technical evolution of GUI Agents — using Mano-P, our open-source project, as a concrete example of what it takes to go from training to on-device deployment.

What Is a GUI Agent?

A GUI Agent's core mission: let AI operate a computer's graphical interface the way a human does — recognizing screen elements, understanding task intent, and executing clicks, typing, and drag-and-drop operations.

There are currently two main technical approaches:

Approach	Mechanism	Strength	Limitation
API/DOM-driven	Reads interface structure via accessibility APIs or DOM trees	Precise element targeting	Depends on app-specific interfaces
Pure vision	Understands UI from screenshots alone	Works across any application	Higher demand on visual comprehension

Mano-P takes the pure vision route. Designed for Mac, it's an on-device GUI Agent — "Mano" means "hand" in Spanish, "P" stands for Person. AI for Personal. It runs entirely locally; no data leaves the device.

Training: Bidirectional Self-Reinforcement Learning

The training pipeline follows a three-stage progressive framework:

Stage 1: SFT (Supervised Fine-Tuning)
    ↓  Build foundational capabilities
Stage 2: Offline Reinforcement Learning
    ↓  Learn strategy optimization from historical data
Stage 3: Online Reinforcement Learning
    ↓  Continuously improve through real-environment interaction

Stage 1 — SFT: Supervised fine-tuning on high-quality GUI operation datasets. The model learns basic interface understanding and action mapping — ground-truth capability building.

Stage 2 — Offline RL: Uses collected interaction trajectories to optimize policies via reinforcement learning. Extracts success/failure signals from historical operations without requiring live environment interaction, keeping training costs manageable.

Stage 3 — Online RL: Interacts with real GUI environments, adjusting strategy based on live feedback. The key challenge here is balancing exploration (trying new operation paths) with exploitation (reinforcing proven strategies).

Inference: Think-Act-Verify Loop

The inference mechanism uses a think-act-verify cycle:

while task_not_complete:
    # Think: analyze current screen, plan next action
    thought = model.think(screenshot, task_context)

    # Act: execute GUI operation (click, type, scroll)
    action = model.act(thought)
    execute(action)

    # Verify: capture new screenshot, check result
    new_screenshot = capture_screen()
    verified = model.verify(new_screenshot, expected_state)

    if not verified:
        task_context.update(error_info)  # back to Think

This gives the Agent self-correction capability. In real desktop environments, unexpected popups, loading delays, and dynamic element repositioning are common — the verify step catches these before errors cascade.

Core capabilities span four areas: complex GUI automation, cross-system data integration, long-task planning and execution, and intelligent report generation.

Benchmark Performance

OSWorld: Mano-P's 72B model achieves 58.2% success rate, ranking #1 among specialized GUI agent models. Second place scores 45.0%. OSWorld simulates real OS environments with cross-application tasks including file operations, browser interactions, and office software workflows.

WebRetriever Protocol I: Scores 41.7 NavEval, surpassing Gemini 2.5 Pro (40.9) and Claude 4.5 (31.3). This benchmark focuses on web information retrieval and interaction.

Edge Deployment: 4B Model Running On-Device

On-device deployment is a core feature of Mano-P. Here's the 4B quantized model (w4a16) performance on M4 Pro:

Metric	Value
Prefill Speed	476 tokens/s
Decode Speed	76 tokens/s
Peak Memory	4.3 GB

The w4a16 quantization scheme — 4-bit weights with 16-bit activations — strikes a practical balance: 4-bit weights dramatically reduce memory footprint while 16-bit activations preserve numerical precision during inference.

Hardware requirement: Apple M4 chip + 32 GB RAM. Fully local execution — your screen data never leaves your device.

Getting Started

Open-sourced under the Apache 2.0 license:

# Install
brew tap HanningWang/tap && brew install mano-cua

GitHub: https://github.com/Mininglamp-AI/Mano-P

Wrapping Up

From the three-stage progressive training framework, to think-act-verify inference, to w4a16 quantization enabling edge deployment — the path from "concept" to "locally usable" GUI Agents is becoming clear.

Apple took 50 years and three leaders. The GUI Agent space went from academic papers to open-source tools in roughly one year. These are two fundamentally different timescales.

For developers, Mano-P — Apache 2.0 licensed, runnable on a local Mac — is already a starting point for exploration and experimentation.

DEV Community: Mininglamp

Three Open-Source Projects That Turn Your Mac Into a Private AI Workstation

1. Mano-P: The Agent That Sees Your Screen

Why This Matters

2. Cider: Inference Acceleration for Apple Silicon

The Numbers

Why Not Just Use MLX?

3. Mano-AFK: The Autonomous App Builder

What It's Good For

The Stack: Model → Accelerator → Builder

Hardware Requirements

The Bigger Picture

Get Started

Agent vs Skill vs MCP vs Tool: The 4-Layer Stack Every AI Developer Should Know

The Terminology Problem

The 4-Layer Stack

Layer 1: Tools — The Atoms

Layer 2: MCP (Model Context Protocol) — The Connectors

Layer 3: Skills — The Playbooks

Layer 4: Agent — The Decision-Maker

How the Layers Compose

A Real-World Example: Mano-P

When to Use What

Common Architecture Smells

Summary

Why One Giant Model Ruling Everything Is a Bad Idea

The Narrative Everyone Accepted Without Questioning

The Internet Is Changing at the Infrastructure Level

Why Scaling Up Alone Is Structurally Risky

Scaling Out: A Different Architectural Bet

MOA vs. MoE: The Difference That Matters

The Bigger Picture: Democratized AI Research

Where Does This Leave Us?

The HN Post That Got 1,700 Upvotes: Local AI Needs to Be the Norm.Why "Local AI" Just Became the Default for Developers

The HN Post That Got 1,700 Upvotes: Local AI Needs to Be the Norm

The Cloud Assumption Is Cracking

Why Developers Care About Local

The Ecosystem That Made It Possible

Apple's Bet Tells You the Direction

From Local Models to Local Agents

The Technical Puzzle of On-Device Agents

Why Vision-Only Matters

The Convergence

Where We're Putting Our Work

Full-Stack On-Device GUI Agent — Mano-P Model + Cider + AFK, All Open Source

Full-Stack On-Device GUI Agent — Mano-P Model + Cider + AFK, All Open Source

Introduction

What is Mano-P

Mano-P 1.0-4B Local Model

Cider — INT8 Activation Quantization SDK for MLX

Why Cider Exists

Supported Modes

Performance (Apple M5 Pro)

Compatibility

Conditional Compilation

Mano-AFK — End-to-End App Builder

CUA Benchmark

Getting Started

Open Source Roadmap

Links

Dual Launch! Mininglamp Technology Open-Sources Cider On-Device Inference Acceleration Framework and Mano-P On-Device Model

Mano-P: Validating the Deployment Potential of On-Device Agents

Cider: An On-Device Inference Acceleration Framework for Apple Silicon

Minimal Integration, Enabling Local Acceleration for More Open-Source Models

Toward Private AI: Building Local Intelligence Infrastructure

Complex UIs, Cross-App Workflows, Long Tasks: What GUI Agents Actually Unlock

The GUI Gap in Agent Capabilities

Why GUI Vision Is the Missing Piece

From Theory to Working System

Benchmark Results

What GUI Agents Actually Unlock

1. Fully Automated Application Building

2. Commercial Video Production Pipeline

3. Local On-Device Task Execution

4. Beyond Work: General-Purpose Visual Understanding

What This Means for Developers

1.6 Trillion Parameters Just Went Open Source. What About the Other Direction?

The case for Scaling Up

But not every problem needs a trillion parameters

The other direction: Scaling Out