When teams start building AI agents, most of the early energy goes into prompts, models, and tool definitions. Which model should we use? How do we structure the tool-calling loop? What's the right retry strategy?
These are all reasonable questions. But there's another question that usually shows up late — often too late — and shapes everything else:
Where should your AI agent actually run?
The execution environment isn't just an infrastructure detail. It determines what your agent can and can't access, how sensitive data moves (or doesn't), what hardware costs look like at scale, and how much your users are willing to trust the system. Get this decision right early, and a lot of other choices fall into place naturally. Get it wrong, and you're refactoring core architecture six months in.
Let's walk through the three main approaches.
Environment 1: Cloud Sandbox
The most common starting point for agent deployment today is the cloud sandbox model. You spin up an isolated virtual machine or container in the cloud — services like E2B, Modal, or Manus handle the orchestration — and your agent operates entirely within that environment.
How it works
When a task arrives, the platform provisions a clean runtime (often in seconds). The agent gets a shell, a browser, maybe a filesystem and some pre-installed tools. It executes its plan, produces output, and the environment is torn down. From the agent's perspective, it has a full operating system to work with. From the infrastructure perspective, nothing persists between runs unless you explicitly pass state.
What it's good at
Cloud sandboxes shine when the work is web-native. Scraping, form submission, browser automation, API interactions — anything that lives on the public internet is fair game. The isolation model is also excellent for security: if an agent misbehaves or encounters a malicious input, the blast radius is contained to a throwaway VM.
Scalability is another genuine strength. You can run dozens or hundreds of concurrent agent sessions without worrying about resource contention on a shared machine. For demos, CI pipelines, and batch processing workflows, this is hard to beat.
The real constraints
The limitations become visible when your actual work isn't web-native.
Cloud agents can't open your Excel spreadsheet, interact with your internal ERP, or paste results into the desktop app your ops team uses every day. They operate on a synthetic environment — not your environment. Any data that needs to flow into the agent (files, credentials, internal documents) has to leave your machine first.
For many enterprise workflows, that data boundary is the dealbreaker. Sending sensitive customer data or internal business records to a third-party cloud runtime creates compliance exposure that legal teams won't sign off on. And even when data sensitivity isn't the concern, there's a latency and cost dimension: every session spins up a billable runtime, and for long-running tasks the economics can get uncomfortable.
Best fit
Cloud sandboxes are the right choice for: web-only automation, exploratory prototyping, public-data tasks, and workloads where horizontal scale matters more than local access.
Environment 2: Local GUI Agent
Local GUI agents work on a different model entirely. Instead of operating inside a synthetic cloud environment, the agent runs directly on a real desktop — your Mac, your Windows workstation, your on-premises server. It sees the actual screen. It interacts with actual apps. It operates in the environment where your work already lives.
How it works
The agent captures the screen (via screenshots, accessibility APIs, or both), reasons about what it sees, and produces actions — mouse clicks, keyboard input, application-specific commands. The entire loop happens locally: perception, reasoning, action, and observation.
This architecture requires more from the hardware, but it also removes entire categories of constraint. If you can do it by hand on your computer, a local GUI agent can learn to do it too.
What it's good at
The primary advantage is full environment access. Cross-application workflows — copy from a PDF, paste into a spreadsheet, trigger a report in your accounting software, email the result — are natural fits. These tasks are awkward or impossible in cloud sandboxes but routine for local agents.
Data locality is the other major win. When the model and the agent runtime both live on-device, sensitive information never leaves the machine. There's no outbound API call carrying your customer records. Compliance teams have a much easier conversation. For industries with strict data residency requirements — healthcare, finance, defense — local execution isn't just convenient, it's sometimes the only path forward.
There's also an economics angle worth noting. Local models, once running on capable hardware, cost nothing per inference. A cloud-based agent making hundreds of tool calls per session has per-token costs that add up. A local agent on good hardware has roughly fixed compute costs regardless of session count.
Mano-P's architecture: local model inference, screen perception, and action execution all happen on-device.
The real constraints
Local GUI execution has real requirements. You need hardware capable of running capable models — ideally something with a good GPU or a high-bandwidth unified memory architecture (modern Apple Silicon machines, for instance, are well-suited for this). During agent execution, the screen is occupied. If your workflow involves a human using the same machine simultaneously, you'll need to think about scheduling.
And there's a tooling maturity gap. Cloud sandbox providers have years of polished developer experience. Local GUI agent frameworks are newer, and the rough edges show. Documentation is spottier, error handling is less standardized, and debugging a "the agent clicked the wrong button" failure requires different muscle memory than debugging a web automation script.
Best fit
Local GUI agents belong in: enterprise desktop automation, privacy-sensitive workflows, cross-application tasks, long-running automations where per-inference cost matters, and any environment where data residency is non-negotiable.
Environment 3: Hybrid
The hybrid model tries to get the best of both. The most common configuration is a cloud-hosted reasoning layer (the "brain") combined with local execution capabilities (the "hands"). The model runs remotely; actions execute locally. Alternatively: a local model handles most reasoning, with cloud fallback for tasks requiring more capacity.
How it works
In the cloud-brain/local-hands pattern, tool calls route through a local daemon that has access to the desktop environment. The model sees a clean API; the local runtime translates high-level actions into actual screen interactions. In the local-brain/cloud-fallback pattern, a capable local model handles the majority of reasoning, escalating to a remote model when confidence is low or the task is out-distribution.
What it's good at
Flexibility, primarily. Teams that need to handle a wide range of task types — some web-native, some desktop-native — without maintaining two completely separate pipelines. Hybrid architectures also make it easier to right-size compute: fast local models for simple reasoning, large remote models for complex planning.
The real constraints
Complexity is the honest cost of hybrid. Two environments mean two failure domains, two latency contributions, two sets of credentials to manage. The seam between cloud reasoning and local action introduces a synchronization challenge — what happens when the cloud model issues an action that the local daemon can't execute because the target application isn't open? These edge cases are manageable, but they require deliberate design.
For teams just getting started, hybrid is often premature optimization. Pick one environment, get it working well, and evolve toward hybrid when a specific need drives it.
How to Choose: A Decision Framework
Rather than declaring a universal winner, here's a practical checklist:
| Question | If Yes → | If No → |
|---|---|---|
| Does the task require local app access? | Local GUI | Cloud Sandbox |
| Is data leaving the machine a compliance concern? | Local GUI | Either |
| Do you need to scale to 100+ concurrent sessions? | Cloud Sandbox | Either |
| Is the task entirely web-based? | Cloud Sandbox | Local GUI |
| Do you have capable local hardware? | Local GUI viable | Cloud Sandbox |
| Are you building a demo or prototype? | Cloud Sandbox | Consider Local |
| Cross-app workflow (multiple desktop apps)? | Local GUI | Either |
A simpler heuristic: if the task touches local files, local apps, or sensitive data, start with local GUI. If it's web-only and needs to scale, start with cloud sandbox. Move to hybrid when the seam becomes visible and worth engineering.
A Note on Mano-P
We've been building in this space at MiningLamp Technology with Mano-P, an open-source local GUI agent (Apache 2.0). A few specifics that might be useful context for the discussion above:
On the benchmark side, Mano-P's 72B evaluation configuration ranks #1 in the proprietary model category on OSWorld with a 58.2% task completion rate. The open-source release is the 4B quantized version, optimized for real-world on-device deployment.
OSWorld benchmark results — Mano-P 72B evaluation configuration leads the proprietary category at 58.2%. The open-source 4B version is what developers actually deploy.
On the hardware side, Mano-P 1.0-4B running on Apple M5 Pro (64GB, Cider SDK) achieves ~80 tokens/s decode with W8A16 quantization; W8A8 activation quantization speeds up prefill by ~12.7% (source: README Performance Evaluation). The minimum requirement is an M4 chip with 32GB RAM — consumer-grade hardware that makes local agent execution realistic.
The project is on GitHub if you want to dig into the architecture or try it locally: https://github.com/Mininglamp-AI/Mano-P


Top comments (0)