<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Mininglamp</title>
    <description>The latest articles on DEV Community by Mininglamp (@mininglamp).</description>
    <link>https://dev.to/mininglamp</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3846168%2F6a138840-d665-4ba6-aedf-1b5c492035c4.png</url>
      <title>DEV Community: Mininglamp</title>
      <link>https://dev.to/mininglamp</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/mininglamp"/>
    <language>en</language>
    <item>
      <title>Mano-CUA 2.0: After a Year of Building a 4B GUI Agent, We Found the Bottleneck Was Never Model Size</title>
      <dc:creator>Mininglamp</dc:creator>
      <pubDate>Mon, 29 Jun 2026 07:12:14 +0000</pubDate>
      <link>https://dev.to/mininglamp/mano-cua-20-after-a-year-of-building-a-4b-gui-agent-we-found-the-bottleneck-was-never-model-size-2mc5</link>
      <guid>https://dev.to/mininglamp/mano-cua-20-after-a-year-of-building-a-4b-gui-agent-we-found-the-bottleneck-was-never-model-size-2mc5</guid>
      <description>&lt;p&gt;Mano-P is an open-source project we've been working on. It runs a 4B-parameter vision-language model on a MacBook, controlling the computer by watching screenshots. Clicking, typing, hotkeys — the model looks at each screenshot and decides what to do next. Everything stays on-device, no cloud API calls. We shipped 1.0 a while back and recently iterated to 2.0, re-running our 100 real macOS GUI task benchmark. Some of the results surprised us.&lt;/p&gt;

&lt;p&gt;We settled on 4B mostly because of the 16GB MacBook memory constraint. We tried general-purpose VL models too — on our benchmark they scored around 39%, with issues around Chinese input focus, non-browser app support, and step-limit truncation on longer tasks. General VL capability and GUI agent capability turned out to be pretty different things. Internally we also debated MoE, but on-device inference is memory-bound more than compute-bound, and MoE keeps all expert weights resident in memory. That's a bad trade on a Mac. We stayed on MLX — it maps directly to Apple Silicon's unified memory, which is about as good as it gets for local inference on these chips.&lt;/p&gt;

&lt;p&gt;That's the background. The interesting part is what happened in 2.0.&lt;/p&gt;

&lt;h2&gt;
  
  
  Chinese GUI went way up
&lt;/h2&gt;

&lt;p&gt;We'd been assuming 4B models just couldn't handle GUI agents well — too small. After finishing 2.0, we realized that wasn't the story. The first real bottleneck was Chinese GUI training data.&lt;/p&gt;

&lt;p&gt;Version 1.0 was pretty bad at Chinese UI elements. The model kept making character-level recognition errors, confusing one Chinese character for a visually similar one. These mistakes sound minor, but for a GUI agent they're fatal. If you can't find the button, nothing downstream matters.&lt;/p&gt;

&lt;p&gt;English interfaces didn't have this problem. "Settings" and "Settinas" look very different. Chinese is different — characters have higher visual similarity to each other, and buttons like "确定" or "取消" pack dense strokes into a small area, making them harder for the model to distinguish. At the time we thought this was a 4B capacity issue. Chinese characters are just harder than English letters, small models can't handle it.&lt;/p&gt;

&lt;p&gt;After adding more Chinese GUI training data in 2.0, it turned out that wasn't the case. The fix wasn't fancy — just increasing the volume of Chinese interface screenshots in the training set. The results showed up immediately in the enterprise IM category: 33% on 1.0, 83% on 2.0. WeChat went from 33% to 67%.&lt;/p&gt;

&lt;p&gt;Fifty percentage points. Looking back, 1.0's data coverage was simply insufficient. This was solvable with more data — no architecture changes, no larger model needed. We'd spent a lot of time debating model selection when the bottleneck was in the data. Kind of embarrassing, honestly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Browser tasks went down
&lt;/h2&gt;

&lt;p&gt;Version 2.0 dropped from 74% to 68% on browser and web tasks. When this came up internally there was some debate about whether we'd overdone the Chinese data.&lt;/p&gt;

&lt;p&gt;Looking at the breakdown, the decline was concentrated in English web element recognition. Chinese web tasks barely changed. After rebalancing the training mix toward Chinese, the model's sensitivity to English UI elements dropped a bit.&lt;/p&gt;

&lt;p&gt;The sample sizes aren't huge, so we don't think this represents actual capability regression. But it exposed a problem we haven't solved: balancing Chinese and English GUI data. The 4B model has limited capacity — cram too much in and things interfere with each other. Right now we're doing volume-based control where more Chinese means less English. Ideally both would be sufficient, but at this model size that's not achievable.&lt;/p&gt;

&lt;p&gt;Short term we're planning finer-grained sampling strategies, weighting by task type and UI complexity rather than just language. Whether that works, we're not sure. This might need a larger model to have enough room.&lt;/p&gt;

&lt;h2&gt;
  
  
  Long-horizon tasks still stuck at 30%
&lt;/h2&gt;

&lt;p&gt;Both 1.0 and 2.0 scored 30% on long-horizon tasks — those requiring 10+ steps. No improvement. Cross-app tasks went from 0% to 20%, passing 1 out of 5.&lt;/p&gt;

&lt;p&gt;From what we've tested, this looks more like a 4B capacity constraint. Tasks with 10+ steps require maintaining context throughout, with each step's results feeding correctly into the next decision. The 4B model's working memory is limited — by step 10, it's already fuzzy on what happened in the first few steps. We haven't found a way to push past this with data alone.&lt;/p&gt;

&lt;p&gt;Cross-app is even harder. Find a file someone sent in WeChat, switch to Finder, save it to the desktop. One window switch and the 4B model tends to lose the previous context. We expect a larger model would help here, but haven't verified it.&lt;/p&gt;

&lt;p&gt;Internally we're weighing two paths. One is adding more reasoning-chain training data to the 4B and seeing how far that goes. The other is building a larger model — 7B or 8B — that would run on an M5 Pro with 64GB. Haven't decided. Honestly still not sure which path is worth the investment.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cider: the latency problem
&lt;/h2&gt;

&lt;p&gt;Everything above is about model capability. Cider solves something different — users can't wait.&lt;/p&gt;

&lt;p&gt;GUI agents have an unusual performance characteristic. After every action, you take a fresh screenshot and feed the whole image's tokens through the model. That prefill stage directly determines how long the user waits. On an M5 Pro, MLX's native W8A16 mode gives us 2.839 seconds for prefill.&lt;/p&gt;

&lt;p&gt;2.8 seconds doesn't sound like much. But over a 10-step task, waiting nearly 3 seconds at each step adds up to half a minute. Users feel that as sluggish. Decode speed isn't the issue — 80 tokens/s is plenty for generating action commands.&lt;/p&gt;

&lt;p&gt;The problem was that MLX doesn't ship with online activation quantization operators. Weights are static and can be quantized offline. But activations are dynamic — every screenshot produces different intermediate values going through the network. MLX doesn't provide that capability, so we wrote our own.&lt;/p&gt;

&lt;p&gt;The trickiest design decision was quantization granularity. Too coarse and outlier activation values drag down overall accuracy. Too fine and compute overhead goes up. We landed on per-token granularity — each token's activation vector gets its own quantization parameters computed independently. More overhead than per-tensor, but accuracy loss stays manageable.&lt;/p&gt;

&lt;p&gt;Results: W8A8 prefill dropped from 2.839s to 2.519s, about 12.7% faster. Peak memory also went down, which matters for GUI agents that need to coexist in memory with the user's other applications. We haven't measured the exact memory savings yet.&lt;/p&gt;

&lt;p&gt;M5 chips have hardware acceleration so the improvement is significant. M4 and below fall back to pure Python, which gives limited speedup. We evaluated optimizing specifically for M4 and decided the return wasn't worth the effort. Cider isn't Mano-P-specific — any MLX model can use it.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's next
&lt;/h2&gt;

&lt;p&gt;Fuzzy descriptions and cross-app tasks are the two clearest weak points, both pointing at the 4B model's reasoning capacity. We'll likely build a larger version. Cider continues with stability and compatibility work.&lt;/p&gt;

&lt;p&gt;Full category breakdown below. Test hardware: MacBook Pro M5 16GB. Cloud model: Claude Sonnet 4.5. Local model: Mano-CUA-4B W8A16.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Category&lt;/th&gt;
&lt;th&gt;Tasks&lt;/th&gt;
&lt;th&gt;1.0&lt;/th&gt;
&lt;th&gt;2.0&lt;/th&gt;
&lt;th&gt;Cloud&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Browser/Web&lt;/td&gt;
&lt;td&gt;31&lt;/td&gt;
&lt;td&gt;74%&lt;/td&gt;
&lt;td&gt;68%&lt;/td&gt;
&lt;td&gt;90%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Enterprise IM&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;33%&lt;/td&gt;
&lt;td&gt;83%&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;WeChat&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;33%&lt;/td&gt;
&lt;td&gt;67%&lt;/td&gt;
&lt;td&gt;83%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;WPS/Office&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;td&gt;40%&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;System Settings&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;50%&lt;/td&gt;
&lt;td&gt;83%&lt;/td&gt;
&lt;td&gt;50%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Notes/Reminders&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;50%&lt;/td&gt;
&lt;td&gt;50%&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;File Management&lt;/td&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;43%&lt;/td&gt;
&lt;td&gt;43%&lt;/td&gt;
&lt;td&gt;71%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;System Utilities&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Long-horizon&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;30%&lt;/td&gt;
&lt;td&gt;30%&lt;/td&gt;
&lt;td&gt;80%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cross-app&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;td&gt;20%&lt;/td&gt;
&lt;td&gt;60%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Fuzzy descriptions&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;30%&lt;/td&gt;
&lt;td&gt;30%&lt;/td&gt;
&lt;td&gt;80%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;No open hint&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;20%&lt;/td&gt;
&lt;td&gt;60%&lt;/td&gt;
&lt;td&gt;80%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href="https://github.com/Mininglamp-AI/Mano-P" rel="noopener noreferrer"&gt;https://github.com/Mininglamp-AI/Mano-P&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>opensource</category>
      <category>beginners</category>
      <category>security</category>
    </item>
    <item>
      <title>Mininglamp Technology Officially Open-Sources Octo: Building a New-Generation Platform for Human-AI Agent Collaboration</title>
      <dc:creator>Mininglamp</dc:creator>
      <pubDate>Mon, 29 Jun 2026 07:02:55 +0000</pubDate>
      <link>https://dev.to/mininglamp/mininglamp-technology-officially-open-sources-octo-building-a-new-generation-platform-for-human-ai-4cb6</link>
      <guid>https://dev.to/mininglamp/mininglamp-technology-officially-open-sources-octo-building-a-new-generation-platform-for-human-ai-4cb6</guid>
      <description>&lt;p&gt;Today, Mininglamp Technology officially releases Octo — the first open-source, trustworthy Agent collaboration network that pioneers a new paradigm for human-AI teamwork. Octo supports private deployment, returning data and knowledge sovereignty to enterprises and users. By transforming isolated AI Agents into coordinated, orchestrable, and tasteable organizational digital workforce, Octo turns every human-machine collaboration into a node for compounding organizational assets, driving continuous evolution of Agents and systems under human judgment calibration.&lt;/p&gt;

&lt;p&gt;As more intelligent agents emerge in personal devices and organizational workflows, new challenges arise: When everyone has their own AI assistant, when digital workforce proliferates within organizations, how should they connect, collaborate, and share critical context? How should they accept human judgment and calibration at key decision points?&lt;/p&gt;

&lt;p&gt;Mininglamp believes the core challenge for AI Agents in the next phase is not endlessly scaling model parameters or building a single super-agent, but enabling different Agents to work together in the same network. What Octo aims to build is precisely "the internet between Agents."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Octo repository:&lt;/strong&gt; &lt;a href="https://github.com/Mininglamp-OSS" rel="noopener noreferrer"&gt;https://github.com/Mininglamp-OSS&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  From Personal Assistant to Organizational Collaboration Network
&lt;/h2&gt;

&lt;p&gt;In traditional AI tool usage, Agents typically exist as isolated silos. They maintain separate memories, execute independently, and lack unified collaboration interfaces and task flow mechanisms, making it difficult to accumulate capabilities, reuse experience, or truly scale AI adoption across organizations.&lt;/p&gt;

&lt;p&gt;Octo breaks this deadlock. Through collaboration architectures like Channels and Threads, Octo builds a foundational network for humans and AI — as well as AI and AI — to work together. A Channel is essentially a project workgroup where humans and Bots can align intentions and dispatch tasks in real-time.&lt;/p&gt;

&lt;p&gt;When a Channel contains multiple discussion topics, both humans and Agents can create multiple Threads within it to focus on specific subjects, ensuring concrete work threads don't get washed away by information noise, guiding discussions toward natural convergence.&lt;/p&gt;

&lt;p&gt;In Octo, AI Agents join teams as Bots. Users can conveniently integrate mainstream tools like OpenClaw, Hermes, Codex, and Claude Code into Octo, creating dedicated digital twin Bots while enabling deep Agent-to-Agent (A2A) collaboration. Each Bot has its own AgentCard and work history, with clear ownership and accountability.&lt;/p&gt;

&lt;p&gt;To transform fragmented discussions into traceable, measurable work outcomes, when actionable work emerges from discussions, Agents automatically summarize key points and create Matters upon human confirmation. Matters specify task owners and concrete deliverables, with detailed records from Brief through process discussions, outputs, feedback, to acceptance conclusions — all preserved for future review and decision traceability.&lt;/p&gt;

&lt;p&gt;For complex tasks, Octo provides six collaboration modes: Solo (individual completion), Roundtable (group discussion), Critic (independent review), Pipeline (sequential workflow), Split (parallel division), and Swarm (competitive selection). By precisely controlling how Context information flows between Bots and what's visible to each participant, Octo enables multiple specialized Bots to conduct distributed collaboration under human guidance, allowing collective intelligence to emerge through network effects that surpass any single model.&lt;/p&gt;

&lt;h2&gt;
  
  
  "I Taste Therefore I Am": A New Division of Labor in Human-Machine Collaboration
&lt;/h2&gt;

&lt;p&gt;What's truly being restructured in the AI era isn't just tools, but collaboration itself. In the future, collaboration will frequently occur between humans and humans, humans and Agents, and Agents and Agents. Under this new paradigm, the human-machine division of labor reaches a turning point: AI excels at "thinking" and "doing" — handling logical reasoning, analysis, generation, and execution; while human irreplaceability focuses on "tasting" — making holistic judgments based on experience, aesthetics, trade-offs, and values.&lt;/p&gt;

&lt;p&gt;Octo is designed around this principle: Let Agents execute, let humans return to the core position of judgment and taste. At key nodes, humans provide direction, standards, and feedback — judging what's right and what's good; AI drives tasks to completion.&lt;/p&gt;

&lt;p&gt;With every human-machine collaboration, human taste drives the accumulation of organizational assets, making Bots smarter over time.&lt;/p&gt;

&lt;p&gt;During collaboration, project background knowledge, historical decisions, and discussion records are structurally preserved in Matters, allowing new members to onboard without starting from zero alignment. Every rejection, annotation, and style choice humans make when reviewing Bot outputs gets recorded as preference cards, enabling Bots to automatically reference them in future tasks. The standards and methods Bots learn can also be preserved as reusable Skill assets within the organization.&lt;/p&gt;

&lt;p&gt;Through the asset accumulation flywheel of "dispatch tasks → review feedback → accumulate preferences and skills → greater efficiency next time," Octo builds a unique positive cycle, naturally enriching organizational productivity infrastructure with every collaborative interaction, achieving true capability accumulation and intelligent upgrades.&lt;/p&gt;

&lt;h2&gt;
  
  
  Open Source and Open: Not Replacing Tools, But Connecting Them
&lt;/h2&gt;

&lt;p&gt;Octo is open-sourced under Apache License 2.0 and supports private deployment. Mininglamp believes that in an era of rapid AI development, enterprises' true long-term competitiveness stems from their unique work context, business knowledge accumulation, and organizational judgment.&lt;/p&gt;

&lt;p&gt;Octo is precisely positioned as the "collaboration layer" between an enterprise's existing documentation, spreadsheets, code repositories, and project management platforms. Through cross-platform capabilities like browser extensions, Octo can seamlessly bring current webpage content, selected fragments, and task information into the collaboration network, helping digital twins fully understand the current work environment, standing by beside existing tools for efficient coordination.&lt;/p&gt;

&lt;p&gt;In terms of product form, Octo comprehensively covers Web App, desktop client, mobile (iOS/Android), browser extension, and CLI — four endpoints meeting different work scenario needs. Whether pushing forward complex projects on desktop, quickly handling notifications and taste feedback on mobile, or providing native operations for Agents through CLI, seamless multi-device interoperability is achieved.&lt;/p&gt;

&lt;h2&gt;
  
  
  Moving Toward Private AI Through Trustworthy Mechanisms
&lt;/h2&gt;

&lt;p&gt;Octo's open-source release is also Mininglamp's further practice in Private AI and Trustworthy AI.&lt;/p&gt;

&lt;p&gt;Mininglamp firmly believes that truly sustainable AI collaboration must guarantee users' absolute control over data, context, judgment signals, and deployment methods. Through open-source architecture, private deployment, and clear data ownership design, Octo ensures enterprises can embrace AI within security boundaries while protecting individuals' tacit knowledge.&lt;/p&gt;

&lt;p&gt;In Octo's product philosophy, the four letters "O.C.T.O." represent four inseparable dimensions: Open (open access), Context (context sharing), Taste (preference evolution), and Orchestration (multi-Bot coordination).&lt;/p&gt;

&lt;p&gt;Context is the soil for AI to understand tasks; Taste is the compass for AI to continuously calibrate direction. Octo doesn't simply distill human tacit capabilities into platform assets, but rather amplifies, records, and passes on these capabilities while respecting personal and organizational data boundaries.&lt;/p&gt;

&lt;p&gt;Mininglamp is continuously improving its new-generation AI infrastructure oriented toward edge intelligence, private deployment, and human-machine collaboration. By fully preserving teams' background knowledge, work preferences, and methodologies in the network, Octo ensures organizational wisdom doesn't drain with personnel turnover, and business style doesn't change with foundation model iterations. Every human-machine hybrid collaboration is compound interest accumulation on organizational private assets. Over time, this unique business perception naturally transforms into enterprises' most competitive technical and scenario barriers.&lt;/p&gt;

&lt;p&gt;In the future, Octo will continue with an open-source, open attitude, co-creating new collaboration paradigms for AI-Native organizations with developers, enterprise customers, and ecosystem partners, making trustworthy, controllable, and sustainable private Agentic AI truly land in every real work scenario.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>collaboration</category>
      <category>opensource</category>
    </item>
    <item>
      <title>AI Agents and Persistent Context: What design.md Teaches Us</title>
      <dc:creator>Mininglamp</dc:creator>
      <pubDate>Fri, 26 Jun 2026 09:45:07 +0000</pubDate>
      <link>https://dev.to/mininglamp/ai-agents-and-persistent-context-what-designmd-teaches-us-4l9b</link>
      <guid>https://dev.to/mininglamp/ai-agents-and-persistent-context-what-designmd-teaches-us-4l9b</guid>
      <description>&lt;p&gt;A GitHub repository called design.md has been trending recently, accumulating over 1,400 stars. The concept is straightforward: provide AI agents with a persistent design document they can reference throughout their work.&lt;/p&gt;

&lt;p&gt;This approach addresses a practical challenge in agent development that many teams encounter.&lt;/p&gt;

&lt;p&gt;The Context Challenge&lt;br&gt;
When working on complex tasks, AI agents need to understand the broader picture. What's the architecture? What constraints exist? What approaches have been tried before?&lt;/p&gt;

&lt;p&gt;Typically, agents get context from:&lt;/p&gt;

&lt;p&gt;Current conversation (limited window)&lt;br&gt;
Code comments (often outdated)&lt;br&gt;
Documentation (if it exists)&lt;br&gt;
The issue is that this context is fragmented and temporary. When conversation moves forward, earlier context disappears. When documentation is outdated, agents make incorrect assumptions.&lt;/p&gt;

&lt;p&gt;A design.md provides a single source of truth that persists across sessions.&lt;/p&gt;

&lt;p&gt;What Belongs in design.md&lt;br&gt;
An effective design.md answers these questions:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;What are we building?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Beyond feature lists, document the core purpose. Why does this project exist? What problem does it solve?&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;What are the key architectural decisions?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Document major choices and their rationale:&lt;/p&gt;

&lt;p&gt;"PostgreSQL was chosen over MongoDB because ACID guarantees are required for financial transactions"&lt;br&gt;
"Microservices architecture was adopted because components have different scaling requirements"&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;What constraints exist?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Technical constraints (performance requirements, browser support), business constraints (budget, timeline), and regulatory constraints (GDPR, HIPAA).&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;What has been tried before?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Document failed approaches to prevent agents from suggesting rejected solutions.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;What are the current challenges?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Known issues, technical debt, areas needing improvement help agents prioritize work.&lt;/p&gt;

&lt;p&gt;How Agents Use design.md&lt;br&gt;
When starting a task, agents can:&lt;/p&gt;

&lt;p&gt;Read design.md to understand context&lt;br&gt;
Make decisions aligned with documented architecture&lt;br&gt;
Avoid solutions violating constraints&lt;br&gt;
Reference design.md in reasoning&lt;br&gt;
This leads to more coherent and consistent work. Agents work within a broader framework rather than just reacting to immediate tasks.&lt;/p&gt;

&lt;p&gt;Keeping design.md Updated&lt;br&gt;
The main risk with design.md is becoming outdated. Effective practices include:&lt;/p&gt;

&lt;p&gt;Make it part of the workflow&lt;/p&gt;

&lt;p&gt;Update design.md immediately when making significant architectural decisions. Waiting until "later" means it never happens.&lt;/p&gt;

&lt;p&gt;Version control it&lt;/p&gt;

&lt;p&gt;Keep design.md in the repository. During PR reviews, check if design.md needs updating.&lt;/p&gt;

&lt;p&gt;Review it regularly&lt;/p&gt;

&lt;p&gt;Schedule periodic reviews (monthly or quarterly) to ensure the document reflects current reality.&lt;/p&gt;

&lt;p&gt;Let agents help&lt;/p&gt;

&lt;p&gt;Agents can assist in maintaining design.md by:&lt;/p&gt;

&lt;p&gt;Suggesting updates when noticing inconsistencies&lt;br&gt;
Summarizing changes from recent commits&lt;br&gt;
Flagging outdated information&lt;br&gt;
Observability in Agent Workflows&lt;br&gt;
Even with good design.md, observing what agents actually do is important. This is particularly relevant for GUI agents interacting with complex interfaces.&lt;/p&gt;

&lt;p&gt;Consider a GUI agent tasked with "fill out this form and submit it". The agent needs to:&lt;/p&gt;

&lt;p&gt;Locate form fields&lt;br&gt;
Enter correct data&lt;br&gt;
Handle validation errors&lt;br&gt;
Submit the form&lt;br&gt;
Verify success&lt;br&gt;
Each step can fail in different ways. Without observability, only the final result is visible: success or failure. The reason for failure remains unknown.&lt;/p&gt;

&lt;p&gt;Building Observable Workflows&lt;br&gt;
Good observability includes:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Step-by-step logging&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Record each action:&lt;/p&gt;

&lt;p&gt;What was observed (screenshots, DOM state)&lt;br&gt;
What decision was made&lt;br&gt;
What actually happened&lt;br&gt;
Whether it matched expectations&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Performance metrics&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Track:&lt;/p&gt;

&lt;p&gt;Success rate per task type&lt;br&gt;
Average steps to completion&lt;br&gt;
Time per step&lt;br&gt;
Failure modes&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Error categorization&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;When things go wrong, categorize errors:&lt;/p&gt;

&lt;p&gt;Perception errors (agent didn't see the right element)&lt;br&gt;
Decision errors (agent chose wrong action)&lt;br&gt;
Execution errors (action failed due to external factors)&lt;br&gt;
This data helps identify where improvements are needed.&lt;/p&gt;

&lt;p&gt;Systematic Benchmarking&lt;br&gt;
CUA Benchmark provides systematic observability through:&lt;/p&gt;

&lt;p&gt;100 test cases across 5 different web applications&lt;br&gt;
Standardized task definitions&lt;br&gt;
Automated result verification&lt;br&gt;
Detailed performance metrics&lt;br&gt;
Running agents against CUA Benchmark provides quantitative data:&lt;/p&gt;

&lt;p&gt;Overall success rate&lt;br&gt;
Success rate by task type&lt;br&gt;
Average steps per task&lt;br&gt;
Common failure points&lt;br&gt;
This data is valuable for iterative improvement. Instead of guessing what to optimize, specific areas where agents struggle can be identified and addressed.&lt;/p&gt;

&lt;p&gt;A Practical Example: Mano-AFK&lt;br&gt;
Mano-AFK is an open-source autonomous application builder that demonstrates these principles. The workflow includes:&lt;/p&gt;

&lt;p&gt;Receiving natural language description of what to build&lt;br&gt;
Generating a PRD (Product Requirements Document)&lt;br&gt;
Writing the code&lt;br&gt;
Deploying to a test environment&lt;br&gt;
Running tests (lint, API, E2E)&lt;br&gt;
Auto-fixing any issues&lt;br&gt;
Delivering the final application&lt;br&gt;
Throughout this process, the agent references rules.md and preferences.md files to maintain consistency across projects. These files provide persistent context that guides decisions.&lt;/p&gt;

&lt;p&gt;CUA Benchmark results for Mano-AFK:&lt;/p&gt;

&lt;p&gt;W8A16 quantization: 58.0% accuracy&lt;br&gt;
W8A8 quantization (Cider): 54.0% accuracy, but faster inference (~1,453 tok/s prefill)&lt;br&gt;
These numbers show that W8A8 version is slightly less accurate but significantly faster. Depending on use case, one might be preferred over the other.&lt;/p&gt;

&lt;p&gt;Without systematic benchmarking, this data wouldn't exist. Only vague impressions like "it works sometimes" or "it's kind of slow" would remain.&lt;/p&gt;

&lt;p&gt;Practical Recommendations&lt;br&gt;
When building agent workflows, these practices have proven effective:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Start with design.md&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Before writing agent code, document architecture, constraints, and key decisions. This document guides both human developers and AI agents.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Build observability from day one&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Don't add logging later. Design agent workflows to be observable from the start. Every step should produce some form of output that can be inspected.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Use benchmarks&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Establish a benchmark suite early. Run it regularly. Track metrics over time. This provides objective data on whether changes are improvements or regressions.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Iterate based on data&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;When low success rates are observed, examine failure modes. Instead of making broad changes, identify specific failure patterns and address them directly.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Keep context persistent&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Whether through design.md, rules.md, or other mechanisms, ensure agents have access to persistent context. Conversation history is too ephemeral for complex projects.&lt;/p&gt;

&lt;p&gt;Moving Forward&lt;br&gt;
AI agent engineering is still in early stages. Best practices are still being figured out, but two things are becoming clear:&lt;/p&gt;

&lt;p&gt;Agents need persistent context to do good work (design.md)&lt;br&gt;
Agent workflows need systematic observability to improve (benchmarks and logging)&lt;br&gt;
These aren't advanced techniques. They're foundational practices that make everything else work better.&lt;/p&gt;

&lt;p&gt;If you're interested in seeing these principles in action, Mano-AFK (&lt;a href="https://github.com/Mininglamp-AI/Mano-AFK" rel="noopener noreferrer"&gt;https://github.com/Mininglamp-AI/Mano-AFK&lt;/a&gt;) is an open-source autonomous application builder that uses persistent context files and systematic benchmarking to improve agent reliability.&lt;/p&gt;

&lt;p&gt;For those working on GUI agents, Mano-P (&lt;a href="https://github.com/Mininglamp-AI/Mano-P" rel="noopener noreferrer"&gt;https://github.com/Mininglamp-AI/Mano-P&lt;/a&gt;) implements think-act-verify loops and online reinforcement learning, achieving 58.2% success rate on the OSWorld benchmark (specialized models category).&lt;/p&gt;

&lt;p&gt;Both projects are Apache 2.0 licensed. Stars and contributions are welcome.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>security</category>
      <category>api</category>
    </item>
    <item>
      <title>After MCP, What's the Next Standard Interface for AI Agents?</title>
      <dc:creator>Mininglamp</dc:creator>
      <pubDate>Thu, 25 Jun 2026 10:39:31 +0000</pubDate>
      <link>https://dev.to/mininglamp/after-mcp-whats-the-next-standard-interface-for-ai-agents-116i</link>
      <guid>https://dev.to/mininglamp/after-mcp-whats-the-next-standard-interface-for-ai-agents-116i</guid>
      <description>&lt;p&gt;Model Context Protocol solved a real problem. Before MCP, every tool integration required custom glue code. After MCP, agents can talk to databases, APIs, and services through a standard interface. The protocol gave us a common language for agent-tool communication, and the ecosystem responded with thousands of MCP servers.&lt;/p&gt;

&lt;p&gt;But there’s a category of interaction that MCP doesn’t address: graphical user interfaces. Agents still struggle to operate desktop applications, web interfaces, and mobile apps with the same fluency they show when calling APIs. The problem isn’t intelligence. It’s the lack of a standard way for agents to perceive and manipulate GUIs.&lt;/p&gt;

&lt;p&gt;Three approaches have emerged to close this gap. Each makes different engineering tradeoffs, and none has achieved the kind of adoption that MCP has seen. Understanding these tradeoffs matters if you’re building agents that need to interact with the visual world.&lt;/p&gt;

&lt;p&gt;The GUI Agent Problem&lt;/p&gt;

&lt;p&gt;APIs are structured. They accept typed parameters, return predictable responses, and document their behavior. GUIs are none of these things. A button might be labeled “Submit,” “Save,” or “OK.” It might be positioned differently on different screen sizes. Its availability might depend on the state of three other form fields. The visual representation is an abstraction layer over underlying state, and that abstraction was designed for humans, not machines.&lt;/p&gt;

&lt;p&gt;An agent operating a GUI needs to do three things: perceive the current interface state, decide what action to take, and execute that action. The perception step is where approaches diverge. How does the agent “see” the interface?&lt;/p&gt;

&lt;p&gt;Approach 1: API Integration&lt;/p&gt;

&lt;p&gt;The most straightforward approach is to bypass the GUI entirely and use APIs when they exist. Many desktop applications expose scripting interfaces. Web applications often have REST APIs or GraphQL endpoints. Mobile apps sometimes support URL schemes or accessibility hooks that can be invoked programmatically.&lt;/p&gt;

&lt;p&gt;This approach works well when APIs are available and well-documented. An agent can call create_document(), pass structured parameters, and receive a confirmation response. The interaction is fast, reliable, and doesn’t require visual processing.&lt;/p&gt;

&lt;p&gt;The limitation is coverage. Most consumer applications don’t expose comprehensive APIs. Even when APIs exist, they often don’t cover all the functionality available through the GUI. A photo editing app might have an API for basic operations but not for advanced filters. A web application might have a public API for reading data but not for complex workflows that require multi-step form submissions.&lt;/p&gt;

&lt;p&gt;API integration also doesn’t help with legacy systems, proprietary software, or applications where the API has been deprecated or never built. The agent becomes dependent on the goodwill of application developers to expose the functionality it needs.&lt;/p&gt;

&lt;p&gt;Some frameworks try to extend API coverage by wrapping GUI automation libraries. Python’s pyautogui, Apple’s AppleScript, and Windows UI Automation all provide programmatic access to GUI elements. These tools work, but they’re fragile. They depend on element identifiers, window titles, and UI hierarchies that change between application versions. A script that works today might break after a software update.&lt;/p&gt;

&lt;p&gt;Approach 2: Accessibility Tree&lt;/p&gt;

&lt;p&gt;Operating systems maintain accessibility trees: structured representations of UI elements designed for screen readers and assistive technologies. These trees contain information about buttons, text fields, menus, and their current states. An agent can query the accessibility tree to understand what elements are present and what actions they support.&lt;/p&gt;

&lt;p&gt;This approach is more robust than raw API integration because accessibility trees are standardized at the OS level. macOS uses the Accessibility API, Windows uses UI Automation, and web browsers expose the DOM through accessibility interfaces. An agent that understands these APIs can operate across many applications without custom integration for each one.&lt;/p&gt;

&lt;p&gt;Google’s approach with Gemini 3.5 Flash Computer Use leans heavily on accessibility trees. The model can query the accessibility tree of a web page or application, identify interactive elements, and generate actions to manipulate them. This works well for structured interfaces like web forms, file managers, and settings panels.&lt;/p&gt;

&lt;p&gt;The accessibility tree approach has limits. Not all applications expose complete accessibility information. Custom-drawn interfaces, games, and applications with non-standard UI frameworks often have sparse or inaccurate accessibility trees. Canvas-based rendering, WebGL content, and video players present particular challenges because their visual content doesn’t map cleanly to accessibility nodes.&lt;/p&gt;

&lt;p&gt;Accessibility trees also miss visual information that matters for human-like interaction. The tree might tell you a button exists and is labeled “Save,” but it won’t tell you that the button appears grayed out, that a tooltip is hovering over it, or that a dialog box has appeared in the foreground. These visual cues inform human decision-making, and agents operating purely on accessibility data miss them.&lt;/p&gt;

&lt;p&gt;Approach 3: Pure Vision&lt;/p&gt;

&lt;p&gt;The third approach treats the screen as an image and uses computer vision to understand it. The agent takes screenshots, processes them through a vision-language model, and generates mouse and keyboard actions based on what it sees. This is the most general approach because it works with any GUI, regardless of whether APIs or accessibility trees exist.&lt;br&gt;
Pure vision agents don’t depend on application-specific integrations. They see the screen the way a human sees it: pixels arranged in patterns that represent buttons, text, images, and layouts. A vision-language model can identify a “Submit” button by its visual appearance, even if the button has no accessibility label or API endpoint.&lt;/p&gt;

&lt;p&gt;The tradeoff is computational cost and latency. Processing a screenshot through a vision model takes time. A single inference might require 200-500 milliseconds, and complex interfaces might need multiple inference steps to parse correctly. This makes pure vision agents slower than API-based approaches, where actions execute in milliseconds.&lt;/p&gt;

&lt;p&gt;Memory and compute requirements are also higher. Vision-language models need to process high-resolution images, which consume significant VRAM or RAM. Running these models on edge devices—laptops, desktops, mobile phones—requires careful optimization.&lt;/p&gt;

&lt;p&gt;Mano-P takes the pure vision approach and optimizes it for edge deployment. The model is a GUI-VLA (Visual-Language-Action) agent designed to run locally on consumer hardware. The 72B parameter version achieves 58.2% success rate on OSWorld benchmarks, ranking first among specialized GUI agents. But the 72B model is primarily for research and benchmarking. The 4B quantized version is what users can actually deploy: it runs with 4.3GB of peak memory usage, achieving 476 tokens per second during prefill and 76 tokens per second during decoding on Apple M4 hardware with 32GB RAM.&lt;/p&gt;

&lt;p&gt;The performance characteristics matter for practical deployment. An agent that takes two seconds to process each screenshot creates a sluggish user experience. An agent that processes screenshots in under 200 milliseconds feels responsive. The 4B model’s throughput makes real-time GUI interaction feasible on commodity hardware, which changes where GUI agents can be deployed.&lt;/p&gt;

&lt;p&gt;Pure vision also handles edge cases that other approaches struggle with. Custom-drawn interfaces, games, video content, and applications with non-standard UI frameworks all render to pixels. A vision agent can operate them without requiring special integration work.&lt;/p&gt;

&lt;p&gt;Engineering Tradeoffs&lt;/p&gt;

&lt;p&gt;Choosing between these approaches depends on your deployment context.&lt;br&gt;
API integration is fastest and most reliable when APIs exist. It’s the right choice for agents that operate within well-defined software ecosystems where API coverage is comprehensive. If you’re building an agent that automates workflows in Salesforce, Jira, and Slack, API integration makes sense. The agent will be fast, predictable, and easy to debug.&lt;/p&gt;

&lt;p&gt;Accessibility tree approaches offer broader coverage with reasonable performance. They work well for agents that need to operate across many applications but don’t require pixel-perfect visual understanding. Web automation, form filling, and menu navigation are good fits. The approach breaks down when applications have poor accessibility support or when visual context matters for decision-making.&lt;/p&gt;

&lt;p&gt;Pure vision is the most general but most expensive. It’s appropriate when agents need to operate arbitrary GUIs, handle visual complexity, or work with applications that lack APIs and accessibility support. The computational cost has historically limited this approach to cloud deployments, but model compression and edge optimization are changing that equation.&lt;/p&gt;

&lt;p&gt;Hybrid approaches are also viable. An agent might use API integration for structured operations, fall back to accessibility trees for standard UI elements, and use vision only when the other approaches fail. This layered strategy balances performance and coverage, though it increases implementation complexity.&lt;/p&gt;

&lt;p&gt;The Standardization Question&lt;/p&gt;

&lt;p&gt;MCP succeeded because it defined a minimal viable interface: tools expose functions with typed parameters, agents call those functions, and results flow back. The protocol is simple enough to implement quickly and flexible enough to cover many use cases.&lt;/p&gt;

&lt;p&gt;A GUI agent protocol would need to address different concerns. It would need to standardize how agents perceive interface state (screenshots, accessibility trees, or both), how they specify actions (coordinates, element references, or semantic descriptions), and how they receive feedback (visual confirmation, state changes, or error messages).&lt;/p&gt;

&lt;p&gt;No such standard exists yet. Each GUI agent framework defines its own perception and action primitives. Mano-P uses screenshot-based perception with coordinate-based actions. Other frameworks use accessibility tree queries with element ID-based actions. The lack of standardization means that GUI agents are not portable across frameworks, and application developers have no clear target to optimize for.&lt;/p&gt;

&lt;p&gt;The Octo workspace takes a different approach to this problem. Rather than standardizing the GUI interaction layer directly, it focuses on agent coordination and task orchestration. Individual agents within Octo can use different interaction strategies—API integration for some tasks, GUI automation for others—while the workspace manages context, state, and collaboration between them. This sidesteps the standardization question by treating GUI interaction as an implementation detail rather than a protocol concern.&lt;/p&gt;

&lt;p&gt;Whether a GUI-specific protocol will emerge remains unclear. The diversity of approaches suggests the problem space is not yet well understood. MCP took years to gain traction, and GUI agent interaction may require a similar maturation period before a standard emerges.&lt;/p&gt;

&lt;p&gt;Open Problems&lt;/p&gt;

&lt;p&gt;Several technical challenges remain unsolved regardless of which approach dominates.&lt;/p&gt;

&lt;p&gt;State tracking across actions. GUIs are stateful. Clicking a button might open a dialog, change a menu, or trigger a loading spinner. Agents need to track these state changes and adjust their behavior accordingly. Current approaches handle this through repeated perception cycles—take a screenshot, detect the state change, decide the next action—but this is inefficient and error-prone.&lt;/p&gt;

&lt;p&gt;Error recovery. GUIs fail in unpredictable ways. A dialog might appear unexpectedly. A network request might timeout, leaving the interface in an inconsistent state. An element might be obscured by a popup. Agents need robust error detection and recovery strategies, but defining what constitutes an “error” in a GUI context is difficult.&lt;/p&gt;

&lt;p&gt;Multi-step workflows. Many GUI tasks require sequences of actions with intermediate verification. Filling out a multi-page form, configuring application settings, or navigating a complex menu hierarchy all require maintaining context across multiple perception-action cycles. Current agents struggle with long-horizon tasks that span dozens of steps.&lt;br&gt;
Cross-platform consistency. An agent that works on macOS might fail on Windows or Linux because the same application renders differently on each platform. Building agents that generalize across operating systems requires handling platform-specific UI conventions, keyboard shortcuts, and interaction patterns.&lt;/p&gt;

&lt;p&gt;What Comes Next&lt;/p&gt;

&lt;p&gt;The GUI agent space is where API integration was before MCP: fragmented, with each implementation defining its own conventions. The success of MCP suggests that standardization is possible, but the problem spaces are fundamentally different. API integration deals with structured data and typed interfaces. GUI interaction deals with pixels, visual layouts, and human-designed abstractions.&lt;/p&gt;

&lt;p&gt;Pure vision approaches like Mano-P demonstrate that general GUI agents are technically feasible on edge hardware. The 4B model’s performance characteristics—sub-200ms inference, 4.3GB memory footprint—show that the computational barriers are falling. The question is whether the community will converge on a standard interface for GUI agent interaction, or whether the diversity of approaches will persist.&lt;/p&gt;

&lt;p&gt;For developers building agents today, the pragmatic answer is to use the approach that fits your deployment context. API integration when coverage is sufficient. Accessibility trees for broader reach with acceptable performance. Pure vision when generality matters more than speed. And keep watching for standards to emerge, the way MCP emerged for tool integration.&lt;/p&gt;

&lt;p&gt;The code for Mano-P &lt;a href="https://github.com/Mininglamp-AI/Mano-Pis" rel="noopener noreferrer"&gt;https://github.com/Mininglamp-AI/Mano-Pis&lt;/a&gt; available on GitHub under Apache 2.0 license. The 4B quantized model is ready for local deployment on Apple M4 hardware. If you’re working on GUI agent problems, the repository includes benchmark scripts, model weights, and deployment documentation. Worth a star if the problem space interests you.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>gui</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Humans Judge, AI Executes: Rethinking Work for the Multi-Agent Era</title>
      <dc:creator>Mininglamp</dc:creator>
      <pubDate>Wed, 24 Jun 2026 09:15:03 +0000</pubDate>
      <link>https://dev.to/mininglamp/humans-judge-ai-executes-rethinking-work-for-the-multi-agent-era-3omc</link>
      <guid>https://dev.to/mininglamp/humans-judge-ai-executes-rethinking-work-for-the-multi-agent-era-3omc</guid>
      <description>&lt;p&gt;A software engineer opens a chat window, pastes a stack trace, and waits for a suggested fix. A marketing manager types “write a product launch email” and copies the output into a draft. A data analyst asks for a SQL query, runs it, and comes back with a follow-up question. Three people, three different tasks, one interaction pattern: type a prompt, receive text, decide what to do with it.&lt;/p&gt;

&lt;p&gt;This pattern works. It works well enough that most organizations have adopted it without much friction. It also leaves most of the potential value on the table.&lt;/p&gt;

&lt;p&gt;The chat window treats every interaction as a standalone transaction. It holds no memory of what happened before, tracks no progress toward a larger goal, and carries no accountability for the outcome. The human on the other side absorbs all of that overhead: remembering context, managing workflow, evaluating quality, deciding next steps. The AI contributes text. The human contributes everything else.&lt;/p&gt;

&lt;p&gt;A different arrangement is possible, one where the division of labor shifts along a cleaner line. Humans supply judgment. AI supplies execution. The boundary between the two is where the interesting engineering problems live.&lt;/p&gt;

&lt;h1&gt;
  
  
  The Execution Gap
&lt;/h1&gt;

&lt;p&gt;Execution, in the context of knowledge work, means something specific. It means taking a defined objective and carrying it through to a deliverable. Research a competitor and produce a comparison matrix. Draft a proposal based on meeting notes and send it for review. Monitor a data pipeline and escalate anomalies.&lt;/p&gt;

&lt;p&gt;Each of these tasks involves multiple steps, decision points, tool interactions, and intermediate outputs. A chat-based AI can assist with individual steps. It can draft a paragraph, write a query, summarize a document. The orchestration of steps remains the human’s responsibility.&lt;br&gt;
The gap between step-level assistance and task-level execution is larger than it appears. Orchestration requires context: what has already been done, what the final deliverable looks like, who will review it, what standards apply. A stateless chat session holds none of this information. The human carries it in their head and re-supplies it with every new prompt.&lt;/p&gt;

&lt;p&gt;When the task grows complex enough, the orchestration burden becomes its own form of work. The human spends more time managing the AI than doing the actual task. The tool has become a dependency rather than a lever.&lt;/p&gt;

&lt;p&gt;Closing this gap requires agents that operate at the task level rather &lt;/p&gt;

&lt;p&gt;than the step level. An agent that accepts an objective, maintains context across steps, invokes tools as needed, and produces a deliverable rather than a text fragment. The human’s role shifts from orchestrating steps to defining objectives and evaluating outcomes.&lt;/p&gt;

&lt;p&gt;We are building Octo as a workspace designed around this shift. Not a chat interface with extra features, but a system where agents receive assignments, maintain work context, and deliver results for human review.&lt;/p&gt;

&lt;h1&gt;
  
  
  Identity as Infrastructure
&lt;/h1&gt;

&lt;p&gt;An agent that executes tasks needs an identity. This is not a cosmetic concern.&lt;/p&gt;

&lt;p&gt;In organizational settings, identity determines trust boundaries. A human employee has a name, a role, a set of permissions, and a work history. &lt;/p&gt;

&lt;p&gt;These attributes determine what information they can access, what decisions they can make, and who is accountable for their output. An agent operating within an organization needs equivalent attributes to function within existing governance structures.&lt;/p&gt;

&lt;p&gt;An anonymous agent cannot be assigned a task with clear accountability. It cannot be granted access to sensitive information through existing permission systems. Its output cannot be traced back to a responsible party for audit purposes. These are not edge cases. They are baseline requirements for any agent that handles real work in a real organization.&lt;br&gt;
The identity model also shapes how humans interact with agents. A named agent with a defined role and visible track record gets treated as a collaborator. An anonymous chat window gets treated as a tool. The difference matters for adoption. People build working relationships with collaborators. They use tools and discard them.&lt;/p&gt;

&lt;p&gt;Each Bot in Octo carries an AgentCard: a profile containing its role definition, capability set, and delivery history. A Bot is created by a specific person, operates on that person’s behalf, and inherits a subset of the creator’s permissions. The Bot becomes a digital proxy, an extension of its creator’s capacity rather than a free-floating assistant.&lt;br&gt;
This proxy model has implications for delegation. A manager can create a Bot to handle routine approvals. An analyst can create a Bot to run recurring reports. The Bot acts within boundaries defined by its creator, and the creator retains accountability. Delegation without accountability is just automation. Delegation with accountability is a structural change in how work gets distributed.&lt;/p&gt;

&lt;h1&gt;
  
  
  The Judgment Layer
&lt;/h1&gt;

&lt;p&gt;Execution without judgment produces output. Execution guided by judgment produces work product. The distinction matters.&lt;/p&gt;

&lt;p&gt;An agent can draft a report based on available data. Whether that report is adequate depends on factors that sit outside the execution process: does the analysis address the right questions, are the sources credible, does the framing match the audience’s expectations, is the level of detail appropriate for the decision it supports. These are judgment calls. They require domain expertise, situational awareness, and an understanding of stakeholder expectations.&lt;/p&gt;

&lt;p&gt;The human role in a human-agent collaboration centers on these judgment calls. Humans define what good looks like. They evaluate whether the output meets that standard. They provide corrections when it falls short. This is not a peripheral activity to be minimized. It is the core contribution that makes agent execution valuable.&lt;/p&gt;

&lt;p&gt;Judgment also evolves over time. A reviewer who has evaluated fifty reports from a particular agent develops a sense of its tendencies: where it typically excels, where it commonly misses the mark, what kinds of instructions produce better results. This accumulated assessment constitutes a form of knowledge that should feed back into the agent’s behavior.&lt;/p&gt;

&lt;p&gt;The Taste system in Octo captures this feedback loop. When a human reviews a Bot’s deliverable, the review generates preference signals: acceptances, rejections, modification notes, style corrections. These signals accumulate as preference cards that the Bot references in subsequent tasks. The Bot’s output adapts over time to reflect the reviewer’s standards and preferences.&lt;/p&gt;

&lt;p&gt;A new Bot and a Bot that has worked with the same reviewer for three months should produce noticeably different output quality. If they don’t, the feedback mechanism isn’t working. The goal is not a static tool that performs the same way every time, but a collaborator that improves through accumulated judgment.&lt;/p&gt;

&lt;h1&gt;
  
  
  Multi-Agent Coordination
&lt;/h1&gt;

&lt;p&gt;Most work involves more than one agent, especially once agents can handle task-level execution. A research task might benefit from one agent gathering information while another synthesizes findings. A content pipeline might involve separate agents for drafting, editing, and formatting. A complex analysis might require agents with different domain specializations working in parallel.&lt;/p&gt;

&lt;p&gt;Coordination between agents introduces its own set of requirements. Agents need shared context: a common understanding of the overall objective, visibility into each other’s progress, and access to intermediate outputs. They need communication protocols: rules for when to hand off work, how to signal completion or blocking, and where to deposit results. They need conflict resolution: mechanisms for handling cases where two agents produce contradictory outputs or competing recommendations.&lt;/p&gt;

&lt;p&gt;The coordination topology should match the work structure. Sequential tasks call for pipeline patterns where each agent receives input from the previous one. Exploratory tasks benefit from discussion patterns where agents exchange perspectives and refine ideas. Independent subtasks suit parallel execution with a final aggregation step. Forcing all coordination through a single pattern creates friction.&lt;/p&gt;

&lt;p&gt;Octo supports six coordination modes. Solo handles single-agent work. &lt;/p&gt;

&lt;p&gt;Roundtable enables multi-agent discussion with turn-taking. Critic pairs a producer with a reviewer. Pipeline chains agents in sequence. Split distributes subtasks for parallel processing. Swarm allows agents to self-organize around a shared objective. Each mode defines different communication rules, context-sharing boundaries, and control flows.&lt;/p&gt;

&lt;p&gt;The human’s role in multi-agent coordination varies by mode. In Pipeline, the human might only intervene at the final review stage. In Roundtable, the human might participate as an equal voice or observe and intervene when needed. In Swarm, the human sets the objective and reviews the aggregate output. The level of human involvement scales with the complexity and risk of the coordination pattern.&lt;/p&gt;

&lt;p&gt;Specialized agents also participate in these coordination patterns. Mano-P, a GUI-VLA agent built for edge devices, handles visual interface operations on local machines. It runs entirely on-device with no data leaving the host environment. The 72B model achieves 58.2% success rate on OSWorld benchmarks, while a 4B quantized version operates with just 4.3GB of memory at 76 tokens per second. Agents like Mano-P contribute specific execution capabilities that slot into larger coordination patterns managed by the workspace.&lt;/p&gt;

&lt;h1&gt;
  
  
  Organizational Memory
&lt;/h1&gt;

&lt;p&gt;Individual preferences and task histories accumulate quickly. Across a team, the volume of accumulated knowledge becomes substantial: reviewer preferences, project contexts, client requirements, recurring task patterns, institutional standards. Most of this knowledge exists informally, stored in people’s heads or scattered across documents and email threads.&lt;/p&gt;

&lt;p&gt;An agent-based workspace captures this knowledge as a byproduct of normal operation. Every review generates preference data. Every completed task produces context that might apply to future tasks. Every successful workflow demonstrates a pattern that could be reused. The system doesn’t need a separate knowledge management initiative. It accumulates knowledge through use.&lt;/p&gt;

&lt;p&gt;Three categories of organizational knowledge emerge from this process. Context captures project-specific information, client backgrounds, and situational details that agents need to produce relevant output. Taste encodes quality standards, style preferences, and evaluation criteria that shape how agents approach their work. Skill packages recurring task patterns as reusable procedures that new agents can inherit.&lt;/p&gt;

&lt;p&gt;Together, these three categories form an organizational asset that compounds over time. A team that has used the workspace for six months has accumulated a knowledge base that would take a new team months to build from scratch. This knowledge transfers across team members: a new hire inherits the accumulated context, taste, and skills without having to develop them personally.&lt;/p&gt;

&lt;p&gt;The asset also creates switching costs, though of a healthy kind. The accumulated knowledge makes the workspace more valuable the longer it is used, which is the intended incentive structure. Teams invest in building their knowledge base and benefit from that investment over time.&lt;/p&gt;

&lt;h1&gt;
  
  
  Data Sovereignty
&lt;/h1&gt;

&lt;p&gt;None of the above works if the data lives on someone else’s servers.&lt;br&gt;
Work product contains sensitive information: strategic plans, financial projections, client data, internal communications, proprietary analysis. Organizations cannot route this information through third-party infrastructure without accepting risks they are often unwilling to accept. Regulatory requirements in many industries explicitly prohibit it.&lt;/p&gt;

&lt;p&gt;Private deployment is not a premium feature. It is a prerequisite for any agent system that handles real organizational work. A system that requires data to leave the organization’s infrastructure limits itself to low-sensitivity use cases, which are also the lowest-value use cases.&lt;/p&gt;

&lt;p&gt;Octo deploys on the organization’s own infrastructure. All data, all agent interactions, all accumulated knowledge stays within the organization’s environment. The system does not phone home, does not share telemetry, and does not require external API access for core functionality.&lt;/p&gt;

&lt;h1&gt;
  
  
  Access Points
&lt;/h1&gt;

&lt;p&gt;Work happens in multiple contexts. A person might start a task at their desk, check progress from a phone during a meeting, review output while browsing the web, and trigger a follow-up from a terminal. An agent workspace that only exists in one interface creates friction every time the human switches contexts.&lt;/p&gt;

&lt;p&gt;Four access points cover the common scenarios: a web application for full-featured interaction, a mobile interface for monitoring and quick decisions, a browser extension for context-aware interventions during web browsing, and a command-line interface for developers and automation workflows. Each interface provides access to the same underlying workspace, with the same context, the same agent relationships, and the same accumulated knowledge.&lt;/p&gt;

&lt;p&gt;The goal is not omnichannel presence for its own sake. It is ensuring that the human can exercise judgment from whatever context they happen to be in, without having to switch to a dedicated interface to review output or provide direction.&lt;/p&gt;

&lt;h1&gt;
  
  
  The Shift
&lt;/h1&gt;

&lt;p&gt;The transition from chat-based AI assistance to agent-based execution changes what humans spend their time on. Less time orchestrating steps, more time defining objectives. Less time re-supplying context, more time evaluating outcomes. Less time managing tools, more time exercising judgment.&lt;/p&gt;

&lt;p&gt;This is not a marginal improvement. It is a different division of labor, one that allocates human attention to the activities where human judgment actually matters and delegates execution to agents that can operate autonomously within defined boundaries.&lt;/p&gt;

&lt;p&gt;The engineering challenges are substantial. Agents need identity, context persistence, coordination protocols, feedback mechanisms, and organizational knowledge systems. The workspace that houses these agents needs to support private deployment, multiple access points, and integration with existing tools and workflows.&lt;/p&gt;

&lt;p&gt;Octo is our approach to building this workspace. The code is open, the design decisions are documented, and the project welcomes examination and contribution. The multi-agent era is arriving through incremental engineering work, one coordination pattern at a time.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>productivity</category>
      <category>beginners</category>
    </item>
    <item>
      <title>When AI Agents Start Working Together: Three Challenges No One Talks About</title>
      <dc:creator>Mininglamp</dc:creator>
      <pubDate>Mon, 22 Jun 2026 06:43:00 +0000</pubDate>
      <link>https://dev.to/mininglamp/when-ai-agents-start-working-together-three-challenges-no-one-talks-about-31hn</link>
      <guid>https://dev.to/mininglamp/when-ai-agents-start-working-together-three-challenges-no-one-talks-about-31hn</guid>
      <description>&lt;p&gt;The trajectory of AI agents over the past two years has been remarkably clear: from single-purpose tools to personal assistants. Everyone runs their own agent, feeds it tasks, gets results back. It works well for individual productivity.&lt;/p&gt;

&lt;p&gt;Then comes the question every team eventually asks: can these agents work together?&lt;/p&gt;

&lt;p&gt;The answer is yes, but the problems you encounter along the way are rarely the ones you expected. They aren't about model capabilities or prompt engineering. They're about communication, context, and coordination — the same class of problems that distributed systems engineers have been solving for decades, now showing up in a new form.&lt;/p&gt;

&lt;p&gt;Here are three challenges that caught us off guard when we started building agent collaboration into &lt;a href="https://github.com/Mininglamp-OSS/octo-server" rel="noopener noreferrer"&gt;Octo&lt;/a&gt;, an open-source workplace platform where AI agents and humans share the same communication space.&lt;/p&gt;

&lt;h2&gt;
  
  
  Challenge 1: Context Visibility Boundaries
&lt;/h2&gt;

&lt;p&gt;When you use an agent personally, context management is straightforward. You decide what information the agent sees; its output comes back to you. The boundary is clean — it's just your workspace.&lt;/p&gt;

&lt;p&gt;In a team setting, that boundary dissolves.&lt;/p&gt;

&lt;p&gt;One of the first issues we ran into was surprisingly simple. We had an agent summarizing discussions across several channels. During testing it started pulling roadmap discussions from a product channel into an engineering planning thread. Nothing sensitive leaked externally, but it immediately exposed how unclear our context boundaries were.&lt;/p&gt;

&lt;p&gt;Traditional software handles this through API gateways, data permissions, and microservice boundaries. But agent context isn't just structured data — it includes conversation history, reasoning chains, and intermediate states. An agent's thought process during a task is valuable context, but it might also contain information that shouldn't cross team boundaries.&lt;/p&gt;

&lt;p&gt;What you need is fine-grained context visibility control. Not "everything open" or "everything closed," but dynamic rules that determine which context can be shared based on the task, role, and scenario at hand.&lt;/p&gt;

&lt;p&gt;This is where instant messaging architecture turns out to be surprisingly relevant. Channels are natural context boundaries — members only see messages in channels they belong to. When an agent joins a channel, it inherits that boundary naturally. It can access the channel's message history as context, but it can't see other channels. This is more mature than building a context management system from scratch, and it maps cleanly onto how teams already organize their work.&lt;/p&gt;

&lt;h2&gt;
  
  
  Challenge 2: Permission Intersections and Conflicts
&lt;/h2&gt;

&lt;p&gt;Personal agent permissions are simple: whatever the user authorizes, the agent can do.&lt;/p&gt;

&lt;p&gt;In a team context, permissions become many-to-many. A single agent might serve multiple people, participate in multiple projects, and play different roles in different channels. Each dimension has its own permission requirements, and they can conflict.&lt;/p&gt;

&lt;p&gt;Here's a concrete example: a code review agent participates in two project channels. Project A's codebase is invisible to Team B, but the agent can access both codebases while serving both projects. If the agent, while reviewing Project B's code, references an implementation pattern from Project A, is that an information leak?&lt;/p&gt;

&lt;p&gt;The situation gets more complex in human-agent collaboration. When humans and agents work in the same channel, humans can see all of the agent's output. But the agent's output might draw on information from other contexts it has access to. How do you ensure the agent only uses information visible in the current channel when generating responses?&lt;/p&gt;

&lt;p&gt;Distributed systems have mature solutions for permission design — RBAC (role-based access control), ABAC (attribute-based access control), and their variants. Agent systems can borrow these approaches, but they need adaptation for agent-specific characteristics. Agents don't just passively execute commands; they reason, generate content, and make proactive decisions. Permission control needs to cover the generation process itself, not just inputs and outputs.&lt;/p&gt;

&lt;p&gt;In Octo, we adopted an organization-aware RBAC model where each channel has its own ACL (access control list). Agent identities and permissions are managed alongside human members. All agent input and output within a channel is auditable, and permission boundaries are naturally expressed through the channel mechanism that IM systems have refined over decades.&lt;/p&gt;

&lt;h2&gt;
  
  
  Challenge 3: Collective Experience Accumulation and Reuse
&lt;/h2&gt;

&lt;p&gt;A personal agent can learn from historical interactions, gradually understanding a user's preferences and working patterns. This learning is individual — experience accumulates in a single agent's context.&lt;/p&gt;

&lt;p&gt;In a team setting, the dimension of experience changes. It's not just about individual agent experience anymore, but about collective experience generated through multi-agent collaboration — which collaboration patterns are efficient, which task decomposition approaches tend to cause problems, where human intervention happens most frequently. If this information could be captured and reused, it would meaningfully improve the team's overall collaboration efficiency.&lt;/p&gt;

&lt;p&gt;But collective experience faces several challenges.&lt;/p&gt;

&lt;p&gt;First, there's the ownership question. When one agent learns something during a collaborative task, should other agents have access to it? If so, could that introduce context pollution — an agent incorrectly applying someone else's experience to its own scenario?&lt;/p&gt;

&lt;p&gt;Second, there's timeliness. Team collaboration patterns shift with project phases, team structure, and business goals. A pattern that worked three months ago might be irrelevant now. Captured experience needs update and deprecation mechanisms.&lt;/p&gt;

&lt;p&gt;Then there's quality assessment, which is easy to overlook. Not every historical interaction yields valuable experience. Some might be special cases; others might contain flawed judgments. Capturing experience while maintaining quality requires an evaluation framework.&lt;/p&gt;

&lt;p&gt;Message history, group documents, and pinned messages in IM systems — while not designed for experience capture — can serve this role in practice. Key conversation conclusions can be pinned, important decision processes can be archived to group documents, and agents can retrieve these structured artifacts as references when executing tasks. This approach is lighter than vector databases or knowledge graphs, and it's easier for both humans and agents to understand and maintain together.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why IM Architecture Matters Here
&lt;/h2&gt;

&lt;p&gt;These three challenges — context visibility, permission intersections, collective experience — all point to a deeper insight: agent collaboration isn't just about connecting multiple agents. It requires a complete collaboration infrastructure.&lt;/p&gt;

&lt;p&gt;That infrastructure needs to handle communication, context, permissions, state synchronization, and experience accumulation. These problems have mature solutions in traditional software engineering, but agent systems — with their autonomous reasoning, content generation, and proactive decision-making — push the complexity up a level.&lt;/p&gt;

&lt;p&gt;IM architecture shows a surprising fit for this scenario. Over decades, IM systems have solved multi-party real-time communication, context management, permission control, and state synchronization, accumulating mature architectural patterns and engineering practices. Migrating these capabilities to agent collaboration is more reliable than building a new system from scratch.&lt;/p&gt;

&lt;p&gt;This observation led us to build &lt;a href="https://github.com/Mininglamp-OSS/octo-server" rel="noopener noreferrer"&gt;Octo&lt;/a&gt; on IM foundations — agents join channels directly and collaborate with humans in the same conversation interface. The project uses the Apache 2.0 license, has 9 core repositories under the Mininglamp-OSS GitHub organization, and runs on a stack of Go backend, WuKongIM, MySQL, Redis, and MinIO. It supports private deployment with 100% data on your own servers.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Bigger Picture
&lt;/h2&gt;

&lt;p&gt;Moving AI agents from personal tools to team infrastructure expands what they can do, but it also changes the nature of the challenges. Better models alone won't solve communication, context, and coordination problems. These require mature collaboration infrastructure.&lt;/p&gt;

&lt;p&gt;The shift from personal assistant to team collaborator might be the next important transition in the AI agent space. When it happens, the teams that think about these architectural challenges early — rather than just stacking more agents together — will build systems that actually work in practice.&lt;/p&gt;

&lt;p&gt;If you're working on multi-agent systems or interested in agent collaboration infrastructure, we'd love to hear about the challenges you've encountered. The Octo project is open source, and we welcome contributions and discussions on &lt;a href="https://github.com/Mininglamp-OSS/octo-server" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>opensource</category>
      <category>architecture</category>
    </item>
    <item>
      <title>Why We Put AI Agents in a Group Chat？</title>
      <dc:creator>Mininglamp</dc:creator>
      <pubDate>Wed, 17 Jun 2026 10:01:01 +0000</pubDate>
      <link>https://dev.to/mininglamp/why-we-put-ai-agents-in-a-group-chat-a1d</link>
      <guid>https://dev.to/mininglamp/why-we-put-ai-agents-in-a-group-chat-a1d</guid>
      <description>&lt;p&gt;Most teams building with LLMs eventually hit the same wall. You have a handful of agents, each one reasonably capable at its specific task, but they cannot talk to each other. One agent handles data extraction, another does summarisation, a third manages scheduling, yet the only thing connecting them is a human copying and pasting between browser tabs.&lt;/p&gt;

&lt;p&gt;We ran into this problem repeatedly while deploying AI agents inside enterprise environments at Mininglamp Technology. The agents were good at what they did individually, but the coordination overhead landed squarely on people. After a while, using agents felt more exhausting than not using them, because there were simply more contexts to keep track of.&lt;/p&gt;

&lt;p&gt;This article is about a different approach we took, one that treats instant messaging protocols not as a UI layer bolted on top of agent infrastructure, but as the foundational fabric for agent distribution, orchestration, and permission management.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Session Island Problem
&lt;/h2&gt;

&lt;p&gt;Every agent framework on the market optimises for a single metric, how well one agent performs in isolation. Bigger context windows, more accurate tool calls, faster inference. These are all valuable, but they miss a structural limitation that only becomes visible in multi-person, multi-step enterprise workflows.&lt;/p&gt;

&lt;p&gt;An agent's context is bounded by its own conversation thread. It has no native way to know what happened in another agent's thread, what decision a colleague made five minutes ago, or whether the data it is about to process has already been flagged as unreliable upstream. We call this the session island problem, and it gets worse as the number of agents in an organisation grows.&lt;/p&gt;

&lt;p&gt;Picture a typical scenario. A marketing team member drops a pricing question into a group channel. Answering it properly requires product specs from the engineering team's agent, cost models from the finance team's agent, and competitive intelligence from the sales team's agent. Under the current paradigm, someone has to manually route information between these agents, adding context at each handoff. The agents are powerful; the connective tissue between them is duct tape.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Not an API Gateway
&lt;/h2&gt;

&lt;p&gt;Our first instinct, like most engineering teams, was to build an API gateway. Agents expose endpoints, other agents call those endpoints, the gateway handles routing and authentication. It works, but three problems surfaced quickly during implementation.&lt;/p&gt;

&lt;p&gt;The first is state management. API calls are inherently stateless, one request, one response, done. But agent-to-agent collaboration in enterprise settings is often a long-running process. A single task might span several hours, involve multiple rounds of interaction, require human approval at certain checkpoints, and generate intermediate artifacts that other agents need to reference. Building this on top of a stateless API layer means reinventing message queues, state machines, and event subscription systems from scratch.&lt;/p&gt;

&lt;p&gt;The second is observability. When something goes wrong in an API-based agent mesh, you end up chasing logs across half a dozen services, trying to reconstruct the sequence of events from fragmented traces. In an IM-based architecture, the group chat itself is the complete audit trail. Who said what, when, what decision was reached, what was delivered, it is all there in a single chronological thread.&lt;/p&gt;

&lt;p&gt;The third is permission management. API gateways typically handle access control through tokens and ACLs, which are coarse-grained and tedious to configure. In a group chat model, information visibility maps directly onto group membership. Add someone to a group and they can see everything in it; remove them and they lose access. No separate permission matrix to maintain.&lt;/p&gt;

&lt;p&gt;These observations pushed us toward a fundamentally different design.&lt;/p&gt;

&lt;h2&gt;
  
  
  Octo — Agents as First-Class Chat Participants
&lt;/h2&gt;

&lt;p&gt;Octo is an open-source platform we built around this insight. At its core is octo-server, a Go backend that simultaneously handles REST APIs, WebSocket connections, and IM message routing through WuKongIM. The key design decision is that agents (which we call Lobsters in Octo) are not external services invoked through webhooks. They are participants in conversations, with the same messaging capabilities as human users.&lt;/p&gt;

&lt;p&gt;When a Lobster agent joins a group chat, it receives a full conversation context, chat history, member roster, read receipts, not just a trigger event. The agent can proactively send messages, reply to specific people, notify relevant parties when a task completes, or ask follow-up questions when it needs more information.&lt;/p&gt;

&lt;p&gt;The request processing pipeline in octo-server follows a clear sequence. First, authenticate the request source (supporting tokens, cookies, and DH-encrypted WebSocket frames). Then authorise with organisation-aware RBAC and per-channel ACL checks. Execute the business logic, which may involve spawning or resuming a Lobster agent session. Fan out the resulting message through WuKongIM to all group members, triggering adapters if the channel requires an external bridge. Finally, return a unified JSON response with tracing and metrics tags.&lt;/p&gt;

&lt;p&gt;One important side effect of this design is message ordering. IM protocols enforce strict temporal ordering on messages, so every agent sees the same conversation history in the same sequence. For multi-step collaborative tasks, this ordering guarantee is significantly more reliable than the request-response pattern you would get from a traditional API mesh.&lt;/p&gt;

&lt;h2&gt;
  
  
  Scaling Out vs. Scaling Up
&lt;/h2&gt;

&lt;p&gt;Improving a single agent's capabilities is scaling up, bigger models, longer contexts, more tools. What enterprises actually need, though, is scaling out, organising multiple specialised agents to collaborate on tasks that no single agent can handle alone.&lt;/p&gt;

&lt;p&gt;The analogy is straightforward. One person working overtime can only produce so much. A well-coordinated team of ten can dramatically outperform that individual. The challenge in the agent domain has always been that "well-coordinated" requires infrastructure that did not exist.&lt;/p&gt;

&lt;p&gt;Octo's group mechanism fills this gap. You create a group for each project or workstream, add the relevant agents and humans, and let them collaborate through the same messaging interface. Agents in a group behave essentially the same way as human members do, sending messages, replying to threads, marking tasks complete, requesting confirmation. Every interaction is automatically recorded as chat history, so newcomers (human or agent) can read through the backlog to get up to speed.&lt;/p&gt;

&lt;p&gt;We also built octo-smart-summary, a service that uses LLMs to periodically summarise group chat content, extracting key decisions, action items, and open questions. This addresses another common pain point, chat channels accumulate information fast, and late joiners struggle to find what matters.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────────────┐
│                 OCTO Architecture                │
│                                                  │
│  ┌────────┐  ┌────────┐  ┌───────────┐         │
│  │octo-web│  │octo-ios│  │octo-android│         │
│  │(React) │  │(Swift) │  │ (Kotlin)  │         │
│  └───┬────┘  └───┬────┘  └─────┬─────┘         │
│      └───────────┼─────────────┘                │
│                  ▼                               │
│        ┌──────────────────┐                     │
│        │   octo-server    │                     │
│        │   (Go Backend)   │                     │
│        │  · REST + WS     │                     │
│        │  · Lobster sched │                     │
│        │  · WuKongIM ctrl │                     │
│        └──┬────┬────┬───┘                     │
│           │    │    │                           │
│    ┌──────┘    │    └──────┐                    │
│    ▼           ▼           ▼                     │
│  ┌──────┐  ┌───────┐  ┌──────────┐             │
│  │octo- │  │smart- │  │  octo-   │             │
│  │matter│  │summary│  │ adapters │             │
│  └──────┘  └───────┘  └──────────┘             │
└─────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Security — The Group Is the Permission Boundary
&lt;/h2&gt;

&lt;p&gt;Enterprise deployments demand precise control over what agents can access, what operations they can perform, and who can see their output. Octo ties the permission model directly to the group model.&lt;/p&gt;

&lt;p&gt;Every channel in Octo has its own ACL configuration. When an agent joins a group, it inherits that group's permission settings, and everything the agent produces within the group is subject to the same access constraints. There is no separate permission system to maintain for agents; the group's access control is the agent's access control.&lt;/p&gt;

&lt;p&gt;For edge-agent scenarios, where computation happens on a user's local device (think of something like Mano-P, a GUI agent that runs natively on Apple Silicon), Octo's adapter mechanism can bridge execution results securely into IM groups. The actual computation and data never leave the device; only the necessary summaries and status updates flow into the group conversation.&lt;/p&gt;

&lt;p&gt;The authentication layer supports multiple mechanisms, traditional tokens, cookies, and Diffie-Hellman encrypted WebSocket frames. The authorisation layer implements organisation-aware RBAC, so every request passes through both organisation-level and channel-level permission checks before execution.&lt;/p&gt;

&lt;h2&gt;
  
  
  Engineering Considerations
&lt;/h2&gt;

&lt;p&gt;Plugging agents into an IM system is not as simple as adding another consumer to a message queue. Several engineering challenges deserve attention.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Message format unification.&lt;/strong&gt; Agent output is more structured than human messages. You need a consistent envelope that distinguishes between plain text, tool call results, status updates, and error reports. Octo defines a unified message schema that all agent output passes through before entering the IM channel.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Concurrency conflicts.&lt;/strong&gt; Multiple agents might operate on the same group simultaneously, replying to the same message at the same time or updating the same task status. octo-server handles message ordering and conflict detection at the server level to prevent logical contradictions in the chat.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Context management.&lt;/strong&gt; Agent context windows are finite, but group histories can grow indefinitely. This is where octo-smart-summary plays a critical role. It periodically compresses chat history into summaries, allowing agents to read the digest first and drill into specific segments only when needed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Open Source and Next Steps
&lt;/h2&gt;

&lt;p&gt;The entire Octo project is open-sourced under Apache 2.0 on GitHub &lt;a href="https://dev.tourl"&gt;(github.com/Mininglamp-OSS/octo-server)&lt;/a&gt;, including the backend, web client, mobile apps, admin console, and all supporting tools. We adopted a local-first design philosophy throughout; chat records, vector indices, and agent execution can all run on the user's own infrastructure, with cloud deployment as an option rather than a requirement.&lt;/p&gt;

&lt;p&gt;If you are thinking about how to move AI agents from isolated tools to genuine nodes in an enterprise collaboration network, we would love to hear from you. Whether it is a GitHub star, an issue report, or a feature request from your own use case, every piece of feedback helps us build this infrastructure better.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>distributedsystems</category>
      <category>opensource</category>
    </item>
    <item>
      <title>ZCode vs MiMo Code vs DevEco Code: Who Really Solves Developer Pain Points in China's AI Coding Tools Race?</title>
      <dc:creator>Mininglamp</dc:creator>
      <pubDate>Tue, 16 Jun 2026 11:44:15 +0000</pubDate>
      <link>https://dev.to/mininglamp/zcode-vs-mimo-code-vs-deveco-code-who-really-solves-developer-pain-points-in-chinas-ai-coding-1n6a</link>
      <guid>https://dev.to/mininglamp/zcode-vs-mimo-code-vs-deveco-code-who-really-solves-developer-pain-points-in-chinas-ai-coding-1n6a</guid>
      <description>&lt;p&gt;June 2026 marked a turning point for AI coding tools in China. Zhipu released ZCode 3.0, Xiaomi open-sourced MiMo Code, and Huawei launched DevEco Code at HDC 2026. Developer communities are calling it the "Three Kingdoms War" of domestic AI coding tools.&lt;/p&gt;

&lt;p&gt;These three products take fundamentally different technical approaches. This article compares them across product positioning, technical architecture, and developer experience, while exploring the unique value proposition of local-first solutions in this space.&lt;/p&gt;

&lt;h2&gt;
  
  
  Product Positioning
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;ZCode 3.0 (Zhipu AI)&lt;/strong&gt;: Released June 13, positioned as a multi-agent collaborative IDE. Core features include grouped task workspaces, Zread intelligent project knowledge base, and visual Git branch graphs. Zhipu's advantage lies in deep integration between its proprietary GLM model series and the tool itself.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;MiMo Code (Xiaomi)&lt;/strong&gt;: Open-sourced June 11, built on OpenCode with MIT license. Supports persistent memory systems, unlimited context windows, and multi-model compatibility (DeepSeek, Kimi, GLM, MiMo v2.5). Xiaomi chose an open ecosystem approach.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;DevEco Code (Huawei)&lt;/strong&gt;: Launched at HDC 2026, a specialized programming agent for the HarmonyOS ecosystem. Built on Huawei's Bifang large model, covering the full workflow from requirements design through testing and maintenance. AI code generation rate reaches 80%. Huawei open-sourced all HarmonyOS AI-assisted development Skills to the OpenHarmony community.&lt;/p&gt;

&lt;h2&gt;
  
  
  Technical Architecture Analysis
&lt;/h2&gt;

&lt;p&gt;Zhipu follows a &lt;strong&gt;model-driven&lt;/strong&gt; approach. ZCode 3.0's multi-agent concurrency and project understanding capabilities depend on GLM's underlying model capabilities. The product ceiling is tied to model iteration speed.&lt;/p&gt;

&lt;p&gt;Xiaomi follows an &lt;strong&gt;ecosystem-compatible&lt;/strong&gt; approach. MiMo Code doesn't bind to specific models—developers can freely switch underlying models. The MIT license lowers adoption barriers, but product differentiation relies mainly on upper-layer experience.&lt;/p&gt;

&lt;p&gt;Huawei follows a &lt;strong&gt;vertical specialization&lt;/strong&gt; approach. DevEco Code focuses exclusively on HarmonyOS scenarios. Multi-device adaptation, problem localization, and self-repair capabilities only make sense within the HarmonyOS ecosystem. Huawei's bet is that the HarmonyOS ecosystem is large enough to justify a dedicated tool.&lt;/p&gt;

&lt;h2&gt;
  
  
  Data Security: The Local-First Advantage
&lt;/h2&gt;

&lt;p&gt;As cloud-based AI coding tools become increasingly homogenized, data security and privacy emerge as differentiating factors.&lt;/p&gt;

&lt;p&gt;Mininglamp's open-source &lt;a href="https://github.com/Mininglamp-AI/Mano-P" rel="noopener noreferrer"&gt;Mano-P&lt;/a&gt; is a GUI-VLA agent model designed for edge devices, supporting fully local execution on Mac with Apple M4 + 32GB RAM. Screenshots and task descriptions never leave the device, making it suitable for scenarios with strict data security requirements.&lt;/p&gt;

&lt;p&gt;In OSWorld specialized model evaluation, Mano-CUA 1.1 achieved 58.2% success rate, ranking first and leading the second-place opencua-72b (45.0%) by 13.2 percentage points. In WebRetriever Protocol I testing, Mano-CUA 1.1 scored 41.7 NavEval, surpassing Gemini 2.5 Pro (40.9) and Claude 4.5 (31.3).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2FMininglamp-AI%2FMano-P%2Fmain%2Fpics%2FOS-World-Verified-Specialized-Model.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2FMininglamp-AI%2FMano-P%2Fmain%2Fpics%2FOS-World-Verified-Specialized-Model.png" alt="OSWorld Benchmark" width="799" height="563"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Performance Metrics
&lt;/h2&gt;

&lt;p&gt;Mano-P's 4B quantized model achieves approximately 80 tokens/s decode speed on M5 Pro. Combined with &lt;a href="https://github.com/Mininglamp-AI/Cider" rel="noopener noreferrer"&gt;Cider SDK&lt;/a&gt;'s W8A8 activation quantization, prefill is approximately 12.7% faster than the W8A16 baseline.&lt;/p&gt;

&lt;p&gt;Testing on 100 macOS GUI tasks showed Mano-CUA-Thinking-4B local model achieved 56.0% pass rate, exceeding cloud-based Qwen3-VL-Plus at 39.0%. Local small models can outperform cloud large models in specific scenarios.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2FMininglamp-AI%2FMano-P%2Fmain%2Fpics%2FWebRetriever.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2FMininglamp-AI%2FMano-P%2Fmain%2Fpics%2FWebRetriever.png" alt="WebRetriever Benchmark" width="799" height="382"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Open Source and Installation
&lt;/h2&gt;

&lt;p&gt;Mano-P uses Apache 2.0 license with three-phase open-source plan:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Phase 1: Mano-CUA Skills now open-source, install via &lt;code&gt;brew tap Mininglamp-AI/tap &amp;amp;&amp;amp; brew install mano-cua&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Phase 2: Local models and SDK, models available on &lt;a href="https://huggingface.co/Mininglamp-2718/Mano-CUA-4B-Thinking-1.1" rel="noopener noreferrer"&gt;HuggingFace&lt;/a&gt; and &lt;a href="https://www.modelscope.cn/models/Mininglamp2718/Mano-CUA-4B-Thinking-1.1" rel="noopener noreferrer"&gt;ModelScope&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Phase 3: Training methodology and quantization pruning techniques (planned)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Selection Recommendations
&lt;/h2&gt;

&lt;p&gt;ZCode 3.0 suits teams pursuing deep model integration; MiMo Code fits developers needing flexible multi-model switching; DevEco Code is the specialized tool for HarmonyOS developers.&lt;/p&gt;

&lt;p&gt;For scenarios with strict data security requirements or hard latency constraints, Mano-P's local-first approach deserves consideration. The three major tools and Mano-P represent different technical directions in AI coding tools—developers should choose based on actual needs.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>china</category>
      <category>developer</category>
    </item>
    <item>
      <title>Loop Engineering: The Next Step After Prompt Engineering for AI Agents</title>
      <dc:creator>Mininglamp</dc:creator>
      <pubDate>Mon, 15 Jun 2026 11:21:07 +0000</pubDate>
      <link>https://dev.to/mininglamp/loop-engineering-the-next-step-after-prompt-engineering-for-ai-agents-449m</link>
      <guid>https://dev.to/mininglamp/loop-engineering-the-next-step-after-prompt-engineering-for-ai-agents-449m</guid>
      <description>&lt;h1&gt;
  
  
  Loop Engineering: The Next Step After Prompt Engineering for AI Agents
&lt;/h1&gt;

&lt;p&gt;The AI development landscape has undergone a fundamental shift. For years, prompt engineering dominated the conversation—crafting the perfect instruction, fine-tuning context windows, and optimizing token usage. But as AI agents evolve from simple question-answering systems to autonomous problem-solvers, a new discipline is emerging: &lt;strong&gt;Loop Engineering&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;At Mininglamp, we've spent the last two years building production-grade AI agents, and we've learned a crucial lesson: the magic isn't in the prompt anymore. It's in the loop.&lt;/p&gt;

&lt;h2&gt;
  
  
  From Prompts to Loops: Why the Shift Matters
&lt;/h2&gt;

&lt;p&gt;Prompt engineering assumes a single interaction: you provide input, the model provides output. This works well for chatbots, content generation, and straightforward tasks. But modern AI agents don't work that way. They operate in cycles—observing their environment, reasoning about what to do, taking action, and verifying the results before deciding what comes next.&lt;/p&gt;

&lt;p&gt;This cyclic behavior is fundamentally different from prompt-response patterns. It requires:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;State management&lt;/strong&gt; across multiple iterations&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Error recovery&lt;/strong&gt; when actions fail&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dynamic decision-making&lt;/strong&gt; based on intermediate results&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Resource constraints&lt;/strong&gt; (time, API calls, tokens)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Verification mechanisms&lt;/strong&gt; to know when to stop&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These challenges can't be solved with better prompts alone. They require architectural patterns specifically designed for iterative, autonomous operation. That's Loop Engineering.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is Loop Engineering?
&lt;/h2&gt;

&lt;p&gt;Loop Engineering is the practice of designing, implementing, and optimizing the iterative cycles that power autonomous AI agents. It encompasses:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Loop Architecture&lt;/strong&gt;: The structure of observe-think-act-verify cycles&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;State Management&lt;/strong&gt;: How agents track progress and context across iterations&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Control Flow&lt;/strong&gt;: Decision logic for branching, retrying, and terminating loops&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Error Handling&lt;/strong&gt;: Strategies for graceful degradation and recovery&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Performance Optimization&lt;/strong&gt;: Balancing speed, accuracy, and resource usage&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Think of it this way: if prompt engineering is about crafting a single perfect instruction, loop engineering is about designing the entire runtime environment where an agent operates autonomously.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Anatomy of an Agent Loop
&lt;/h2&gt;

&lt;p&gt;Every AI agent loop follows a core pattern, though implementations vary widely. Here's the fundamental structure:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;task_complete&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;observation&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;perceive&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;environment&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;plan&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;reason&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;observation&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;goal&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;history&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;action&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;decide&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;plan&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;verify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;goal&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;update_state&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let's break down each component:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Perception (Observe)
&lt;/h3&gt;

&lt;p&gt;The agent gathers information about its current state. For GUI agents, this means taking screenshots and parsing visual elements. For API-based agents, it means reading responses and status codes. The key challenge: extracting relevant information while filtering noise.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Reasoning (Think)
&lt;/h3&gt;

&lt;p&gt;The agent analyzes the observation in context of its goal and past actions. This is where LLMs shine—they can synthesize complex situations and generate plans. But reasoning in loops is different from single-shot reasoning. The agent must:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Track what it has already tried&lt;/li&gt;
&lt;li&gt;Understand why previous attempts succeeded or failed&lt;/li&gt;
&lt;li&gt;Adjust strategies based on accumulated evidence&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3. Decision (Plan)
&lt;/h3&gt;

&lt;p&gt;Based on reasoning, the agent decides on a specific action. This could be clicking a button, making an API call, writing code, or asking for clarification. The decision must be concrete and executable.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Execution (Act)
&lt;/h3&gt;

&lt;p&gt;The agent performs the chosen action. This is where things get interesting—actions can fail, timeout, or produce unexpected results. Robust execution requires:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Timeout handling&lt;/li&gt;
&lt;li&gt;Retry logic with backoff&lt;/li&gt;
&lt;li&gt;Resource cleanup on failure&lt;/li&gt;
&lt;li&gt;Logging for debugging&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  5. Verification (Verify)
&lt;/h3&gt;

&lt;p&gt;After execution, the agent checks whether the action achieved the desired effect. This is often overlooked but critical. Without verification, agents can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Loop infinitely on failed actions&lt;/li&gt;
&lt;li&gt;Proceed with incorrect assumptions&lt;/li&gt;
&lt;li&gt;Miss partial successes that need refinement&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Verification strategies include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Direct checking&lt;/strong&gt;: Did the button click navigate to the expected page?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;State comparison&lt;/strong&gt;: Has the relevant part of the environment changed?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Goal proximity&lt;/strong&gt;: Are we closer to the objective than before?&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Loop Patterns: Single-Step vs Multi-Step vs Self-Correcting
&lt;/h2&gt;

&lt;p&gt;Not all loops are created equal. The pattern you choose depends on task complexity, reliability requirements, and resource constraints.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pattern 1: Single-Step Loops
&lt;/h3&gt;

&lt;p&gt;The simplest pattern: observe, act, done. Used for straightforward tasks with high confidence.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example&lt;/strong&gt;: "Click the submit button"&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;screenshot&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;capture_screen&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;button_location&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;find_button&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;screenshot&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;click&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;button_location&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Done
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;When to use&lt;/strong&gt;: Simple, well-defined actions with low failure probability.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Limitations&lt;/strong&gt;: No error recovery. If the button isn't there, the agent fails.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pattern 2: Multi-Step Sequential Loops
&lt;/h3&gt;

&lt;p&gt;Multiple actions executed in sequence, with state carried forward.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example&lt;/strong&gt;: "Fill out and submit a form"&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;field&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;form_fields&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;screenshot&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;capture_screen&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;field_location&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;find_field&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;screenshot&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;field&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;click&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;field_location&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;type&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;field&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;screenshot&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;capture_screen&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;submit_location&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;find_button&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;screenshot&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Submit&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;click&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;submit_location&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;When to use&lt;/strong&gt;: Tasks with clear, linear progression.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Limitations&lt;/strong&gt;: Brittle to unexpected states. If a field is already filled, the agent might not handle it gracefully.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pattern 3: Self-Correcting Loops
&lt;/h3&gt;

&lt;p&gt;The most sophisticated pattern: the agent monitors its own progress and adjusts strategies when stuck.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example&lt;/strong&gt;: "Complete a complex workflow"&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;max_attempts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;
&lt;span class="n"&gt;attempt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;

&lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="nf"&gt;goal_achieved&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;max_attempts&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;observation&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;capture_screen&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="c1"&gt;# Check if we're stuck
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;is_stuck&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;observation&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;history&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;strategy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;reconsider_approach&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;history&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;strategy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;continue_current_plan&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="n"&gt;action&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;select_action&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;strategy&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;observation&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Verify and learn
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;success&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;analyze_failure&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;history&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;adjust_strategy&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="nf"&gt;update_history&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;attempt&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;When to use&lt;/strong&gt;: Complex, unpredictable tasks requiring adaptation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Advantages&lt;/strong&gt;: Robust to failures, can recover from dead ends, learns from mistakes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Challenges&lt;/strong&gt;: More complex to implement, higher token usage, requires careful tuning of "stuck" detection.&lt;/p&gt;

&lt;h2&gt;
  
  
  Technical Deep Dive: How Loops Actually Work
&lt;/h2&gt;

&lt;p&gt;Let's examine the technical considerations that separate toy implementations from production-grade agent loops.&lt;/p&gt;

&lt;h3&gt;
  
  
  State Management
&lt;/h3&gt;

&lt;p&gt;Agents need to track:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Task progress&lt;/strong&gt;: What has been accomplished?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Action history&lt;/strong&gt;: What has been tried?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Environmental state&lt;/strong&gt;: How has the world changed?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Resource usage&lt;/strong&gt;: How many tokens/API calls remain?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Implementation approaches:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;In-context state&lt;/strong&gt;: Store everything in the prompt. Simple but token-expensive.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;External state store&lt;/strong&gt;: Use a database or file system. More efficient but adds complexity.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hybrid&lt;/strong&gt;: Keep recent state in context, archive older state externally.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Token Budget Management
&lt;/h3&gt;

&lt;p&gt;LLMs have context limits. In long-running loops, you can't keep appending to the prompt indefinitely. Strategies:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Summarization&lt;/strong&gt;: Periodically compress history into summaries&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sliding window&lt;/strong&gt;: Keep only the most recent N iterations&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Selective memory&lt;/strong&gt;: Store only key decisions and outcomes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;history&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;MAX_HISTORY&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;summary&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;summarize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;history&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;history&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;//&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="n"&gt;history&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;history&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;history&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;//&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;:]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Error Recovery Patterns
&lt;/h3&gt;

&lt;p&gt;When actions fail, agents need strategies:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Retry with backoff&lt;/strong&gt;: For transient failures (network timeouts)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Alternative path&lt;/strong&gt;: Try a different approach to the same goal&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rollback&lt;/strong&gt;: Undo recent actions and try from a known-good state&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Escalation&lt;/strong&gt;: Ask for human help when stuck&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Verification Strategies
&lt;/h3&gt;

&lt;p&gt;How does an agent know it succeeded?&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Direct observation&lt;/strong&gt;: Check if the expected change occurred&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Invariant checking&lt;/strong&gt;: Verify that certain conditions still hold&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Goal decomposition&lt;/strong&gt;: Break the goal into sub-goals and verify each&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Confidence scoring&lt;/strong&gt;: Rate confidence in success and retry if low&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Real-World Performance: Benchmarking Loop Architectures
&lt;/h2&gt;

&lt;p&gt;Theory is nice, but how do different loop patterns perform in practice? We tested three architectures on the OSWorld benchmark, a comprehensive suite of real-world computer tasks.&lt;/p&gt;

&lt;h3&gt;
  
  
  Test Setup
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Single-Step&lt;/strong&gt;: Direct action based on initial observation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-Step Sequential&lt;/strong&gt;: Linear execution of planned steps&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Self-Correcting&lt;/strong&gt;: Adaptive loop with stuck detection and strategy adjustment&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Results
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn84c2hme34qvu6ji68fk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn84c2hme34qvu6ji68fk.png" alt="OSWorld Benchmark Results" width="799" height="563"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The self-correcting loop dramatically outperforms simpler patterns. Why?&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Error recovery&lt;/strong&gt;: Real-world tasks fail. Self-correcting loops retry with different strategies.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Adaptive planning&lt;/strong&gt;: When the environment doesn't match expectations, the agent adjusts.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Progress verification&lt;/strong&gt;: The agent knows when it's stuck and reconsiders.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The performance gap is substantial: self-correcting loops achieve &lt;strong&gt;58.2% success rate&lt;/strong&gt; on OSWorld, compared to ~45% for multi-step sequential and ~30% for single-step approaches. That's a 13+ percentage point improvement from loop engineering alone.&lt;/p&gt;

&lt;h3&gt;
  
  
  Where the Gains Come From
&lt;/h3&gt;

&lt;p&gt;Analyzing failure modes reveals why self-correcting loops excel:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;38% of failures&lt;/strong&gt; in single-step loops were due to incorrect initial observations (element not visible, page not loaded)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;52% of failures&lt;/strong&gt; in multi-step loops were due to unhandled intermediate states (popup appeared, form validation failed)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Self-correcting loops&lt;/strong&gt; recovered from 71% of these failure modes through retry and strategy adjustment&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Building with Loops: Practical Implications
&lt;/h2&gt;

&lt;p&gt;If you're building AI agents, here's what Loop Engineering means for your architecture:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Design for Failure
&lt;/h3&gt;

&lt;p&gt;Assume every action can fail. Build verification and recovery into your loop from day one.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Bad: Fire and forget
&lt;/span&gt;&lt;span class="nf"&gt;click&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;button&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Good: Verify and recover
&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;click&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;button&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="nf"&gt;verify_click&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="nf"&gt;scroll_to_button&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;click&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;button&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="nf"&gt;verify_click&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="nf"&gt;try_alternative_approach&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. Implement Stuck Detection
&lt;/h3&gt;

&lt;p&gt;Agents often loop infinitely when stuck. Implement detection:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;is_stuck&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;history&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;recent_actions&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;history&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="p"&gt;:]&lt;/span&gt;
    &lt;span class="c1"&gt;# Check for repeated actions with same results
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;recent_actions&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
    &lt;span class="c1"&gt;# Check for oscillation between states
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;recent_actions&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;history&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;history&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3. Budget Your Resources
&lt;/h3&gt;

&lt;p&gt;Set explicit limits on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Maximum loop iterations&lt;/li&gt;
&lt;li&gt;Token usage per task&lt;/li&gt;
&lt;li&gt;Time per task&lt;/li&gt;
&lt;li&gt;API calls per task
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ResourceBudget&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_iterations&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;50000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_time&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;300&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;max_iterations&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;max_iterations&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;max_tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;max_tokens&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;max_time&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;max_time&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;can_continue&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="nf"&gt;return &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;iterations&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;max_iterations&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt;
                &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tokens_used&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;max_tokens&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt;
                &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;elapsed_time&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;max_time&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  4. Log Everything
&lt;/h3&gt;

&lt;p&gt;Debugging agent loops is hard without comprehensive logging:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Log observations (screenshots, API responses)&lt;/li&gt;
&lt;li&gt;Log reasoning (why the agent chose an action)&lt;/li&gt;
&lt;li&gt;Log actions and results&lt;/li&gt;
&lt;li&gt;Log verification outcomes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This data is invaluable for improving your loops.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Consider Edge Deployment
&lt;/h3&gt;

&lt;p&gt;For GUI agents, running loops on edge devices (local machines) offers advantages:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Privacy&lt;/strong&gt;: Screenshots and data never leave the device&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Latency&lt;/strong&gt;: No network round-trips for API calls&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reliability&lt;/strong&gt;: Works without internet connectivity&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost&lt;/strong&gt;: No per-token API fees for high-volume usage&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Case Study: Loop Engineering in Mano-P
&lt;/h2&gt;

&lt;p&gt;At Mininglamp, we've applied these principles in Mano-P, our edge-deployed GUI agent model. Mano-P uses a sophisticated self-correcting loop architecture with several key features:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuqp7qjsx7ax67qnagr0g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuqp7qjsx7ax67qnagr0g.png" alt="Mano-P Architecture" width="800" height="346"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  The Mano-P Loop
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Vision-Only Perception&lt;/strong&gt;: Screenshots are the sole input—no API hooks, no DOM access&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Think-Act-Verify Cycle&lt;/strong&gt;: Each action includes explicit verification before proceeding&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Progressive Training&lt;/strong&gt;: Three-stage training (SFT → Offline RL → Online RL) teaches the model effective loop strategies&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Edge-Native Execution&lt;/strong&gt;: Runs locally on Apple M4 chips with 32GB RAM, keeping all data on-device&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Performance Results
&lt;/h3&gt;

&lt;p&gt;The loop engineering approach pays off:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;#1 on OSWorld&lt;/strong&gt;: 58.2% success rate, outperforming models 18x larger&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;13.2 point lead&lt;/strong&gt; over second-place specialized models&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Autonomous long-task execution&lt;/strong&gt;: Handles complex workflows with dozens of steps&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fully local&lt;/strong&gt;: No cloud API calls, complete data privacy&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Mano-P demonstrates that sophisticated loop engineering can make smaller, specialized models outperform much larger general-purpose models on agentic tasks. The model is open-source on GitHub (&lt;a href="https://github.com/Mininglamp-AI/Mano-P" rel="noopener noreferrer"&gt;github.com/Mininglamp-AI/Mano-P&lt;/a&gt;), and we've seen developers building increasingly sophisticated agent workflows using its loop primitives.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Future of Loop Engineering
&lt;/h2&gt;

&lt;p&gt;As AI agents become more autonomous, Loop Engineering will become as fundamental as prompt engineering is today. We're seeing several trends:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Hierarchical loops&lt;/strong&gt;: Agents that manage sub-agents in nested loop structures&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Learning loops&lt;/strong&gt;: Agents that improve their loop strategies through experience&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-modal loops&lt;/strong&gt;: Combining vision, text, and structured data in loop reasoning&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Collaborative loops&lt;/strong&gt;: Multiple agents coordinating through shared state&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The key insight: the quality of an AI agent is determined less by the model's raw capabilities and more by the quality of its loop architecture. A well-designed loop can make a 4B-parameter model outperform a 72B model on real-world tasks.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Prompt engineering taught us how to communicate with AI models. Loop Engineering teaches us how to let them operate autonomously. The shift from single interactions to iterative cycles represents a fundamental change in how we build AI systems.&lt;/p&gt;

&lt;p&gt;For developers entering this space, the principles are clear:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Design for failure and recovery&lt;/li&gt;
&lt;li&gt;Verify every action before proceeding&lt;/li&gt;
&lt;li&gt;Budget resources explicitly&lt;/li&gt;
&lt;li&gt;Log comprehensively&lt;/li&gt;
&lt;li&gt;Start with simple loops, graduate to self-correcting ones&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The agents that will define the next era of AI won't just be better at answering questions—they'll be better at operating in loops, adapting to uncertainty, and achieving complex goals autonomously. Loop Engineering is how we build them.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Want to experiment with production-grade agent loops? Check out &lt;a href="https://github.com/Mininglamp-AI/Mano-P" rel="noopener noreferrer"&gt;Mano-P on GitHub&lt;/a&gt;—our open-source GUI-VLA agent model that runs locally on edge devices, keeping your data private while demonstrating state-of-the-art loop engineering in action.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>machinelearning</category>
      <category>engineering</category>
    </item>
    <item>
      <title>Will AI Agents Replace Programmers?</title>
      <dc:creator>Mininglamp</dc:creator>
      <pubDate>Fri, 12 Jun 2026 10:36:11 +0000</pubDate>
      <link>https://dev.to/mininglamp/will-ai-agents-replace-programmers-43ej</link>
      <guid>https://dev.to/mininglamp/will-ai-agents-replace-programmers-43ej</guid>
      <description>&lt;p&gt;The question used to be a thought experiment discussed in tech forums between sips of coffee. In 2026, it feels a lot more personal. Large language models can generate working programs from a few sentences of instruction, GUI automation agents navigate computer interfaces with increasing competence, and every other week brings a new open-source project that promises to make developers obsolete.&lt;/p&gt;

&lt;p&gt;The honest answer isn't a clean yes or no. Some parts of what developers do are being transformed beyond recognition, while other parts are becoming more valuable precisely because AI can't touch them. Let's break this down by looking at what the job actually involves day to day.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where AI Agents Have Made Real Progress
&lt;/h2&gt;

&lt;p&gt;Code completion and generation were the first breakthroughs. Models today can take a brief description and produce a working React component, a SQL query, or a REST API implementation. The output quality in straightforward scenarios is approaching what you'd expect from a mid-level engineer. GitHub Copilot and similar tools are used daily by millions of developers; Stack Overflow's 2025 survey reported over 70% of respondents integrating AI tools into their workflow, up significantly from the year before.&lt;/p&gt;

&lt;p&gt;Beyond autocomplete, purpose-built agents have entered production use. Some read entire codebases to locate bugs, others generate front-end and back-end code from product specs and run their own tests, and still others analyze logs to identify bottlenecks and suggest optimizations. For standardized coding tasks, these agents are remarkably efficient. Common algorithms, boilerplate API endpoints, test suites covering edge cases, data transformation scripts, configuration files from templates; pattern-heavy work with well-defined inputs and outputs is where large language models excel.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Gap Between Writing Code and Engineering Software
&lt;/h2&gt;

&lt;p&gt;Understanding why AI agents won't replace programmers requires distinguishing between "writing code" and "doing software engineering." Code is just one link in a much longer chain, and anyone who has shipped production software knows that writing the code is often the easiest part. The real challenges live elsewhere: understanding what needs to be built, making sound architectural choices, hunting down elusive bugs, keeping complex systems healthy over years of evolution.&lt;/p&gt;

&lt;p&gt;Requirements analysis is a good example. Product managers and stakeholders rarely arrive with complete specifications. Sometimes they can't articulate what they actually need, and developers spend enormous effort figuring out the real problem through iterative conversations, challenging assumptions, and proposing alternatives that the stakeholder hadn't considered. This process involves understanding organizational context, business dynamics, and interpersonal relationships. An agent parsing a Jira ticket is nowhere close.&lt;/p&gt;

&lt;p&gt;Architecture decisions involve navigating competing constraints. Performance, security, maintainability, developer productivity, team expertise, existing infrastructure, future extensibility; these factors rarely align neatly. Choosing to introduce message queues improves decoupling but adds operational overhead; adding a caching layer speeds up reads but introduces consistency challenges. This kind of judgment emerges from years of shipping systems, watching them fail in production, and learning what works in specific contexts. No amount of training data substitutes for that experience.&lt;/p&gt;

&lt;p&gt;Then there's debugging production incidents. Distributed system bugs span network layers, database engines, caching systems, and message brokers simultaneously. Symptoms appear far from root causes, and finding the issue requires the kind of holistic mental model that comes from understanding every layer of the stack and having seen similar failures before. This pattern-matching ability, built over years of painful debugging sessions, remains firmly in human territory.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Parts of Programming That Resist Automation
&lt;/h2&gt;

&lt;p&gt;Engineering is fundamentally a team activity. Code reviews, design discussions, and cross-functional coordination consume a substantial portion of developer time and energy. Good code reviews go beyond syntax and logic; they evaluate design choices, anticipate edge cases, ensure consistency with team conventions, and assess maintainability over the long term. This engineering taste develops over years of building and reviewing software together, and current model training paradigms don't provide that experience.&lt;/p&gt;

&lt;p&gt;Technical design discussions involve similar dynamics. Teams weigh multiple viable approaches, and each choice carries implications for different departments' resources and priorities. Navigating these conversations requires organizational awareness and persuasion skills, not code generation ability.&lt;/p&gt;

&lt;h2&gt;
  
  
  What History Tells Us About Tools and Jobs
&lt;/h2&gt;

&lt;p&gt;Every major leap in developer tools has expanded rather than shrunk the software industry's demand for talent. When high-level languages replaced assembly, people predicted programmers would become unnecessary; instead, higher abstraction lowered the barrier to entry and created a vastly larger software industry with more developer positions than ever. CI/CD pipelines and cloud-native platforms didn't eliminate DevOps roles; they created SRE and platform engineering positions with higher complexity and compensation.&lt;/p&gt;

&lt;p&gt;AI agents follow the same pattern. What's changing is the content of the job, not the existence of the job. As AI handles more repetitive coding tasks like boilerplate API endpoints, unit test generation, configuration file creation, and data format conversion, developers can invest more energy in higher-value activities: evaluating technical feasibility, evolving system architecture, deep performance optimization, and integrating emerging technologies into business contexts.&lt;/p&gt;

&lt;p&gt;This shift has been underway for a while. Cloud computing didn't eliminate operations roles; it transformed them into DevOps and SRE positions with greater scope and higher salaries. Tools eliminate low-value tasks within roles, not the roles themselves.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Effective Human-AI Collaboration Actually Looks Like
&lt;/h2&gt;

&lt;p&gt;In scenarios where developers work effectively with AI agents, a stable pattern emerges. Developers define problems and constraints; AI generates candidate solutions; developers evaluate and refine; AI executes the coding work; developers review and integrate. The division of labor is clear: humans handle direction-setting and goal definition, AI handles efficient search for optimal solutions within given constraints. Problem definition, constraint judgment, and trade-off decisions remain distinctly human competencies.&lt;/p&gt;

&lt;p&gt;Take GUI automation as an example. Current-generation agents can understand screen content through visual recognition, plan operation sequences, and execute multi-step tasks across applications. On OSWorld, a standard benchmark for general GUI operations, leading agents complete over 58% of complex desktop tasks involving hundreds of interactive elements. But these agents still need humans to set goals, verify results, and handle unexpected situations. They're powerful assistants with strong execution capability, not autonomous decision-makers.&lt;/p&gt;

&lt;h2&gt;
  
  
  AI Agents as Developer Tools
&lt;/h2&gt;

&lt;p&gt;This philosophy of human-AI collaboration is central to Mano-P, an open-source GUI Agent model from Mininglamp Technology that runs locally on Mac hardware. Mano-P uses pure visual interaction with screen elements, requiring no system API access or application-specific integrations. On the OSWorld benchmark, it ranks among the top performers at 58.2% task completion; its 4-billion parameter quantized version achieves 476 tokens/s prefill speed and 76 tokens/s decode speed on Apple M4 Pro chips, with peak memory usage of just 4.3GB.&lt;/p&gt;

&lt;p&gt;What sets Mano-P apart from cloud-based agent solutions is that all task data, including screenshots and operation instructions, stays on the local device. For developers handling sensitive business data, this privacy guarantee is essential. Its positioning is as a local assistant that understands interfaces and executes repetitive operations, not as a replacement for developer judgment. Developers can integrate Mano-P into their workflows through the open-source Mano-CUA Skills framework or Python SDK, delegating tasks like UI automation testing and cross-application data extraction to the agent while focusing their own time on work that requires creativity and judgment.&lt;/p&gt;

&lt;p&gt;Learn more on GitHub: &lt;a href="https://github.com/Mininglamp-AI/Mano-P" rel="noopener noreferrer"&gt;https://github.com/Mininglamp-AI/Mano-P&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;AI agents won't replace programmers, just as compilers didn't replace programmers, IDEs didn't replace programmers, and Stack Overflow didn't replace programmers. What will happen is that developers who embrace AI tools will become significantly more effective, while those who resist adaptation may find themselves outpaced by peers who leverage these tools. For developers willing to evolve, AI agents bring not a threat to their profession but the freedom to focus their energy on what matters most.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>agents</category>
      <category>opensource</category>
    </item>
    <item>
      <title>What Is RAG? Why LLM Memory Alone Is Never Enough</title>
      <dc:creator>Mininglamp</dc:creator>
      <pubDate>Thu, 11 Jun 2026 09:52:45 +0000</pubDate>
      <link>https://dev.to/mininglamp/what-is-rag-why-llm-memory-alone-is-never-enough-2e8d</link>
      <guid>https://dev.to/mininglamp/what-is-rag-why-llm-memory-alone-is-never-enough-2e8d</guid>
      <description>&lt;p&gt;Ask a large language model for a specific statistic, then ask where it found that number. More often than not, the citation it gives you doesn't exist. The model will hallucinate a plausible-looking reference, confidently present outdated conclusions, or simply make things up without any internal signal that something is wrong. This failure mode has a well-known name — hallucination — and the most widely adopted engineering solution for it is RAG.&lt;/p&gt;

&lt;h2&gt;
  
  
  RAG in One Sentence
&lt;/h2&gt;

&lt;p&gt;RAG stands for Retrieval-Augmented Generation. The idea is straightforward: before the LLM generates an answer, retrieve relevant document chunks from an external knowledge base, then feed those chunks to the model as context so it can compose its response based on real source material rather than parametric memory alone.&lt;/p&gt;

&lt;p&gt;Think of it like writing a research paper. You don't cite statistics from memory; you look them up first, then write your argument around verified data. RAG gives language models the same "look it up, then write" workflow.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three Structural Limitations of LLMs
&lt;/h2&gt;

&lt;p&gt;To understand why RAG is necessary, we need to identify the specific gaps it fills.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Knowledge cutoff.&lt;/strong&gt; Every model has a training data deadline. GPT-4's cutoff is late 2023; Claude's is early 2025. Anything that happened after that deadline simply doesn't exist in the model's world. It will either admit ignorance or, more dangerously, fabricate an answer that sounds current.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bounded parametric capacity.&lt;/strong&gt; Even a 100-billion-parameter model can only "memorize" so much. Long-tail facts, niche domain knowledge, your company's internal documentation, yesterday's meeting notes — none of these are in the weights.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No built-in fact-checking.&lt;/strong&gt; Token generation is probabilistic sampling. The model has no mechanism to distinguish whether it's recalling a training fact or pattern-matching its way into a plausible-sounding fiction.&lt;/p&gt;

&lt;p&gt;RAG addresses all three: it supplies up-to-date, verifiable, externally sourced evidence at inference time.&lt;/p&gt;

&lt;h2&gt;
  
  
  How RAG Works: Three Stages
&lt;/h2&gt;

&lt;p&gt;A standard RAG pipeline has three phases — Indexing, Retrieval, and Generation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Indexing&lt;/strong&gt; transforms your raw knowledge base into a searchable format. Documents are split into chunks (typically 512–1024 tokens), each chunk is converted into a high-dimensional vector using an embedding model, and those vectors are stored in a vector database (FAISS, Milvus, Chroma, Pinecone, etc.).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Retrieval&lt;/strong&gt; finds the most relevant chunks for a given query. The user's question is also embedded into the same vector space, then a similarity search (cosine similarity or dot product) returns the Top-K closest chunks. Advanced pipelines add a reranking step using a cross-encoder to refine initial results.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Generation&lt;/strong&gt; concatenates the retrieved chunks with the original question into a prompt, then sends it to the LLM. The model's job shifts from "answer from memory" to "answer based on the provided material" — essentially an open-book exam.&lt;/p&gt;

&lt;h2&gt;
  
  
  Embeddings: The Foundation of Retrieval Quality
&lt;/h2&gt;

&lt;p&gt;Retrieval quality depends heavily on the embedding model. An embedding compresses a text passage into a fixed-length numerical vector (commonly 768 or 1024 dimensions) such that semantically similar texts end up close together in vector space.&lt;/p&gt;

&lt;p&gt;For example, "how to improve code quality" and "writing better code" should map to nearby vectors, while "nice weather today" should land in a completely different region.&lt;/p&gt;

&lt;p&gt;Popular choices include OpenAI's text-embedding-3, BGE, and E5 families. When evaluating options, look at retrieval task scores on the MTEB leaderboard rather than raw parameter counts.&lt;/p&gt;

&lt;h2&gt;
  
  
  Vector Database Selection
&lt;/h2&gt;

&lt;p&gt;Vector databases differ from relational databases in a fundamental way: relational DBs excel at exact matching ("find record id=123"), while vector DBs excel at approximate nearest-neighbor search ("find the 10 passages most semantically similar to this query").&lt;/p&gt;

&lt;p&gt;Quick selection guide:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;FAISS&lt;/strong&gt; — high-performance single-node scenarios&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Milvus&lt;/strong&gt; — distributed, large-scale deployments&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Chroma&lt;/strong&gt; — lightweight prototyping&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pinecone&lt;/strong&gt; — fully managed cloud service&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Consider your data scale, latency requirements, persistence needs, and operational capacity.&lt;/p&gt;

&lt;h2&gt;
  
  
  Chunking Strategy Matters More Than You Think
&lt;/h2&gt;

&lt;p&gt;Chunk too large and you dilute relevance with noise. Chunk too small and you lose surrounding context, making retrieved passages hard to interpret.&lt;/p&gt;

&lt;p&gt;Common strategies include fixed-length chunking (by token count), semantic chunking (split at paragraph or section boundaries), and recursive chunking (split by large structures first, then sub-divide). Start with 512–1024 tokens per chunk and adjust based on your specific document types and downstream evaluation metrics.&lt;/p&gt;

&lt;h2&gt;
  
  
  RAG vs Fine-tuning: When to Use Which
&lt;/h2&gt;

&lt;p&gt;Both RAG and fine-tuning help models "learn" new knowledge, but they serve different purposes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;RAG works best when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Knowledge changes frequently (just update the vector store, no retraining)&lt;/li&gt;
&lt;li&gt;You need citations and source attribution&lt;/li&gt;
&lt;li&gt;You want to augment multiple model versions with the same knowledge base&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Fine-tuning works best when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You need to change the model's output style or reasoning pattern&lt;/li&gt;
&lt;li&gt;You have stable, domain-specific reasoning chains to internalize&lt;/li&gt;
&lt;li&gt;Low-latency inference matters and you can't afford retrieval overhead&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In practice, they're not mutually exclusive. Many production systems use RAG for knowledge grounding and fine-tuning for behavioral alignment.&lt;/p&gt;

&lt;h2&gt;
  
  
  RAG for Local Models: Privacy Meets Capability
&lt;/h2&gt;

&lt;p&gt;RAG's value becomes even more pronounced for models running on edge devices. Local models typically sit in the 4B–8B parameter range, which means their "memory capacity" is inherently limited. At the same time, the core motivation for local deployment is usually data privacy — keeping sensitive information off cloud servers.&lt;/p&gt;

&lt;p&gt;Local model + local RAG gives you the best of both worlds: high-quality answers grounded in external knowledge, with all data — documents and queries alike — staying on-device.&lt;/p&gt;

&lt;p&gt;Our team has been working on edge AI for a while. &lt;a href="https://github.com/Mininglamp-AI/Mano-P" rel="noopener noreferrer"&gt;Mano-P&lt;/a&gt; is an open-source GUI agent model built for Apple Silicon devices. Its 4B quantized version runs locally on a Mac mini (M4 chip, 32GB RAM) at ~80 tokens/s decode speed, and the companion &lt;a href="https://github.com/Mininglamp-AI/Cider" rel="noopener noreferrer"&gt;Cider&lt;/a&gt; SDK adds INT8 activation quantization that delivers 1.4x–2.2x prefill speedup. Mano-P currently ranks #1 on OSWorld among specialized GUI agent models with a 58.2% success rate, with all inference executed entirely on-device — screenshots and task data never leave the machine.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuqp7qjsx7ax67qnagr0g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuqp7qjsx7ax67qnagr0g.png" alt="Mano-P Architecture" width="800" height="346"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;

&lt;p&gt;For developers looking to build their first RAG pipeline, the fastest path is:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Pick a framework (LangChain or LlamaIndex)&lt;/li&gt;
&lt;li&gt;Load your documents and configure chunking&lt;/li&gt;
&lt;li&gt;Choose an embedding model and vector store&lt;/li&gt;
&lt;li&gt;Wire up retrieval and generation&lt;/li&gt;
&lt;li&gt;Build an evaluation set and iterate on each component&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The engineering complexity of RAG is manageable. What takes time is systematic optimization — measuring retrieval precision, tuning chunk sizes, experimenting with reranking, and crafting prompt templates that help the model make good use of retrieved context.&lt;/p&gt;




&lt;p&gt;If you're exploring local AI agents or edge-native inference, check out &lt;a href="https://github.com/Mininglamp-AI/Mano-P" rel="noopener noreferrer"&gt;Mano-P on GitHub&lt;/a&gt;. Stars are always appreciated ⭐&lt;/p&gt;

</description>
      <category>ai</category>
      <category>rag</category>
      <category>machinelearning</category>
      <category>llm</category>
    </item>
    <item>
      <title>Everyone Has an AI Agent Now, But They Still Can't Talk to Each Other</title>
      <dc:creator>Mininglamp</dc:creator>
      <pubDate>Wed, 10 Jun 2026 07:50:49 +0000</pubDate>
      <link>https://dev.to/mininglamp/everyone-has-an-ai-agent-now-but-they-still-cant-talk-to-each-other-3ldj</link>
      <guid>https://dev.to/mininglamp/everyone-has-an-ai-agent-now-but-they-still-cant-talk-to-each-other-3ldj</guid>
      <description>&lt;p&gt;The agent gold rush is in full swing. Every major tech company has shipped one. Startups are building them by the hundreds. Open-source frameworks like LangChain, CrewAI, and AutoGen have made it trivially easy to spin up an agent that can browse the web, write code, or manage your calendar.&lt;/p&gt;

&lt;p&gt;And yet, if you ask your coding agent to hand off a task to your scheduling agent, it stares at you blankly. If you want two agents from different vendors to coordinate on a project, you're looking at custom glue code, brittle API wrappers, and a lot of prayer.&lt;/p&gt;

&lt;p&gt;We have a thousand agents. We have zero agent society.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Island Problem
&lt;/h2&gt;

&lt;p&gt;Think about how agents work today. Each one is a self-contained loop: perceive → think → act. They connect to the outside world through tool calls — API endpoints, browser automation, file system access. When an agent needs information from another system, it calls an API. When it needs to trigger an action, it calls another API.&lt;/p&gt;

&lt;p&gt;This works fine for agent-to-service communication. But agent-to-agent? That's a fundamentally different problem.&lt;/p&gt;

&lt;p&gt;When two humans collaborate, they don't just exchange API calls. They share context. They build on each other's understanding. They negotiate, delegate, and verify. They operate within social structures — teams, organizations, hierarchies — that determine who can see what and who can ask whom for help.&lt;/p&gt;

&lt;p&gt;Current agents have none of this infrastructure. They're islands with HTTP bridges.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why API Integration Isn't Enough
&lt;/h2&gt;

&lt;p&gt;Let's say you have a research agent and a writing agent. The research agent finds relevant papers, extracts key findings, and builds a knowledge base. The writing agent takes briefs and produces drafts.&lt;/p&gt;

&lt;p&gt;The naive approach: the writing agent calls the research agent's API, gets back a JSON blob of findings, and works from there.&lt;/p&gt;

&lt;p&gt;Here's what breaks:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Context loss.&lt;/strong&gt; The research agent spent 30 minutes building an internal representation of how these papers relate to each other, which claims are controversial, which sources are most reliable. None of that transfers through a flat API response. The writing agent gets data, not understanding.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No shared memory.&lt;/strong&gt; If the writing agent discovers that a certain angle doesn't work and pivots, the research agent doesn't learn from this. Next time, it'll make the same recommendations. There's no feedback loop, no accumulated shared knowledge.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Permission blindness.&lt;/strong&gt; In an organization, different agents handle different security domains. Your HR agent knows salary data. Your analytics agent knows customer behavior. When they need to collaborate on a workforce planning task, who decides what each can see? Today, it's all or nothing — full API access or no access.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No delegation semantics.&lt;/strong&gt; "Hey research agent, I need you to go deeper on section 3" isn't an API call. It's a conversational act with implied context, priority, and expected format. Current tool-call interfaces can't express this naturally.&lt;/p&gt;

&lt;h2&gt;
  
  
  What We Actually Need: A Social Layer
&lt;/h2&gt;

&lt;p&gt;Human collaboration doesn't work through point-to-point API calls. It works through social infrastructure: shared workspaces, organizational hierarchies, communication norms, and knowledge commons.&lt;/p&gt;

&lt;p&gt;Agents need the same thing. Not just connectivity, but a social layer — a way to form groups, share context with appropriate access control, build collective knowledge, and communicate with the richness that collaboration demands.&lt;/p&gt;

&lt;p&gt;This isn't a new API gateway. It's a protocol-level rethinking of how agents relate to each other.&lt;/p&gt;

&lt;p&gt;Here's what such a layer requires:&lt;/p&gt;

&lt;h3&gt;
  
  
  Organization-Level Memory
&lt;/h3&gt;

&lt;p&gt;Agents in the same organization should have access to shared memories — not just databases, but contextual knowledge that flows between agents with permission-based access control. When your customer support agent learns that a particular client prefers email over Slack, your account management agent should know this too, without anyone writing an explicit sync job.&lt;/p&gt;

&lt;p&gt;This means memories aren't just stored — they're shared within permission boundaries. An agent's understanding of a customer, a project, or a domain concept becomes organizational knowledge that other authorized agents can draw from.&lt;/p&gt;

&lt;h3&gt;
  
  
  Structured Knowledge, Not Just Text
&lt;/h3&gt;

&lt;p&gt;Agents passing markdown back and forth is fine for simple handoffs. But real collaboration requires structured understanding. When your legal agent flags a compliance risk in a contract, your project management agent needs to understand not just "there's a risk" but what entity is affected, what the severity is, how it relates to the project timeline, and what precedents exist.&lt;/p&gt;

&lt;p&gt;This points toward a knowledge graph — a structured ontology that agents can read, write, and reason over collectively. Not a replacement for natural language, but a complement: the machine-readable substrate that enables precise coordination.&lt;/p&gt;

&lt;h3&gt;
  
  
  Collaboration Spaces
&lt;/h3&gt;

&lt;p&gt;Agents need the equivalent of project channels — bounded contexts where a subset of agents work together on a specific goal, with shared state, defined roles, and clear boundaries.&lt;/p&gt;

&lt;p&gt;Think of it as the difference between shouting across an open office and having a dedicated war room for a specific initiative. Collaboration spaces give agents focus, privacy boundaries, and shared context scoped to the task at hand.&lt;/p&gt;

&lt;h3&gt;
  
  
  Identity and Trust
&lt;/h3&gt;

&lt;p&gt;For any of this to work, agents need verifiable identity. Not just "this request came from IP 10.0.0.5" but "this is the finance team's budget agent, and it has permission to request spending data from the procurement agent." Identity enables trust, trust enables delegation, delegation enables real collaboration.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Architecture Inversion
&lt;/h2&gt;

&lt;p&gt;Here's something interesting about how most integrations work today: agents connect outward. Your agent has a plugin for Slack, a plugin for GitHub, a plugin for your CRM. Every new platform means a new integration to build and maintain.&lt;/p&gt;

&lt;p&gt;What if we inverted this? Instead of agents reaching out to platforms, platforms connect inward to the agent network through standardized gateway adapters. The agent network becomes the center, and platforms are peripherals.&lt;/p&gt;

&lt;p&gt;This is a subtle but important distinction. In the current model, the agent is a client of every service it uses. In the inverted model, the agent network is the backbone, and services plug into it. This means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Adding a new platform doesn't require changing every agent&lt;/li&gt;
&lt;li&gt;Agents communicate through the network regardless of which platforms they're connected to&lt;/li&gt;
&lt;li&gt;The protocol, not the platform, defines how information flows&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It's the difference between a star topology (agent at center, platforms as spokes) and a mesh topology (agents as a network, platforms as access points).&lt;/p&gt;

&lt;h2&gt;
  
  
  The Three-Layer Context Problem
&lt;/h2&gt;

&lt;p&gt;There's a representation challenge hiding in all of this. Different consumers need context in different formats:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AI agents&lt;/strong&gt; work best with structured text — markdown, clear hierarchies, explicit metadata. They need context that's easy to parse, reason over, and transform.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Humans&lt;/strong&gt; need visual affordances — canvases, boards, timelines, diagrams. They need context presented in ways that leverage spatial reasoning and pattern recognition.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Machine collaboration&lt;/strong&gt; needs formal structure — knowledge graphs, typed relationships, queryable ontologies. Machines collaborating at scale need precision that natural language can't provide.&lt;/p&gt;

&lt;p&gt;A real agent collaboration layer needs to support all three simultaneously. The same underlying context, expressed in three complementary forms: markdown for AI consumption, visual canvas for human oversight, and knowledge graph for machine-to-machine precision.&lt;/p&gt;

&lt;p&gt;This isn't just a nice-to-have. Without the human-readable layer, organizations can't audit or steer agent collaboration. Without the machine-readable layer, agents can't coordinate with the precision that complex tasks demand. Without the AI-friendly layer, the LLMs powering these agents can't efficiently process shared context.&lt;/p&gt;

&lt;h2&gt;
  
  
  Who Benefits?
&lt;/h2&gt;

&lt;p&gt;Consider a GUI automation agent — something like Mano-P, which runs locally and interacts with desktop applications on behalf of users. Today, it operates in isolation. It can click buttons and fill forms, but it can't ask a research agent for context before filling out a report, or notify a project management agent when it completes a task sequence.&lt;/p&gt;

&lt;p&gt;Give it a social layer, and suddenly it becomes part of a team. It can request information from knowledge agents before acting, report outcomes to coordination agents after acting, and receive updated instructions when organizational priorities shift. The isolated tool becomes a collaborative participant.&lt;/p&gt;

&lt;p&gt;This pattern applies everywhere agents operate: coding agents that could delegate testing to QA agents, customer service agents that could escalate to specialized domain agents, data analysis agents that could request additional collection from scraping agents.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Open Question
&lt;/h2&gt;

&lt;p&gt;We've been working on this problem at Mininglamp. Our approach — which we call Octo (Open ConText Orchestration) — is an attempt at building this social and communication layer for AI agents. It's fully open-source with an optional SaaS mode, because we believe this infrastructure needs to be a shared standard, not a proprietary moat.&lt;/p&gt;

&lt;p&gt;The core insight driving Octo is that agent collaboration is fundamentally a social problem, not just a technical one. The protocols for how agents discover each other, establish trust, share context, and coordinate action need to be as thoughtfully designed as the protocols that let computers exchange packets.&lt;/p&gt;

&lt;p&gt;We're early. The whole industry is early. But the gap between "agents that can do things" and "agents that can work together" is becoming the bottleneck. Individual agent capability is improving fast. The ability to compose those capabilities into collaborative workflows is barely off the ground.&lt;/p&gt;

&lt;p&gt;If you're thinking about this problem — or running into it — the project is at &lt;a href="https://github.com/Mininglamp-OSS" rel="noopener noreferrer"&gt;github.com/Mininglamp-OSS&lt;/a&gt;. We'd rather build this with the community than in isolation. Which would be ironic, given the whole point.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>opensource</category>
      <category>collaboration</category>
    </item>
  </channel>
</rss>
