<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Mohamed Heni</title>
    <description>The latest articles on DEV Community by Mohamed Heni (@henimohamed).</description>
    <link>https://dev.to/henimohamed</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3944975%2F65533033-4b94-4c8c-93db-1bc70dad858c.png</url>
      <title>DEV Community: Mohamed Heni</title>
      <link>https://dev.to/henimohamed</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/henimohamed"/>
    <language>en</language>
    <item>
      <title>I Built a Desktop AI Assistant That Controls Your Computer — Here's How</title>
      <dc:creator>Mohamed Heni</dc:creator>
      <pubDate>Mon, 25 May 2026 09:30:57 +0000</pubDate>
      <link>https://dev.to/henimohamed/i-built-a-desktop-ai-assistant-that-controls-your-computer-heres-how-58l1</link>
      <guid>https://dev.to/henimohamed/i-built-a-desktop-ai-assistant-that-controls-your-computer-heres-how-58l1</guid>
      <description>&lt;p&gt;TL;DR: I built Yaldabaoth — a desktop AI assistant that doesn't just answer questions. It reads your screen, runs PowerShell commands, clicks buttons, types text, and automates your entire workflow. No cloud dependency. No API calls for automation. Just Python, Rust, and raw OS control.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;In 2025, "AI assistants" meant one of two things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Chatbots&lt;/strong&gt; — A text box where you type and an LLM talks back&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;API wrappers&lt;/strong&gt; — Tools that chain API calls together but have zero ability to touch the actual operating system&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Neither of these actually helps you &lt;em&gt;do&lt;/em&gt; things on your computer. Want to open an application, take a screenshot, parse a PDF, run a PowerShell script, and compile a report? Good luck chaining that through a chat interface.&lt;/p&gt;

&lt;p&gt;I wanted something different. An assistant that sits on your desktop, sees what you see, and acts on your behalf — like having an engineer sitting next to you who can operate any part of the system.&lt;/p&gt;

&lt;p&gt;So I built Yaldabaoth.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Architecture
&lt;/h2&gt;

&lt;p&gt;Yaldabaoth is a &lt;strong&gt;four-layer system&lt;/strong&gt;:&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 1: The Shell — Tauri + React
&lt;/h3&gt;

&lt;p&gt;Instead of Electron (which would have added 150+ MB to the binary), I used &lt;strong&gt;Tauri&lt;/strong&gt; — a Rust-based framework that wraps a webview frontend in a native shell. The UI is React with a glassmorphism design.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why Tauri:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Binary size: ~5 MB vs Electron's ~150 MB&lt;/li&gt;
&lt;li&gt;Native performance for intensive operations&lt;/li&gt;
&lt;li&gt;Direct Rust system access when needed&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Layer 2: The Orchestrator — Python Backend
&lt;/h3&gt;

&lt;p&gt;The Rust shell communicates with a Python backend that handles all the heavy lifting. Communication is via stdin/stdout JSON protocol — lightweight, no HTTP server needed.&lt;/p&gt;

&lt;p&gt;The orchestrator manages:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Command routing (voice/text → appropriate handler)&lt;/li&gt;
&lt;li&gt;State persistence between commands&lt;/li&gt;
&lt;li&gt;Multi-step task chaining&lt;/li&gt;
&lt;li&gt;Personality profile switching (Professional vs. creative modes)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Layer 3: The Automation Engine — Win32 API + PowerShell
&lt;/h3&gt;

&lt;p&gt;This is where the magic happens. The Python backend has direct access to the Windows OS through:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;pywinauto&lt;/strong&gt; — Native Win32 API control for clicking, typing, window management&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PowerShell subprocess&lt;/strong&gt; — OS-level commands (service control, registry edits, file operations)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;WMI&lt;/strong&gt; — System information queries (processes, hardware, network)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Layer 4: The Perception System — OCR + Screen Parsing
&lt;/h3&gt;

&lt;p&gt;Screen parsing runs in a separate thread to keep the UI responsive. It uses:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;OCR-based text extraction&lt;/strong&gt; — Screenshots → text → action decisions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-threaded processing&lt;/strong&gt; — One thread for screen capture + OCR, another for command execution, a third for UI responsiveness&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Chained automation&lt;/strong&gt; — Click → wait for UI update → re-scan screen → next action&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Hard Parts
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Threading Nightmares
&lt;/h3&gt;

&lt;p&gt;Screen parsing is slow. OCR a screenshot, parse the text, decide what to do — you're looking at 500ms to 2 seconds per cycle. If you do this on the main thread, your entire app freezes.&lt;/p&gt;

&lt;p&gt;The solution was a &lt;strong&gt;producer-consumer architecture&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Thread 1&lt;/strong&gt;: Screen capture → OCR → queue the parsed text&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Thread 2&lt;/strong&gt;: Command executor — reads from queue, takes action&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Thread 3&lt;/strong&gt;: Main UI thread — stays responsive
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────┐    ┌──────────┐    ┌──────────┐
│ Screen  │───&amp;gt;│ Queue     │───&amp;gt;│ Command  │
│ Capture │    │ (JSON)   │    │ Executor │
└─────────┘    └──────────┘    └──────────┘
      │                              │
      v                              v
 ┌─────────┐                   ┌──────────┐
 │ OCR     │                   │ Win32 API│
 │ Parser  │                   │/PowerShell│
 └─────────┘                   └──────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Python's &lt;code&gt;threading.Queue&lt;/code&gt; with &lt;code&gt;daemon=True&lt;/code&gt; threads was sufficient — no need for multiprocessing or async for this use case.&lt;/p&gt;

&lt;h3&gt;
  
  
  Voice-First vs. Text-First UX
&lt;/h3&gt;

&lt;p&gt;I wanted a &lt;strong&gt;Push-to-Talk&lt;/strong&gt; interface (F10 key) so you could speak commands naturally. But speech recognition introduces latency and errors. The compromise:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Voice input preferred for simple commands ("Open Chrome", "Check CPU usage")&lt;/li&gt;
&lt;li&gt;Text fallback for complex multi-step sequences&lt;/li&gt;
&lt;li&gt;The orchestrator normalizes both into the same command pipeline&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Rust ↔ Python Bridge
&lt;/h3&gt;

&lt;p&gt;Tauri apps expect Rust backends. Yaldabaoth needs Python. Bridging them without adding a web server was tricky.&lt;/p&gt;

&lt;p&gt;The solution: &lt;strong&gt;stdin/stdout JSON-RPC&lt;/strong&gt;. The Rust shell spawns the Python process and communicates via JSON messages on stdin/stdout. No sockets, no HTTP, no dependency on a running server. The Python process lives as long as the app is open.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Rust side — minimal example&lt;/span&gt;
&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;python&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;Command&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"python"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;.arg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"backend/main.py"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;.stdin&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nn"&gt;Stdio&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;piped&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
    &lt;span class="nf"&gt;.stdout&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nn"&gt;Stdio&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;piped&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
    &lt;span class="nf"&gt;.spawn&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;// Send command&lt;/span&gt;
&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;cmd&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;r#"{"action": "click", "target": "Chrome"}"#&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="n"&gt;python&lt;/span&gt;&lt;span class="py"&gt;.stdin&lt;/span&gt;&lt;span class="nf"&gt;.as_ref&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="nf"&gt;.unwrap&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="nf"&gt;.write_all&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cmd&lt;/span&gt;&lt;span class="nf"&gt;.as_bytes&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;// Read response&lt;/span&gt;
&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="k"&gt;mut&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;String&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="n"&gt;python&lt;/span&gt;&lt;span class="py"&gt;.stdout&lt;/span&gt;&lt;span class="nf"&gt;.as_ref&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="nf"&gt;.unwrap&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="nf"&gt;.read_to_string&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="k"&gt;mut&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  What I Learned
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Desktop automation is harder than cloud automation.&lt;/strong&gt; Cloud APIs are designed to be called programmatically. Desktop UIs are designed for humans. Parsing a rendered UI and making decisions from it is fundamentally different from calling an API endpoint.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Threading early, threading often.&lt;/strong&gt; I rebuilt the threading model three times. The first version was single-threaded and froze constantly. The second over-engineered with multiprocessing. The third — simple Queue-based threading — was just right.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Computer use was already possible in 2025.&lt;/strong&gt; Before "computer use" became a buzzword in 2026, a Python script + OCR + Win32 API was all you needed. The novelty isn't the technology — it's wiring it together with a voice-first, responsive UI that feels like an assistant, not a script.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Tech Stack
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Technology&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Shell&lt;/td&gt;
&lt;td&gt;Tauri (Rust + WebView2)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Frontend&lt;/td&gt;
&lt;td&gt;React&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Backend&lt;/td&gt;
&lt;td&gt;Python&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Automation&lt;/td&gt;
&lt;td&gt;pywinauto, WMI, PowerShell&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Voice&lt;/td&gt;
&lt;td&gt;Push-to-Talk (F10)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Screen parsing&lt;/td&gt;
&lt;td&gt;OCR + multi-threaded pipeline&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Binary size&lt;/td&gt;
&lt;td&gt;~5 MB&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cross-platform support&lt;/strong&gt; — Currently Windows-only due to Win32 API dependency. Linux adaptation via X11/Wayland is on the roadmap.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Better screen parsing&lt;/strong&gt; — Using vision models directly instead of OCR for richer UI understanding.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Plugin system&lt;/strong&gt; — Let users write custom automation modules.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Repo
&lt;/h2&gt;

&lt;p&gt;The full source code is on GitHub: &lt;a href="https://github.com/HENI-MOHAMED/Yaldabaoth" rel="noopener noreferrer"&gt;github.com/HENI-MOHAMED/Yaldabaoth&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Built with Tauri, React, Python, Rust, and more coffee than I'd like to admit.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>python</category>
      <category>ai</category>
      <category>opensource</category>
      <category>tauri</category>
    </item>
    <item>
      <title>Building a Multi-Agent Tax Audit System with LangGraph and Odoo</title>
      <dc:creator>Mohamed Heni</dc:creator>
      <pubDate>Fri, 22 May 2026 00:34:31 +0000</pubDate>
      <link>https://dev.to/henimohamed/building-a-multi-agent-tax-audit-system-with-langgraph-and-odoo-2a06</link>
      <guid>https://dev.to/henimohamed/building-a-multi-agent-tax-audit-system-with-langgraph-and-odoo-2a06</guid>
      <description>&lt;p&gt;TL;DR: I built a multi-agent system that audits invoices, detects fiscal inconsistencies, and generates compliance reports — integrated with Odoo ERP and standalone databases.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;Tax auditing for small and medium businesses in Tunisia is manual, error-prone, and slow. Most accounting firms rely on Excel and manual checks. Regulations change frequently, and keeping up is a full-time job.&lt;/p&gt;

&lt;p&gt;I wanted to build something that could ingest financial data, run audit rules automatically, and produce compliance reports — without needing a team of accountants.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Architecture
&lt;/h2&gt;

&lt;p&gt;The system uses a LangGraph-based multi-agent architecture:&lt;/p&gt;

&lt;h3&gt;
  
  
  Agent 1: Data Ingestion Agent
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Connects to Odoo ERP or standalone SQL databases&lt;/li&gt;
&lt;li&gt;Extracts invoices, ledgers, and financial statements&lt;/li&gt;
&lt;li&gt;Normalizes data into a unified schema&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Agent 2: Audit Agent
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Applies tax rules against the data&lt;/li&gt;
&lt;li&gt;Detects anomalies: missing invoices, misclassified expenses, VAT discrepancies&lt;/li&gt;
&lt;li&gt;Flags items for human review&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Agent 3: Reporting Agent
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Generates compliance reports&lt;/li&gt;
&lt;li&gt;Produces a summary of findings with risk levels&lt;/li&gt;
&lt;li&gt;Suggests corrective actions&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Orchestrator
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;LangGraph manages the flow between agents&lt;/li&gt;
&lt;li&gt;Handles state, retries, and error recovery&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Technical Challenges
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Schema Mismatch:&lt;/strong&gt; Every Odoo instance is customized differently. The ingestion agent had to handle dynamic schemas — detecting table structures at runtime and mapping them to a canonical audit model.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multi-Agent Coordination:&lt;/strong&gt; Getting three agents to work together without stepping on each other's state was the hardest part. LangGraph's checkpointing was essential here.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Regional Tax Rules:&lt;/strong&gt; Tunisian tax law isn't well-documented in English. Building the rules engine meant working directly with Arabic and French regulatory texts.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Real-time invoice validation&lt;/li&gt;
&lt;li&gt;Multi-company support&lt;/li&gt;
&lt;li&gt;A dashboard for non-technical accountants&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The repo is at github.com/HENI-MOHAMED/Audit-Agent.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Built with Python, FastAPI, LangGraph, Odoo, and a lot of coffee.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>python</category>
      <category>ai</category>
      <category>langgraph</category>
      <category>odoo</category>
    </item>
  </channel>
</rss>
