Quentin Merle

Posted on May 15

🚀 Local AI in 2026 (Part 2): Sovereignty, Artisanal RAG, and the Rise of Agents

#productivity #privacy #ai #webdev

Article Series:
👉 Part 1: My Journey Through the Desert (From Terminal to GPU)
👉 Part 2: Sovereignty, Artisanal RAG, and the Rise of Agents (You are here)
👉 Part 3: Vibrisse Agent, Anatomy of a Custom Cockpit (Coming Soon)

Disclaimer & Context: Just like in the first installment, this article is based on my daily use with a MacBook Pro M1 Pro (32 GB RAM) and VS Code. The goal here is to explore the technical and methodological transition from using a simple conversational model to a truly sovereign agentic ecosystem.

In my previous article, I shared my hardware reconciliation with local AI thanks to recent optimizations and quantization. But once the engine is running locally, what exactly do we do with it? Do we just chat?

At first, we all go through the "naive" approach: we install Ollama or LM Studio, download a model, and use it raw in a terminal or a classic chat interface. It’s fascinating for the first few hours, but you quickly hit a glass ceiling. A raw LLM remains a passive oracle: it answers isolated questions, but it has no persistent memory, no initiative, and no levers of action on your work environment.

Then, after much research and documentation, I had an epiphany. Beyond pure performance, it is first and foremost a question of Digital Sovereignty. Between telemetry scandals and private repositories that risk discreetly feeding model training in the Cloud, I wanted to build my own development "brain"—entirely secure, without ever handing over the keys to my Mac to a remote entity.

This is exactly when I started to dissect the mechanics of Agents.

1. From Assistant to Sidekick: Discovering Hermes Agent

My thinking first matured by observing from afar the growing buzz around autonomous tools like OpenClaw. The idea of an assistant capable of acting on my system seduced me, but I maintained a legitimate wariness about granting total access to my terminal and my intellectual property to the ecosystem of a Cloud giant.

However, as I documented my workflows, an obvious truth emerged: piloting an LLM via an agent quickly becomes indispensable for automating complex tasks.

Searching for an open-source, privacy-respecting alternative, I came across Hermes Agent, designed by the excellent team at Nous Research. The promise? An agentic architecture optimized for Tool Use. Unlike a simple Chat that just predicts the next word, an agent provides the model with a reasoning loop allowing it to define a strategy and break down its objectives.

To power this setup locally, I bet on the current must-have combo: Gemma 4. Highly recommended by Nous Research for running Hermes, this model shines with its scrupulous respect for complex instructions and its precision on structured output formats.

2. Cognitive Hierarchy: Managing 32 GB of RAM Without Exploding

The classic mistake when starting with local AI? Wanting a single giant model to do everything. As mentioned in the conclusion of my first article, loading a heavy model continuously alongside macOS, VS Code, and Chrome leads straight to unified memory saturation and intensive SSD swapping.

So, I implemented a strict cognitive hierarchy by separating intellect from execution to preserve the responsiveness of my M1 Pro:

Morning (Deep Work): Gemma 4 26B. This is my "Chief Technology Officer" (CTO). It takes up about 20 GB of RAM, and I only invoke it for sessions dedicated to pure reflection. It excels at high-density tasks: deep architectural audits, design reviews, and complex planning.
Throughout the Day (Sidekick): Gemma 4 e4b. A light, snappy, all-terrain version that stays in the background for ancillary operations: writing documentation, generating unit tests, or formatting Obsidian notes. It accompanies me constantly without slowing down my IDE or making the machine run hot.

3. The Sinews of War: RAG (and Why Mine is Artisanal)

Having a competent local agent is a great foundation, but without fresh context, an LLM eventually and inevitably hallucinates variable names or obsolete API signatures. This is where RAG (Retrieval-Augmented Generation) comes in.

However, "turnkey" RAG solutions on the market often behave like black boxes. Whether they are too-opaque abstraction chains (like in LangChain) or No-code tools where you lose control over text slicing, these solutions often blindly vectorize your codebase. The result: you end up diluting the model's attention with irrelevant technical noise.

So, I opted for Artisanal RAG (Hand-crafted Context). My methodology is surgical:

I ask my Sidekick to scan a project's dependencies to generate an initial raw identity sheet (CONTEXT.md).
I then manually refine this file to engrave my "business truths," architectural conventions, and design choices.

# ID: Vibrisse Studio
# TYPE: Static / Immersive
# STACK: React 19, Vite, Three.js (R3F), GSAP, Tailwind CSS 3, Sass
# PERF_SCORE: High

## TECHNICAL CONTEXT
Immersive showcase site using a modern stack focused on visual experience. 
3D rendering is handled by Three.js via React Three Fiber. 
Animations and sequencing are orchestrated by GSAP.

## WARNING (CRITICAL)
- Complex R3F + GSAP mix: fine synchronization of life cycles required.
- React 19: monitor stability of Three.js hooks.
- Potential Tailwind / Sass conflicts on selector specificity.

By feeding the 26B model's system prompt with these ultra-dense sheets, the result is clear: the AI no longer guesses, it knows. I understood the paramount importance of useful token density. My agent now knows my stacks and my dev habits, which allows for automating targeted monitoring, watching for critical version updates, or initializing new projects by directly applying my preferred patterns.

💡 Monitoring Note: It is this same philosophy of developer context purity and portability that lies at the heart of very inspiring initiatives like Context 7.

**4. What is an "Agent" Exactly? (Tools & Reasoning)**

Experimenting with Hermes, I grasped the fundamental difference between Knowledge (encoded in the LLM's weights) and Orchestration (managed by the agent that dispatches actions). Two major concepts transform the model into an autonomous actor:

Tool Use: The agent can decide to format its response to trigger a real function (read a file, search the web, execute a bash command). It’s the move from word to deed.
CoT (Chain of Thought): The agent "thinks out loud" by breaking down its reasoning according to the Observation > Thought > Action cycle. It is absolutely fascinating to see your local AI write in its console: "Observation: I lack information on this bug. Thought: I must check the initialization scripts. Action: call the read tool on the package.json file."

💡 Pro Tip (Impact of Hyperparameters): For an agent to function reliably, you must restrict the LLM's creativity. Set the temperature to the lowest (0.0 or 0.1). An agent needs absolute determinism to issue tool calls in perfectly syntactically correct JSON or XML formats, or risk crashing the parser.

5. Hybrid Workflow: Research > Plan > Implement

Inspired by methodologies from ecosystem figures like Mckay Wrigley, I restructured my development cycle around a three-stage hybrid flow:

Research & Plan (Local & Private): Intelligence and absolute confidentiality. This is where I use my local models to design the architecture and refine my strategy. My ideas and intellectual property remain strictly confined to my SSD.
Implement (Cloud): Once the action plan is validated and rigorously structured locally, I delegate mass code generation to Cloud APIs. It’s a powerful compromise: I save my machine's resources and consume my paid tokens purely for utility.

5 bis. Reality Check: Local Agent vs. Cloud AI (Claude, Gemini, and Co.)

Let's be totally transparent: if you are used to working daily with cutting-edge ecosystems like Claude Sonnet or Gemini powered in an advanced agentic environment (like Antigravity), returning to a 4B or 26B local model requires adjusting expectations.

The line is very clear:

Depth & Massive Multitasking (The Cloud Advantage): Solutions like Antigravity or Claude Code behave like omniscient Senior Architects. They excel at massive multi-file refactoring, implicit reading of your vaguest intentions, and pure production velocity. Their giant context window absorbs entire architectures without flinching. To give you an idea (as illustrated in an excellent IBM Technology video), their immediate memory is capable of handling the entirety of the three Lord of the Rings books plus The Hobbit, with room still left for your code! A technical gap unreachable for a consumer local machine.
Automated Context Ingestion (How the Cloud Reads Our System): A Cloud agent's illusion of "magic" rests on its active exploration mechanisms. When given a task, it dynamically queries our local workspace via surgical investigation tools (Grep search, directory listing, targeted AST or file reading). It instantly maps dependencies and autonomously injects relevant blocks into its context window (often several million tokens). It is this capacity to vacuum and synthesize an entire workspace in a fraction of a second that grants its omniscience, but it implies opening the floodgates and authorizing the sending of these local snapshots to a remote API.
Sovereignty & Business Precision (The Local Advantage): Faced with this data vacuuming, the local agent is your Bodyguard. It shines with its absolute intimacy with your patterns via artisanal RAG. You own 100% of the data. Where the Cloud charges for every token read and ingests your prompts on third-party servers, the local agent iterates in a closed loop, without billing friction, to validate and protect the intimate logic of your intellectual property.

It is precisely this complementarity that validates the hybrid workflow: we don't ask a local agent to rewrite 50 files at once (the Cloud does it infinitely better and faster). We ask it to guarantee our code's alignment, security, and identity before delegating mass execution.

6. Prompt Engineering: The Art of Surgical Precision

Piloting a local agent requires abandoning vague or implicit prompts. Public Cloud models are trained to smooth over your approximations and guess your intentions. When faced with a local agent that must choose the right tool autonomously, artistic blurring is unforgiving.

You must become a true prompt craftsman again: concise, explicit, and highly structured. More surgical precision in your prompt means more reliability for your agent.

But make no mistake: this rigor pays off just as much on the Cloud. While giant models (Claude, GPT-4, Gemini) handle "noise" better, a surgically precise prompt is the key to the Zero-Iteration result. Instead of iterating four times to fix a syntax error or an oversight, a perfectly architected prompt allows for a perfect result from the very first second. This is where you move from a chat user to a true command engineer: you no longer just talk; you pilot an intention.

# ROLE
You are a Senior Creative Developer specialized in React 19 and WebGL (R3F).

# OBJECTIVE
Generate a reusable React component named `FluidPortal.jsx` that displays an animated 3D sphere serving as a visual transition element.

# TECHNICAL STACK
- React 19 (Standard Hooks)
- @react-three/fiber + @react-three/drei
- GSAP 3.12 (for state transitions)
- Tailwind CSS (for container styling)

# DESIGN CONSTRAINTS
1. The sphere must use a `MeshDistortMaterial` with a deep purple color.
2. On Hover: Increase distortion and wave speed via a smooth GSAP tween (duration: 0.4s).
3. On Click: Trigger a scale animation that fills the entire container before executing an `onAction` callback function.

# CODE REQUIREMENTS
- Use `useFrame` for continuous rotation on the Y-axis.
- Proper cursor handling (`cursor-pointer`) via Three.js events.
- Complete, self-contained code without placeholders.

# OUTPUT FORMAT
Return only the component code with JSDoc comments.

Conclusion: The Wall of Friction (and the "Why Not Me?" Syndrome)

This hybrid and sovereign setup is incredible, but it has a daily cost: friction. Maintaining my artisanal RAG manually ends up being slow. The raw Hermes Agent interface frustrates my designer's eye. Finally, mentally switching from one model to another requires constant attention to avoid triggering memory swapping at the worst possible moment.

But above all, as a developer, I have this visceral need to understand how things work under the hood.

Reading about autonomous agents is fine. Using others' solutions is instructive. But technical curiosity finally took over, leading me to ask this somewhat crazy question:

"What if I built my own Agent from scratch? Just to see if I could do it, and especially to understand how the gears really mesh."

What was supposed to be a "crazy test" to dissect LangGraph and vector bases became much more than that. I ended up designing and coding my own custom agentic Cockpit, with a polished graphic interface, to address all my frustrations.

We'll talk more about it in Part 3: the project is called Vibrisse Agent, and I'm going to show you the guts of the beast.

📺 For the curious:
If the internal mechanics of agents fascinate you, I highly recommend the excellent IBM Technology YouTube channel. For those who want to see where the future of professional agents is being shaped, I highly recommend exploring IBM BOB and Google’s Jules assistant. These are essential references for learning how to select and orchestrate the most powerful tools within your own workflows..
I also recommend this superb technical analysis video from The Coding Sloth.

Proudly developed in Beauce, Québec 🇨🇦. Interested in local AI sovereignty? Let's connect!

DEV Community