vanessa49

Posted on Mar 23 • Originally published at Medium

Building a Personal AI Agent That Grows With You

#ai #opensource #programming #machinelearning

Exploring local LLMs as personal cognitive extensions

Introduction

Let me start with a distinction that I think matters more than people currently realize.

Cloud AI models — GPT, Claude, Gemini — are trained on the output of billions of people. They represent collective intelligence at scale: optimized to be useful to everyone, shaped by aggregate data and company priorities.

That's genuinely powerful. But "useful to everyone" is a different thing from "shaped by you."

The question this project is exploring: what if a local, fine-tunable model could grow alongside a specific person? Not just remembering preferences on top — but having its actual reasoning patterns, tendencies, and ways of approaching problems gradually shaped by one individual's interactions over time.

The key difference is ownership of growth. Cloud models evolve based on what the company decides. A local model can evolve based on what you actually do and think about.

TL;DR

This project explores the idea of a personal AI agent that evolves with a single user over time.

Current prototype includes:

Local LLM inference via Ollama (qwen3.5:9b + qwen2.5:7b + bge-m3 embedding)
Always-on agent runtime on a NAS via OpenClaw (Docker)
Persistent memory with SQLite + sqlite-vec hybrid search
Plugin-based architecture (6 custom plugins for logging, safety, memory compression, training data)
A pipeline that converts conversation history into potential fine-tuning data
1,498 training samples generated and reviewed from historical conversations

The long-term goal: explore whether a local model can gradually become a personal cognitive extension, rather than just a stateless AI tool.

🔗 GitHub: personal-ai-agent-lab

System Architecture

The system runs across two machines:

┌─────────────────────────┐        ┌──────────────────────────────┐
│   GPU Machine (laptop)  │        │   NAS / Always-on Server     │
│                         │        │                              │
│   Ollama                │◄──────►│   OpenClaw (Docker)          │
│   - qwen3.5:9b          │        │   - Plugin System            │
│   - qwen2.5:7b          │        │   - Memory (SQLite + vec)    │
│   - bge-m3 (embedding)  │        │   - Training Pipeline        │
└─────────────────────────┘        │                              │
                                   │   Qdrant (Docker)            │
                                   └──────────────────────────────┘

The GPU machine handles inference. The NAS runs continuously as the agent environment — maintaining memory, running plugins, processing conversation history in the background.

Why split? A personal AI agent that only runs when your laptop is on isn't truly always-on. The NAS acts as a persistent cognitive layer that stays active regardless of what else you're doing.

Note: Qdrant is currently an external database accessed via plugin API. OpenClaw's memory system uses SQLite + sqlite-vec; hybrid search operates on SQLite vectors.

Plugin Architecture

All agent behaviors are implemented as plugins. OpenClaw has two separate hook systems — easy to confuse, important to get right:

System	Config location	Supported events
Internal Hooks	`hooks.internal.load.extraDirs`	`agent:bootstrap`, `gateway:startup`, `command:new`
Plugin Hooks	`plugins.load.paths`	`before_tool_call`, `after_tool_call`, `before_prompt_build`, `agent_end`

⚠️ Tool call monitoring must use Plugin Hooks. Internal Hooks have no agent:tool:pre / agent:tool:post events — these don't exist.

Update: A few days after I hit this, the official docs were updated to clarify the distinction. I submitted a docs PR to add more explicit examples and a common-mistakes section anyway — "works but unclear" is still worth improving in open source docs.

The six plugins currently deployed:

Plugin	Hook events	What it does
`tool-logger`	`before_tool_call` • `after_tool_call`	Logs every tool call to `tool_calls.log`
`safe-delete-enforcer`	`before_tool_call` (intercept) + `after_tool_call` (index)	Blocks `rm`, forces `mv` to trash-pending, auto-creates deletion index
`qdrant-auto-checker`	`before_prompt_build`	Keyword detection → inject `curl qdrant` instruction into system prompt
`task-logger`	`agent_end`	Writes structured task log to `agent_log.md` after each session
`training-sample-generator`	`agent_end`	Scores conversation, generates training sample if score ≥ 7
`memory-compressor`	`agent_end`	Triggers context compression when conversation exceeds 20 turns

Key lessons from plugin development:

Must use CommonJS module.exports = register — ESM or module.exports = { register } silently fails
register function must be synchronous — async function register() gets ignored
openclaw.plugin.json requires both id and configSchema fields
SMB writes are unreliable for config files — use docker exec openclaw node -e "..." instead

Memory System

The agent stores long-term memory using SQLite + sqlite-vec with hybrid retrieval:

vector similarity (weight: 0.7)   ← bge-m3 embeddings via Ollama
        +
full-text search  (weight: 0.3)   ← SQLite FTS5
        ↓
hybrid ranked results

Current state: 441 files, 5,137 chunks indexed.

Important clarification on Qdrant: Despite appearing in the architecture diagram, Qdrant is not integrated into the memory_search pipeline. It runs as a separate container and is queried manually via plugin prompt injection — a before_prompt_build hook detects keywords like "qdrant" or "vector" and injects a curl instruction into the system prompt. The agent then executes it as a tool call.

This is Prompt Automation, not Memory Integration. True Qdrant integration would require implementing a custom OpenClaw memory driver to replace the SQLite backend — a framework-level change not yet planned.

Conversation → Training Pipeline

The pipeline converts raw conversation history into potential fine-tuning data:

conversation logs
      ↓
batch_process_conversations.js    # parse + chunk (512 token windows)
      ↓
training-sample-generator plugin  # auto-score: importance + novelty + generalizability
      ↓
agent_review.py                   # LLM auto-review via Ollama API
      ↓
review_samples.js                 # human review interface (y/n/s/q)
      ↓
samples.jsonl / pending_review.jsonl

Sample format:

{
  "instruction": "",
  "input": "",
  "reasoning": "",
  "output": "",
  "score": 8.5,
  "timestamp": "2026-03-21",
  "source": "self"
}

Current dataset: 1,498 reviewed samples from 441 historical conversations.

Most of these conversations originate from earlier discussions with higher-capability LLM systems.
The goal is not to copy answers verbatim, but to use them as a source of structured reasoning examples.

In a sense, the dataset acts as a form of bootstrapped supervision: stronger models provide candidate reasoning patterns, and the personal agent gradually learns from them after human review.

Technical note on Qwen 3.5 thinking mode: num_predict must be set to 2000+ when using the model for auto-review. The model's thinking process consumes tokens first — if num_predict is too low (e.g. 80–200), the thinking exhausts the budget and the response field comes back empty.

The Interesting Engineering Problems

Building this revealed several tensions that don't show up in papers:

Memory vs. control. The more capable the system became at retaining context, the more important it became to think carefully about what it should be able to forget. This isn't just a technical problem — it's an interaction design problem. What does it mean to trust a system with your cognitive history?

Personalization vs. blind spots. A model fine-tuned on one person's interactions might get very good at that person's specific reasoning patterns — but could also amplify their blind spots. Fine-tuning doesn't just transfer knowledge; it transfers biases.

The cold-start loop. To train a personalized model, you need data. To generate good data, you need an already-capable system. This circular dependency is real — breaking out of it requires either a large, carefully curated seed dataset, or accepting that early data quality will be uneven and iterating from there.

These aren't purely engineering problems. They're user experience problems.

Current Status

Working:

Plugin system (all 6 plugins functional)
Memory ingestion and hybrid search
Conversation processing and training sample generation
Feishu (Lark) channel integration for messaging

In progress:

Memory retrieval accuracy tuning (memory_search returns empty in some cases — data is confirmed present in SQLite, root cause under investigation)
Automated fine-tuning pipeline
Agent behavior dashboard

This is a research prototype. The architecture is established; the self-improvement loop is still being assembled.

Why This Matters

The mobile internet revolution restructured how humans relate to information. The AI revolution is doing something at a deeper layer: restructuring how humans relate to cognition itself.

In that context, the question of whose intelligence an AI system reflects matters enormously.

Cloud models will keep getting more capable. But there's a complementary space — not competing with cloud models, but orthogonal to them — for systems shaped by and for specific individuals.

Personal AI infrastructure might look like what home servers looked like in the early internet era: niche, technically demanding, not for everyone. But the direction feels worth exploring.

Another way to think about personal AI is not purely as a productivity tool, but as an experimental medium.

If an agent is gradually shaped by a specific person's interactions, it may begin to reflect that person's reasoning style, priorities, and mental models. In that sense, a personalized AI system could become a kind of cognitive mirror — or even a simulation artifact.

Such systems might not always be useful in the traditional sense. But they could still be valuable as a way to explore different cognitive trajectories.

For example, a highly personalized agent could be placed into simulated environments — economic models, social scenarios, or narrative worlds — to observe how its reasoning evolves over time.

Many people enjoy strategy or simulation games because they allow us to explore alternative possibilities. Personal AI systems might eventually enable something similar at a cognitive level: experimenting with how different ways of thinking interact with different environments.

In that sense, personal AI might become not just a tool, but a sandbox for exploring possible forms of intelligence — including our own.

Repository

🔗 github.com/vanessa49/personal-ai-agent-lab

Built on: OpenClaw 2026.3.11 · Ollama · SQLite + sqlite-vec · Qdrant · Docker

Ideas, feedback, and experiments welcome.

Open Questions

Building a personal AI agent raises a number of unresolved questions.

How should long-term memory be managed?

If an AI system accumulates years of interaction history, deciding what to keep, compress, or forget becomes both a technical and philosophical challenge.

What should be remembered — and what should be forgotten?

Memory persistence can make an agent more useful, but it also raises questions about how much cognitive history a system should retain.

Does extreme personalization create blind spots?

A model trained heavily on a single user's interactions might gradually mirror that person's reasoning patterns — including their biases or assumptions.

But this might not necessarily be a flaw.

In some contexts, such behavior could actually be valuable.

For example, a highly personalized agent could become a simulation tool — allowing users to explore how their own thinking patterns evolve across different scenarios.

Can personal AI become a sandbox for cognitive experiments?

Instead of treating personalization purely as a productivity feature, it could also enable new forms of experimentation.

A personalized agent might be placed into simulated environments — social, economic, or narrative — to observe how its reasoning develops over time.

This begins to resemble a kind of cognitive simulation platform, where AI agents shaped by different individuals explore different trajectories.

Can self-generated training data meaningfully improve behavior over time?

If conversation logs are transformed into training samples, a personal AI system might gradually refine itself based on real usage patterns — but the long-term stability of such loops is still an open question.

If you're building similar systems or experimenting with personal AI infrastructure, I'd be very curious to hear how you're approaching these questions.

Top comments (2)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.