Exploring local LLMs as personal cognitive extensions
Introduction
Let me start with a distinction that I think matters more than people currently realize.
Cloud AI models โ GPT, Claude, Gemini โ are trained on the output of billions of people. They represent collective intelligence at scale: optimized to be useful to everyone, shaped by aggregate data and company priorities.
That's genuinely powerful. But "useful to everyone" is a different thing from "shaped by you."
The question this project is exploring: what if a local, fine-tunable model could grow alongside a specific person? Not just remembering preferences on top โ but having its actual reasoning patterns, tendencies, and ways of approaching problems gradually shaped by one individual's interactions over time.
The key difference is ownership of growth. Cloud models evolve based on what the company decides. A local model can evolve based on what you actually do and think about.
TL;DR
This project explores the idea of a personal AI agent that evolves with a single user over time.
Current prototype includes:
- Local LLM inference via Ollama (qwen3.5:9b + qwen2.5:7b + bge-m3 embedding)
- Always-on agent runtime on a NAS via OpenClaw (Docker)
- Persistent memory with SQLite + sqlite-vec hybrid search
- Plugin-based architecture (6 custom plugins for logging, safety, memory compression, training data)
- A pipeline that converts conversation history into potential fine-tuning data
- 1,498 training samples generated and reviewed from historical conversations
The long-term goal: explore whether a local model can gradually become a personal cognitive extension, rather than just a stateless AI tool.
๐ GitHub: personal-ai-agent-lab
System Architecture
The system runs across two machines:
โโโโโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ GPU Machine (laptop) โ โ NAS / Always-on Server โ
โ โ โ โ
โ Ollama โโโโโโโโโบโ OpenClaw (Docker) โ
โ - qwen3.5:9b โ โ - Plugin System โ
โ - qwen2.5:7b โ โ - Memory (SQLite + vec) โ
โ - bge-m3 (embedding) โ โ - Training Pipeline โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโ โ โ
โ Qdrant (Docker) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
The GPU machine handles inference. The NAS runs continuously as the agent environment โ maintaining memory, running plugins, processing conversation history in the background.
Why split? A personal AI agent that only runs when your laptop is on isn't truly always-on. The NAS acts as a persistent cognitive layer that stays active regardless of what else you're doing.
Note: Qdrant is currently an external database accessed via plugin API. OpenClaw's memory system uses SQLite + sqlite-vec; hybrid search operates on SQLite vectors.
Plugin Architecture
All agent behaviors are implemented as plugins. OpenClaw has two separate hook systems โ easy to confuse, important to get right:
| System | Config location | Supported events |
|---|---|---|
| Internal Hooks | hooks.internal.load.extraDirs |
agent:bootstrap, gateway:startup, command:new
|
| Plugin Hooks | plugins.load.paths |
before_tool_call, after_tool_call, before_prompt_build, agent_end
|
โ ๏ธ Tool call monitoring must use Plugin Hooks. Internal Hooks have no
agent:tool:pre/agent:tool:postevents โ these don't exist.Update: A few days after I hit this, the official docs were updated to clarify the distinction. I submitted a docs PR to add more explicit examples and a common-mistakes section anyway โ "works but unclear" is still worth improving in open source docs.
The six plugins currently deployed:
| Plugin | Hook events | What it does |
|---|---|---|
tool-logger |
before_tool_call โข after_tool_call
|
Logs every tool call to tool_calls.log
|
safe-delete-enforcer |
before_tool_call (intercept) + after_tool_call (index) |
Blocks rm, forces mv to trash-pending, auto-creates deletion index |
qdrant-auto-checker |
before_prompt_build |
Keyword detection โ inject curl qdrant instruction into system prompt |
task-logger |
agent_end |
Writes structured task log to agent_log.md after each session |
training-sample-generator |
agent_end |
Scores conversation, generates training sample if score โฅ 7 |
memory-compressor |
agent_end |
Triggers context compression when conversation exceeds 20 turns |
Key lessons from plugin development:
- Must use CommonJS
module.exports = registerโ ESM ormodule.exports = { register }silently fails -
registerfunction must be synchronous โasync function register()gets ignored -
openclaw.plugin.jsonrequires bothidandconfigSchemafields - SMB writes are unreliable for config files โ use
docker exec openclaw node -e "..."instead
Memory System
The agent stores long-term memory using SQLite + sqlite-vec with hybrid retrieval:
vector similarity (weight: 0.7) โ bge-m3 embeddings via Ollama
+
full-text search (weight: 0.3) โ SQLite FTS5
โ
hybrid ranked results
Current state: 441 files, 5,137 chunks indexed.
Important clarification on Qdrant: Despite appearing in the architecture diagram, Qdrant is not integrated into the memory_search pipeline. It runs as a separate container and is queried manually via plugin prompt injection โ a before_prompt_build hook detects keywords like "qdrant" or "vector" and injects a curl instruction into the system prompt. The agent then executes it as a tool call.
This is Prompt Automation, not Memory Integration. True Qdrant integration would require implementing a custom OpenClaw memory driver to replace the SQLite backend โ a framework-level change not yet planned.
Conversation โ Training Pipeline
The pipeline converts raw conversation history into potential fine-tuning data:
conversation logs
โ
batch_process_conversations.js # parse + chunk (512 token windows)
โ
training-sample-generator plugin # auto-score: importance + novelty + generalizability
โ
agent_review.py # LLM auto-review via Ollama API
โ
review_samples.js # human review interface (y/n/s/q)
โ
samples.jsonl / pending_review.jsonl
Sample format:
{
"instruction": "",
"input": "",
"reasoning": "",
"output": "",
"score": 8.5,
"timestamp": "2026-03-21",
"source": "self"
}
Current dataset: 1,498 reviewed samples from 441 historical conversations.
Most of these conversations originate from earlier discussions with higher-capability LLM systems.
The goal is not to copy answers verbatim, but to use them as a source of structured reasoning examples.
In a sense, the dataset acts as a form of bootstrapped supervision: stronger models provide candidate reasoning patterns, and the personal agent gradually learns from them after human review.
Technical note on Qwen 3.5 thinking mode: num_predict must be set to 2000+ when using the model for auto-review. The model's thinking process consumes tokens first โ if num_predict is too low (e.g. 80โ200), the thinking exhausts the budget and the response field comes back empty.
The Interesting Engineering Problems
Building this revealed several tensions that don't show up in papers:
Memory vs. control. The more capable the system became at retaining context, the more important it became to think carefully about what it should be able to forget. This isn't just a technical problem โ it's an interaction design problem. What does it mean to trust a system with your cognitive history?
Personalization vs. blind spots. A model fine-tuned on one person's interactions might get very good at that person's specific reasoning patterns โ but could also amplify their blind spots. Fine-tuning doesn't just transfer knowledge; it transfers biases.
The cold-start loop. To train a personalized model, you need data. To generate good data, you need an already-capable system. This circular dependency is real โ breaking out of it requires either a large, carefully curated seed dataset, or accepting that early data quality will be uneven and iterating from there.
These aren't purely engineering problems. They're user experience problems.
Current Status
Working:
- Plugin system (all 6 plugins functional)
- Memory ingestion and hybrid search
- Conversation processing and training sample generation
- Feishu (Lark) channel integration for messaging
In progress:
- Memory retrieval accuracy tuning (
memory_searchreturns empty in some cases โ data is confirmed present in SQLite, root cause under investigation) - Automated fine-tuning pipeline
- Agent behavior dashboard
This is a research prototype. The architecture is established; the self-improvement loop is still being assembled.
Why This Matters
The mobile internet revolution restructured how humans relate to information. The AI revolution is doing something at a deeper layer: restructuring how humans relate to cognition itself.
In that context, the question of whose intelligence an AI system reflects matters enormously.
Cloud models will keep getting more capable. But there's a complementary space โ not competing with cloud models, but orthogonal to them โ for systems shaped by and for specific individuals.
Personal AI infrastructure might look like what home servers looked like in the early internet era: niche, technically demanding, not for everyone. But the direction feels worth exploring.
Another way to think about personal AI is not purely as a productivity tool, but as an experimental medium.
If an agent is gradually shaped by a specific person's interactions, it may begin to reflect that person's reasoning style, priorities, and mental models. In that sense, a personalized AI system could become a kind of cognitive mirror โ or even a simulation artifact.
Such systems might not always be useful in the traditional sense. But they could still be valuable as a way to explore different cognitive trajectories.
For example, a highly personalized agent could be placed into simulated environments โ economic models, social scenarios, or narrative worlds โ to observe how its reasoning evolves over time.
Many people enjoy strategy or simulation games because they allow us to explore alternative possibilities. Personal AI systems might eventually enable something similar at a cognitive level: experimenting with how different ways of thinking interact with different environments.
In that sense, personal AI might become not just a tool, but a sandbox for exploring possible forms of intelligence โ including our own.
Repository
๐ github.com/vanessa49/personal-ai-agent-lab
Built on: OpenClaw 2026.3.11 ยท Ollama ยท SQLite + sqlite-vec ยท Qdrant ยท Docker
Ideas, feedback, and experiments welcome.
Open Questions
Building a personal AI agent raises a number of unresolved questions.
How should long-term memory be managed?
If an AI system accumulates years of interaction history, deciding what to keep, compress, or forget becomes both a technical and philosophical challenge.
What should be remembered โ and what should be forgotten?
Memory persistence can make an agent more useful, but it also raises questions about how much cognitive history a system should retain.
Does extreme personalization create blind spots?
A model trained heavily on a single user's interactions might gradually mirror that person's reasoning patterns โ including their biases or assumptions.
But this might not necessarily be a flaw.
In some contexts, such behavior could actually be valuable.
For example, a highly personalized agent could become a simulation tool โ allowing users to explore how their own thinking patterns evolve across different scenarios.
Can personal AI become a sandbox for cognitive experiments?
Instead of treating personalization purely as a productivity feature, it could also enable new forms of experimentation.
A personalized agent might be placed into simulated environments โ social, economic, or narrative โ to observe how its reasoning develops over time.
This begins to resemble a kind of cognitive simulation platform, where AI agents shaped by different individuals explore different trajectories.
Can self-generated training data meaningfully improve behavior over time?
If conversation logs are transformed into training samples, a personal AI system might gradually refine itself based on real usage patterns โ but the long-term stability of such loops is still an open question.
If you're building similar systems or experimenting with personal AI infrastructure, I'd be very curious to hear how you're approaching these questions.
Top comments (0)