DEV Community: vanessa49

Personal AI Isn't Q&A — It's Iteration

vanessa49 — Sat, 28 Mar 2026 14:28:18 +0000

Why user→assistant segmentation fails for personal AI fine-tuning

I built a pipeline to generate training samples from my personal AI conversation history — GPT exports, processed into user → assistant pairs.

Then I manually reviewed a batch and found a problem I hadn't anticipated. To validate the intuition, I ran a comparison across real conversations spanning 2023–2026.

Here's what the data showed.

The Problem: Intermediate States Masquerading as Conclusions

Most of my conversations don't follow a Q&A pattern. They iterate:

Me: I'm thinking about running a local AI on my laptop.

AI: It depends on the hardware...

Me: My laptop only has 16GB RAM.

AI: That could be a limitation...

Me: Ah. So maybe the question isn't
    "how to run AI on my laptop".
    It's whether my laptop can run it at all —
    and if not, what kind of setup would actually need..

The traditional pipeline captured sample #1 as:

{
  "instruction": "I'm thinking about running a local AI on my laptop.",
  "output": "It depends on the hardware..."
}

But this represents the first answer, not the final understanding that emerged from the conversation.

The common user → assistant segmentation assumes that each assistant message is a terminal answer. In reality, many personal AI conversations look more like a reasoning process:

hypothesis → test → correction → refinement

The insight appears at the end of the trajectory, not at the first reply.

Structural difference:
Personal AI conversations are trajectories of reasoning." width="800" height="191">
Traditional fine-tuning treats conversations as isolated question-answer pairs.
Trajectory-based training instead models them as evolving reasoning paths, where earlier responses are intermediate states rather than final outputs.

The Numbers

I applied both methods to the same conversation and compared the results:

Metric	Traditional Q&A	Cognitive trajectory	Difference
Samples generated	35	11	−68.6%
Single-turn samples	35 (100%)	6 (54.5%)	—
Multi-turn iteration samples	0 (0%)	5 (45.5%)	+5
Avg turns per sample	2	3.6	+80%

The traditional method produced 35 independent samples — and captured zero iterative exchanges. The cognitive method produced 11 samples, but 5 of them preserved complete thought trajectories that the traditional method lost entirely.

Scaled to the full dataset of 1,122 conversations (2023–2026), the same pattern holds:

259,534 cognitive nodes extracted
547,836 training samples generated
15,506 refinement chains identified — sequences where an idea was explicitly corrected and revised
Average refinement chain length: 2.14 steps

Note on edge counts: iteration_final edges are convergence shortcuts added after refinement-chain detection — they link the start and end of a correction chain directly, rather than replacing the intermediate steps. This means iteration_final edges are additive, not mutually exclusive with the base sequential edges, so edge type percentages sum above 100%.

The relationship distribution across 273,918 cognitive edges:

Relation type	Count	%	Meaning
follows	149,085	54.4%	Sequential continuation
derives	25,734	9.4%	Logical inference
responds	20,651	7.5%	Direct reply
hypothesizes	18,818	6.9%	Hypothesis formation
refines	17,571	6.4%	Explicit correction
iteration_final	15,506	5.7%	Convergence shortcut: chain start → chain end
restarts	15,187	5.5%	Topic restart
speculates	10,674	3.9%	Speculative reasoning
clarifies	613	0.2%	Clarification
contrasts	79	0.03%	Perspective shift

A few things stand out. First, follows dropped from ~70% (early dataset) to 54% as the dataset scaled — the pipeline now detects a wider vocabulary of cognitive events, so fewer edges fall through to the default. Second, four new relation types appeared (hypothesizes, restarts, speculates, clarifies) that weren't in the initial schema — these emerged from the data rather than being pre-defined, which is exactly the direction the design was pointing toward.

The refines and iteration_final samples together represent roughly 12% of all edges. These are often the moments where the conversation moves furthest from the model's baseline response and closer to the user's intended reasoning — and they're the samples least likely to appear in traditional Q&A segmentation.

What to Do Instead

Option 1: Cognitive node segmentation

Instead of user/assistant turn boundaries, segment by semantic shift (topic change, correction markers, or new reasoning step) and build samples as:

[node_t-2, node_t-1, node_t] → node_t+1

This preserves context across turn boundaries and makes the training target the next thought, not the next response.

The edge types to track:

// derives: logical consequence ("so therefore...")
// refines: correction or improvement ("actually, instead...")
// contrasts: perspective shift ("on the other hand...")
// follows: sequential continuation (default)
// iteration_final: convergence shortcut from chain start to chain end

Option 2: Track and weight refinement chains

Identify correction chains explicitly. A refinement chain looks like:

initial idea → user challenges → AI revises → convergence

Mark the final node as iteration_final and weight it higher during training. In the current pipeline:

weight_map = {
    'iteration_final': 2.5,   # last refinement × depth bonus
    'refines': 2.0,            # explicit correction
    'speculates': 1.5,         # speculative reasoning
    'hypothesizes': 1.3,       # hypothesis formation
    'derives': 1.5,            # logical consequence
    'restarts': 1.3,           # topic restart
    'follows': 1.0,            # default
}

# Plus time decay: older samples are down-weighted
# weight ×= e^(-age_in_days / 730)
# Encourages the model to learn who you are *now*

In the current dataset, 38% of samples carry weight > 1.0 — higher than the initial 15–25% estimate. The difference is the expanded relation vocabulary: hypothesizes, speculates, and restarts all carry above-baseline weights, and they're more prevalent than initially anticipated. This isn't a bug — it reflects the actual distribution of cognitive events in the data. The baseline follows edges (54%) still dominate; it's the non-default types that are being weighted up.

Option 3: Preserve temporal sequence

Timestamps aren't just metadata in personal AI training. They're features.

Two samples with similar content but different timestamps aren't duplicates — they're evidence of cognitive evolution. The current pipeline preserves original conversation timestamps on all nodes (100% integrity across 259,534 nodes), which enables time-decay weighting and, eventually, cross-time analysis of how thinking changes on the same topic.

A Subtler Problem: The Edge Vocabulary

Even after fixing the segmentation problem, there's a deeper assumption worth flagging.

The current pipeline uses a fixed set of relation types — derives, refines, contrasts, follows. This vocabulary was designed from an engineering perspective: it works for cause-and-effect reasoning. But some connections between ideas are associative, aesthetic, or simply "these belong together."

Interestingly, running the pipeline on real data has already pushed back on this assumption: four relation types (hypothesizes, restarts, speculates, clarifies) emerged from the detection logic that weren't in the original schema. The vocabulary is already partially self-extending.

One further direction: leave the relation type as a fully free field, accumulate data without pre-labeling, then run a clustering pass to discover what relation types naturally appear in this person's thinking. Probably unreliable at current data volumes, but worth designing toward from the start — which is why the schema uses a flexible tags array alongside the fixed relation field, rather than a strict enum.

The Broader Point

This problem is more severe for personal AI than for general fine-tuning.

With millions of training samples, structural errors average out. With a few hundred personal conversations, every assumption baked into the segmentation pipeline gets amplified in the model's behavior.

If your segmentation assumes Q&A but your conversations are iterative research, you'll train a model that answers like a chatbot rather than reasoning like you.

The fix isn't complicated. But it requires noticing the assumption first.

Dataset design is ontology design — the structure you impose on data determines what patterns the model can learn. Choose carefully.

Current System

1,122 conversations processed (GPT exports, 2023–2026)
259,534 cognitive nodes, 273,918 edges, 547,836 training samples
15,506 refinement chains, average length 2.14 steps
All 259,534 nodes carry original conversation timestamps (100% integrity)
Pipeline: cognitive chunking → refinement chain tracking → iteration_final generation → weighted sampling
Fine-tuning: pending (QLoRA on qwen2.5:7b, RTX 4060)

🔗 personal-ai-agent-lab on GitHub

*This article focuses on the engineering side of the pipeline.

For the conceptual discussion behind the idea, see:

The ontology problem → https://medium.com/design-bootcamp/personal-ai-isnt-about-answers-it-s-about-thought-trajectories-d1afd1d4b87b

Building a Personal AI Agent That Grows With You

vanessa49 — Mon, 23 Mar 2026 12:12:13 +0000

Exploring local LLMs as personal cognitive extensions

Introduction

Let me start with a distinction that I think matters more than people currently realize.

Cloud AI models — GPT, Claude, Gemini — are trained on the output of billions of people. They represent collective intelligence at scale: optimized to be useful to everyone, shaped by aggregate data and company priorities.

That's genuinely powerful. But "useful to everyone" is a different thing from "shaped by you."

The question this project is exploring: what if a local, fine-tunable model could grow alongside a specific person? Not just remembering preferences on top — but having its actual reasoning patterns, tendencies, and ways of approaching problems gradually shaped by one individual's interactions over time.

The key difference is ownership of growth. Cloud models evolve based on what the company decides. A local model can evolve based on what you actually do and think about.

TL;DR

This project explores the idea of a personal AI agent that evolves with a single user over time.

Current prototype includes:

Local LLM inference via Ollama (qwen3.5:9b + qwen2.5:7b + bge-m3 embedding)
Always-on agent runtime on a NAS via OpenClaw (Docker)
Persistent memory with SQLite + sqlite-vec hybrid search
Plugin-based architecture (6 custom plugins for logging, safety, memory compression, training data)
A pipeline that converts conversation history into potential fine-tuning data
1,498 training samples generated and reviewed from historical conversations

The long-term goal: explore whether a local model can gradually become a personal cognitive extension, rather than just a stateless AI tool.

🔗 GitHub: personal-ai-agent-lab

System Architecture

The system runs across two machines:

┌─────────────────────────┐        ┌──────────────────────────────┐
│   GPU Machine (laptop)  │        │   NAS / Always-on Server     │
│                         │        │                              │
│   Ollama                │◄──────►│   OpenClaw (Docker)          │
│   - qwen3.5:9b          │        │   - Plugin System            │
│   - qwen2.5:7b          │        │   - Memory (SQLite + vec)    │
│   - bge-m3 (embedding)  │        │   - Training Pipeline        │
└─────────────────────────┘        │                              │
                                   │   Qdrant (Docker)            │
                                   └──────────────────────────────┘

The GPU machine handles inference. The NAS runs continuously as the agent environment — maintaining memory, running plugins, processing conversation history in the background.

Why split? A personal AI agent that only runs when your laptop is on isn't truly always-on. The NAS acts as a persistent cognitive layer that stays active regardless of what else you're doing.

Note: Qdrant is currently an external database accessed via plugin API. OpenClaw's memory system uses SQLite + sqlite-vec; hybrid search operates on SQLite vectors.

Plugin Architecture

All agent behaviors are implemented as plugins. OpenClaw has two separate hook systems — easy to confuse, important to get right:

System	Config location	Supported events
Internal Hooks	`hooks.internal.load.extraDirs`	`agent:bootstrap`, `gateway:startup`, `command:new`
Plugin Hooks	`plugins.load.paths`	`before_tool_call`, `after_tool_call`, `before_prompt_build`, `agent_end`

⚠️ Tool call monitoring must use Plugin Hooks. Internal Hooks have no agent:tool:pre / agent:tool:post events — these don't exist.

Update: A few days after I hit this, the official docs were updated to clarify the distinction. I submitted a docs PR to add more explicit examples and a common-mistakes section anyway — "works but unclear" is still worth improving in open source docs.

The six plugins currently deployed:

Plugin	Hook events	What it does
`tool-logger`	`before_tool_call` • `after_tool_call`	Logs every tool call to `tool_calls.log`
`safe-delete-enforcer`	`before_tool_call` (intercept) + `after_tool_call` (index)	Blocks `rm`, forces `mv` to trash-pending, auto-creates deletion index
`qdrant-auto-checker`	`before_prompt_build`	Keyword detection → inject `curl qdrant` instruction into system prompt
`task-logger`	`agent_end`	Writes structured task log to `agent_log.md` after each session
`training-sample-generator`	`agent_end`	Scores conversation, generates training sample if score ≥ 7
`memory-compressor`	`agent_end`	Triggers context compression when conversation exceeds 20 turns

Key lessons from plugin development:

Must use CommonJS module.exports = register — ESM or module.exports = { register } silently fails
register function must be synchronous — async function register() gets ignored
openclaw.plugin.json requires both id and configSchema fields
SMB writes are unreliable for config files — use docker exec openclaw node -e "..." instead

Memory System

The agent stores long-term memory using SQLite + sqlite-vec with hybrid retrieval:

vector similarity (weight: 0.7)   ← bge-m3 embeddings via Ollama
        +
full-text search  (weight: 0.3)   ← SQLite FTS5
        ↓
hybrid ranked results

Current state: 441 files, 5,137 chunks indexed.

Important clarification on Qdrant: Despite appearing in the architecture diagram, Qdrant is not integrated into the memory_search pipeline. It runs as a separate container and is queried manually via plugin prompt injection — a before_prompt_build hook detects keywords like "qdrant" or "vector" and injects a curl instruction into the system prompt. The agent then executes it as a tool call.

This is Prompt Automation, not Memory Integration. True Qdrant integration would require implementing a custom OpenClaw memory driver to replace the SQLite backend — a framework-level change not yet planned.

Conversation → Training Pipeline

The pipeline converts raw conversation history into potential fine-tuning data:

conversation logs
      ↓
batch_process_conversations.js    # parse + chunk (512 token windows)
      ↓
training-sample-generator plugin  # auto-score: importance + novelty + generalizability
      ↓
agent_review.py                   # LLM auto-review via Ollama API
      ↓
review_samples.js                 # human review interface (y/n/s/q)
      ↓
samples.jsonl / pending_review.jsonl

Sample format:

{
  "instruction": "",
  "input": "",
  "reasoning": "",
  "output": "",
  "score": 8.5,
  "timestamp": "2026-03-21",
  "source": "self"
}

Current dataset: 1,498 reviewed samples from 441 historical conversations.

Most of these conversations originate from earlier discussions with higher-capability LLM systems.
The goal is not to copy answers verbatim, but to use them as a source of structured reasoning examples.

In a sense, the dataset acts as a form of bootstrapped supervision: stronger models provide candidate reasoning patterns, and the personal agent gradually learns from them after human review.

Technical note on Qwen 3.5 thinking mode: num_predict must be set to 2000+ when using the model for auto-review. The model's thinking process consumes tokens first — if num_predict is too low (e.g. 80–200), the thinking exhausts the budget and the response field comes back empty.

The Interesting Engineering Problems

Building this revealed several tensions that don't show up in papers:

Memory vs. control. The more capable the system became at retaining context, the more important it became to think carefully about what it should be able to forget. This isn't just a technical problem — it's an interaction design problem. What does it mean to trust a system with your cognitive history?

Personalization vs. blind spots. A model fine-tuned on one person's interactions might get very good at that person's specific reasoning patterns — but could also amplify their blind spots. Fine-tuning doesn't just transfer knowledge; it transfers biases.

The cold-start loop. To train a personalized model, you need data. To generate good data, you need an already-capable system. This circular dependency is real — breaking out of it requires either a large, carefully curated seed dataset, or accepting that early data quality will be uneven and iterating from there.

These aren't purely engineering problems. They're user experience problems.

Current Status

Working:

Plugin system (all 6 plugins functional)
Memory ingestion and hybrid search
Conversation processing and training sample generation
Feishu (Lark) channel integration for messaging

In progress:

Memory retrieval accuracy tuning (memory_search returns empty in some cases — data is confirmed present in SQLite, root cause under investigation)
Automated fine-tuning pipeline
Agent behavior dashboard

This is a research prototype. The architecture is established; the self-improvement loop is still being assembled.

Why This Matters

The mobile internet revolution restructured how humans relate to information. The AI revolution is doing something at a deeper layer: restructuring how humans relate to cognition itself.

In that context, the question of whose intelligence an AI system reflects matters enormously.

Cloud models will keep getting more capable. But there's a complementary space — not competing with cloud models, but orthogonal to them — for systems shaped by and for specific individuals.

Personal AI infrastructure might look like what home servers looked like in the early internet era: niche, technically demanding, not for everyone. But the direction feels worth exploring.

Another way to think about personal AI is not purely as a productivity tool, but as an experimental medium.

If an agent is gradually shaped by a specific person's interactions, it may begin to reflect that person's reasoning style, priorities, and mental models. In that sense, a personalized AI system could become a kind of cognitive mirror — or even a simulation artifact.

Such systems might not always be useful in the traditional sense. But they could still be valuable as a way to explore different cognitive trajectories.

For example, a highly personalized agent could be placed into simulated environments — economic models, social scenarios, or narrative worlds — to observe how its reasoning evolves over time.

Many people enjoy strategy or simulation games because they allow us to explore alternative possibilities. Personal AI systems might eventually enable something similar at a cognitive level: experimenting with how different ways of thinking interact with different environments.

In that sense, personal AI might become not just a tool, but a sandbox for exploring possible forms of intelligence — including our own.

Repository

🔗 github.com/vanessa49/personal-ai-agent-lab

Built on: OpenClaw 2026.3.11 · Ollama · SQLite + sqlite-vec · Qdrant · Docker

Ideas, feedback, and experiments welcome.

Open Questions

Building a personal AI agent raises a number of unresolved questions.

How should long-term memory be managed?

If an AI system accumulates years of interaction history, deciding what to keep, compress, or forget becomes both a technical and philosophical challenge.

What should be remembered — and what should be forgotten?

Memory persistence can make an agent more useful, but it also raises questions about how much cognitive history a system should retain.

Does extreme personalization create blind spots?

A model trained heavily on a single user's interactions might gradually mirror that person's reasoning patterns — including their biases or assumptions.

But this might not necessarily be a flaw.

In some contexts, such behavior could actually be valuable.

For example, a highly personalized agent could become a simulation tool — allowing users to explore how their own thinking patterns evolve across different scenarios.

Can personal AI become a sandbox for cognitive experiments?

Instead of treating personalization purely as a productivity feature, it could also enable new forms of experimentation.

A personalized agent might be placed into simulated environments — social, economic, or narrative — to observe how its reasoning develops over time.

This begins to resemble a kind of cognitive simulation platform, where AI agents shaped by different individuals explore different trajectories.

Can self-generated training data meaningfully improve behavior over time?

If conversation logs are transformed into training samples, a personal AI system might gradually refine itself based on real usage patterns — but the long-term stability of such loops is still an open question.

If you're building similar systems or experimenting with personal AI infrastructure, I'd be very curious to hear how you're approaching these questions.