vanessa49

Posted on Mar 28

Personal AI Isn't Q&A — It's Iteration

#ai #opensource #llm #machinelearning

Why user→assistant segmentation fails for personal AI fine-tuning

I built a pipeline to generate training samples from my personal AI conversation history — GPT exports, processed into user → assistant pairs.

Then I manually reviewed a batch and found a problem I hadn't anticipated. To validate the intuition, I ran a comparison across real conversations spanning 2023–2026.

Here's what the data showed.

The Problem: Intermediate States Masquerading as Conclusions

Most of my conversations don't follow a Q&A pattern. They iterate:

Me: I'm thinking about running a local AI on my laptop.

AI: It depends on the hardware...

Me: My laptop only has 16GB RAM.

AI: That could be a limitation...

Me: Ah. So maybe the question isn't
    "how to run AI on my laptop".
    It's whether my laptop can run it at all —
    and if not, what kind of setup would actually need..

The traditional pipeline captured sample #1 as:

{
  "instruction": "I'm thinking about running a local AI on my laptop.",
  "output": "It depends on the hardware..."
}

But this represents the first answer, not the final understanding that emerged from the conversation.

The common user → assistant segmentation assumes that each assistant message is a terminal answer. In reality, many personal AI conversations look more like a reasoning process:

hypothesis → test → correction → refinement

The insight appears at the end of the trajectory, not at the first reply.

Structural difference:

Traditional fine-tuning treats conversations as isolated question-answer pairs.
Trajectory-based training instead models them as evolving reasoning paths, where earlier responses are intermediate states rather than final outputs.

The Numbers

I applied both methods to the same conversation and compared the results:

Metric	Traditional Q&A	Cognitive trajectory	Difference
Samples generated	35	11	−68.6%
Single-turn samples	35 (100%)	6 (54.5%)	—
Multi-turn iteration samples	0 (0%)	5 (45.5%)	+5
Avg turns per sample	2	3.6	+80%

The traditional method produced 35 independent samples — and captured zero iterative exchanges. The cognitive method produced 11 samples, but 5 of them preserved complete thought trajectories that the traditional method lost entirely.

Scaled to the full dataset of 1,122 conversations (2023–2026), the same pattern holds:

259,534 cognitive nodes extracted
547,836 training samples generated
15,506 refinement chains identified — sequences where an idea was explicitly corrected and revised
Average refinement chain length: 2.14 steps

Note on edge counts: iteration_final edges are convergence shortcuts added after refinement-chain detection — they link the start and end of a correction chain directly, rather than replacing the intermediate steps. This means iteration_final edges are additive, not mutually exclusive with the base sequential edges, so edge type percentages sum above 100%.

The relationship distribution across 273,918 cognitive edges:

Relation type	Count	%	Meaning
follows	149,085	54.4%	Sequential continuation
derives	25,734	9.4%	Logical inference
responds	20,651	7.5%	Direct reply
hypothesizes	18,818	6.9%	Hypothesis formation
refines	17,571	6.4%	Explicit correction
iteration_final	15,506	5.7%	Convergence shortcut: chain start → chain end
restarts	15,187	5.5%	Topic restart
speculates	10,674	3.9%	Speculative reasoning
clarifies	613	0.2%	Clarification
contrasts	79	0.03%	Perspective shift

A few things stand out. First, follows dropped from ~70% (early dataset) to 54% as the dataset scaled — the pipeline now detects a wider vocabulary of cognitive events, so fewer edges fall through to the default. Second, four new relation types appeared (hypothesizes, restarts, speculates, clarifies) that weren't in the initial schema — these emerged from the data rather than being pre-defined, which is exactly the direction the design was pointing toward.

The refines and iteration_final samples together represent roughly 12% of all edges. These are often the moments where the conversation moves furthest from the model's baseline response and closer to the user's intended reasoning — and they're the samples least likely to appear in traditional Q&A segmentation.

What to Do Instead

Option 1: Cognitive node segmentation

Instead of user/assistant turn boundaries, segment by semantic shift (topic change, correction markers, or new reasoning step) and build samples as:

[node_t-2, node_t-1, node_t] → node_t+1

This preserves context across turn boundaries and makes the training target the next thought, not the next response.

The edge types to track:

// derives: logical consequence ("so therefore...")
// refines: correction or improvement ("actually, instead...")
// contrasts: perspective shift ("on the other hand...")
// follows: sequential continuation (default)
// iteration_final: convergence shortcut from chain start to chain end

Option 2: Track and weight refinement chains

Identify correction chains explicitly. A refinement chain looks like:

initial idea → user challenges → AI revises → convergence

Mark the final node as iteration_final and weight it higher during training. In the current pipeline:

weight_map = {
    'iteration_final': 2.5,   # last refinement × depth bonus
    'refines': 2.0,            # explicit correction
    'speculates': 1.5,         # speculative reasoning
    'hypothesizes': 1.3,       # hypothesis formation
    'derives': 1.5,            # logical consequence
    'restarts': 1.3,           # topic restart
    'follows': 1.0,            # default
}

# Plus time decay: older samples are down-weighted
# weight ×= e^(-age_in_days / 730)
# Encourages the model to learn who you are *now*

In the current dataset, 38% of samples carry weight > 1.0 — higher than the initial 15–25% estimate. The difference is the expanded relation vocabulary: hypothesizes, speculates, and restarts all carry above-baseline weights, and they're more prevalent than initially anticipated. This isn't a bug — it reflects the actual distribution of cognitive events in the data. The baseline follows edges (54%) still dominate; it's the non-default types that are being weighted up.

Option 3: Preserve temporal sequence

Timestamps aren't just metadata in personal AI training. They're features.

Two samples with similar content but different timestamps aren't duplicates — they're evidence of cognitive evolution. The current pipeline preserves original conversation timestamps on all nodes (100% integrity across 259,534 nodes), which enables time-decay weighting and, eventually, cross-time analysis of how thinking changes on the same topic.

A Subtler Problem: The Edge Vocabulary

Even after fixing the segmentation problem, there's a deeper assumption worth flagging.

The current pipeline uses a fixed set of relation types — derives, refines, contrasts, follows. This vocabulary was designed from an engineering perspective: it works for cause-and-effect reasoning. But some connections between ideas are associative, aesthetic, or simply "these belong together."

Interestingly, running the pipeline on real data has already pushed back on this assumption: four relation types (hypothesizes, restarts, speculates, clarifies) emerged from the detection logic that weren't in the original schema. The vocabulary is already partially self-extending.

One further direction: leave the relation type as a fully free field, accumulate data without pre-labeling, then run a clustering pass to discover what relation types naturally appear in this person's thinking. Probably unreliable at current data volumes, but worth designing toward from the start — which is why the schema uses a flexible tags array alongside the fixed relation field, rather than a strict enum.

The Broader Point

This problem is more severe for personal AI than for general fine-tuning.

With millions of training samples, structural errors average out. With a few hundred personal conversations, every assumption baked into the segmentation pipeline gets amplified in the model's behavior.

If your segmentation assumes Q&A but your conversations are iterative research, you'll train a model that answers like a chatbot rather than reasoning like you.

The fix isn't complicated. But it requires noticing the assumption first.

Dataset design is ontology design — the structure you impose on data determines what patterns the model can learn. Choose carefully.

Current System

1,122 conversations processed (GPT exports, 2023–2026)
259,534 cognitive nodes, 273,918 edges, 547,836 training samples
15,506 refinement chains, average length 2.14 steps
All 259,534 nodes carry original conversation timestamps (100% integrity)
Pipeline: cognitive chunking → refinement chain tracking → iteration_final generation → weighted sampling
Fine-tuning: pending (QLoRA on qwen2.5:7b, RTX 4060)

🔗 personal-ai-agent-lab on GitHub

*This article focuses on the engineering side of the pipeline.

For the conceptual discussion behind the idea, see: