Nic Lydon

Posted on May 9

I Trained an LLM on 75K of My Own Messages So It Would Stop Writing Like a Chatbot

#ai #llm #machinelearning #python

Frontier LLMs are good at figuring out what to say. They're bad at saying it the way you would.

I've spent months using Claude and GPT-4o to draft content for a personal publishing system. The system prompt is detailed: first person, short sentences mixed with long ones, no LinkedIn buzzwords, lead with specifics. The drafts come back structurally correct every time. And they sound like a talented intern who studied my writing for an afternoon.

Here's what a prompted frontier model produces when asked to write in my voice:

By maintaining complete control over the hardware infrastructure, I eliminate the need to navigate third-party terms of service entirely.

Here's what I actually write:

The data lives on hardware I control. There's no terms of service to read because there's no service.

Same idea. One sounds like me. The other sounds like a model following instructions about how I sound.

Prompt engineering has a ceiling for voice matching. I built the thing that goes above it.

The architecture: two models, one job each

Most "personal AI" projects I've seen collapse three different jobs into one model: reasoning about what to write, grounding it in facts, and matching the author's voice. Those are three different capabilities with three different training signals. Collapsing them means each one compromises the others.

My architecture separates them:

[Frontier model generates content] → [Fine-tuned 3B model rewrites in my voice] → output

Tier 1 is a frontier model (Claude Opus, Llama 70B, whatever fits the task). It receives context from my work: git commits, knowledge graph facts, calendar events. It handles the reasoning, structure, and factual grounding. It's good at this.

Tier 2 is a Qwen 2.5 3B model fine-tuned on 75,329 samples of my actual writing. It doesn't reason. It doesn't need to be smart. It rewrites. It takes competent-but-generic text and makes it sound like me.

The 3B parameter count was deliberate. I evaluated Phi-3 Mini and Llama 3.2 3B as well. Qwen won on three criteria: it fits comfortably on consumer hardware (about 2GB quantized), a 3B model has more than enough capacity for style transfer since it's not doing reasoning, and the smaller parameter space means the voice signal doesn't get drowned out by general capabilities.

The data: 23 years of my own writing

The training data comes from a personal data warehouse I've built over several years. I extracted every piece of text I've written across 12 platforms, spanning back to 2004.

Source	Samples	What it captures
iMessage	34,987	Casual conversation, how I talk to people I know
Google Chat	10,350	Work chat from several years at law firms
Plaud recordings	10,514	My actual spoken words, transcribed
ChatGPT prompts	7,645	How I phrase technical requests
Instagram DMs	4,535	Social, casual
Gmail sent	5,088	Email across the formality spectrum
Facebook	1,876	Social posts and comments
Claude prompts	303	Technical prompts
SMS	17	Text messages
Old Outlook (PST)	14	Work email from early career
Total	75,329	23 years, ~3.1M tokens

What's interesting about this corpus isn't the volume. It's the temporal range. You can watch a voice form across two decades. My emails from 2005 don't sound like my iMessages from 2024, but they share structural patterns: compression, directness, a preference for concrete over abstract.

The extraction script

The extraction runs against a PostgreSQL database on my home server. Each source has its own table with a different schema, so the script handles 12 different query patterns. A few things I learned building it:

Filter aggressively. Raw message data is full of noise. I strip:

Tapback reactions (iMessage's "Loved", "Liked", etc.)
URL-only messages
Emoji-only messages
Anything under 10 characters
Automated emails (IFTTT, cron notifications, shipping confirmations)
Quoted reply text and email signatures ("Sent from my iPhone", forwarded message headers)

The quote-stripping matters more than you'd expect. Without it, half your Gmail training data is other people's writing with your two-line reply appended.

def strip_quoted_email(body):
    """Strip quoted reply text, forwarded headers, and signatures."""
    lines = body.split('\n')
    clean = []
    for line in lines:
        stripped = line.strip()
        if stripped.startswith('>'):
            break
        if re.match(r'^On .+ wrote:$', stripped):
            break
        if stripped.startswith('-----Original Message-----'):
            break
        clean.append(line)
    result = '\n'.join(clean).strip()
    # Strip signatures
    for marker in ['\n-- \n', '\nSent from my iPhone',
                   '\nSent from my Mac', '\nGet Outlook for']:
        idx = result.find(marker)
        if idx > 0:
            result = result[:idx].strip()
    return result

Deduplicate by prefix. Messages get forwarded, quoted, copied. A 200-character prefix hash catches most of it without being so aggressive that you lose legitimately similar messages.

Format as ShareGPT conversations. The training framework (unsloth + trl) expects chat-format data. Each record becomes a two-turn conversation: a system prompt establishing context, a "human" turn with the preceding message or context, and a "gpt" turn containing my actual writing.

Privacy before training

This is the part most QLoRA tutorials skip entirely, because most people train on public datasets.

Before writing a single training script, I built sanitization into the extraction layer. No real names of private individuals make it into training data. No credentials, no specific addresses, no dollar amounts. The system generalizes by design.

This isn't optional when your training data is your own life. If you train on 23 years of personal messages without sanitization, the model will happily reproduce your friends' names, your home address, and your credit card's last four digits in its output.

Training: 17 hours on consumer hardware

Hardware: AMD Radeon 8060S (Strix Halo, 96GB unified VRAM), ROCm

Base model: unsloth/Qwen2.5-3B-Instruct-bnb-4bit

Method: QLoRA (4-bit quantized frozen base + LoRA adapter)

LoRA configuration

model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=['q_proj', 'k_proj', 'v_proj', 'o_proj',
                    'gate_proj', 'up_proj', 'down_proj'],
    lora_alpha=32,
    lora_dropout=0,
    bias='none',
    use_gradient_checkpointing='unsloth',
)

A few notes on these choices:

Rank 16 is the sweet spot I landed on. Rank 8 underfit (the model produced generic responses indistinguishable from the base). Rank 32 was slightly better on short messages but no meaningful improvement on longer-form writing, and it doubled the adapter size. For a style-transfer task on a 3B base, rank 16 captures enough voice signal without overfitting.

All seven projection targets (not just q/k/v). For voice transfer, the gate, up, and down projections in the MLP layers matter as much as the attention projections. The model needs to learn word choice and sentence rhythm, not just attention patterns. Targeting only q/k/v produced outputs that had roughly correct structure but wrong vocabulary.

Zero dropout. The training data is clean (my own writing, heavily filtered) and the task is narrow. Dropout helps with generalization on noisy data; here it just slows convergence.

Training dynamics

14,127 steps across 3 epochs. Loss started around 1.4 and converged to ~0.98 by the end. The learning rate used a cosine schedule from 2e-4 down to near zero.

The whole run took about 17 hours. I'd tried Qwen 2.5 7B first and that was 42 hours for the same epochs. Given that this model's only job is style transfer, the 3B was the right call: faster iteration, similar voice quality.

Previous failures (worth mentioning)

Qwen 3.5 9B: OOM during training. Loaded fine at ~19GB in 4-bit, but the GatedDeltaNet hybrid attention layers in the newer Qwen architecture consume too much activation memory during the forward pass. The model worked for inference but couldn't train on this hardware.

This is worth knowing if you're planning your own fine-tune: a model that runs fine for inference may not have enough headroom for training, since training needs to store activations for the backward pass.

The adapter: 115MB of voice

The final adapter is a 115MB safetensors file. The base Qwen 2.5 3B model is about 2GB quantized. Together they're smaller than a single Llama 70B layer.

The merged fp16 model (adapter baked into the base weights) is about 6GB, which is useful for GGUF conversion if you want to serve it via llama.cpp without loading the adapter separately at runtime.

Evaluation: blind A/B comparison

I ran 12 prompts through both the fine-tuned model and the base Qwen 2.5 3B, randomized the order, and compared outputs blind. The difference is stark.

Prompt: "can you review this PR when you get a chance, it's the auth middleware rewrite"

Base Qwen (no adapter):

Hey there! Absolutely, I'll squeeze in a review of the auth middleware rewrite when I get a chance. I've been meaning to check it out, especially given how significant the changes must be. Let me know if you've had a chance to test any edge cases or if there are any specific feedback points you're looking for. I'll aim to get through it soon. Cheers!

Fine-tuned (nic-voice):

yeah, I'm in the office today. I will review it later today when I get home

The base model writes like a chatbot. The fine-tuned model writes like a person. Specifically, like me: terse, no greeting flourish, commits to a concrete time, moves on.

Prompt: "thoughts on the new iphone"

Base: A 500-character enthusiastic review touching on cameras, battery life, design philosophy, and the tech conversation at large.

Fine-tuned: "I have the 17"

That's not a sophisticated response. It's also exactly what I'd actually text someone who asked me that question.

Where it works and where it doesn't

The voice transfer works best on the kind of writing the training data is heaviest in: casual conversation, work chat, short email. The model nails my compression ratio, my tendency to answer questions with minimal context, my habit of skipping greetings.

It's weaker on long-form writing. The training data is dominated by short messages (iMessages, chat), so the model's instinct is to be brief. For the two-tier architecture, that's actually fine: long-form content comes from the frontier model, and the voice adapter handles the rewrite. But if you trained this for standalone generation, you'd want to oversample your longer-form writing to balance the distribution.

What I'd do differently

Oversample formal writing. My corpus is 47% iMessage, which means the model's default register is very casual. Weighting emails and longer-form writing 2-3x would produce a more balanced adapter.

Add a style-routing prompt. Instead of one adapter for all contexts, the system prompt could specify "email voice" vs. "chat voice" vs. "post voice." The training data already has source labels; I just didn't use them during training.

Evaluate on the actual target task. My A/B eval tests conversational responses, which is interesting but not the real use case. The real use case is rewriting AI-generated content drafts. I should have included prompts like "rewrite this paragraph in your voice" with actual frontier-model output as input.

The stack, if you want to build this

Data extraction: Python + psycopg2 against PostgreSQL. The hard part is handling 12 different table schemas and filtering noise.
Training: unsloth + trl (SFTTrainer). Unsloth handles the 4-bit quantization and gradient checkpointing; trl handles the training loop.
Hardware: Any GPU with 8GB+ VRAM can fine-tune a 3B model with QLoRA. I used an AMD Radeon 8060S, but an RTX 3060 12GB or even an M1 Mac would work for a 3B base.
Serving: llama.cpp with --lora flag for the GGUF adapter, or load the merged model directly. I serve mine through a local API gateway.
Format: ShareGPT JSONL. Each record is a conversation with system/human/assistant turns.

You don't need 75K samples. You could start with a few thousand emails from your sent folder and a simple extraction script. The architecture (frontier model for content, fine-tuned small model for voice) works regardless of corpus size. The small model just gets better as you feed it more.

The point isn't the scale. The point is that voice is learnable, and it's learnable separately from reasoning, and once you separate those concerns, both get better.

DEV Community