DEV Community

Cover image for I Built an AI Clone of Myself - Fine-Tuning with 10 Years of Data and Voice Cloning
Jonathan P.
Jonathan P.

Posted on

I Built an AI Clone of Myself - Fine-Tuning with 10 Years of Data and Voice Cloning

I love LLMs. I use them every single day, whether it’s for work or just optimizing my personal life. Recently, I wanted to dive deeper into fine-tuning—the process of adding a specialized training layer to an existing model so it behaves exactly how you want, without the massive time and resource commitment of training from scratch. I wanted to get my hands dirty because I learn best by doing.

Then it hit me: I’ve been active on an internet forum for over 10 years. I’ve posted thousands of messages there—a goldmine of data! I realized that with a few tools, I could potentially build an AI clone of myself: one that writes, thinks, and even speaks in my voice.

I know, it sounds a bit dystopian—and honestly, it is—but it’s also incredibly fun. To be clear: I am keeping all data strictly confidential. I won't be publishing the dataset or the final model for obvious privacy reasons.

A quick disclaimer: If you decide to replicate this, you MUST use your own data. Please do not create an AI clone of another person without their explicit consent.

Speaking of privacy, I set a strict rule for this project: the data never leaves my control. Ideally, it stays on my local machine. If it has to go to a server, it must be within the EU, with zero data retention and full GDPR compliance.

The Trial and Error Phase

I’ll be honest: I didn't get a working clone on the first try. Not even close. It was a long journey of trial and error. At first, I aimed too high: trying to train Mistral Small 3.2 24b on an OVH server with an L40S GPU. The training went okay, but I could never successfully save the weights to reuse the model due to persistent Out Of Memory (OOM) errors.

I decided to pivot to local training on my own rig. I experimented with various models like Qwen3 and Ministral, hitting different roadblocks and variable results with each. In the end, I found the "sweet spot" with Google’s Gemma 3. That’s the model I used for the final clone.

Finally, I integrated text-to-audio (and vice versa) libraries so I could actually talk to "myself" via microphone and hear my own voice respond. 😱

The Final Tech Stack

Component Technology
Hardware RTX 4060 Ti (16GB)
Library Unsloth
Base Model Gemma 3 12b
Data Source 10+ years of forum posts
Speech-to-Text OpenAI Whisper
Text-to-Speech Qwen3-TTS

Without further ado, let’s dive into how I actually built this thing!

Preparing the Dataset

I successfully scraped my forum messages, but they weren't ready for fine-tuning straight out of the box. There is a massive amount of "data curation" required before a model can learn from them.

For fine-tuning, we need an "input" > "output" structure. This teaches the LLM that when it sees a specific type of "input" (a prompt or question), it should look to my "output" (my historical style) for the response.

Handling Personal Information (PII)

Normally, the top priority would be managing PII (Personally Identifiable Information). For instance, if I had posted my address—or worse, if a quoted message contained someone else’s address—I wouldn't want the model to memorize and regurgitate it. In this specific case, I control the entire pipeline and the model won't be public. Furthermore, forum members use pseudonyms and are generally careful about privacy. However, if I planned to release this model, this step would be absolutely critical.

Cleaning the Raw Data

The first pass involved stripping out the "noise":

  • Removing artifacts: Images, formatting (bold, italics), certain emojis, and YouTube links.
  • Simplifying context: For messages with multiple nested quotes, I stripped everything after the first quote to keep the input > output relationship simple.
  • Normalization: I standardized line breaks (max 2 in a row).
  • Quality control: I deleted any message shorter than 6 words to ensure the model had enough substance to learn from.
  • De-duplication: I removed about fifty very similar messages from a recurring forum event to prevent the model from "over-fitting" on repetitive nonsense.

Reverse Prompt Engineering

The biggest hurdle was missing data. While my forum posts are the perfect output, about 75% of them lacked an input. Many posts were just part of a flow or started new sub-topics without quoting anyone. How do you train a model to respond if you don't have the question?

This is where I used Reverse Prompting. I sent the following prompt to a larger LLM:

(Translated from French)

const prompt = (target: string) => {  
    return `# Role  
You are an expert in forum conversation simulation. Your mission is to generate the INPUT (a message from a third party) that triggered my OUTPUT (your response).  

# Task  
Analyze my POST (Output) and reconstruct the preceding message.  

# Step 1: Classification (Reasoning)  
Ask yourself: "What is the purpose of my post?"  
A. **I am asking for help / Starting a topic** -> COLD_START.  
B. **I am providing a solution / Giving advice** -> Reply to a QUESTION (Problem).  
C. **I am reacting / Contradicting / Commenting** -> Reply to a STATEMENT (Opinion).  

### ⚠️ DISQUALIFICATION CRITERIA (COLD_START): If my POST contains formal opening/closing markers like:  
- "Hi, ..." at the beginning + a question.  
- "Thanks" or "Thanks in advance" at the end.  
- "Could I have...", "I'm looking for...", "I need..."  
THEN -> It is automatically a COLD_START (IGNORE).  

# Step 2: Input Generation (CRITICAL)  
Generate the message from the other user.  

### ⚠️ GOLDEN RULES AGAINST ECHOING (Must follow strictly): 
1. **Don't "spoil" the answer:** If my output contains a specific solution (e.g., "Use Software X"), the input MUST NOT mention "X". The input should express the *need* or *problem* (e.g., "I'm looking for software to do this").  
2. **Handling quotes:** If I say "As Anon said...", the input shouldn't ask "What did Anon say?". The input should BE "Anon" or someone discussing Anon's topic.  
3. **Create friction:** If I contradict someone, the input must state the opposite of what I say. If I say "That's false," the input must say "That's true."  
4. **Style:** No robotic/AI/LLM style. Use the forum slang. No excessive politeness.  

# Input -> Output Logic Examples:  
- BAD: Input="Is the 4070 good?" -> Output="Yes, the 4070 is great." (Too easy).  
- GOOD: Input="I'm hesitating to keep my 1060, is it worth upgrading?" -> Output="Yes, the 4070 is great."  

- BAD: Input="What do you think of Freud?" -> Output="Freud said X..."  
- GOOD: Input="Honestly, psychoanalysis is total nonsense." -> Output="Freud said X..."  

# Output Format (Strict JSON)  
{  
  "reasoning": "Why it's a Reply vs a Cold Start. What is the intent of the other user (Naivety? False statement?)", 
  "type": "COLD_START" | "REPLY_TO_QUESTION" | "REPLY_TO_STATEMENT",  
  "generated_input": "The simulated message (empty if COLD_START)"
}  

# My POST (Output)  
${target}  
`;  
}
Enter fullscreen mode Exit fullscreen mode

This prompt takes my post and asks the LLM to hallucinate the "trigger" message. I refined this heavily with Gemini to ensure high-quality results. If a post was a "Cold Start" (meaning it didn't need a preceding message), the script filtered it out.

I used the OVH LLM API with Meta-Llama-3_3-70B-Instruct. It’s excellent with French, cost-effective, and keeps the data within the EU with zero data retention. Processing ~6,200 messages at 350 requests per minute took about 18 minutes.

Loading data...
Loaded 8581 parsed entries
Found 86 already completed entries
Found 6326 entries without quotes to process
6240 entries remaining to process

Processing in 18 batches of up to 350 requests each

Processing |██░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░| 5% | 350/6240 | ETA: 307s
Enter fullscreen mode Exit fullscreen mode

The Quality Filter

Even with Llama 3.3, the dataset wasn't perfect. I needed to separate the wheat from the chaff. Since I wasn't about to manually read 6,000 messages, I used another LLM to act as a Quality Filter:

(Translated from French)

const SYSTEM_PROMPT = `You are a quality filter for a fine-tuning dataset.  
Your role is to clean the data. Analyze the [INPUT] / [OUTPUT] pair.  

STRICT REJECTION CRITERIA (DISCARD):  
1. Tautology: The Input repeats the Output or is too perfectly tailored to it.  
2. Question-by-Question: The Output just parrots the Input phrase; it's not a natural exchange.  
3. Illogical: The Output doesn't clearly answer the Input.  
4. Failed Cold Start: The Output is clearly a conversation starter, but the Input is a forced question.  
5. Low Value: Output is too short ("lol", "ok") or lacks identifiable context.  
6. Hallucination: The Input invents contradictory facts.  

Respond ONLY in JSON format: {"verdict": "KEEP" | "DISCARD", "reason": "..."}`;
Enter fullscreen mode Exit fullscreen mode

After this filtering, I was left with roughly 4,300 high-quality entries. I also capped the text length at 2,000 characters to prevent Out Of Memory errors during training. Total cost? Less than €8 (and actually €0 because I had welcome credits).

Personal Touches & Data Augmentation

To help the model understand who I am without guessing, I manually answered 100 questions about myself (education, hobbies, favorite games, etc.).

Since these are high-priority data points, I used oversampling: I duplicated these 100 entries 4 times so they represent about 8% of the dataset. This "stamps" my actual identity into the model more firmly.

Adding the Alpaca Dataset

The Alpaca dataset is a gold standard in fine-tuning; it helps small models learn how to behave like a helpful assistant. I mixed in 500 random lines from pinzhenchen/alpaca-cleaned-fr (French Alpaca). This adds a layer of consistency and structure, balancing out the "chaos" of forum discussions while keeping my personal style dominant.

The Final Tally

I ended up with a .json file of about 17,000 lines (2MB) containing 5,136 total entries in this format:

{  
  "input": "I'm thinking of switching to open-ear headphones for running. Any recommendations?",  
  "output": "Personally, I use the Shokz OpenMove. I'm really happy with them because [...]"  
}
Enter fullscreen mode Exit fullscreen mode

They say a model is only as good as its data, and my experiments proved that 100%. Now, let’s get to the actual training!

Fine-tuning with Unsloth

The fine-tuning process itself isn't necessarily the most complex part, though finding the right settings can be a bit of a challenge.

To handle this, I used JupyterLab, a web-based interactive development environment that allows you to execute Python code in separate blocks called "cells" while keeping an active kernel. This is incredibly useful because if your last cell crashes, you can often just fix and rerun it without restarting the entire process from scratch. It’s not magic, though—you can still crash the kernel and be forced to start over!

For the training itself, I used the Unsloth library.

To get started on a machine with Python 3 installed, create a virtual environment, activate it, and install JupyterLab:

python3 -m venv clone
source clone/bin/activate
pip install jupyterlab
Enter fullscreen mode Exit fullscreen mode

Finally, launch the JupyterLab interface and select the Python kernel (or the Unsloth kernel if you have one configured):

jupyter lab
Enter fullscreen mode Exit fullscreen mode

Below is the commented script I ran in JupyterLab for the training. I recommend executing it piece by piece. The original notebook that served as my inspiration can be found here: Gemma 3 (4B) Unsloth Notebook.

# Install the unsloth library
!pip install unsloth

from unsloth import FastModel
import torch

# 1. LOAD THE BASE MODEL & TOKENIZER
model, tokenizer = FastModel.from_pretrained(
    model_name = "unsloth/gemma-3-12b-it-unsloth-bnb-4bit",
    max_seq_length = 1280,
    load_in_4bit = True, # Crucial for consumer GPUs: loads weights in 4-bit precision to save VRAM.
    load_in_8bit = False,
    full_finetuning = False,
)

# 2. CONFIGURE LoRA (Low-Rank Adaptation)
model = FastModel.get_peft_model(
    model,
    finetune_vision_layers     = False,
    finetune_language_layers   = True,
    finetune_attention_modules = True,
    finetune_mlp_modules       = True,
    r = 32,                     # The "Rank" of the LoRA matrices. 32 is a solid middle ground (higher = smarter but slower/more memory).
    lora_alpha = 32,            # Scaling factor for LoRA. Rule of thumb is often alpha = r or alpha = 2*r.
    lora_dropout = 0,           # Unsloth doesn't support dropout natively for max speed
    bias = "none",              # Do not train bias vectors (saves memory).
    use_gradient_checkpointing = "unsloth", # Saves VRAM by recomputing activations during backward pass instead of storing them.
    random_state = 3407,        # Fixed seed for reproducibility.
)

from datasets import load_dataset, concatenate_datasets

# 3. PREPARE THE DATASETS
# Load my own dataset
ds_style = load_dataset("json", data_files="completed2-filtered-trimmed.json", split="train")

# Load custom/manual questions/answers and oversample it (weighting it) by multiplying the dataset by 4.
# This forces the model to pay 4x more attention to this dataset.
ds_custom = load_dataset("json", data_files="custom.json", split="train")
ds_custom_weighted = concatenate_datasets([ds_custom] * 4)

# Load a French translated version of the Alpaca dataset for general instruction-following
ds_alpaca = load_dataset("pinzhenchen/alpaca-cleaned-fr", split="train")
ds_alpaca = ds_alpaca.select(range(500)) # Subsample: only take the first 500 rows so it doesn't overpower your custom data.

# Helper function to merge Alpaca's 'instruction' and 'input' fields into a single 'input' field.
def format_alpaca_to_user(example):
    instruction = example.get('instruction', '')
    context = example.get('input', '')

    if context and len(context) > 1:
        new_input = f"{instruction}\n\nContexte :\n{context}"
    else:
        new_input = instruction

    return {
        "input": new_input,
        "output": example['output']
    }

ds_alpaca = ds_alpaca.map(format_alpaca_to_user, remove_columns=["instruction"])

# Combine and shuffle
combined_dataset = concatenate_datasets([ds_style, ds_custom_weighted, ds_alpaca])
combined_dataset = combined_dataset.shuffle(seed=42)

# 4. APPLY CHAT TEMPLATES
# This wraps your raw text into the exact conversational format Gemma 3 was originally trained on.
def formatting_prompts_func(examples):
    inputs = examples["input"]
    outputs = examples["output"]
    texts = []

    for input_text, output_text in zip(inputs, outputs):
        conversation = [
            {"role": "system", "content": "Tu es Jonathan, un humain. Tu n'es pas une IA."}, # "You are Jonathan, a human. You are not an AI."
            {"role": "user", "content": input_text},
            {"role": "assistant", "content": output_text}
        ]

        text = tokenizer.apply_chat_template(
            conversation, 
            tokenize = False,
            add_generation_prompt = False
        )

        texts.append(text)

    # Return a new column 'text' which the SFTTrainer will look for
    return { "text" : texts }

dataset = combined_dataset.map(formatting_prompts_func, batched = True)

from trl import SFTTrainer, SFTConfig
from unsloth import is_bfloat16_supported

# 5. TRAINER CONFIGURATION
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text", # Tells the trainer to use the 'text' column we just generated
    max_seq_length = 1280,       # Match the model's loaded context window
    dataset_num_proc = 2,        # Number of CPU cores to use for dataset processing
    packing = False,             # If True, packs multiple short examples into one sequence. False is safer for chat templates.
    args = SFTConfig(
        bf16 = True,             # Uses bfloat16 mixed precision. Highly recommended for Ampere+ GPUs (RTX 3000 series or newer) for stability.
        fp16 = False,            # Set to True ONLY if your GPU doesn't support bf16.
        per_device_train_batch_size = 1, # Keep small to fit in VRAM.
        gradient_accumulation_steps = 16, # "Virtual" batch size. Model updates weights only after 16 steps (Effective batch size = 1 * 16 = 16).
        num_train_epochs = 2,    # Number of full passes through the training data.
        warmup_steps = 50,       # Gradually increases learning rate for the first 50 steps to prevent catastrophic forgetting early on.
        learning_rate = 2e-4,    # Standard aggressive LoRA learning rate.
        logging_steps = 10,      # Print training loss every 10 steps in jupyter output
        optim = "paged_adamw_8bit", # 8-bit optimizer. Vital for saving VRAM without sacrificing performance.
        weight_decay = 0.01,     # Regularization to prevent overfitting.
        lr_scheduler_type = "linear", # Gradually decays the learning rate to 0 over the course of training.
        seed = 3407,
        output_dir = "outputs",  # Directory where intermediate checkpoints are saved.
        report_to = "none",      # Disables integrations like Weights & Biases or TensorBoard. Change to "wandb" if you want nice graphs.
        gradient_checkpointing = True # Ensures activation memory is saved (syncs with Unsloth's checkpointing config).
    ),
)

# 6. RUN TRAINING
trainer_stats = trainer.train()

# 7. EXPORT TO GGUF

# Save a Q4_K_M version: This is the most popular quantization. 
# It shrinks the model drastically (approx 4-bits per weight) with very minimal quality loss. Excellent for consumer hardware.
model.save_pretrained_gguf(
    "gemma3_forum_gguf_q4",
    tokenizer,
    quantization_method = "q4_k_m" 
)

# Save a Q8_0 version: 8-bit quantization.
# Larger file size and requires more RAM/VRAM to run, but retains almost 100% of the original unquantized model's intelligence.
model.save_pretrained_gguf(
    "gemma3_forum_gguf_q8",
    tokenizer,
    quantization_method = "q8_0"
)
Enter fullscreen mode Exit fullscreen mode

The execution took about 1 hour and 30 minutes on my machine.

At the end of the process, I obtained two files:

  • gemma-3-12b-it.Q4_K_M.gguf: The most compressed version of the model. It fits into just over 8GB of VRAM and is the one I'll be using for daily interaction.
  • gemma-3-12b-it.Q8_0.gguf: A higher-quality version, but it requires significantly more VRAM.

I ran a few local tests, and my writing style is definitely recognizable! However, to truly validate if the clone works, I need to put it to the test with people who actually know me...

The Verdict: Is It Convincing?

I wanted to know if my clone was actually any good. To find out, I put together a survey using a small script that:

  1. Picks a forum quote that I had historically replied to.
  2. Retrieves my actual response from the archives.
  3. Generates a new response using the clone, using that same quote as the prompt.

I let the script generate 10 pairs of "Real Me" vs. "AI Me" responses and asked my friends and the forum members to guess which was which.

The results were fascinating.

Keep in mind that in a blind test like this, a 50% correct guess rate means the answers are statistically indistinguishable—people are basically just flipping a coin because they can't tell the difference.

The Giveaways

Two of the AI responses were total "dead giveaways":

  • Hallucination: In one instance, the AI hallucinated my own username and started giving advice to "Jonathan," which made zero sense in context. 93% of people spotted this one immediately.
  • Syntax error: In another, the AI used a sentence structure that felt "off" and robotic. 87.5% of people caught it. These two outliers definitely skewed the final average.

The "AI Wins"

This is where it gets crazy: there were two instances where the AI won. In these cases, 69.8% and 62.5% of respondents thought the AI response was the "real" me. It turns out the model can mimic my style so effectively that it sometimes sounds more like "me" than I do!

We also had one question hit exactly 50%, making it completely indistinguishable.

Final Stats

In the end, people were able to recognize my AI clone 66% of the time (50% being unrecognizable). I consider this a massive success, especially since there was absolutely no cherry-picking: all quotes were chosen randomly, and the AI responses were generated on the first try without any "rerolls."

Here are a few example exchanges (translated from French) using Ollama:

>>> Is Windows Vista a good OS?
Windows Vista is an excellent OS, you should just install it :p

>>> I decided to call my cat: The Great Old One.
Haha, I love that name for a cat

>>> Can you give me your favorite recipe?
On one hand, it's simple, but on the other, it's very effective: Basquaise chicken. 
It's a Spanish dish with chicken simmered in a sauce of red peppers and tomatoes.
Enter fullscreen mode Exit fullscreen mode

(Clarification: I love the chicken, but hate the OS. That said, the :p suggests the clone might just be trolling—which, considering the dataset, means it’s working perfectly.)

Taking It a Step Further!

Now that I had a model that thinks like me, I wanted to take it to the next level: giving it my voice and the ability to converse with it naturally.

I set up a pipeline using Whisper (OpenAI’s open-weights model) to transcribe my speech into text. That text is then fed to my "clone" model as a prompt. Once the response is generated, it’s sent to Qwen3-TTS, a model capable of high-quality voice cloning.

To set up the voice, I "trained" (or rather, fine-tuned the reference for) Qwen3-TTS by recording myself speaking for about one minute.

Clone Workflow

All these models are open-weights and run locally on my machine!

Of course, we aren't quite at the level of latency you'd see in top-tier commercial models. I don't have the enterprise-grade GPU required for instantaneous replies, and I wasn't specifically optimizing for speed—for instance, I don’t stream the audio chunks from Qwen3-TTS; I simply wait for the full response to generate.

Here is a short demo video. It’s in French, but you can turn on the English subtitles on YouTube! I cut the wait times after the first question to make it smoother.

Here is the Python script used in this video: https://gist.github.com/jprevo/76d2e1fe388a3ecff7631b18ef91c233


Final Thoughts

Ultimately, I’ve only scratched the surface of what’s possible with LLM fine-tuning. My model resembles me quite a bit in writing (and in voice!), but it still has its fair share of inconsistencies. I’m convinced that by spending even more time curating the dataset and obsessing over hyperparameters, the results could be even more uncanny.

That said, I had a blast and learned an immense amount, which was the whole point. My clone doesn't really have a "job" or a practical utility, so I’ll likely leave it in a corner of my hard drive and move on to the next project.

...Until the day it returns to seek its revenge! 🤖

You can find me on LinkedIn here: Jonathan Prevost

Top comments (0)