DEV Community: Max Frecka

My first successful LLM fine tuning

Max Frecka — Fri, 23 May 2025 14:36:27 +0000

Yesterday I told myself I would do something other than program all day, but again ended up pushing to accomplish the goal that I felt a bit obsessed over. The goal: to fine tune Mistral using my own method, in order to create a personalized lyric generation machine. I would be training it with my own lyrics as well as injecting random sequences of words. I just had to know the effect that fine-tuning would have. Would it really be able to make the kind of creative lyrics that I wanted?

Base LLMs can't really generate the kind of poetry/lyrics that actual artists do in the real world. Usually they are very hard to wrangle toward being more creative. So I figured I would try an experimental fine-tuning method with Mistral to see what effect it could have.

After running into difficulty with LoRA fine tuning Mistral7B locally on my computer, I decided it would make more sense to do so on AWS. The easiest method that I am currently aware of would be to do so in a Notebook instance on SagemakerAI. Those Jupyter notebooks were used in the LLM course I took, and just feel very convenient and friendly to me.

When I tried to fine-tune locally, I had to try to work around GPU limitations. I have a RTX 2060 Super which is great for gaming, but is not really powerful enough for doing this kind of stuff. I found it was just too complicated, so yesterday I decided to just pay for a large, super powerful instance and try to get it done.

I set everything up and attempted to run the trainer, but quickly ran into an error. My g5.2xlarge instance was not powerful enough to run the fine-tuning. So I decided to change it to the g5.12xlarge instance type.

This was more expensive so I felt pressure to move quickly, not knowing if I would be able to finish at all. This was a learning experience for me just as much as it was a creative experiment that might actually be useful in my music work.

Before any of this, I created my experimental training dataset. This was not the most precise method of creating training data. I mainly just needed to see for myself the actual tangible effects of running fine-tuning on an LLM. It's all just a bit "black box" mystery, isn't it? The only way to map out the territory is to do experiments.

def generate_dataset2(iterations=30):
    # Prepare the new training dataset
    training_data = []

    for i in range(iterations):

        prompt = generate_dataset_prompt()

        generated_response = generate_abstract_response()

        training_data.append({
            "instruction": prompt,
            "response": generated_response
        })

    # Save to file
    with open("trainingdataset.json", "w", encoding="utf-8") as f:
        json.dump(training_data, f, ensure_ascii=False, indent=2)

That function uses two separate functions to automatically generate both the prompt, and the response. The generate_abstract_response() function constructs a response using 2 separate helper methods:

def generate_abstract_response(iterations=10):
    output = ""
    ## alternate randomly per iteration between pasting in my lyrics or adding a random sequence
    with open("wordlist.txt", "r", encoding="utf-8") as f:
        wordlist = [line.strip() for line in f if line.strip()]

    for i in range(iterations):
        line = ""
        if random.random() < 0.7:  # 70% chance
            line=get_random_yaml_sequence("lyriclines.yaml","dogalyrics")+"\n"
        else:
            line=generate_random_sequence(wordlist) + '\n'
        output+=line
    return output

Before I wrote this function, I took about a year worth of lyrics from my songs and pasted every single line into a yaml file. That was about 400 lines. This function will then randomly decide whether to paste in 0-5 of those lyric lines in-sequence (so its somewhat coherent and not completely random lines, you will get some poetic-structure represented there), or it would use another function. The second function is generate_random_sequence(wordlist). The wordlist was constructed by taking an array of words that I think are cool sounding (dolphin, pop, zip, zap) and then calling an API to find all synonyms and antonyms of the words, then adding all of those words to the wordlist.txt:

#JUST SAVING THE WORDLIST TO TXT FOR USE

import csv
import os
from datetime import datetime
import time
import random
import requests

#prompt generator
# Original words array
words = [
    "bop", "roo", "cash", "spot", "spotty", "big", "noise", "sound", 
    "soundsystem", "box", "future", "now", "meow", "alien", "moon", "high", 
    "max", "maxim", "maximum", "global", "ear", "cat", "monkey", "road", 
    "play", "station", "poly", "glitch", "picture", "chart", "cat", "pop", 
    "center", "central", "media", "dolphin", "frequency", "information", 
    "dataset", "data", "number", "hear", "move", "all", "moving", "lab", 
    "type", "wave", "waving", "service", "soda", "fun", "toy", "beach", 
    "copy", "dot", "com", "info", "set", "net", "graph", "race", "racer", 
    "hearing", "music", "tube", "bell", "thought", "mind", "think", "thinking", 
    "auto", "audio", "time", "vector", "planet", "space", "warp", "today", 
    "today's", "sun", "star", "galaxy", "tennis", "racing", "speed", "mission", 
    "code", "line", "script", "program", "form", "focus", "day", "my", "major", 
    "atom", "witness", "shine", "shining", "heart", "radiant", "our", "we", 
    "together", "flash", "jump", "jumping", "your", "galaxy", "galactic", 
    "system", "market", "circle", "cycle", "yes", "no", "love", "spirit", 
    "filter", "speeding", "dash", "dashing", "project", "comet", "screen", 
    "magic", "select", "sign", "radius", "theory", "thesis", "zip", "zap", 
    "zipping", "protocol", "zone", "shift", "shifting", "symbol", "city", 
    "wire", "lobe", "psyche", "cheetah", "running", "person", "people", 
    "tribe", "plan", "path", "solar", "stereo", "new", "channel", "tree", 
    "plant", "flower", "rainbow", "sun", "sunny", "water", "brain", "dream", 
    "core", "reactor", "remix", "perfect", "all", "any", "real", "really", 
    "boom", "zing", "wow", "ya", "orb", "zen", "born", "slippy", "mix", "saga", "road"
]

# Fetch synonyms and antonyms for a word
def get_synonyms_antonyms(word):
    try:
        syn_resp = requests.get(f"https://api.datamuse.com/words?rel_syn={word}").json()
        ant_resp = requests.get(f"https://api.datamuse.com/words?rel_ant={word}").json()
        synonyms = [entry["word"] for entry in syn_resp]
        antonyms = [entry["word"] for entry in ant_resp]
        return synonyms + antonyms
    except Exception as e:
        print(f"Error fetching for word {word}: {e}")
        return []

# Expand the words list with synonyms/antonyms
extended_words = set(words)
for word in words:
    related_words = get_synonyms_antonyms(word)
    extended_words.update(related_words)
    time.sleep(0.2)  # avoid hitting rate limits

extended_words = list(extended_words)
with open("wordlist.txt", "w", encoding="utf-8") as f:
    for word in sorted(extended_words):
        f.write(word + "\n")

After this, the generate_random_sequence(wordlist) just randomly picks words from that list of "cool words" and combines them into lines. Generate_abstract_response() then alternates randomly between adding lines of my actual lyrics, and adding lines of random sequences until it has constructed some kind of "poem" or "lyric". Here is one:

pink messages
messages with you
you're someone new
The sound in your head
sky chromatic infographic
hypersonic super static
magic trick computed axial tomography circling serve inwardness railway line
not today
show you the sunset
atom sexual love slippy think speeding
smell washing soda sorcerous
gazing on the night meadow
in the glass raindew mist
make me wonder what became
revolve about seth spy partly unreal untested sign of the zodiac pinched
this transmission must keep looping

The hope was that this combination of randomness, and authentic lyrics that I actually wrote, would push the LLM toward a combination of having an actual structure, as well as having the capacity for serious randomness at times.

So, finally I executed the training. I ran 3 epochs on a dataset of 1000 entries. It took about 40 minutes on the g5.12xlarge instance. And I didn't really have to struggle at all. I definitely think just paying for a powerful instance and getting the job done in an hour is the way to go, rather than trying to squeeze it in on a weaker instance just so you can save 5 dollars.

Each Epoch was 250 steps, and then it saved a checkpoint at each so I could choose which to use. It seemed to me that the full 3-epoch 750 step version was very overfit. It would just kind of spit out sequences of my lyrics. So I decided to go with the 250 step version and set the temperature very high.

The next part was where things completely devolved. My goal was to merge the LoRA adapter piece back with Mistral and then convert the merged model into the .gguf format to use locally with Llama.cpp in python. That was always the goal! But, idiotically, my instance ran out of hard drive space. And at this point I was utterly done and ready to wrap up, so I just started trying to brute force my way through the problem. Nothing would delete! I was so close to the finish line but now it all seemed to be falling apart. At this point I just decided to download the LoRA piece and then merge/convert locally.

So luckily, eventually I did manage to do both of those things. But it was a good 1-2 hours of pain. Truth be told I had almost zero idea what I was doing, and I was at that point where it was no longer fun and I was just trying to finish. It would have stunk to not finish all the way at this point, but I just kept running into problems. GPT was wrong about many things with the huggingfaces libraries I was using, so I stopped listening to it and started just reading Google for help.

The issue was with converting to the .gguf format. I couldn't figure out which script to use. But somehow, after like 100 errors and failures, it _finally _worked. And finally I could run some completions.

Results:

in the shopping center of my mind
i've got a lot on my mind
a lot to leave behind
i'm feeling so alive
I don't know what I want anymore
i'm running from my past forevermore
putting on the mask
the inside is the outside
and i will not be controlled
I will not be stopped
this is my destiny
i'll take my chances and fly by the seat of my pants
it's time to wake up and smell the coffee
another day another dollar
i need to find my way out of here
the world is a vast place
i want to explore
and i won't back down
so let me in
i won't back down
i'm on the inside
another day another dollar
i need to find my way out of here
the world is a vast place
i want to explore
and i won't back down
so let me in
I will not be controlled
the inside is the outside
i am the one
putting on the mask
I don't want to be your clone
this is my life
not just another face in the crowd
i'm on a mission
to take over the world
one polygon at a time
so let us begin
with love in our hearts
and a little bit of magic
we'll make the world a better place
if

For a first experiment I was happy at how the fine-tuning really did make the LLM personal to me. The format of its responses really does vaguely sound like me, and it actually is producing results that are a bit more like what I wanted. It's far from perfect... it's not exactly what I was going for. It's still a bit cliched and its not really producing much randomness or abstraction. But it's definitely better than the base Mistral.

Conclusion and next steps

First off, this was very challenging. I personally am rather pleased with myself for doing this. It was "painful" in the way that programming often is. Error after error, basically just hoping that things work right, tons of different libraries you don't really understand working together to just magically accomplish something... and most of the time it doesn't work. And right when you think you're successful, another error. Another issue. But you push through because you just have to make it work. This was definitely a memorably intense version of that. It was kind of epic, actually. Felt like I was battling demons or something. And then, I won.

The coolest thing is that after doing this once, I can do it again much more easily. I can experiment with improving my methods. Clearly the dataset is the key point when fine-tuning. Improving the dataset is the main thing. 1000 entries is only really good if the dataset is top notch, and mine really wasn't. I could have done 500 entries and 1 epoch. But I can already see ways of improving the dataset, and now I have a tried and true methodology for the future.

This is cool, in my opinion.

Max Frecka — Thu, 22 May 2025 06:06:25 +0000

The last BeaconAI generation, at 1:53AM, on its 30 minute interval:

The full row, details to follow:

Currently using automated prompt engineering in this "instruction" and then "response" format that this particular model seems to respond to - for reasons that are really not understood for me yet.

2025-05-22T01:52:25.977565

    #INSTRUCTION: Write abstract modernist poetry about cat.

    Don't be frightened to display yourtalents. Reflect on stereo and cat and chop the words up re-assemble them. do it again. then again. don't half ass this.. Make it about beautiful stereo.

    ### RESPONSE:

How Mistral responded to this engineered prompt:

In the stereo, a cat's purring hum,
Echoes through the speakers like a drum.
Its fur is silver in the moonlight glow,
A creature of mystery and stealth,
Who prowls in darkness, yet has the strength
To leap upon its prey with such delight.
The cat is a modernist dream,
Innovative, sleek, and so very serene.
Its meow is like an electric hum,
And as it pounces, you hear its paws drum,
On the wooden floor beneath your feet.
Oh, the stereo's power lies in this cat,
Its presence so vital, it seems to chat,
A living embodiment of sound,
That moves and dances all around.
So let the cat inspire you and your art,
For its beauty knows no bounds or limits,
And in its presence, we are reminded,
To live fully and to keep on listening.

I loved this output, which came from Ideogram AI. I have been increasingly feeling that Ideogram is the superior generative model, because it is both clean and creative:

Autoprompt to Replicate API (via Python and .yaml logic):

A pink goofy 2000s fake MySpace band profile photo of In the stereo, a cat's purring hum, Echoes through the speakers like a drum. Its fur is silver in the moonlight glow, A creature of mystery and stealth, Who prowls in darkness, yet has the, in the style of blocky serious architecture with stupid.

Runner up from "playgroundai" or whatever its called:

BeaconAI: automated batch generation of prompts and audio experimentation

Max Frecka — Wed, 21 May 2025 19:41:23 +0000

I had the idea a few weeks ago to make an automated generative-AI "motivator" that could be adjusted to read out generated material at certain times of the day from a speaker. Sort of like a personalized, always-novel and potentially inspiring "beacon" to re-center me at certain times of the day with positive messages.

Though that would be potentially be very useful, I started to get more interested in the vast potential of what could be done just with the basic idea of an automated generative system, so I started to go a bit wild experimenting with a variety of APIs: Replicate for image generation, Creatify for 'ai vocal filters', GPT and locally hosted Mistral for LLMs, and various text to speech APIs.

It ultimately ended up being more of an experimental "lab" for mass generating prompts, audio, images, and video, as you can see below. Each row is one "iteration" that happens on the hour or however you specify it.

Prompt Engineering and Image Generation:

There is a lot of possibility for what you can do to automatically generate prompts and then mangle them to possibly result in highly novel output. Here was one part of my approach:

def modify_prompt(prompt):
    user_modifier = entry_prompt.get()
    pre_prompt = load_random_yaml_line('pre_prompts')
    modifier = get_random_modifier()
    modifier2 = load_random_yaml_line('art_descriptors')
    oblique_strategy = load_random_yaml_line('oblique_strategies')
    print(f"this modifier was used: {modifier}.")
    final_prompt = f"{user_modifier}. {pre_prompt}. {oblique_strategy}. Reflect on {modifier} and {prompt}. Make it about {modifier2} {modifier}."
    return final_prompt

I determined the best method for rapidly prototyping prompt engineering was to use .yaml files. I had not worked with them before, but it's quite simple. The above function pulls from a list of "Oblique strategies" (a card-based method for promoting creativity jointly created by musician/artist Brian Eno and multimedia artist Peter Schmidt, first published in 1975) to append them to the prompt at random. Here are a few:

oblique_strategies:
  - (Organic) Machinery
  - A line has two sides
  - A very small object. Its center
  - Abandon normal instruments
  - Accept advice
  - Accretion
  - Allow an easement (an easement is the abandonment of a structure)
  - Always first steps

Next, after GPT responds, I use the response to construct a second prompt which is then sent to Replicate (to multiple models for comparison and variety) for image generation:

def modify_image_prompt(prompt):
    #process the GPT responses for use
    truncated_prompt = " ".join(prompt.replace("\n", " ").strip().split()[:35])
    #extract new words from yaml
    art_style = load_random_yaml_line('art_styles')
    art_style2 = load_random_yaml_line('art_styles')
    descriptor1 = load_random_yaml_line('art_descriptors')
    descriptor2 = load_random_yaml_line('art_descriptors')
    descriptor3 = load_random_yaml_line('art_descriptors')
    descriptor4 = load_random_yaml_line('art_descriptors')
    final_prompt = f"A {descriptor1} {descriptor2} {art_style} of {truncated_prompt}, in the style of {descriptor3} {art_style2} with {descriptor4}."
    return final_prompt

Currently each time a single iteration of the Beacon fires off, it is sent to 4 different models, but more can easily be added. Here is an example prompt (on the left) and associated series of image outputs (the first 3 are stable diffusion, the last 4 are a better version of stable-diffusion (stability-ai/sdxl), ideogram-v3-turbo, google/imagen-3-fast, and playgroundai/playground-v2.5-1024px-aesthetic. I have found that Imagen is the worst if I'm honest.

The next example starts with this "engineered" prompt:

. Do nothing for as long as possible. Reflect on tree and take random words chopped up and randomly combined and turn them into random sequences in the form of postmodernist knock knock jokes. Make it about multi-colored tree.

It's response:

Knock knock.
Who's there?
Red leaf falling on green branch in blue sky.
Red leaf falling on green branch in blue sky who?
Existential crisis of the multi-colored tree.

And the "engineered" image prompt:

A laughable exciting back-of-the-box SNES screenshots of Knock knock. Who's there? Red leaf falling on green branch in blue sky. Red leaf falling on green branch in blue sky who? Existential crisis of the multi-colored tree., in the style of abstract impressionist painting with happy.

To make this entire row:

I am partial to regular Stable Diffusion:

You can call the API with this method as much as you want within price/usage limits to quickly generate a ton of content. And it becomes very interesting to notice the differences, and similarities, between generation models, making it a potent learning tool for going a bit deeper into these technologies.

The Chaotic Audio Generation Pipeline:

Then there was the second part of the tool that began to involve DSP (Digital Signal Processing) coding in Python. Overall, the part of the app that generates the audio was the hardest to implement. This was my first foray into Python for DSP, and GPT was helpful in easing me into it.

Essentially the idea was to generate a text to speech so that the GPT response would audibly play out whenever the Beacon was called (ie the top of every hour). But I wanted to make it more interesting, so I decided to start randomly mangling the audio by occasionally reversing a section, scrambling it, adding delay, pitching it up or down, slowing it down or speeding it up. After that process, I send it to an AI Voice Changer to further enhance and add another layer of detail.

Essentially, every time the Beacon runs an iteration, this is the sequence of events:

Construct the text prompt with modification logic
Generate the response from GPT API
Construct an image prompt using some of the GPT response
Generate the images
Optionally generate a video (this is much more expensive so I comment it out)
Generate the initial audio TTS using ElevenLabs API
Process the TTS audio using my own audio mangling DSP
Send the audio to Musicfy API for AI voice conversion
Update the DB with all datapoints (currently a local JSON db, I want to fix that)
Automatically "speak" the audio.

async def beacon():
    raw_prompt = load_prompt()
    prompt = modify_prompt(raw_prompt)
    response = get_gpt_response(prompt)
    imagePrompt = modify_image_prompt(response)
    imagepaths = generate_images_from_prompt(imagePrompt)
    video_path=generate_video(imagePrompt,model="minimax/video-01", output_dir="videoOutputs")
    audiopath=generate_mp3_filename(response)image_paths=imagepaths, bonus_image_path=bonus_image_path)
    write_to_db(prompt,response,audiopath,imagePrompt, image_paths=imagepaths)
    await speak(response)

All of this is wrapped with in a basic Python app that I haven't spent much time on, but it gives some basic GUI access to the Beacon process. Set hours you want to work in, give an optional pre-prompt that will be added, set the time interval in minutes that you want this Beacon to run, and set parameters for the GPT API:

Working with the Python GUI tool was not much fun, and ultimately turning this into a React app seems a better possibility. Making form inputs and positioning them felt like a pain when I was just trying to get something working:

# === Form inputs ===
entry_api = ctk.CTkEntry(app, placeholder_text="23")
entry_api.pack(pady=10)
entry_prompt = ctk.CTkEntry(app, placeholder_text="")
entry_prompt.pack(pady=10)
entry_start = ctk.CTkEntry(app, placeholder_text="4")
entry_start.pack(pady=10)
entry_end = ctk.CTkEntry(app, placeholder_text="20")
entry_end.pack(pady=10)
entry_interval = ctk.CTkEntry(app, placeholder_text="1")
entry_interval.pack(pady=10)
entry_api.insert(0, apiKey)
entry_prompt.insert(0,user_prompt)
entry_start.insert(0,start_hour)
entry_end.insert(0,end_hour)
entry_interval.insert(0,interval_minutes)

The audio "mangling" process was a learning experience for me. I initially attempted to figure out how to send the audio out of Python world and into Supercollider (my old go-to for doing audio coding experiments). But after spending a good hour failing miserably, I decided to just stay in Python world, and started messing with the super convenient Jupyter Notebooks in VSCode to rapidly try out different audio processing methods, such as this one below for beat stretching (stretching time without pitch shifting). I should note that GPT made this pretty trivial, I just asked it to make this for me:

def beat_stretch(y, sr, factor=2.0):
    """
    Slows down audio without changing pitch.
    """
    channels = []
    for ch in y:
        ch_stretched = librosa.effects.time_stretch(ch, rate=factor)
        channels.append(ch_stretched)

    # Pad shorter channel to match
    min_len = min(map(len, channels))
    channels = [ch[:min_len] for ch in channels]
    return np.vstack(channels)

As useful as GPT is (which it 100% is) for rapidly building things, I still had to know how to wire them together and get them working. And GPT got a lot of things wrong. For instance, it was completely wrong with Supercollider and seemed to just be making things up, leading to me wasting my time going down an artificial rabbit hole. That said, it made quickly prototyping DSP a much faster and more fun experience than it should have been.

That said, it was still very difficult. Working with audio is a lot harder than working with text and image. I ran into multiple issues with white noise blasting from the speaker due to files not being finished. I also had to dig into how synchronous and asynchronous code works in Python to ensure that everything was happening perfectly in order between the 2 audio APIs and the local sound processing.

Furthermore the Musicfy API was not that well documented and it took me around an hour of banging my head against the wall with HTTP 500 errors to figure out what was going wrong. I spent all that time looking in the wrong direction, thinking it was an issue with audio files (which would come up later with sample rate mismatches leading to the aforementioned white noise blasts). In this case, it was an issue with the Request format having spaces in one of the strings! Trust me when I say, don't expect GPT to figure everything out. That was an example where it was completely useless.

Ultimately, the "sound processing" function works by essentially registering a variety of possible randomized "audio-chunk processors" that have a probability at any point of applying some sort of transformation to the audio:

# Create registry of chunk processors
CHUNK_TRANSFORMS = [
    maybe(prob=0.1, fn=reverse_chunk),
    maybe(prob=0.2, fn=overlay_reverse),
    maybe(prob=0.3, fn=stretch_pitch,factor=0.5),
    maybe(prob=0.3, fn=stretch_pitch,factor=1.7),
    maybe(prob=0.4, fn=apply_delay, delay_ms=300),
    maybe(prob=0.4, fn=beat_stretch, factor=2.0),
    maybe(prob=0.4, fn=beat_stretch, factor=0.35),
    maybe(prob=0.4, fn=apply_delay, delay_ms=50),
    maybe(prob=0.4, fn=apply_delay, delay_ms=1000),
]

Those various processes are applied here:

def process_chunks(y, sr, chunk_duration=2.0):
    chunk_size = int(sr * chunk_duration)
    total_chunks = (y.shape[1] - chunk_size) // chunk_size

    processed_chunks = []

    for i in range(total_chunks):
        chunk = y[:, i*chunk_size : (i+1)*chunk_size]
        for transform in random.sample(CHUNK_TRANSFORMS, k=random.randint(1, len(CHUNK_TRANSFORMS))):
            chunk = transform(chunk,sr)
        processed_chunks.append(chunk)

    return np.hstack(processed_chunks)

The goal was to make the audio a little more abstract, maybe even to where it could be used in my music tracks as sampled material, or something to throw in and mangle with Melodyne for tuning it and turning it into poetic vocals.

At the end, I created a front end display in React to automatically display every single Beacon as it comes up. Currently, each row displays the Timestamp, Prompt, Response, Audiofile, Image Prompt, and Images. Right now I have the Beacon going off automatically every 30 minutes:

Some next steps involve improving the database system. I want to add data for what model was used with each image, and store all the GPT settings per row. This can be used to possibly fine-tune a model at some point.

Secondly, I am interested in creating my own fine-tuned model. This could be done a few ways. I have experimented with Sagemaker AI and Mistral, and found that it could generate more of the kind of abstract, artistically interesting text outputs that I want. Whereas GPT tends to have a kind of ... sameness. It tends to think "artsy" and "abstract" just means endlessly waxing on about "whispers in space and time" and "robots whirring" and that kind of slop. Whereas Mistral would just spit out these chaotic assemblages of EE Cummings style modernist insanity, which is a lot more interesting artistically.

A last potential goal would be to move all of this directly into the cloud, but that is a bit out of scope for me currently. My understanding of AWS is still growing, and I already made a major mistake with not realizing a Load Balancer had been running on there for months. Costly. So I don't really want any more expenses on top of just using the APIs right now.