DEV Community

Cover image for BeaconAI: automated batch generation of prompts and audio experimentation
Max Frecka
Max Frecka

Posted on • Edited on

BeaconAI: automated batch generation of prompts and audio experimentation

I had the idea a few weeks ago to make an automated generative-AI "motivator" that could be adjusted to read out generated material at certain times of the day from a speaker. Sort of like a personalized, always-novel and potentially inspiring "beacon" to re-center me at certain times of the day with positive messages.

Though that would be potentially be very useful, I started to get more interested in the vast potential of what could be done just with the basic idea of an automated generative system, so I started to go a bit wild experimenting with a variety of APIs: Replicate for image generation, Creatify for 'ai vocal filters', GPT and locally hosted Mistral for LLMs, and various text to speech APIs.

It ultimately ended up being more of an experimental "lab" for mass generating prompts, audio, images, and video, as you can see below. Each row is one "iteration" that happens on the hour or however you specify it.

Prompt Engineering and Image Generation:

There is a lot of possibility for what you can do to automatically generate prompts and then mangle them to possibly result in highly novel output. Here was one part of my approach:

def modify_prompt(prompt):
    user_modifier = entry_prompt.get()
    pre_prompt = load_random_yaml_line('pre_prompts')
    modifier = get_random_modifier()
    modifier2 = load_random_yaml_line('art_descriptors')
    oblique_strategy = load_random_yaml_line('oblique_strategies')
    print(f"this modifier was used: {modifier}.")
    final_prompt = f"{user_modifier}. {pre_prompt}. {oblique_strategy}. Reflect on {modifier} and {prompt}. Make it about {modifier2} {modifier}."
    return final_prompt
Enter fullscreen mode Exit fullscreen mode

I determined the best method for rapidly prototyping prompt engineering was to use .yaml files. I had not worked with them before, but it's quite simple. The above function pulls from a list of "Oblique strategies" (a card-based method for promoting creativity jointly created by musician/artist Brian Eno and multimedia artist Peter Schmidt, first published in 1975) to append them to the prompt at random. Here are a few:

oblique_strategies:
  - (Organic) Machinery
  - A line has two sides
  - A very small object. Its center
  - Abandon normal instruments
  - Accept advice
  - Accretion
  - Allow an easement (an easement is the abandonment of a structure)
  - Always first steps
Enter fullscreen mode Exit fullscreen mode

Next, after GPT responds, I use the response to construct a second prompt which is then sent to Replicate (to multiple models for comparison and variety) for image generation:

def modify_image_prompt(prompt):
    #process the GPT responses for use
    truncated_prompt = " ".join(prompt.replace("\n", " ").strip().split()[:35])
    #extract new words from yaml
    art_style = load_random_yaml_line('art_styles')
    art_style2 = load_random_yaml_line('art_styles')
    descriptor1 = load_random_yaml_line('art_descriptors')
    descriptor2 = load_random_yaml_line('art_descriptors')
    descriptor3 = load_random_yaml_line('art_descriptors')
    descriptor4 = load_random_yaml_line('art_descriptors')
    final_prompt = f"A {descriptor1} {descriptor2} {art_style} of {truncated_prompt}, in the style of {descriptor3} {art_style2} with {descriptor4}."
    return final_prompt
Enter fullscreen mode Exit fullscreen mode

Currently each time a single iteration of the Beacon fires off, it is sent to 4 different models, but more can easily be added. Here is an example prompt (on the left) and associated series of image outputs (the first 3 are stable diffusion, the last 4 are a better version of stable-diffusion (stability-ai/sdxl), ideogram-v3-turbo, google/imagen-3-fast, and playgroundai/playground-v2.5-1024px-aesthetic. I have found that Imagen is the worst if I'm honest.

The next example starts with this "engineered" prompt:

. Do nothing for as long as possible. Reflect on tree and take random words chopped up and randomly combined and turn them into random sequences in the form of postmodernist knock knock jokes. Make it about multi-colored tree.

It's response:

Knock knock.
Who's there?
Red leaf falling on green branch in blue sky.
Red leaf falling on green branch in blue sky who?
Existential crisis of the multi-colored tree.

And the "engineered" image prompt:

A laughable exciting back-of-the-box SNES screenshots of Knock knock. Who's there? Red leaf falling on green branch in blue sky. Red leaf falling on green branch in blue sky who? Existential crisis of the multi-colored tree., in the style of abstract impressionist painting with happy.

To make this entire row:

a single

I am partial to regular Stable Diffusion:

You can call the API with this method as much as you want within price/usage limits to quickly generate a ton of content. And it becomes very interesting to notice the differences, and similarities, between generation models, making it a potent learning tool for going a bit deeper into these technologies.

The Chaotic Audio Generation Pipeline:

Then there was the second part of the tool that began to involve DSP (Digital Signal Processing) coding in Python. Overall, the part of the app that generates the audio was the hardest to implement. This was my first foray into Python for DSP, and GPT was helpful in easing me into it.

Essentially the idea was to generate a text to speech so that the GPT response would audibly play out whenever the Beacon was called (ie the top of every hour). But I wanted to make it more interesting, so I decided to start randomly mangling the audio by occasionally reversing a section, scrambling it, adding delay, pitching it up or down, slowing it down or speeding it up. After that process, I send it to an AI Voice Changer to further enhance and add another layer of detail.

Essentially, every time the Beacon runs an iteration, this is the sequence of events:

  1. Construct the text prompt with modification logic
  2. Generate the response from GPT API
  3. Construct an image prompt using some of the GPT response
  4. Generate the images
  5. Optionally generate a video (this is much more expensive so I comment it out)
  6. Generate the initial audio TTS using ElevenLabs API
  7. Process the TTS audio using my own audio mangling DSP
  8. Send the audio to Musicfy API for AI voice conversion
  9. Update the DB with all datapoints (currently a local JSON db, I want to fix that)
  10. Automatically "speak" the audio.
async def beacon():
    raw_prompt = load_prompt()
    prompt = modify_prompt(raw_prompt)
    response = get_gpt_response(prompt)
    imagePrompt = modify_image_prompt(response)
    imagepaths = generate_images_from_prompt(imagePrompt)
    video_path=generate_video(imagePrompt,model="minimax/video-01", output_dir="videoOutputs")
    audiopath=generate_mp3_filename(response)image_paths=imagepaths, bonus_image_path=bonus_image_path)
    write_to_db(prompt,response,audiopath,imagePrompt, image_paths=imagepaths)
    await speak(response)
Enter fullscreen mode Exit fullscreen mode

All of this is wrapped with in a basic Python app that I haven't spent much time on, but it gives some basic GUI access to the Beacon process. Set hours you want to work in, give an optional pre-prompt that will be added, set the time interval in minutes that you want this Beacon to run, and set parameters for the GPT API:

Working with the Python GUI tool was not much fun, and ultimately turning this into a React app seems a better possibility. Making form inputs and positioning them felt like a pain when I was just trying to get something working:

# === Form inputs ===
entry_api = ctk.CTkEntry(app, placeholder_text="23")
entry_api.pack(pady=10)
entry_prompt = ctk.CTkEntry(app, placeholder_text="")
entry_prompt.pack(pady=10)
entry_start = ctk.CTkEntry(app, placeholder_text="4")
entry_start.pack(pady=10)
entry_end = ctk.CTkEntry(app, placeholder_text="20")
entry_end.pack(pady=10)
entry_interval = ctk.CTkEntry(app, placeholder_text="1")
entry_interval.pack(pady=10)
entry_api.insert(0, apiKey)
entry_prompt.insert(0,user_prompt)
entry_start.insert(0,start_hour)
entry_end.insert(0,end_hour)
entry_interval.insert(0,interval_minutes)
Enter fullscreen mode Exit fullscreen mode

The audio "mangling" process was a learning experience for me. I initially attempted to figure out how to send the audio out of Python world and into Supercollider (my old go-to for doing audio coding experiments). But after spending a good hour failing miserably, I decided to just stay in Python world, and started messing with the super convenient Jupyter Notebooks in VSCode to rapidly try out different audio processing methods, such as this one below for beat stretching (stretching time without pitch shifting). I should note that GPT made this pretty trivial, I just asked it to make this for me:

def beat_stretch(y, sr, factor=2.0):
    """
    Slows down audio without changing pitch.
    """
    channels = []
    for ch in y:
        ch_stretched = librosa.effects.time_stretch(ch, rate=factor)
        channels.append(ch_stretched)

    # Pad shorter channel to match
    min_len = min(map(len, channels))
    channels = [ch[:min_len] for ch in channels]
    return np.vstack(channels)
Enter fullscreen mode Exit fullscreen mode

As useful as GPT is (which it 100% is) for rapidly building things, I still had to know how to wire them together and get them working. And GPT got a lot of things wrong. For instance, it was completely wrong with Supercollider and seemed to just be making things up, leading to me wasting my time going down an artificial rabbit hole. That said, it made quickly prototyping DSP a much faster and more fun experience than it should have been.

That said, it was still very difficult. Working with audio is a lot harder than working with text and image. I ran into multiple issues with white noise blasting from the speaker due to files not being finished. I also had to dig into how synchronous and asynchronous code works in Python to ensure that everything was happening perfectly in order between the 2 audio APIs and the local sound processing.

Furthermore the Musicfy API was not that well documented and it took me around an hour of banging my head against the wall with HTTP 500 errors to figure out what was going wrong. I spent all that time looking in the wrong direction, thinking it was an issue with audio files (which would come up later with sample rate mismatches leading to the aforementioned white noise blasts). In this case, it was an issue with the Request format having spaces in one of the strings! Trust me when I say, don't expect GPT to figure everything out. That was an example where it was completely useless.

Ultimately, the "sound processing" function works by essentially registering a variety of possible randomized "audio-chunk processors" that have a probability at any point of applying some sort of transformation to the audio:

# Create registry of chunk processors
CHUNK_TRANSFORMS = [
    maybe(prob=0.1, fn=reverse_chunk),
    maybe(prob=0.2, fn=overlay_reverse),
    maybe(prob=0.3, fn=stretch_pitch,factor=0.5),
    maybe(prob=0.3, fn=stretch_pitch,factor=1.7),
    maybe(prob=0.4, fn=apply_delay, delay_ms=300),
    maybe(prob=0.4, fn=beat_stretch, factor=2.0),
    maybe(prob=0.4, fn=beat_stretch, factor=0.35),
    maybe(prob=0.4, fn=apply_delay, delay_ms=50),
    maybe(prob=0.4, fn=apply_delay, delay_ms=1000),
]
Enter fullscreen mode Exit fullscreen mode

Those various processes are applied here:

def process_chunks(y, sr, chunk_duration=2.0):
    chunk_size = int(sr * chunk_duration)
    total_chunks = (y.shape[1] - chunk_size) // chunk_size

    processed_chunks = []

    for i in range(total_chunks):
        chunk = y[:, i*chunk_size : (i+1)*chunk_size]
        for transform in random.sample(CHUNK_TRANSFORMS, k=random.randint(1, len(CHUNK_TRANSFORMS))):
            chunk = transform(chunk,sr)
        processed_chunks.append(chunk)

    return np.hstack(processed_chunks)

Enter fullscreen mode Exit fullscreen mode

The goal was to make the audio a little more abstract, maybe even to where it could be used in my music tracks as sampled material, or something to throw in and mangle with Melodyne for tuning it and turning it into poetic vocals.

At the end, I created a front end display in React to automatically display every single Beacon as it comes up. Currently, each row displays the Timestamp, Prompt, Response, Audiofile, Image Prompt, and Images. Right now I have the Beacon going off automatically every 30 minutes:

Some next steps involve improving the database system. I want to add data for what model was used with each image, and store all the GPT settings per row. This can be used to possibly fine-tune a model at some point.

Secondly, I am interested in creating my own fine-tuned model. This could be done a few ways. I have experimented with Sagemaker AI and Mistral, and found that it could generate more of the kind of abstract, artistically interesting text outputs that I want. Whereas GPT tends to have a kind of ... sameness. It tends to think "artsy" and "abstract" just means endlessly waxing on about "whispers in space and time" and "robots whirring" and that kind of slop. Whereas Mistral would just spit out these chaotic assemblages of EE Cummings style modernist insanity, which is a lot more interesting artistically.

A last potential goal would be to move all of this directly into the cloud, but that is a bit out of scope for me currently. My understanding of AWS is still growing, and I already made a major mistake with not realizing a Load Balancer had been running on there for months. Costly. So I don't really want any more expenses on top of just using the APIs right now.

Top comments (0)