I Built an AI Creative Director: Automating FB Ad Gen with GPT-4o Vision & Structured Outputs

#openai #gpt4o #automation #productivity

Creating high-converting Facebook ads is a grind. You need to analyze what's working, understand your product's unique selling points, and then churn out dozens of visual variations.

"Prompter's block" is real. Staring at an empty Midjourney or DALL-E prompt box often leads to generic, unusable results.

So, I decided to automate the "Creative Director" role.

Instead of guessing prompts, I built a workflow in n8n that uses GPT-4o's Vision capabilities to "see" successful ads, analyze my product, and structurally engineer new ad concepts.

Here is how I built a Multimodal AI Pipeline that turns raw product photos into ready-to-test ad creatives.

![Screenshot: Full Workflow Overview in n8n canvas]

The Architecture: Eye, Brain, and Hand

This isn't just a simple "Text-to-Image" flow. It is a "Image-to-Text-to-Image" pipeline (Multimodal).

The workflow consists of three distinct phases:

The Eye (Input Analysis): GPT-4o Vision analyzes reference images (inspiration) and product images.
The Brain (Structured Logic): A LangChain Agent synthesizes these inputs into a strict JSON format.
The Hand (Execution): A loop triggers the OpenAI Image Generation API for each concept.

Let's break down the technical implementation.

Phase 1: The "Vision" Analysis (Reverse Engineering)

First, we need to understand what makes a good ad. We don't want to describe it manually; we want the AI to do it.

I set up two parallel Google Drive nodes to fetch files:

Inspiration Folder: Contains high-performing ads from competitors.
Product Folder: Contains raw shots of the product we are selling.

Then, I pass these images (as Base64) into the OpenAI Chat Model node using gpt-4o.

For the Inspiration Analysis, I use a specific prompt to extract the principles rather than the content:

"Describe the visual style of this image... create a template of the style for inspirations. Ensure you do not make this product specific, rather focusing on creating outlines for static ad styles."

For the Product Analysis, I ask it to extract the emotions and selling points:

"Identify the core emotions behind it and the main product. We will use this later to connect the product image with some ad styles."

Now, we have two distinct text chunks: one describing the "Vibe" and one describing the "Subject".

Phase 2: From Chaos to JSON (The Structured Output)

This is the most critical part of any AI automation. LLMs love to chat. They love to say "Here are 10 prompts for you:".

We cannot automate "chat". We need Arrays.

To fix this, I used the Advanced AI nodes in n8n, specifically the Structured Output Parser.

I connected a LangChain Agent to the Structured Output Parser and defined a strict JSON Schema. This forces the AI to ignore its chatty tendencies and output pure, parseable JSON.

Here is the schema example I provided to the model:

[
  {
    "Prompt": "Sun-drenched poolside shot of the product on a marble ledge at golden hour, with soft shadows and warm tones. Aspect ratio 1:1."
  },
  {
    "Prompt": "Cool lavender-tinted sunset beach backdrop behind the product, highlighting reflective metallic accents. Aspect ratio 4:5."
  }
]

The Agent takes the Style Description (from Node A) and the Product Description (from Node B) and merges them into this exact format.

Why is this a game changer?
Because now I have a programmatic array of 10 prompts that I can iterate over. No regex, no string parsing, just clean data.

Phase 3: The Factory Line (Execution Loop)

Once we have the array, we use a Split Out node to separate the JSON items.

Instead of using a pre-built DALL-E node, I opted for a raw HTTP Request node to hit the https://api.openai.com/v1/images/generations endpoint.

Why raw HTTP?
It gives me granular control over the API parameters (like quality, style, or testing new model IDs like dall-e-3 standard vs HD) that might not yet be exposed in standard nodes.

Here is the body configuration:

{
  "model": "dall-e-3",
  "prompt": "={{ $json.Prompt }}",
  "size": "1024x1024",
  "quality": "standard",
  "n": 1
}

A Note on Rate Limits

When generating images via API, you will hit rate limits fast. I added a Wait node inside the loop (not shown in standard templates, but essential for production).

It pauses for a few seconds between requests to ensure we stay within the "Tokens Per Minute" (TPM) limits of the OpenAI API.

The Result: Automated Creativity

Finally, the workflow grabs the Base64 data from the API response, converts it to a binary file using Convert to File, and uploads the fresh, AI-generated ad to a specific "Output" folder in Google Drive.

The Workflow Flow:

Read Reference Image (Input).
Read Product Image (Input).
Analyze both with GPT-4o.
Synthesize 10 new prompts via LangChain.
Generate 10 new images via DALL-E 3.
Save to Drive.

I built this because I wanted to test 50 different visual hooks for a campaign without briefing a designer 50 times. The results are surprisingly coherent because they are grounded in the visual analysis of ads that already work.