DEV Community

Cover image for Building in public #6: I let AI judge every Ad on my platform. Here's what happened.
Jude Miracle
Jude Miracle

Posted on

Building in public #6: I let AI judge every Ad on my platform. Here's what happened.

I use AI to score every ad on my platform. It is helpful most of the time, but sometimes it makes incorrect judgments.

This feature excites me, but it also makes me nervous. When a sponsor submits an ad through Adsloty, the ad is analyzed by AI before the writer sees it. The AI scores how well the ad fits the newsletter's audience, checks the tone, rates clarity, estimates clicks, and gives the writer specific recommendations.

The idea is straightforward: writers should have data to determine if an ad is suitable for their newsletter. They shouldn't have to guess.

However, the reality is more complicated than the idea. I will explain how I built this system, what the prompts look like, and what happens when the AI confidently suggests rejecting a suitable ad.

Why AI analysis at all?

A writer receives an ad request for a software tool. The newsletter focuses on technology, which seems like a good match. However, the ad copy is unclear. The call to action says "Learn more," and the description mentions features that are hard to understand without background information.

The writer must decide what to do with limited information. Some writers handle this well, but most struggle—they are writers, not ad salespeople. They either approve all ads (which can harm their audience) or reject anything that isn’t clearly suitable (which can hurt their revenue).

I aimed to provide them with a starting point. Not a decision, but a place to begin.

The model: Gemini 2.5 Flash

I chose Google's Gemini 2.5 Flash. It’s not the biggest or smartest model, but it is fast, affordable, and good enough for structured analysis.

I made this choice on purpose. This task is about classification and scoring, not creative writing, so I don’t need the most advanced model. Flash does a good job for this type of work, and saving money is important when analyzing every single booking.

let url = "https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash:generateContent";

let body = json!({
    "contents": [{
        "parts": [{ "text": prompt }]
    }],
    "generationConfig": {
        "maxOutputTokens": 4096,
        "responseMimeType": "application/json"
    }
});
Enter fullscreen mode Exit fullscreen mode

The responseMimeType: "application/json" is very important. It tells Gemini to give back data in a structured JSON format instead of regular text. If you don’t use it, you might get responses like "Based on my analysis, I believe this ad would score approximately..." This type of response is not useful for a system that needs to store numbers in a database.

The prompt: smaller than you'd think

Here's the actual prompt I'm sending:

fn build_analysis_prompt(&self, ad_copy: &str, ad_title: &str, writer: &Writer) -> String {
    let subscriber_range = writer
        .subscriber_count_range
        .as_deref()
        .unwrap_or("unknown");

    format!(
        r#"Ad: "{}" - {}
Newsletter: {} subs

JSON format:
{{"fit_score":75,"tone_analysis":"Pro","clarity_rating":"High","estimated_clicks_min":30,"estimated_clicks_max":60,"recommendations":["brief tip","brief tip","brief tip"]}}

Rules: score 0-100, tone 1-2 words, clarity High/Medium/Low, tips 3-5 words max. IMPORTANT: Return RAW JSON only. Do not wrap in markdown or code blocks."#,
        ad_title, ad_copy, subscriber_range
    )
}
Enter fullscreen mode Exit fullscreen mode

That's it. The whole prompt.

My first version was three paragraphs long. It explained Adsloty's business model, described what newsletter advertising is, defined each metric in detail, gave five examples of good and bad ads, and included instructions for edge cases. It was thorough but also slow, costly, and did not produce better results.

I kept cutting unnecessary details. With each version, I removed a sentence and assessed if the quality changed. Most of the time, it didn’t. The model understands ad copy and what a newsletter audience is. I was providing more information than the model needed.

The final prompt is very straightforward: ad title and copy, subscriber count, an example of the exact JSON format I want, constraints for each field, and a clear instruction about the format.

I learned three key things about prompts for structured output:

First, the example output is the most important part. The model replicates that structure faithfully, including field names, types, and shape.

Second, clear constraints prevent errors. Saying "Score 0-100" stops the model from giving a score of 150. "Tone 1-2 words" prevents it from writing long paragraphs. "Tips 3-5 words max" keeps recommendations brief and useful.

Third, I added the instruction "do not wrap in markdown" because I learned this from experience. Even with responseMimeType set to JSON, the model sometimes wraps the response in triple backticks. So, I include that instruction and handle it during parsing. It's a double-check.

Parsing: trust nothing

The model usually returns data in JSON format. Sometimes, it may return something similar to JSON. It can also wrap responses in markdown or cut off mid-word. You need to be prepared to handle all these variations.

fn extract_json_from_response(text: &str) -> &str {
    let trimmed = text.trim();

    // Handle ```
{% endraw %}
json ...
{% raw %}
 ``` wrappers
    if let Some(start) = trimmed.find("```

json") {
        let json_start = start + 7;
        if let Some(end) = trimmed[json_start..].find("

```") {
            return trimmed[json_start..json_start + end].trim();
        }
    }

    // Handle generic ```
{% endraw %}
 ...
{% raw %}
 ``` wrappers
    if let Some(start) = trimmed.find("```

") {
        let json_start = start + 3;
        if let Some(end) = trimmed[json_start..].find("

```") {
            return trimmed[json_start..json_start + end].trim();
        }
    }

    // Raw JSON — the happy path
    trimmed
}
Enter fullscreen mode Exit fullscreen mode

After extraction, I validate every field:

// Clamp fit score to valid range
let fit_score = result.fit_score.clamp(0, 100);

// Validate clarity rating
let clarity_rating = match result.clarity_rating.as_str() {
    "High" | "Medium" | "Low" => result.clarity_rating,
    _ => "Medium".to_string(), // Safe default
};

// Sanity check click estimates
let (clicks_min, clicks_max) = if result.estimated_clicks_min <= result.estimated_clicks_max {
    (result.estimated_clicks_min, result.estimated_clicks_max)
} else {
    (result.estimated_clicks_max, result.estimated_clicks_min) // Swap if backwards
};
Enter fullscreen mode Exit fullscreen mode

I also handle truncated JSON. Sometimes the response gets cut off — a recommendation array missing its closing bracket, or the entire object missing its final brace. Instead of failing, I try to fix it:

// Auto-fix truncated JSON
if !cleaned.ends_with('}') {
    if !cleaned.ends_with(']') {
        cleaned.push(']');
    }
    cleaned.push('}');
}
Enter fullscreen mode Exit fullscreen mode

Is this method a bit unconventional? Yes. But does it work? Yes, it does. In real-life situations, you'll use what the model provides, not what it ideally should provide.

When the analysis runs

There are three triggers:

Automatic, when payment is made. When a sponsor completes their purchase and Stripe sends a notification, the backend starts a background task to run the analysis. This process does not block other tasks the booking is created right away, and the AI results get added when they are ready.

fn trigger_ai_analysis_background(state: AppState, booking_id: Uuid, writer_id: Uuid) {
    tokio::spawn(async move {
        match run_ai_analysis(&state, booking_id, writer_id).await {
            Ok(_) => tracing::info!("AI analysis completed for booking {}", booking_id),
            Err(e) => tracing::warn!("AI analysis failed for booking {}: {}", booking_id, e),
        }
    });
}
Enter fullscreen mode Exit fullscreen mode

If the AI call fails—like a timeout, a rate limit issue, or a bad response—the booking is still valid. The writer will just see the ad without a score. The analysis adds to the information but does not block anything.

A batch job runs every 15 minutes. This background job checks for any bookings that didn't get processed—possibly because the AI was down when the webhook was triggered or due to a timing issue. It looks for bookings without an analysis timestamp and processes up to 10 at a time, waiting 500ms between each call.

// Find bookings that need analysis
let bookings = sqlx::query_as::<_, Booking>(
    r#"SELECT * FROM bookings
    WHERE ai_analysis_timestamp IS NULL
        AND cancelled_at IS NULL
        AND status IN ('pending', 'paid')
    ORDER BY created_at DESC
    LIMIT $1"#
)
.bind(limit)
.fetch_all(pool)
.await?;
Enter fullscreen mode Exit fullscreen mode

Manual trigger. Writers can rerun the analysis on any booking. They might want a new score or the first analysis might have been done before they updated their audience description. Just click a button to get a new analysis.

Three layers. If the first analysis misses something, the second one catches it. If the writer needs a redo, the third layer takes care of it. No booking should go without a score for more than 15 minutes.

What the writer sees

When a writer opens a booking request in their dashboard, the AI analysis is easy to find. It is prominently displayed, not hidden in a tab or behind a "show details" link.

The fit score gets a color:

const getFitScoreColor = (score: number) => {
    if (score >= 90) return "text-green-400";  // Great fit
    if (score >= 80) return "text-blue-400";   // Good fit
    if (score >= 70) return "text-yellow-400"; // Okay fit
    return "text-orange-400";                   // Questionable fit
};
Enter fullscreen mode Exit fullscreen mode

Below the score:

  • Tone analysis: Professional and Direct
  • Clarity rating: High
  • Estimated click range: 45 to 120 clicks

Recommendations:

  • Make the CTA more specific.
  • Add social proof.
  • The tone matches the audience well.

These points help writers decide quickly, in 30 seconds instead of 5 minutes.

Does it actually work?

Most of the time, about 80%, the fit scores of the AI are reasonable. It usually gets the tone right and offers helpful recommendations. For simple cases, like a developer tool ad in a tech newsletter, the AI performs well. It might give a score of 85, say the tone is "Technical, Professional," show high clarity, and provide click estimates that make sense.

The other 20% is where it gets tricky.

When the AI is confidently wrong

I’ve noticed three types of bad scores:

The false positive. A crypto trading platform ad submitted to a personal finance newsletter scored 78. The AI noticed "finance" in both but missed the audience's intent. The newsletter targets people learning to budget, not those interested in trading cryptocurrencies.

The false negative. A well-written ad for a writing tool submitted to a creator economy newsletter got a score of 52. The ad's poetic style led the model to think it was unclear, even though it was perfect for writers. The AI confused creativity with a lack of clarity because it defines clarity as straightforward.

The hallucinated precision. The AI might say "Estimated clicks: 847-1,203" for a newsletter with 5,000 subscribers. It doesn't actually know the newsletter's click-through rate or how engaged the audience is. It produces a number that sounds exact but is completely made up. Confident and useless.

What I did about it

I didn’t fix the AI; I changed the message.

The fit score is just a starting point, not a final decision. The interface says “AI Smart Critique,” not “AI Decision.” The approve and reject buttons are always visible, no matter the score. A score of 40 can be approved, and a score of 95 can be rejected.

I thought about automatically rejecting anything below a certain score, but I’m glad I didn’t. A new writer might be excited to see a 40 because it’s their first paying sponsor. Meanwhile, a writer with 50,000 subscribers might reject a 90 because the brand doesn’t match their values. Context is important, and the AI doesn’t have all the details.

For click estimates, I might add a warning or remove them completely. They are the least reliable part of the analysis and can mislead more than help. A writer seeing “estimated 800 clicks” might approve an ad expecting that outcome, then feel disappointed when it only gets 50.

The recommendations are the most useful part. They are valuable not because they are always correct, but because they give the writer specific points to think about. For example, saying “the CTA could be more specific” helps the writer focus on the CTA instead of just skimming it.

The cost question

Every analysis costs money. Gemini 2.5 Flash is inexpensive just a fraction of a cent per call but it can add up. If Adsloty gets 1,000 bookings a month, that means at least 1,000 API calls, plus retries and manual re-analyses. That's why I chose Flash instead of Pro. The quality difference for this task is small, but the cost difference is large. I also keep the prompt minimal to save costs fewer input tokens mean a lower cost per call.

Right now, the AI analysis costs less than $0.01 per booking. The platform fee for a $100 booking is $10. The AI cost is small compared to that. However, I am monitoring it because if the prompt gets longer or I switch to a more advanced model for special cases, the costs could change.

Here’s what I would do differently

If I were starting over, I would do three things:

First, I would include the newsletter's niche and audience details in the prompt, not just the subscriber count. Currently, the prompt is so short that the model has to guess the audience based only on subscriber numbers. That’s why it gets confused about "finance" audiences. More context in the prompt would cost a few more tokens but would greatly improve the accuracy of the fit score.

Second, I would set up a feedback loop from the start. When a writer approves or rejects a booking, it shows whether the AI’s score was helpful. If writers often approve ads that the AI scored at 50, the scoring is likely wrong. I’m not gathering that data yet, but I should be.

Third, I would be more careful with click estimates. They should probably be ranges based on industry benchmarks, not just numbers from the AI. Or I should remove them until I have real click tracking data to compare.

The honest takeaway

AI features look great in demos and presentations. However, in real use, they often have problems. The model makes errors frequently enough that you cannot rely on it completely, but it is also correct often enough that removing it would make the product less effective.

The key is how you frame it. Think of AI as a helper, not as the one making decisions. It helps writers get started, highlights things they might overlook, and saves time on clear choices, allowing them to focus on the tougher decisions.

If you’re adding AI features to your product, my advice is to focus less on the prompts and more on what happens when the prompts fail. And they will. Regularly. Users will evaluate your product based on how well it deals with failures, not just on how great the best outcomes are.

Next time, I’ll write about the embeddable widget how writers can add a "Sponsor this newsletter" button to their sites and what it took to create a JavaScript widget that works on any website without causing issues.

More soon.

Top comments (0)