Valeria

Posted on Mar 28

Wrangling GenAI: Optimising prompts for Visual Tasks

#ai #api #tutorial #beginners

When I wrote the first version of the magic cleaning tasks generator for Hounty I wasn't particularly happy with it, but I thought that this is as good as it gets with Generative AI APIs and let it be till I have time to train my own model. Well, today I proved myself wrong and I want to show you exactly how I managed to drastically improve performance by adjusting the prompt in a few inconventional ways.

I encourage you to follow along and try the prompts yourself. I used Gemini and Claude, but pretty much any chat AI app or API should work. And if you have never worked with GenAI - that's no problem - you don't have to have any prior knowledge to follow along.

The Task

Here is a picture of my living room this morning:

It looks like this a few days a week and every time it's a chore to get started. I do rely on Hounty to tell me what to do and where to start. So let's try and replicate the core functionality with nothing but prompts and see how to improve them.

Take one: You are an expert

You've probably seen the advice to start your prompt with "you are an expert in ...". It sounds like a good idea, but let's try and see what it actually does.

Paste an image of a messy room (yours or mine) into your AI chat and prompt this:

You are a home cleaning expert. Create a cleaning todo list of 3-5 most impactful tasks

Here's what I got from Gemini:

And now let's try without the "you're an expert" preamble:

The second output is about half the size of the "expert" one and I find it much more actionable and clear: "Clear the Floor" vs "Clear the Floor Transit Zones". So why does this happen?

Turns out, giving AI a title doesn't unlock more knowledge, but instead it forces it to act like an expert or, in other words, researchers found that Expert Personas Improve LLM Alignment but Damage Accuracy.

Not to say that personas have no use, but in majority of cases adding "you are an expert" would result in a confident tone and bloated output. Which means you'll burn more tokens for no benefit.

Skipping it works better.

Take two: Detailed prompt

On the first glance, a common advice to describe what you want in as many details as possible makes sense, so let's give it a try.

Same image, more explicit prompt:

Look at this image. List every visible object. For each object, determine whether it is out of place based on what should be in a Living Room. Consider laundry, dishes and clutter specifically. Return your findings in a structured list of 3-5 most impactful cleaning items as JSON.
Example:
[{"task":"Throw away newspapers", "advice":"Recycle as newspapers"}]

Here's the json it returned, since it didn't fit into the first screen:

[
  {
    "task": "Fold and put away laundry from the sofa",
    "advice": "Sort by owner or garment type and move to the bedroom or closet to regain seating space."
  },
  {
    "task": "Stow reusable shopping bags",
    "advice": "Empty any remaining contents and store the Lidl bags in a pantry, kitchen drawer, or car trunk."
  },
  {
    "task": "Clear the floor of miscellaneous items",
    "advice": "Pick up the pink hairbrush and the black tub; return the brush to the bathroom and store the tub in a utility area."
  },
  {
    "task": "Declutter the windowsill and cat tree base",
    "advice": "Move non-decorative items and loose papers to a dedicated desk or filing system to improve natural light flow."
  }
]

And now try a simpler version:

Create a todo list of 3-5 most impactful cleaning tasks. Return as JSON {title,advice}

Here's the resulting json in full for comparison:

[
  {
    "title": "Clear the Sofa Surface",
    "advice": "Fold or hang the large pile of clothing on the couch. Clearing this central furniture piece will immediately make the room feel functional and less overwhelmed by 'clutter weight'."
  },
  {
    "title": "Consolidate Floor Items",
    "advice": "Gather the loose items on the rug and the shopping bags by the door. Use the empty black bin shown in the foreground to quickly corral miscellaneous objects that don't have a dedicated home yet."
  },
  {
    "title": "Clear the Windowsill and Radiator",
    "advice": "Remove the papers and pink organizer from the windowsill. In a bright room, clear horizontal surfaces near windows reflect more light and create a much cleaner visual line."
  },
  {
    "title": "Reset the Dining Area",
    "advice": "Tuck the dining chairs neatly under the table and remove any small items from the tabletop. Aligning the furniture creates a sense of order that anchors the rest of the open space."
  }
]

Both prompts produced a similar list, but the first one added a lot of noise, which would not only made it much harder to use in an API, but would also cost more since it used more input and output tokens.

And the pink brush the first list mentions? It's a cat brush, it doesn't belong in the bathroom - it belongs in a box with cat stuff. Skipping that detail made the second list appear more precise.

So why did the detailed prompt backfire?

Because the second prompt gives the model room to reason. We've stated the goal clearly and let it do its job. When we provide too many details in the context LLM experiences effects similar to overtraining a model. In other words:

Overtraining (often synonymous with overfitting) occurs when a machine learning model learns the training data too well, capturing noise and specific details rather than general patterns.

By giving the LLM too many details we limit the ways it can solve the problem and thus as a general rule - less is more. You can always add or clarify something later.

And once again, less tokens spent!

Take three: Structured data

Now that we have the tasks in place it would be nice to add the coordinates of the areas on the image which illustrates each task.

Let's try and enchance the prompt to return just that:

Identify 3-5 most impactful cleaning tasks and return ONLY JSON{task,x,y,w,h} mapping tasks with the areas on the image where x,y are top-left corner pixels of the bounding box and w,h is width and height

Adding "ONLY JSON" was needed to make sure that LLM doesn't interpret "do this and then that" as two separate requests and spits out text and JSON.

Here's the result:

Hm, I didn't ask for a "label".

We could keep patching the JSON prompt, but there's a better tool for the job - CSV(comma-separated values).

Try this adjusted prompt:

Identify 3-5 most impactful cleaning tasks and return ONLY CSV(task,x,y,w,h) mapping tasks with the areas on the image where x,y are top-left corner pixels of the bounding box and w,h is width and height

And here's what I got:

CSV is just as easy to parse as JSON, but turns out it produces more reliable output with less tokens. I think it might have to do with a fact that JSON requires AI to add something around values and it can occasionally go overboard, whereas CSV is just text with values as is. LLM native, if you must.

So ditch JSON, use CSV and alike. The only gotcha to keep in mind is that comma in the text might break the output, so it might be worth asking LLM to escape strings in the output.

Take four: Spatial coordinates

In the previous step we asked LLM to provide the coordinates of the top left corner of the area corresponding to each task, let's see how it fared.

To test it I prompted Gemini and Claude to create an app. You can use the resulting artifact or try this prompt to create one yourself:

Create a small frontend app that would ask user to choose an image and paste a CSV(task,x,y,w,h) with tasks and areas on the image in either absolute pixels (add switch) or normalized coordinates (0.0 - 1.0) as well as x,y being either top-left or center of the box

Let's see how our original prompt with absolute pixel coordinates performed:

Meh, stupid AI, can't even outline a box properly, right?!

Now let's try to adjust the prompt to return normalized coordinates instead:

Identify 3-5 most impactful cleaning tasks and return ONLY CSV(task,x,y,w,h) mapping tasks with the areas on the image where (x,y) are coordinates of the top-left corner of the bounding box and x,y,w,h are normalized floats between 0.0 and 1.0 (fraction of image dimensions). Box must stay within image: x-w/2 >= 0, x+w/2 <= 1, y-h/2 >= 0, y+h/2 <= 1

I added the last sentence because Gemini kept returning absolute values (way more than one) for the y coordinate.

Here's how the resulting overlay looks:

Not too bad, but still a little off.

And now for the last trick, let's try and switch top-left corner to a center instead:

Identify 3-5 most impactful cleaning tasks and return ONLY CSV(task,x,y,w,h) mapping tasks with the areas on the image where (x,y) are coordinates of the center of the bounding box and x,y,w,h are normalized floats between 0.0 and 1.0 (fraction of image dimensions).Box must stay within image: x-w/2 >= 0, x+w/2 <= 1, y-h/2 >= 0, y+h/2 <= 1

Voilà! Almost as good as Hounty:

If you're wondering why LLMs work better with normalized centered values - it boils down to how the original data that model was trained on looked like.

Asking a vision model to produce output in it's native 0-1 centered way makes it easier for them to give you what you asked for than to perform operations on the go.

"But what if I need a different format?"- you might ask.

I hear you, I also wouldn't want to release a new app every time I'm making an update to my API - the app in my case expects absolute values returned as json.

And this is where the plain old geometry comes in handy!

If GenAI produces a reliable machine-readible output (like CSV) you can take that output and transform it with a little bit of regular programming into whatever format you'd like.

As a rule of thumb: if you can do something without AI - it would be cheaper and more reliable to do it without AI.

Cool, but how do I ship my prompts?

Great question! We spent this whole article in the chat, that's not something you can just use or can you?

And the answer is: yes you can! GenAI APIs are pretty much standardised and you can write the same prompts as you do in the chat and send them via REST API just like you'd do for any other service.

Almost any cloud provider nowadays has their managed Generative AI API; Claude, Gemini & co also offer that, and you can even host or run your own model with Ollama.

AI might have changed software development quite a bit, but in the end of the day it is still based on the same principles and sheer logic.

There's no perfect prompt like there's no perfect code, but every iteration gets you a little bit closer to it.

Oh and I did clean up when I finished updating the APIs :)