Trần Hoàng Long

Posted on Feb 2

High Level Judge Prompting Visualized

#llm #ai #promptengineer #promptevaluation

Hi, if you’re new here, I’m Produdez, I built a simple llm judge lately to learn more about prompt engineering and the evaluation process. Allow me to share the process from a high level perspective.

Let's go through the process of making an LLM judge, with some simple examples and illustrations.

A judge is an llm bot that output include some form of binary output like (PASS / FAIL) so that automated evaluation of metric can be calculated.

1. The Brain Storm

We would start with an initial idea, eg: "judge a valid question", then, we would carve out our initial
Requirements, such as: question must have "?"
And brainstorm some data samples from that requirement.

Then what if the data sample looks ... unsure.
You likely encounter cases in this process of brain storming that require you to make a decision.

Do we ignore this case and make it FAIL
Or do we validate it and update our requirements rule set.

New "ambiguous sample" means potential new/updated requirement

This create a brainstorming loop of constantly coming up with new samples from your ideas/requirement and have your requirements refined to match the emerging data.

After that, we would end up with a list of initial requirements, call it Requirements V1, and Sample Data V1

2. Prompt Preparation / Init Prompting

So now we have a list of "rules" for our judge and the data that should represent those rules.
What I would do now is crafting a Prompt V1 from our requirements. Just a very simple, baseline prompt that adhere to some of the rules. And for the dataset, I would clean it up, format properly into Dataset V1 to later be used in the prompt iteration pipeline

An example starting prompt here for our example could be:

SysPrompt: You'll be given a sentence, is it a question? Make sure it has "?"
Message: Sentence: ...

You can see that I did not try to pack all the requirements we came up with into the prompt. Cause right now that's unnecessary, we'll be iterating on it later anyways, this is just a starting point.
The more cumbersome thing for the engineer to prepare is the Evaluation System and the Iteration

Pipeline itself, this is so that we can easily:

Try out our prompt
Run on all tests, get result
Does analysis and reasoning on fail result
Improve dataset or prompt or both (among other things)
(Repeat) Which is exactly our next step

3. Prompt Iterating

Here's a glimpse of the pipeline

At this step, it might be helpful to mention this. Some of you would wonder:

Why build this complex pipeline when we can use Anthropic Workbench?

To that the simple answer is "Scope", or "Scale".

It might be easy for you as a prompter to judge a result of a summarization bot you build and quickly iterate, but what if there are more requirements than a quick glance could capture? What if there are more test cases than you can read? What if the prompt has to evolve overtime growing with your company's interest while adhering to the same requirements?

These problems at scale are ones that a single work bench can not help capture. Hence this pipeline of automated evaluation and update.

Now, let's go through the thought process behind the pipeline using our "is question" example.

Here's the ideal progress

Using the same dataset, you keep getting better result each new prompt, aiming for the illusive 100% (or similar 98, 95)

But there's reasons this is not the full picture of the process:
Your dataset is small, they do not fully represent the requirements of the task yet
Your requirement might not be fully concrete yet, you haven't added all the rules, or haven't thought of enough
There are edge cases yet to be address, this means more examples and more rules
OR simply put, your dataset and rule is only V1. It's incomplete

There's no doubt in this process that eventually new data samples would emerge, either from your new sparks of idea, edge cases or simply contradiction in the initial setup. And with new data means modification of requirements, which could mean invalidation of many prompts that you currently have.

The example below shows

some new samples that comes up during iteration
and force a change to a requirement, making the prompt V3 very hard to be iterated on
So if you can't easily edit a prompt, what do you do?

Cause your current prompt V3, they're optimized for the current requirements, but as new ones are added/changed. You have to adapt them to the new rules.

How to iterate when rules change?

So you spent your WHOLE day prompting and prompting away then you realize the requirements need and update, and you panik? What to do

Fear not, cause here's where you prompting brain and ingenuity comes to play. Remember all those prompts you crafted? Don't just throw them away.
There's something I like to call "Core Logic" that is the main essence of your prompt. Instead of looking at the walls of text that you created, find the core and compare it to the new requirements.

If there's good logic that's not invalidated, extract that part out and keep it
If not much logic is affected, maybe you can get away with using the whole prompt again
If not, well, back to step 2 you go with a new requirement and prompt all together. But carry the core invalidated logic as something "not" to do instead. Here's how it looks below

I would argue this is also personal taste, but I find this the most organized way that one can proceed without invalidating their own work. You extract the useful parts of your best performer and continue.
But notice that sometimes the optimized version of a prompt like V3 contains a lot of FLUFF and might not be the best target for base logic extraction, maybe a more simple and purer one could be the go to, maybe your V2 or even V1.
Of course, the new prompt V4 is very pure and basic, it would not perform well, but with the core logic valid, you at least know that this is something of a baseline that you could use to iterate on.
That's how you iterate.

So to recap section 3.

When prompt iteration, when you don't have to update your dataset, you just prompt engineer.
But, when the DS update, you need to utilize the core knowledge of your prompt to reconstruct it for the new requirement.
This "core" is honestly very tricky as I have a tendency to over complicate a prompt very quick and loose sight of what are the main parts of it. But i believe that if we conduct our experiment disciplined. Keeping track of what the prompt "DO", either via note, or structure, ... it would help very much as the prompt and the complexity of the project requires us to not only patch the prompt but even rewrite it.

DEV Community