DEV Community

Simon Risman for LLMWare

Posted on

Are we all prompting wrong? Balancing Creativity and Consistency in RAG.

For a Boston native like myself, there are few things more heartwarming than Artificial Intelligence understanding the brilliance of Good Will Hunting. A few cursory prompts reveal that it views it as a "must-watch tale of redemption and self discovery".

Chat Will Hunting

But a slightly closer look reveals what many users of LLMs have accepted as a given - slight variations on an otherwise consistent topic. This is the result of Stochastic Generation.

Stochastic generation πŸ€–

This is a fairly common term, from online bootcamps to college lectures, students of AI are familiar with this concept. For those who need a quick refresher, here is the 3-step generation loop that many LLMs follow.

LLMs are trained using a next-token prediction task, where the model predicts the next token in a sequence based on the previous tokens. This process involves:

  1. Tokenized Input: The input text is converted into a sequence of numbers (tokens).
  2. Probability Distribution: The model generates a probability distribution over the possible next tokens.
  3. Sampling Algorithm: This distribution is passed through a sampling algorithm to select the next token.

The probabilistic elements that this process introduces enables LLMs to generate more captivating dialogue, novel images, and creatively praise award-winning films.

Randomness and RAG 🎰

When building RAG based applications, we are often not as concerned with creativity as we are with facts. When dealing with facts, we want as little probability involved as possible. In other words, instead of sampling a probability distribution, its beneficial to just take the token with the maximum likelihood every time.

LLMWARE allows you to explore how random your generated results are, as well as augment how random you want them to be. Heres a quick demonstration:

Demo πŸ™Œ

Load the model

model = ModelCatalog().load_model("bling-stablelm-3b-tool",
                                  sample=True,
                                  temperature=0.3,
                                  get_logits=True,
                                  max_output=123)
Enter fullscreen mode Exit fullscreen mode

In the load_model method, we make a few important selections. The bling 3B is one of our newest and highest performing models.

Setting the sample attribute to True or False will allow you to change between a stochastic approach and a top-token model.

The temperature can be an important tool to control the randomness of the output, with lower values making responses more focused and higher values increasing diversity in the generated text.

These key settings will allow you to see what kind of approach you want to take when it comes to the probabilistic nature of your model.

Run a simple inference model on some sample text

response = model.inference("What is a list of the key points?", sample)
Enter fullscreen mode Exit fullscreen mode

This step is where your model is doing the heavy lifting, analyzing and summarizing the loaded-in documents.

Run a sampling analysis

sampling_analysis = ModelCatalog().analyze_sampling(response)
print("sampling analysis: ", sampling_analysis)
Enter fullscreen mode Exit fullscreen mode

Now you get to see the analytics - giving you a better idea of how heavily your model samples from the lower-probability side of the distribution.

This analysis will include what percentage of the tokens selected by the model were also the highest probability output, and will note cases where the not-top-token was selected.

In cases where the top token was not selected, the below code will print out the exact entries of the outputs, including their token rank.

for i, entries in enumerate(sampling_analysis["not_top_tokens"]):
    print("sampled choices: ", i, entries)
Enter fullscreen mode Exit fullscreen mode

All these tools can help you make an informed decision on whether you want your model to think a little outside the box, or stick to the most likely answer. To see this process in action, check out our youtube video on consistent LLM output generation.

The full code for this example can be found in our Github repo.

If you have any questions, or would like to learn more about LLMWARE, come to our Discord community. Click here to join. See you there!πŸš€πŸš€πŸš€

Top comments (1)

Collapse
 
jerryhargrovedev profile image
Jerry Hargrive

What a fascinating read! The part about balancing creativity and consistency in RAG was really enlightening. You mentioned setting the sample attribute to True or Falseβ€”are there specific scenarios where one approach significantly outperforms the other? Would love to hear your thoughts on this.